An open automation system for predatory journal detection

Predatory and legit journals

Predatory journals benefit from students’ eagerness to submit papers to solicit articles. Options embrace fast overview with no skilled overview mechanism, fraudulent influence issue, faux editorial boards truthlessly itemizing revered scientists, an intensive assortment of articles, journal titles seemingly much like these of reliable journals, and aggressive spam invites to submit articles. Moreover, predatory journals make earnings by charging excessive article processing charges.

As proven in Fig. 1, each predatory and legit journal web sites generally show textual content blocks labeled “Influence issue,” “Editorial board,” “Concerning the journal,” and “Contact us.” Distinguishing between them requires the identical machine studying ways used to resolve binary classification issues similar to faux social media identities27, suspicious URLs in social networks, and the hijacking of reliable web sites25. In machine studying, the textual content classification course of consists of tag or class assignments primarily based on textual content content material. Though textual content can supply wealthy sources of data, extracting insights could be tough and time-consuming when unstructured knowledge are concerned.

Determine 1
figure 1

Our proposed educational journal predatory checking (AJPC) system recognized the primary journal, Antarctic Science, as reliable, and the second, Worldwide Journal for Improvement of Laptop Science and Expertise, as probably predatory. Similarities between the 2 web sites are famous within the coloration field frames 1a was captured from and 1b was captured from

Ways utilized by predatory publishers embrace misrepresentations of peer overview processes, editorial providers, and database indexing statuses1. Revenue-oriented predatory journals typically in the reduction of drastically on editorial and publishing prices by fully eliminating procedures similar to referee opinions, addressing educational misconduct points, flagging attainable cases of plagiarism, and confirming creator group legitimacy29. However, a stunning variety of predatory journals discover it simple to draw scholarly submissions from authors concerned with padding their CVs21,30. These purposefully misleading actions can lead to incorrect quotes and citations, thus wasting your analysis funds and assets whereas destroying public confidence in college analysis. Predatory journal web sites additionally are inclined to lack credible database indexing with businesses similar to Journal Quotation Reviews (JCR) or the Listing of Open Entry Journals (DOAJ). Mixed, these issues are creating chaos in educational communities, with editors, authors, reviewers, and associated people pursuing numerous methods to guard analysis high quality31,32.

Since predatory journals are inclined to falsify their index data and influence values whereas selling excessive acceptance charges33, researchers concerned with avoiding predatory journals have to be aware of present index rankings, scientific indicators, and bulletins from science publication databases. Together with editorial workplace addresses, phrases and phrases similar to “indexing in [specific] database” and “journal metrics” seem to point legitimacy, however they’re additionally utilized in deceptive promoting and promotional emails despatched out by predatory journals34. Different pink flags embrace guarantees of quick peer overview; using casual or private contact emails that aren’t related to an internet site; journal webpages with a number of spelling, grammar, and content material errors; false claims of excessive influence components with self-created indicators; and lack of writer listings in common databases such because the DOAJ, the Open Entry Scholarly Publishers Affiliation, or Committee on Publication Ethics13,19,33,34,35,36. Unintentionally publishing educational analysis by spam and phishing emails could injury careers and lack of cash precipitated. Researchers are troubled by the digital invites they obtain to submit papers or attend conferences, they usually want an excellent schooling or a worthwhile analysis system to evaluate whether or not they’re predatory or not.

As Fig. 1 exhibits, the house owners of predatory journal web sites are expert at mimicking the structure types of reliable web sites. Determine 2 exhibits the opening strains of letters and emails from predatory journals that students commonly obtain inviting them to submit manuscripts; it’s tough to differentiate them from communications despatched out by reliable journals21,30. Each figures comprise examples of textual content extolling the virtues of the inviting journals, together with excessive h5-index values; excessive quotation charges; and particular indexing (inexperienced, pink and orange packing containers, respectively).

Determine 2
figure 2

Examples of probably deceptive textual content in invites despatched to students to submit manuscripts.

Classification mannequin

Supervised, unsupervised, and reinforcement machine studying for pure language processing are helpful instruments for fixing quite a few textual content analytics issues. The first problem for making a handy predatory journal identification system is much like these for faux information and malicious URL detection issues28,37: each drawback varieties contain textual content variation, complicated or unclear messages, and imitative web site layouts. Since predatory journal homepage identification is basically a classification drawback, we got down to modify a number of algorithms to enhance the human-centered machine studying course of related to the Google UX Group38. At the moment, essentially the most generally used textual content analysis and classification approaches are help vector machine (SVM), Gaussian naïve Bayes, multinomial naïve Bayes, random forest (RF), logistic regression, stochastic gradient descent (SGD), Ok-nearest neighbor (KNN), and voting39. All use finely tuned parameters to pick out the most effective configuration for every classification method. The next are transient descriptions of those approaches.

Steadily used to detect misleading textual content, clickbait, and phishing web sites, SVMs are sensible instruments that use resolution planes to categorise objects in response to two classes: anticipated and non-expected37,40,41. An instance of an SVM-based strategy exploits content-based options to coach classifiers which are then used to tag completely different classes (F1 = 0.93)40. Their SVM algorithm used every knowledge set as a vector, plotted it in a high-dimensional house, and constructed a hyperplane to separate courses. The hyperplane maximized distances between planes and their nearest clickbait and non-clickbait knowledge factors.

The RF and two naïve Bayesian (NB) programs are often utilized to textual content classification issues attributable to their computational effectivity and implementation efficiency42. Nevertheless, the shortage of algorithm-specific parameters signifies that NB system customers will need to have an intensive data of the mannequin being examined, which provides a substantial computational burden for optimization functions43. The RF system works as a random hyperlink with particular parameters—as an example, particular tree and variable numbers for every break up. So long as the general enter measurement is sufficiently giant, its efficiency is taken into account suitably strong to deal with parameter adjustments. In a research designed to detect cases of phishing, the RF classifier had a 98.8% accuracy fee41, and in a separate research geared toward detecting predatory biomedical journals, it produced an F1 rating of 0.9326. The RF system has additionally been used with resolution timber as a method for stopping the indexing of papers printed in predatory journals since some people have turn out to be expert at hijacking journal web sites and amassing processing and publication charges from unwary authors25.

Logistic regressions have been used to categorise information headlines and content material. In a single research involving faux and true information tales in Bulgaria, a logistic regression strategy achieved 0.75 accuracy for essentially the most tough dataset44. Logistic regressions assign weight components to options in particular person samples, with predicted outcomes equal to every pattern characteristic worth multiplied by its influence issue—the equation coefficient. Accordingly, classification issues are remodeled into optimization coefficient-solving issues.

SGD has been efficiently utilized to large-scale and sparse machine studying issues often encountered in textual content classification and pure language processing. It may be used for both classification or regression calculation functions. In an Indonesian research, an SGD classifier with a modified huber kernel was used to detect hoaxes on information web sites and was reported as having an 86% accuracy fee35.

KNN is an instance-based or lazy studying technique, with native approximations and with all computations deferred till post-classification45. Thought-about one of many easiest of all machine studying algorithms, KNN is delicate to native knowledge buildings. This technique can be utilized with a coaching set to categorise journals by figuring out the closest teams. Class labels are assigned in response to the dominance of a specific class inside a gaggle. One research utilized heuristic characteristic representations with the KNN technique to categorise predatory journals, and reported a 93% accuracy fee46.

Voting is among the best methods to mix predictions from a number of machine studying algorithms. The tactic doesn’t entail an precise classifier, however a set of wrappers educated and evaluated in parallel to benefit from every algorithm’s traits.

Classification entails two major goals: analyzing components that have an effect on knowledge classification, and assigning parts to pre-established courses by way of characteristic predictions39. When a classifier has adequate knowledge, a mannequin can determine the options of anticipated classes and use them for additional knowledge class predictions. For textual content classification functions, if phrase order relationships and grammar buildings in a file will not be thought of, a typical vectorization technique is bag of phrases (BOW), which calculates weights related to the numbers of phrase occurrences in a textual content. BOW has often been utilized to duties involving restaurant overview classification, destructive data retrieval, and spam mail filtration28,37,47. To utilize machine studying algorithms, particular person paperwork have to be remodeled into vector representations. Assuming N paperwork with T phrases are utilized in all of them, it’s attainable to transform all paperwork right into a vector matrix. For instance, assume a vector N3 = [15, 0, 1,…, 3] with phrase T1 showing 15 instances, phrase T3 one time, and phrase Tt 3 instances in doc 3. Though BOW is taken into account a easy technique for doc transformation, two issues have to be resolved, the primary being that the whole variety of phrases per particular person doc will not be the identical. If there are 10,000 whole phrases in doc 2 and 50 in doc N, and phrase 3 seems ten instances in doc 2 however solely two instances in doc N, clearly it should have a lot better weight in doc N. The opposite drawback is that idiomatic expressions and often used phrases exert important impacts on particular person paperwork. For example, if a typical phrase similar to “the” seems many instances in numerous paperwork however has essentially the most appearances in a single, it turns into a dominant however meaningless vector.

Frequency-inverse doc frequency (TF-IDF) is a statistical technique generally utilized in data retrieval and text-related eventualities to guage phrase significance in paperwork43,49,50. The TF-IDF algorithm divides characteristic phrases by way of weight and reduces the variety of zero-weight phrases. For the predatory journal web site drawback, discovering higher characteristic phrase weights can enhance discrimination effectivity if phrases could be recognized as showing extra often in predatory web sites. A brief listing of characteristic phrases which have been recognized as probably assembly this requirement consists of “worldwide,” “American,” “British,” “European,” “common,” and “international,” with some researchers suggesting that they’re extra prone to seem in predatory journal titles21,34,51. Different suspect phrases are related to metrics: “high quality influence issue,” “international influence issue,” and “scientific journal influence issue” are three examples. Different characteristic phrases discuss with concepts expressed in an earlier part of this paper: guarantees of peer overview processes and brief overview cycles starting from just a few days to lower than 4 weeks.

Measuring the prediction efficiency of classification algorithms

Since early web site sample detection is central to figuring out predatory journals, figuring out mannequin accuracy is a important process. 4 efficiency metrics have typically been used to guage classifiers: accuracy (share of appropriate classification predictions), precision (proportion of appropriate optimistic identifications), recall (share of related paperwork efficiently retrieved), and F1 rating (common of precision and recall as a balanced index). For this research, we used recall and F1 scores as measures of classifier efficiency. F1 scores can be utilized to substantiate recall and precision ranges, with greater scores indicating fewer reliable journal classification errors. Calculation strategies for accuracy, precision, recall, and F1 scores are proven in Desk 1.

Desk 1 Definitions for the 4 efficiency metrics used for mannequin analysis.

System design

Determine 3 presents the AJPC system structure, constructed utilizing Flask, an online software framework written in Python. AJPC extracts URL content material entered by a consumer, preprocesses the information, converts web site content material into phrase vectors, and applies a classification mannequin for class prediction earlier than sending outcomes to its again finish and displaying them. In short, AJPC consists of three essential modules: knowledge assortment, characteristic extraction, and mannequin prediction. Information assortment throughout pure language preprocessing focuses on URL content material for characteristic extraction utilizing the BOW technique. In the course of the mannequin prediction stage, eight widespread classifiers are utilized to mannequin coaching, with the most effective mannequin chosen primarily based on recall fee and F1-score.

Determine 3
figure 3

Proposed educational journal predatory checking (AJPC) system structure.

Information assortment

A single predatory journal listing was established utilizing data collected from up to date Beall’s19 and the Cease Predatory Journals listing52. Journals showing on these lists are screened by way of credibility as established by the Committee on Publication Ethics, long-term observations, and nameless community-based suggestions19,52. Respectable journal listing knowledge have been collected from the Berlin Institute of Well being (BIH) Quest web site53, which makes use of knowledge from the DOAJ and Pubmed Central lists of journals. After manually checking all predatory and legit journal hyperlinks to substantiate energetic statuses, an online crawler was utilized to create two lists. For this research AJPC recognized 833 hyperlinks to predatory journals and 1,213 to reliable journals. In supervised machine studying, samples are usually divided into separate coaching and testing units, with the primary used to coach the mannequin and the second used to look at the efficiency of the mannequin chosen as the most effective.

Information assortment preprocessing procedures generally entail the elimination of tags, cease phrases, and punctuation, and the transformation of stems and decrease case textual content54. Along with decreasing characteristic house dimensionality, these procedures promote textual content classification system effectivity54,55. Within the instance proven as Fig. 4, pointless tags (HTML, CSS) and scripts are filtered out, and among the mostly used “cease phrases” are eliminated—for instance, “will” and “and” within the sentence, “Info Sciences will publish authentic, revolutionary, artistic and refereed analysis articles.” “Publish,” “printed” and “publishing” are examples of stem phrase variants; AJPC retains the stem phrase “publish” however removes the opposite two56. All textual content is transformed to decrease case to scale back the potential for various remedy for phrases utilizing combined upper- and lower-case letters.

Determine 4
figure 4

AJPC system preprocessing steps.

Function extraction and knowledge classification

The characteristic extraction module makes use of the BOW technique, an environment friendly data retrieval device for textual content knowledge19,57. BOW converts textual content into numerical values and vectors that machine studying algorithms can course of and use as enter. For instance we are going to use two sentences:

“It was the most effective time for epidemic management,” (sentence 1)

“It was the time for financial restoration.” (sentence 2)

BOW data all occurrences of phrases in each sentences in a dictionary of the coaching corpus. This technique seems to be up the dictionary when the sentence is transformed to a vector. If the phrase within the sentence seems within the dictionary, the vector worth is saved as 1; in any other case, it’s saved as 0. For instance, “time” is saved as 1 in every vector, and sentence 2’s phrases (i.e., “greatest,” “epidemic,” and “management”) will not be within the dictionary and are saved as 0. On this instance the 2 binary vectors are represented as [1, 1, 1, 1, 1, 1, 1, 1, 0, 0] and [1, 1, 1, 0, 1, 1, 0, 0, 1, 1]. These vectors are used to create two phrase units, one related to predatory journal web sites and the opposite with reliable web sites. The TF-IDF technique makes use of the units to guage the diploma of significance for particular person phrases in a group of paperwork. TF-IDF is believed to resolve two issues related to the BOW algorithm: coping with variations in whole numbers of phrases in two or extra articles, and recurring idiomatic phrases and expressions that exert important affect in paperwork. As defined in an earlier instance, if phrase ({w}_{2}) seems 9 instances in doc ({D}_{2}) and two instances in doc ({D}_{t}), however ({D}_{2}) has 10,000 phrases and ({D}_{t}) solely 50 phrases, ({w}_{2}) is rather more necessary to file ({D}_{t}).

TF refers back to the frequency of a given phrase. With ({tf}_{t,d}) expressed as

$${tf}_{t,d}= frac{{q}_{t,d}}{{sum }_{ok}{q}_{ok,d}},$$


the place ({q}_{t,d}) denotes the variety of instances that phrase t seems in doc (d) and ({sum }_{ok}{q}_{ok,d}) denotes the whole variety of phrases in doc (d). In different phrases, the TF technique considers the significance of every phrase by way of frequency relatively than whole variety of appearances, with the most typical phrases preprocessed by IDF. ({idf}_{t}) denotes a phrase significance measure, expressed as



the place D is the whole variety of phrases and ({d}_{t}) is the variety of paperwork containing phrase t. ({d}_{t}) is bigger and ({idf}_{t}) smaller for phrases showing in lots of articles. The worth of phrase t in doc d is calculated utilizing a mixture of TF and IDF, expressed as

$${rating}_{t,d}= {tf}_{t,d} instances {idf}_{t}.$$


The worth of ({rating}_{t,d}) is greater when phrase t seems extra often in doc d (i.e., a bigger ({tf}_{t,d})) and when it seems occasionally in different paperwork (i.e., a bigger ({idf}_{t})). Thus, if a predatory journal web site incorporates “this,” “journal,” “is” and “worldwide” and a reliable journal web site incorporates “this,” “journal,” “has,” “peer overview,” and “step”, then the 2 web sites are stated to comprise a complete of 9 phrases. On the predatory journal web site (d = 1), the rating2,1 assigned to the phrase “journal” is (1/4*mathrm{log}(9/1)), and on the reliable journal web site (d = 2) the rating2,2 assigned to the identical phrase is (1/5*mathrm{log}(9/1)).

After constructing predatory and legit journal web site datasets for TF-IDF rating calculations, diff scores have been used to determine characteristic phrases. A diff rating representing the completely different appearances of phrase t in paperwork 1 (predatory) and a couple of (reliable) is calculated as

$${diff}_{t}= {rating}_{t,1}- {rating}_{t,2}.$$


Utilizing the above instance, ({diff}_{2}= 1/4*mathrm{log}(9/1)-1/5*mathrm{log}(9/1)).

On this case, a bigger diff worth signifies that phrase t seems extra usually on predatory than on reliable journal web sites, due to this fact it might have better utility for figuring out the predatory or reliable standing of an internet site. The rankings of particular person phrases primarily based on their diff scores have been used to create a characteristic phrase set consisting of n phrases. Desk 2 lists the 20 characteristic phrases that appeared most often on the predatory journal web sites used on this research.

Desk 2 Prime 20 characteristic phrases recognized by the proposed educational journal predatory checking (AJPC) system.

The textual content content material of all 833 predatory and 1,213 reliable journal web sites was transformed into vectors. Particularly, a 1 × n vector was constructed for every web site, with vector t set to 1 when phrase t was one of many high n characteristic phrases in journal ji, and to 0 if phrase t didn’t seem as a high characteristic phrase. For instance, if the highest 5 characteristic phrases have been recognized as “journal,” “concern,” “worldwide,” “quantity” and “paper,” and the journal ji textual content content material consists of “journal,” “analysis,” “worldwide,” “data” and “paper,” the ensuing ji phrase vector used for mannequin coaching and prediction was [1, 0, 1, 0, 1]. The first aim of classification is to find out classes or courses for brand spanking new knowledge. Classification could be carried out with both structured or unstructured knowledge. Every classifier requires parameter optimization to attain essentially the most correct outcomes. Following knowledge assortment and have extraction, 80% of the journals in our pattern (666 predatory, 970 reliable) have been randomly chosen to be used as a coaching set; the remaining 20% (167 predatory, 243 reliable) was used as a testing set. Mannequin coaching additionally utilized the highest 50–9,000 characteristic phrases.

Supply hyperlink

Leave a Reply

Your email address will not be published. Required fields are marked *