Skip to main content
Journal of Applied Statistics logoLink to Journal of Applied Statistics
. 2021 Apr 27;50(3):574–591. doi: 10.1080/02664763.2021.1919063

Unsupervised document classification integrating web scraping, one-class SVM and LDA topic modelling

Anton Thielmann a,CONTACT, Christoph Weisser a,b, Astrid Krenz a,c, Benjamin Säfken a,b
PMCID: PMC9930816  PMID: 36819086

Abstract

Unsupervised document classification for imbalanced data sets poses a major challenge. To obtain accurate classification results, training data sets are often created manually by humans which requires expert knowledge, time and money. Depending on the imbalance of the data set, this approach also either requires human labelling of all of the data or it fails to adequately recognize underrepresented categories. We propose an integration of web scraping, one-class Support Vector Machines (SVM) and Latent Dirichlet Allocation (LDA) topic modelling as a multi-step classification rule that circumvents manual labelling. Unsupervised one-class document classification with the integration of out-of-domain training data is achieved and >80% of the target data is correctly classified. The proposed method thus even outperforms common machine learning classifiers and is validated on multiple data sets.

Keywords: Unsupervised document classification, out-of-domain training data, one-class SVM, LDA topic model, web scraping, machine learning

1. Introduction

The classification of large, unlabelled and imbalanced text corpora can be a time-consuming and cost-intensive task. Manual labelling is often dependent on the availability of experts and their individual assessment of the data. In addition, obtaining adequate representations of underrepresented categories requires the labelling of a very large amount of data. The general approaches to document classification are manifold and often achieve remarkable results. Supervised document classification, either using classical machine learning techniques (e.g. Moraes et al. [27], Ting et al. [41]) or neural networks achieve in some cases accuracies of up to 90% [2,51]. Unfortunately, these algorithms require large amounts of labelled training data which often need to be balanced for optimal classification results. Unsupervised document classification, not needing labelled training data, already achieve sensible results for balanced data sets. Commonly used methods are topic models such as the Probabilistic Latent Semantic Analysis (pLSA) [12] and Latent Dirichlet Allocation (LDA) [6], agglomerative hierarchical clustering [19] and K-means [18]. They are prone to failure, however, for large and imbalanced data sets with heavily underrepresented categories. Real world problems, however, often come along with heavily imbalanced data sets and thus require the classification of underrepresented classes [14,15,22]. The integration of labelled out-of-domain training data into document classification thus offers a great opportunity due to the large availability of free text data on the Internet. It was successfully implemented so far, for example, by Dai et al. [11]. The challenge in our case is to identify documents of an extremely underrepresented class (<1%). To use supervised document classification algorithms, we generate labelled out-of-domain training data via web scraping. However, we use web scraping to generate only the positive target class and subsequently apply supervised one-class document classification [24,25]. An alternative approach would be to also scrape a negative training class so that powerful binary classifiers could be used. Depending on the imbalance of the classes in the target data, common practices to accurately classify underrepresented classes are advanced over- and under-sampling techniques [4,21–23,49]. However, the scraping of a representative negative class is generally not feasible in the unsupervised case, given that we do not know the other classes in the data set. For extremely unbalanced data, as in our case, one-class classifiers can show better results than oversampling and undersampling techniques and avoid the problem of having to generate a negative class [10,15,29]. Further, supervised one-class document classification is remarkably accurate [24]. The one-class SVM [36] is still considered as a benchmark one-class classifier [53]. A new one-class Naive Bayes classifier has been recently proposed by Zhang and Jatowt [53]. Besides, [25] has developed a neural network based one-class classifier.

In this paper we use the established [24] one-class SVM, since the focus of this research lies mainly on the combination of a one-class classifier with LDA topic models [6] to propose a multi-step classification rule. The adaptation of supervised one-class document classification approaches to the unsupervised case, with the integration of out-of-domain training data has already shown promising results regarding the classification of patents in the realm of artificial intelligence [40]. However, the paper by Thielmann et al. [40] uses the last step of the presented classification rule only as a means to identify sub-categories. The second step of the presented classification rule in which LDA topic modelling is used to identify false positive classified data is thus a novelty. Additionally Thielmann et al. [40] lacks any performance evaluation because it works with unlabelled data.

Our proposed method tackles the problems of unsupervised one-class document classification in heavily imbalanced data sets, by combining the statistical LDA model with a machine learning classifier. This combination of a Bayesian statistical model with advanced machine learning techniques and additionally incorporating web scraping to generate out-of-domain training data provides a novel classification rule that can be effectively and easily applied, while the cost-intensive and time-consuming manual labelling is eliminated. The generation of out-of-domain training data with web scraping enables the generation of large amounts of training data, that can increase the classification performance.

Different data sets are used to validate the generalizability of the approach. We find that out-of-domain training data, in the form of publicly available abstracts of scientific papers, can also be used for accurate classification of strongly unrelated text corpora, such as newspaper articles. The advantages of the proposed method even increase with the size of the target data set and the degree of unbalancedness.

The remainder of the paper is structured as follows: First, the data used in the analyses and the generation of out-of-domain training data with web scraping are described. The second chapter introduces the integration of one-class SVM and LDA topic modelling. Third, the results arising from analyses using different data sources are presented. The results of our proposed classification rule are compared to the performance of common statistical and machine learning methods. Finally, a conclusion and possible further developments of the method are given in the last chapter.

2. Methodology

2.1. Data

The text data used to validate the suggested classification rule are two commonly used data sets in natural language processing (NLP). The first (data set (1)) is a medical transcription data set which is downloaded from kaggle [1,8]. It is comprised of roughly 2350 observations and includes the diagnosis as well as several other variables. For our analysis, we extract information on the medical transcription, a detailed description of the patients' symptoms as well as the labelling variable, the so-called medical specialty. It includes 40 different labels relating to medical fields such as surgery, psychotherapy or dentistry. The complete data set is heavily imbalanced: the five most strongly represented classes account for more than 50% of the data set. The initial data set includes multiple labelling, meaning that many observations are simply contained multiple times in the data set under different labels. Such labelling makes the interpretation of accuracy, precision and other performance scores difficult, and for that reason we decided to work without the duplicates. Doing so, the classification problem becomes much easier, which is why another data set to validate the obtained results is used. The second data set (data set (2)) is the Reuters-21578 Distribution 1.0 ApteMod data set, which can be directly downloaded from NLTK [5]. This data set is often used in NLP studies (e.g. Yang and Liu [50], Joachims [16]). It consists of around 11,000 newspaper articles on financial topics. The data set is also heavily imbalanced. About 35% of the data are related to the most prominent topic. We use the ApteMod corpus, wherein each document belongs to one or more categories. In total, there are 90 categories such as earn, acquisition or grain.

For our analysis, we select a strongly underrepresented category (label) from both data sets. In data set (1), the category is dentistry with 27 occurrences in total. This corresponds to roughly 1% of observations in the data set. In data set (2), a similarly underrepresented category is selected, namely the category cotton, which has 61 occurrences overall (0.6%). For this data set, including multiple labelling, it has to be noted that the label cotton occurs a unique 24 times, and 37 times in combination with other labels.

2.2. Web scraping

One of the main contributions of our proposed classification rule is the integration of independent, out-of-domain training data. Since we have no labelled training data to train a classifier, we make use of simple web scraping with Python [26,43]. Similar to co-clustering approaches in text classification which uses out-of-domain text data [11,47], we use external but labelled data. For both data sets, which we outlined in the previous chapter, we scrape publicly available abstracts of scientific papers. Firstly, these abstracts are freely accessible on the internet and secondly, the scientific papers are tagged with author keywords. These keywords can implicitly be seen as the result of manual labelling. They are added by those who have studied the scientific texts most thoroughly, namely the authors of the scientific papers themselves. In order to scrape the out-of-domain training data, we define the target class by at least one term that is as precise as possible and defines the class in the best possible way. Subsequently we scrape all those documents that are tagged with at least one of the chosen author keywords.

For the presented data sets, all papers are scraped from the website ‘IEEE Xplore’ [48]. For the medical data set, we choose three different keywords and scrape abstracts of scientific papers that have at least one of the tagged author keywords dentistry, teeth or tooth.1 For the Reuters data set we choose only a single keyword, namely cotton. We select only one author keyword to show that the described approach also works for only little training data.

With the chosen approach of web scraping out-of-domain training data, we obtain 597 abstracts of scientific papers on the broad topic of dentistry and 182 scientific papers on the topic of cotton. Both of the scraped data sets are visualized in Figure 1. The Figure shows the representation of two randomly selected topics from the output of an LDA topic model (see Section 2.5.2), using a visualization through the help of wordclouds.

Figure 1.

Figure 1.

Wordclouds for the web scraped data sets.

2.3. Procedure

The suggested method follows a procedure which incorporates web scraping [26], one-class SVMs introduced by Schölkopf et al. [36] and LDA topic modelling [6]. The combination of these three methods allows for a classification rule that is able to incorporate external, out-of-domain training data. The procedure successfully classifies unlabelled text documents and circumvents the cost- and time-intensive task of manual labelling.

The single steps of the classification rule are described in more detail in the following. An overview of the procedure is visualized in Figure 2. The general idea and what allows us to use out-of-domain training data is to train the one-class SVM classifier in such a way that it deliberately overgeneralizes. This will also lead to false positive classified data, but ensures that the classifier captures most of the relevant data. Further, overfitting problems that might be induced by the out-of-domain training data are mitigated. Subsequently, the classified data is analysed using LDA topic modelling, selecting and filtering out data incorrectly classified as positive. This last step can be repeated, depending on the perceived quality of the prediction, to further reduce the amount of data incorrectly classified as positive.

Figure 2.

Figure 2.

Classification procedure.

2.4. Input data

In order to use the text data in Support Vector Machines and LDA topic modelling, it is preprocessed and the features are extracted. We apply common text preprocessing techniques [42,45] and remove all stop words [39] by using NLTKs built-in stop word dictionary [5]. Further, all numbers are removed and all words are put in lower cases, tokenized and lemmatized.

As document representations we choose the term frequency-inverse document frequency (tf-idf) representation [35], formally denoted by:

tf-idf(word)=frequency(word)[logkK(word)+1], (1)

where k is the total number of words in the dictionary and K(word) is the total number of documents wherein the word appears. The representations are computed with Scikit-learns [28] built-in tf-idf vectorizer. We use the out-of-domain training data and the data to be classified as a joint corpus. While Manevitz and Yousef [24] found binary representations to be more accurate, the use of out-of-domain data makes tf-idf representations more suitable.

We find no clearly discernible connection between the number of extracted features and classification performance (see supplemental material A.3). However, due to computational efficiency we choose to set the maximum number of features at 8000 for the cotton classification. For the classification of dentistry, on the other hand, due to the comparably small size of the data set, we simply choose to select all features (385,192 features). The training data thus consists of the tf-idf feature matrix representing only the out-of-domain training data. However, the features are calculated based on the joint corpus of the target data and the out-of-domain training data.

2.5. Classification

2.5.1. One-class support vector machine

The one-class SVM as introduced by Schölkopf et al. [36] closely resembles classical SVMs, which were firstly introduced by Vapnik [7,44]. The major difference between the two classifiers is the use of only a single class in the one-class SVM. The method is originally used for unsupervised outlier and anomaly detection in a single class data set [3,20,46,52]. The goal is to obtain a decision function, which is positive, taking the value +1 in a small region S, that captures most of the training data and is negative elsewhere, S¯. Thus, it detects outliers, novelties or anomalies by taking the value –1 respectively:

f(x)={+1ifxS1ifxS¯ (2)

Suppose, we are using x1,xX as training data set belonging to a single class and N being the number of observations. Then, after mapping the so-called positive training sample [37], which in the current example are the scraped out-of-domain scientific papers, into a higher dimensional feature space F, Φ:X F, the data is separated from the origin with the maximum margin. The optimization problem, closely resembling the optimization problem in classical Support Vector Machines becomes:

minωF,ξR+,ρR12||ω||2+1νi=1ξiρsubjectto:(ωΦ(xi))ρξii=1,,ξi0 (3)

where ω is the normal vector of the hyperplane, ρ is an offset parameterizing the hyperplane in the feature space and representing the threshold of the decision function, the ξis are nonzero slack variables and Φ is the described feature map X F, such that the inner product of Φ can be computed by evaluating a simple kernel function

k(xi,xj)=Φ(xi),Φ(xj), (4)

with i,j[], which is introduced to account for non-linearly separable cases. Following Manevitz and Yousef [24], as the kernel function we choose the radial basis function

k(xi,xj)=exp[γ||xixj||2], (5)

with γ being a representation of the spread scale of the kernel that can be adjusted by the user (see supplemental material A.1 on how the radial basis function can be expressed in terms of the inner product of Φ(x))).

The crucial parameter in Equation (3), the regularization parameter ν, is a user-specified parameter and can be described as a regularization term. The trade-off between the decision function f(x) being positive for most of the training data x1,xX and ||ω|| being small, is controlled by ν, yielding Schölkopf's one-class SVM, the reference ν-SVM. The regularization parameter ν(0,1] firstly functions as an upper bound on the fraction of outliers [36]. Second, it controls the lower bound of the fraction of support vectors in relation to the total number of training data.

If ω and ρ solve the optimization problem in (3), we obtain a decision function

f(x)=sgn((ωΦ(x))ρ) (6)

that adequately balances the described trade-off. Minimizing the objective function, using two types of positive Lagrange multipliers λi and μi, i=1,,, leads to the following Lagrangian:

L(ω,ξ,ρ,λ,μ)=12||ω||2+1νi=1ξiρi=1λi((ωΦ(xi)ρ+ξi)i=1μiξi. (7)

Setting the respective derivatives (see supplemental material A.2) to the relevant variables ω,ξ and ρ equal to zero and substituting them into the defined Lagrangian in Equation (7) – closely following Schölkopf's notation [36] – results in the following dual problem:

minλ12i=1j=1λiλjk(xi,xj)subjectto:0λi1ν,i=1λi=1,i=1,,. (8)

When solving for the optimal value of λ=(λ1,,λ), the minimum of the objective function (3) is obtained. As the minimization problem in Equation (8) represents a simple quadratic form, the solution can be obtained by using quadratic programming optimization, for example. The optimal value of ρ is thus obtained by the formula:

ρ=j=1λjk(xj,xi), (9)

which exploits that for any λi fulfilling the constraints in Equation (8), the support vectors xi satisfy

(ωΦ(xi))=j=1λjk(xj,xi)(see supplemental material A.2), (10)

which follows from the first order conditions from the Lagrangian (7, supplemental material A.2). The finally obtained decision function (6) is then used to classify the relevant data.

In experimental results, Schölkopf et al. [36] showed that the interpretation of the regularization parameter ν can actually be straightforward. Setting ν to 0.9 for example, results in a roughly 90% fraction of outliers. Thus, the definition of ν strongly depends on the researchers' belief about the prevalence of outliers in the data set. Setting the correct ν is thus important for the quality of the classifier. In the unsupervised case, however, we cannot optimize ν according to accuracy or other performance scores. Therefore, we define a general decision rule that helps finding the right value of the regularization parameter ν. We find that iterating over values of ν which allow for at least 85% of positive classified out-of-domain training data, as well as another user-specified threshold for positive classified in-domain data, leads to the best results. Depending on the quality of the scraped out-of-domain training data, the threshold of 85% needs to be adjusted, i.e. lowered or even increased. Aiming for a positive classification of e.g. 95% of out-of-domain training data, when this training data incorporates a lot of data not capturing the target category will lead to a lot of predictions not capturing the target data. Note that the user-specified threshold incorporates the users' knowledge of the prevalence of the relevant category in the target data set. When dealing with strongly imbalanced data sets that cannot be adequately classified with k-means or topic modelling from the beginning, setting that threshold no higher than 5% of the total amount of the target data proved reasonable. Selecting a very large regularization parameter ν, however, is not advisable because it increases the risk of defining a too narrow decision boundary and hence missing relevant data.

Setting the parameter of the chosen rbf kernel function, γ, to Scikit-learns [28] pre-specified value of auto, which corresponds to 1 divided by the number of features, restricts the number of possible classifiers and minimizes the computational effort. Iterating over possible values of γ could, however, also be implemented.

2.5.2. Latent Dirichlet allocation topic modelling

The documents classified as positive by the one-class SVM are now processed in the second step of the classification rule, the LDA topic modelling [6]. Similar to the one-class SVM, LDA topic modelling is an unsupervised machine learning technique. Topics are characterized by a distribution over words, independent of the positional occurrence of the words, as defined by Blei et al. [6]. Hence, documents are represented by a random mixture over these latent topics. To find the topics associated with the remaining documents, we are formally looking for:

p(θ,z,w|α,β)=p(θ|α)n=1Np(zn|θ)p(wn|zn,β) (11)

with the corpus D={w1,,wM} being M documents denoted as a sequence of words w=(w1,,wN). These words, on the other hand are represented by vectors, indexed on a predetermined, fixed vocabulary ( {1,,V}). The word probability matrix βRK×V parameterizes the word probabilities, where k=1,,K is the user-defined number of topics. The document topic variables associated with the corresponding words are denoted by zn, n=1,,N, which, in other words, gives the words' topic assignment based on the document-specific distribution over the topics. The K-dimensional topic mixture distribution is given by θDir(α) and α is representing the distribution-related parameters of the Dirichlet distribution. Thus, the probability of document m to contain topic k is represented by θm,k. The goal is to obtain the parameters β, α, and most importantly θk in order to filter out data that does not cover the relevant topic.

As we are using the LDA algorithm implemented in gensim [30], the applied inference approach is the algorithm by Hoffman et al. [13]. The applied variational inference approach assumes that the posterior distribution can be approximated by other, traceable distributions. The aim is to find the closest distribution to the real posterior distribution by Kullback-Leibler (KL) divergence. Although not a distance metric in the mathematical sense, KL divergence is widely used to measure the distance between distributions and hence is implemented in the gensim package. The parameters of the variational posterior are approximated using the Expectation Maximization algorithm.

2.5.3. Integration of one-class SVM and LDA topic modelling

The integration of the topic models' output and the one-class SVM follows a simple logic. The positive classified documents obtained from the one-class SVM classification should include a large amount of false positives. First, false positives can result from the use of out-of-domain training data; second, they might be due to the setting of the regularization parameter ν, such that we try not to miss too much of the relevant target data. Depending on the amount of classified data in the one-class SVM classification, the user needs to define the number of topics during the LDA topic modelling. The obtained topics of the classified data are analysed, for example, with the help of visual representations [9,38]. Depending on the topic model's output, either relevant topics that are covering the target topic or topics that are clearly unrelated to the target topic are identified. Subsequently, the user must define a threshold for which documents relating to the bad or good topics are selected. Two possible strategies are introduced in Section 3. Depending on the selected strategy, the relevant documents are extracted according to a prevalence of the chosen topics for that document, such that the documents are filtered by their respective θj values.

2.6. Performance measures

In order to evaluate the performance of the proposed method in the examples in Section 3, different performance measures are used. This section provides a brief summary of these measures. The performance measures are based on the number of documents that are correctly classified as positive (i.e. belonging to the underrepresented category), called true positives tp, the number of documents that are falsely classified as positive, called false positives fp, the number of documents that are correctly classified as negative (i.e. belonging to the overrepresented category), called true negatives tn and, the number of documents that are falsely classified as negative, called false negatives fn.

The first performance measure of interest is the ratio of documents correctly classified as positive out of all documents classified as positive. This is called the

Precision=tptp+fp.

Sometimes the precision is called positive predictive value. Only relying on the precision is not adequate as the maximum precision of 1 could be attained by only classifying one positive document as positive. Therefore, a second measure is often introduced, namely the ratio of documents correctly classified as positive to all documents that should have been classified as positive. This is called the

Recall=tptp+fn.

It is often useful to combine the recall and the precision. This can be done by the harmonic mean of both and is called the

F1-score=2Precision1+Recall1.

Sometimes the recall is called sensitivity or true positive rate. The proportion of correctly classified documents among all documents is defined as the

Accuracy=tp+tntp+tn+fp+fn,

although this measure is not recommended for imbalanced data sets. Another common performance measure is based on the receiver operating characteristic (ROC). The so-called ROC curve plots the recall (or sensitivity) against the false positive rate (i.e. 1-specificity) for different threshold values T. Thus the ROC curve is defined by the pairs

[tp(T)tp(T)+fn(T),fp(T)fp(T)+tn(T)],

for different values of the threshold T. A performance measure can be derived from the ROC curve by calculating the area under the curve. This would be close to 1 for a near perfect classifier and close to 1/2 for a classifier based solely on randomness.

3. Application

For both data sets (which were described in Section 2.1), we train a one-class SVM on the out-of-domain training data. We iterate over regularization parameters (ν's) in the range from 0.01 to 0.9 in steps of 0.001 and look at all classifiers that accurately predict at least 85% of the out-of-domain training data, as the training data contains little to literally no outliers at all (see Figure 1 in Section 2.1).

In order to further reduce the amount of possible classifiers, we set additional constraints on the one-class SVM that depend on the analysed data set. The classification results and a comparison to other classifiers is given in the following.

3.1. Medical transcriptions

We use the abstracts of the scraped scientific papers, tagged with at least one of the three author keywords, dentistry, teeth or tooth and train the one-class SVM on the out-of-domain data. For the medical transcriptions data set, we set a second threshold at values of 0.8–4%. This implies that we only consider classifiers in the further analysis that classify at least 85% of the training data and between 0.8% and 4% of the target data as positive.

Only two possible classifiers fulfil the given constraints. Note that for γ in the rbf kernel function, we use Scikit-learns auto value. Thereby, we reduce the number of possible classifiers. The selected classifiers use the ν values of 0.166 and 0.3 respectively and predict 576/575 observations of the training data and 79/65 documents of the target data as positive. Based on the analysis of both possible classifiers, using LDA topic models, we select the second classifier, using ν=0.3 and predicting 65 medical transcriptions. We find that the topic models' output shows less topics not capturing the target topic for this value of the regularization parameter ν. We do not tune any hyperparameters or optimize the topic models with respect to coherence scores in order to keep the steps as simple as possible. A topic model with 5 topics estimated for the 65 medical transcription documents reveals 2 topics that clearly depict the target topic dentistry (see Figure 3) and 3 topics clearly not depicting the target topic (see 2 of them in Figure 4).

Figure 3.

Figure 3.

Wordclouds for visualization of data covering the target topic.

Figure 4.

Figure 4.

Wordclouds for visualization of data not covering the target topic.

The first step of the classification correctly classifies 22 out of 27 (81%) dentistry medical transcriptions. Thus, in total we have 43 false positive predictions and the precision is 33.8% (see Table 1). The goal of the following steps is to increase the precision by filtering out false positives. We select a strict prevalence threshold for the identified good topics, of 40%. Thus, only those documents are further analysed which have a prevalence of one of the target topics of at least 40%. Doing that, we are able to reduce the number of false positive classified documents to 13, increasing the precision to 62.9%. If, for example, we select only those documents where one of the two identified topics is the most dominantly prevalent topic, we increase the precision even further to 64.5%. However, such a more restrictive selection would increase the amount of missed-out dentistry medical transcriptions to 7 (compared to previously 5). An overview of the results can be seen in Table 1. We are able to correctly classify 81.5% of the dentistry medical transcriptions with a precision of 62.9%.

Table 1.

Performance scores for the dentistry prediction.

Method Precis. Recall F1 Acc. ROC Correct False
(1) O-SVM 0.338 0.815 0.478 0.980 0.898 22 43
(2) + Threshold 0.629 0.815 0.710 0.992 0.905 22 13
(3) + Dominant 0.645 0.741 0.69 0.992 0.868 20 13

Notes: Performance scores given for: (1) The initial one-class SVM classifier (trained on out-of-domain training data), (2) the one-class SVM classifier extended with LDA topic modelling and documents selected on a prevalence threshold for identified topics, (3) the one-class SVM classifier extended with LDA topic modelling and documents selected whose dominant topic is one of the identified target topics.

3.1.1. Comparison to other classifiers

In order to compare the proposed method with other algorithms, we classify the medical transcriptions with a Naive Bayes (NB) classifier, a classical SVM and a simple Logistic Regression (Log-Reg). For further comparison, we use two tf-idf representations of the documents. First, we reduce the number of features in the tf-idf transformation to 8000, and second, we use the already described preprocessed data with no maximum feature restriction. The results can be seen in Tables 2 and 3. For all of the 3 methods, we use one third of the target data as training data (9 documents) and 90 randomly selected documents representing the negative class.

Table 2.

Performance scores of Naive Bayes, SVM and logistic regression on the medical transcription data set.

Algorithm Precision Recall F1-score Accuracy ROC Correct False
NB 0.989 0.994 0.989 0.5 0 0
SVM 0.989 0.994 0.989 0.5 0 0
Log-Reg 1 0.997 0.993 0.995 0.7 11 0

Notes: Performance scores for: A Naive Bayes-, a SVM- and a Logistic Regression Classifier, trained on 9 (randomly sampled) dentistry medical transcriptions and 90 non-dentistry medical transcriptions, using tf-idf max-feature representations.

Table 3.

Performance scores of Naive Bayes, SVM and logistic regression on the medical transcription data set with tf-idf max-features = 8000.

Algorithm Precision Recall F1-score Accuracy ROC Correct False
NB 0.989 0.994 0.989 0.5 0 0
SVM 0.989 0.994 0.989 0.5 0 0
Log-Reg 1 0.997 0.996 0.996 0.815 17 0

Notes: Performance scores for: A Naive Bayes-, a SVM- and a Logistic Regression Classifier, trained on 9 (randomly sampled) dentistry medical transcriptions and 90 non-dentistry medical transcriptions, using tf-idf representations with 8000 features.

Neither classical NB nor SVM classifiers perform well on this small and imbalanced training data set. Logistic regressions, however, perform surprisingly well and produce no false positives. Note, however, that a human labeller (who is not allowed to make any mistakes) would have to manually classify roughly 1300 documents to obtain at least 9 documents representing the target class with a probability of 90%. The labelling is a time- and cost-intensive task. In order to get an impression of how true that problem actually is, we asked an expert – a physician – to label 100 medical transcriptions, which include the 27 relevant dentistry transcriptions.2 It took the expert roughly 50 minutes, and no mistakes were made. Thus, labelling the needed 1300 documents would take the expert approximately 11 hours. Hence, having an expert label such a large amount of data for the classification results in Tables 2 and 3 seems impractical.

In order to compare our method with other, less sophisticated approaches, which are often used in applications, we perform a keyword search. The relevant keywords are defined by two professional dentists and can be found in the supplemental material (see supplemental material A.4). The keyword search correctly classifies all 27 relevant documents, with a precision of 15.79%. Thus 144 medical transcriptions are falsely classified. Note that including an additional single keyword can significantly change the obtained results. Adding the word dental for example leads to 187 falsely classified medical transcriptions and a reduced precision of 12.6%.

3.2. Reuters ApteMod

We use the scientific papers tagged with the author keyword cotton in order to classify the Reuters data set and find the target newspaper documents covering the topic of cotton. Two wordclouds, again representing two randomly selected topics from a LDA topic models output on the complete Reuters newspaper corpus are pictured in Figure 5.

Figure 5.

Figure 5.

Wordclouds for visualization of the Reuters data set.

Since we suspected that the prevalence of the target topic is very low, we again set a second threshold at 0.45–3%. This implies that we only consider classifiers in the further analysis that classify at least 85% of the training data and between 0.45% and 3% of the target data as positive. We find three possible classifiers and select one for further analyses: again this is done by LDA topic modelling. The three possible ν values are 0.056, 0.078 and 0.084, respectively, classifying 175, 173 and 170 of the training data and 124, 82 and 56 of the target data as positive. The third classifier ( ν=0.084) is selected, as the topic modelling revealed that through this parameter combination less topics emerge that do not cover the target topic. A topic model which was estimated with 5 topics for the 56 newspaper documents, reveals only one topic that does clearly not depict the target topic of cotton (see Figure 6, left-hand depiction).

Figure 6.

Figure 6.

Wordclouds for visualization of the Reuters data set covering the target data (right) and not covering the target data (left).

In the first step of the classification, we find 36 of 61 (59%) documents with a precision of 64% of the target data. Note that from the 24 uniquely labelled documents, we find all 24. In total, we have 20 false positive classifications.

Similar to the procedure in the medical transcription classification, we aim to filter out the false positive classifications. This time, we filter the documents only depending on the prevalence threshold of the unrelated topic. As we are interested in those documents that have a very low prevalence threshold for the identified topics, we set the threshold at <0.05. By that, we are able to increase the precision to 87.2% (see Table 4) and again outperform other common classifiers trained on in-domain training data (see Table S.1 in the supplemental material (A.3)).

Table 4.

Performance scores for the cotton prediction.

Method Precis. Recall F1 Acc. ROC Correct False
(1) O-SVM 0.643 0.590 0.615 0.996 0.794 36 20
(2) + Threshold 0.872 0.557 0.680 0.997 0.778 34 5
(3) + Dominant 0.800 0.590 0.679 0.997 0.795 36 9

Notes: Performance scores given for: (1) The initial one-class SVM classifier (trained on out-of-domain training data), (2) the one-class SVM classifier extended with LDA topic modelling and documents selected on a prevalence threshhold for the identified bad topic and (3) the one-class SVM classifier extended with LDA topic modelling and documents selected whose dominant topic is not the identified bad topic.

3.3. Comparison of the performance to the literature

For a comparison of the performance of our classification rule, we would need to consider unsupervised classification problems for an extremely underrepresented class (<1%). Given that we turn the unsupervised classification problems with the generation of labelled data via web scraping into a supervised one-class classification problem, we consider the performance of classifiers in the literature on one-class classifiers, that are applied on unbalanced text data as an appropriate benchmark. [24] use their one-class SVM to classify data in the Reuters data set, but only for the 10 classes with the largest prevalence instead of extremely underrepresented classes such as cotton. On average they achieve an F1-score of 0.52 for the 10 different categories which is substantially lower than the F1-score of around 0.7 that we achieve with our classification rule. This is the case although [24] use in-domain training data and categories that have a higher prevalence. In a recent publication [53] propose a one-class Naive Bayes classifier. They achieve an average F1-score of 0.47 for three different data sets. Thereby their best F1-score is 0.59.

4. Conclusion

In this contribution, we show that the implementation of unsupervised document classification integrating web scraping, one-class Support Vector Machines and Latent Dirichlet Allocation topic modelling yields very accurate classification results for different data sets. With the proposed method, we are able to circumvent the time- and cost-intensive task of manual labelling and even outperform Naive Bayes and classical SVM classifiers trained on in-domain training data. Note that the advantages of the proposed method increase with the size of the target data set and the degree of unbalancedness. As described, manual labelling would require the entire data set to be manually labelled in order to obtain a sufficiently large training class for the target category. The proposed method offers additional advantages, e.g. that the obtained classified target data can easily be used as training data and incorporated into multi-class classification algorithms. Thus, with the proposed method more suitable training data sets could be generated with the help of out-of-domain data.

The presented approach can be extended in several ways in future research. If there are structural variables within the data sets these could be included in structural topic models [34], the one-class SVM could be replaced by other methods and other approaches for instance based on decision making approaches [31–33] could be integrated. From a practical perspective the method could be employed for analysing geospatial twitter data [17]. The authors plan to investigate further approaches in the future.

Supplementary Material

Supplemental_Material

Acknowledgments

We are grateful to Cornelius Weisser for the data labelling and to Jeanne Micallef and Maximilian Kornhass for helping with the dictionary used in the keyword search. For both tasks their medical expert knowledge was invaluable. We also thank two anonymous reviewers for their many insightful comments and suggestions on the original version of the paper that improved the resulting manuscript a lot.

Notes

1

For medical data, privacy policies make it very difficult to obtain in-domain training data or data that is comparable to in-domain data, such that scientific papers covering the broad topic of dentistry seem like a good representation for this topic. Regarding the Reuters data set, however, one could argue to consider newspaper articles on the subject of cotton. However, the difficulty in finding text labels that are accurate enough for classification, and the paywalls that apply for the most popular newspaper websites, constitute problems for newspaper article web scraping which are not easy to overcome. For that reason, taking scientific papers for web scraping prove as the best practicable solution.

2

Remember from Section 1.1: there were 27 occurrences with the labelling dentistry found in the medical transcriptions data set.

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

  • 1.MTsamples, MTsamples, preprint (2020). Available at https://www.mtsamples.com/.
  • 2.Afzal M.Z., Capobianco S., Malik M.I., Marinai S., Breuel T.M., Dengel A. and Liwicki M., Deepdocclassifier: Document classification with deep convolutional neural network, in 13th International Conference on Document Analysis and Recognition (ICDAR), IEEE, 2015, pp. 1111–1115.
  • 3.Amer M., Goldstein M. and Abdennadher S., Enhancing one-class support vector machines for unsupervised anomaly detection, in Proceedings of the ACM SIGKDD Workshop on Outlier Detection and Description, 2013, pp. 8–15.
  • 4.Anand A., Pugalenthi G., Fogel G.B. and Suganthan P., An approach for classification of highly imbalanced data using weighting and undersampling, Amino Acids 39 (2010), pp. 1385–1391. [DOI] [PubMed] [Google Scholar]
  • 5.Bird S., Klein E. and Loper E., Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit, O'Reilly Media, Sebastopol, CA, 2009. [Google Scholar]
  • 6.Blei D.M., Ng A.Y. and Jordan M.I., Latent dirichlet allocation, J. Mach. Learn. Res. 3 (2003), pp. 993–1022. [Google Scholar]
  • 7.Boser B.E., Guyon I.M. and Vapnik V.N., A training algorithm for optimal margin classifiers, in Proceedings of the 5th Annual Workshop on Computational Learning Theory, 1992, pp. 144–152.
  • 8.Boyle T., Medical transcriptions, preprint (2020). Available at https://www.kaggle.com/tboyle10/medicaltranscriptions.
  • 9.Chaney A.J.B. and Blei D.M., Visualizing topic models, 6th International AAAI Conference on Weblogs and Social Media, 2012.
  • 10.Chawla N.V., Japkowicz N. and Kotcz A., Editorial: Special issue on learning from imbalanced data sets, SIGKDD Explor. Newsl. 6 (2004), pp. 1–6. Available at 10.1145/1007730.1007733. [DOI] [Google Scholar]
  • 11.Dai W., Xue G.R., Yang Q. and Yu Y., Co-clustering-based classification for out-of-domain documents, in Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2007, pp. 210–219.
  • 12.Hofmann T., Probabilistic latent semantic indexing, in Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, USA, Association for Computing Machinery, SIGIR '99, 1999, pp. 50–57. Available at 10.1145/312624.312649. [DOI]
  • 13.Hoffman M., Bach F.R. and Blei D.M., Online learning for latent Dirichlet allocation, Adv. Neural Inf. Process. Syst. (2010), pp. 856–864.
  • 14.Japkowicz N. and Stephen S., The class imbalance problem: A systematic study, Intell. Data Anal. 6 (2002), pp. 429–449. [Google Scholar]
  • 15.Jiang X., Ringwald M., Blake J.A., Arighi C., Zhang G. and Shatkay H., An effective biomedical document classification scheme in support of biocuration: Addressing class imbalance, Database 2019 (2019). [DOI] [PMC free article] [PubMed]
  • 16.Joachims T., Text categorization with support vector machines: Learning with many relevant features, in European Conference on Machine Learning, Springer, 1998, pp. 137–142.
  • 17.Kant G., Weisser C. and Säfken B., Ttlocvis: A twitter topic location visualization package, J. Open Source Softw. 5 (2020), p. 2507. [Google Scholar]
  • 18.Kanungo T., Mount D.M., Netanyahu N.S., Piatko C.D., Silverman R. and Wu A.Y., An efficient k-means clustering algorithm: Analysis and implementation, IEEE Trans. Pattern Anal. Mach. Intell. 24 (2002), pp. 881–892. Available at 10.1109/TPAMI.2002.1017616. [DOI] [Google Scholar]
  • 19.Karypis M.S.G., Kumar V. and Steinbach M., A comparison of document clustering techniques, TextMining Workshop at KDD2000 (May 2000), 2000.
  • 20.Li K.L., Huang H.K., Tian S.F. and Xu W., Improving one-class SVM for anomaly detection, in Proceedings of the 2003 International Conference on Machine Learning and Cybernetics (IEEE Cat. No. 03EX693), Vol. 5, IEEE, 2003, pp. 3077–3081.
  • 21.Liu A., Ghosh J. and Martin C.E., Generative oversampling for mining imbalanced datasets, in DMIN, 2007, pp. 66–72.
  • 22.Luo Y., Feng H., Weng X., Huang K. and Zheng H., A novel oversampling method based on SeqGAN for imbalanced text classification, in IEEE International Conference on Big Data (Big Data), IEEE, 2019, pp. 2891–2894.
  • 23.Maldonado S., López J. and Vairetti C., An alternative smote oversampling strategy for high-dimensional datasets, Appl. Soft. Comput. 76 (2019), pp. 380–389. [Google Scholar]
  • 24.Manevitz L.M. and Yousef M., One-class svms for document classification, J. Mach. Learn. Res. 2 (2001), pp. 139–154. [Google Scholar]
  • 25.Manevitz L. and Yousef M., One-class document classification via neural networks, Neurocomputing 70 (2007), pp. 1466–1481. [Google Scholar]
  • 26.Mitchell R., Web-Scraping with Python: Collecting More Data From the Modern Web, O'Reilly Media, Inc., Sebastopol, CA, 2018. [Google Scholar]
  • 27.Moraes R., Valiati J.F. and Neto W.P.G., Document-level sentiment classification: An empirical comparison between svm and ann, Expert. Syst. Appl. 40 (2013), pp. 621–633. [Google Scholar]
  • 28.Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R. and Dubourg V., et al. Scikit-learn: Machine learning in python, J. Mach. Learn. Res. 12 (2011), pp. 2825–2830. [Google Scholar]
  • 29.Raskutti B. and Kowalczyk A., Extreme re-balancing for svms: A case study, SIGKDD. Explor. 6 (2004), pp. 60–69. [Google Scholar]
  • 30.Řehůřek R. and Sojka P., Software framework for topic modelling with large corpora, in Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, ELRA, 2010, pp. 45–50.
  • 31.Riaz M., Çagman N., Wali N. and Mushtaq A., Certain properties of soft multi-set topology with applications in multi-criteria decision making, Decis. Making Appl. Manag. Eng. 3 (2020), pp. 70–96. [Google Scholar]
  • 32.Riaz M., Çitak F., Wali N. and Mushtaq A., Roughness and fuzziness associated with soft multi-sets and their application to madm, J. New Theory 31 (2020), pp. 1–19. [Google Scholar]
  • 33.Riaz M. and Hashmi M.R., Linear diophantine fuzzy set and its applications towards multi-attribute decision-making problems, J. Intell. Fuzzy Syst. 37 (2019), pp. 5417–5439. [Google Scholar]
  • 34.Roberts M.E., Stewart B.M. and Airoldi E.M., A model of text for experimentation in the social sciences, J. Am. Stat. Assoc. 111 (2016), pp. 988–1003. Available at 10.1080/01621459.2016.1141684. [DOI] [Google Scholar]
  • 35.Salton G. and Buckley C., Term-weighting approaches in automatic text retrieval, Inf. Process. Manag. 24 (1988), pp. 513–523. [Google Scholar]
  • 36.Schölkopf B., Platt J.C., Shawe-Taylor J., Smola A.J. and Williamson R.C., Estimating the support of a high-dimensional distribution, Neural. Comput. 13 (2001), pp. 1443–1471. [DOI] [PubMed] [Google Scholar]
  • 37.Schölkopf B., Smola A.J. and Bach F., et al. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, MIT Press, Cambridge, MA, 2002. [Google Scholar]
  • 38.Sievert C. and Shirley K., LDAvis: A method for visualizing and interpreting topics, in Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces, 2014, pp. 63–70.
  • 39.Silva C. and Ribeiro B., The importance of stop word removal on recall values in text categorization, in Proceedings of the International Joint Conference on Neural Networks, Vol. 3, IEEE, 2003, pp. 1661–1666.
  • 40.Thielmann A., Weißer C. and Krenz A., One-class support vector machine and lda topic model integration – Evidence for ai patents, preprint (2021), to appear in Comput. Intell.
  • 41.Ting S., Ip W. and Tsang A.H., Is Naive Bayes a good classifier for document classification, Int. J. Softw. Eng. Appl. 5 (2011), pp. 37–46. [Google Scholar]
  • 42.Uysal A.K. and Gunal S., The impact of preprocessing on text classification, Inf. Process. Manag. 50 (2014), pp. 104–112. [Google Scholar]
  • 43.Van Rossum G. and Drake Jr F.L., Python Reference Manual, Centrum voor Wiskunde en Informatica Amsterdam, Scotts Valley, CA, 1995. [Google Scholar]
  • 44.Vapnik V.N., The Nature of Statistical Learning Theory, Springer-Verlag, New York, 1995. [Google Scholar]
  • 45.Vijayarani S., Ilamathi M.J. and Nithya M., Preprocessing techniques for text mining – An overview, Int. J. Comput. Sci. Commun. Netw. 5 (2015), pp. 7–16. [Google Scholar]
  • 46.Wang Y., Wong J. and Miner A., Anomaly intrusion detection using one-class SVM, in Proceedings from the 5th Annual IEEE SMC Information Assurance Workshop, IEEE, 2004, pp. 358–364.
  • 47.Wang P., Domeniconi C. and Hu J., Using Wikipedia for co-clustering-based cross-domain text classification, in 8th IEEE International Conference on Data Mining, IEEE, 2008, pp. 1085–1090.
  • 48.Xplore I.E.E.E., IEEE Xplore digital library, preprint (2020). Available at https://ieeexplore.ieee.org/Xplore/home.jsp.
  • 49.Xu Z., Yu K., Tresp V., Xu X. and Wang J., Representative sampling for text classification using support vector machines, in European Conference on information Retrieval, Springer, 2003, pp. 393–407.
  • 50.Yang Y. and Liu X., A re-examination of text categorization methods, in Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999, pp. 42–49.
  • 51.Yang Z., Yang D., Dyer C., He X., Smola A. and Hovy E., Hierarchical attention networks for document classification, in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016, pp. 1480–1489.
  • 52.Zhang J., Ma K.K., Er M.H. and Chong V., Tumor segmentation from magnetic resonance imaging by learning via one-class support vector machine, 2004.
  • 53.Zhang Y. and Jatowt A., Estimating a one-class Naive Bayes text classifier, Intel. Data Anal. 24 (2020), pp. 567–579. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental_Material

Articles from Journal of Applied Statistics are provided here courtesy of Taylor & Francis

RESOURCES