Skip to main content
Springer Nature - PMC COVID-19 Collection logoLink to Springer Nature - PMC COVID-19 Collection
. 2022 Dec 19;6(1):91–163. doi: 10.1007/s42001-022-00191-7

A comparison of approaches for imbalanced classification problems in the context of retrieving relevant documents for an analysis

Sandra Wankmüller 1,
PMCID: PMC9762672  PMID: 36568019

Abstract

One of the first steps in many text-based social science studies is to retrieve documents that are relevant for an analysis from large corpora of otherwise irrelevant documents. The conventional approach in social science to address this retrieval task is to apply a set of keywords and to consider those documents to be relevant that contain at least one of the keywords. But the application of incomplete keyword lists has a high risk of drawing biased inferences. More complex and costly methods such as query expansion techniques, topic model-based classification rules, and active as well as passive supervised learning could have the potential to more accurately separate relevant from irrelevant documents and thereby reduce the potential size of bias. Yet, whether applying these more expensive approaches increases retrieval performance compared to keyword lists at all, and if so, by how much, is unclear as a comparison of these approaches is lacking. This study closes this gap by comparing these methods across three retrieval tasks associated with a data set of German tweets (Linder in SSRN, 2017. 10.2139/ssrn.3026393), the Social Bias Inference Corpus (SBIC) (Sap et al. in Social bias frames: reasoning about social and power implications of language. In: Jurafsky et al. (eds) Proceedings of the 58th annual meeting of the association for computational linguistics. Association for Computational Linguistics, p 5477–5490, 2020. 10.18653/v1/2020.aclmain.486), and the Reuters-21578 corpus (Lewis in Reuters-21578 (Distribution 1.0). [Data set], 1997. http://www.daviddlewis.com/resources/testcollections/reuters21578/). Results show that query expansion techniques and topic model-based classification rules in most studied settings tend to decrease rather than increase retrieval performance. Active supervised learning, however, if applied on a not too small set of labeled training instances (e.g. 1000 documents), reaches a substantially higher retrieval performance than keyword lists.

Keywords: Imbalanced classification, Boolean query, Keyword lists, Query expansion, Topic models, Active learning

Introduction

When conducting a study on the basis of textual data, at the very start of an analysis researchers are often confronted with a difficulty: Online platforms and other sources from which textual data are generated usually cover multiple topics and hence tend to contain textual references toward a huge number of various entities. Social scientists, however, are typically interested in text elements that refer to a single entity, e.g. a specific person, organization, object, event, or issue.

Imagine, for example, that a study seeks to examine how rape incidents are framed in newspaper articles [12], or that a study seeks to detect electoral violence based on social media data [84], or that a study seeks to measure attitudes expressed towards further European integration in speeches of political elites [99]. In all these studies, one of the first steps is to extract documents that refer to the entity of interest from a large, multi-thematic corpus of documents.1 This is, researchers have to separate the relevant documents that refer to the entity of interest from the documents that focus on entities irrelevant for the analysis at hand. Newspaper articles reporting about rape incidents have to be parted from those articles that do not. Tweets relating to electoral violence have to be extracted from the stream of all other tweets. And speech elements about the European integration have to be separated from elements in which the speaker talks about other entities.

Given a corpus comprising many diverse topics, it is likely that only a small proportion of documents relate to the entity of interest. Hence, the proportion of relevant documents is substantively smaller than the proportion of irrelevant documents in the data and the task of separating relevant from irrelevant documents turns into an imbalanced classification problem [74, p. 155]. How researchers address this imbalanced classification problem is highly important as the selection of documents affects the inferences drawn. More precisely, if there is a systematic bias in the selection of documents such that the value on a variable of interest is related to the question of whether a document is selected for analysis or not, the inferences that are made on the basis of the documents that have been selected for analysis are likely to be biased. Any method that is applied to the identification of relevant documents can induce a selection bias. Yet the more accurately a method can separate relevant from irrelevant documents, the less the potential size of the bias resulting from this selection step.

Despite the relevance of this problem, the question of how best to retrieve documents from large, heterogenous corpora so far has received little attention in social science research. In many applications, researchers rely on applying human-created sets of keywords and regard those documents as relevant that comprise at least one of the keywords (see, e.g. [12, 14, 27, 47, 57, 84, 99, 122, 135]). Yet, research indicates that humans are not good at generating comprehensive keyword lists and are highly unreliable at the task [61, p. 973–975]. This is, the keyword list generated by one human is likely to contain only a small amount of the universe of terms one could use to refer to a given entity of interest [61, p. 973–975]. Moreover, the list of keywords that one human comes up with is likely to show little overlap with the keyword list generated by another human [61, p. 973–975]. Joining forces by combining keyword lists that researchers have created independently may alleviate the problem somewhat. But still, the conventional approach of using keywords to identify relevant documents is likely to be unreliable and thus is likely to lead to very different (and potentially biased) conclusions depending on which set of keywords the researchers have used [61, p. 974–976].

Other approaches for identifying relevant documents—such as passive and active supervised learning, query expansion techniques, or the construction of topic model-based classification rules—are less frequently employed in social science applications. These approaches also require human input but they detect patterns or keywords the researchers do not have to know beforehand. Except for query expansion, these methods require the researchers to recognize documents or terms related to the entity of interest rather than requiring the researchers to recall such information a priori [61, p. 972]. This does not preclude these techniques from generating selection biases. A supervised learning algorithm, for example, may systematically misclassify some documents as not being relevant based on word usage that could be correlated with a main variable of the analysis. Yet, as these approaches have the potential to extract patterns beyond what a team of researchers may come up with, these methods have the potential to more precisely separate relevant from non-relevant documents. And the higher the retrieval performance of a method, the smaller the potential for strongly biasing effects due to selection biases.

These other techniques, however, also have a disadvantage: they are much more resource-intensive to implement. Supervised learning algorithms require labeled training documents, query expansion techniques depend on a data source to operate on, and topic model-based classification rules hinge on estimating a topic model. As the identification of relevant documents from a large heterogeneous corpus is likely to only constitute an early small step in an elaborate text analysis, considerations regarding the costs and benefits of a retrieval method have to be taken into account.

Hence, an ideal procedure reliably achieves a high retrieval performance such that it reduces the risk of incurring large selection biases and simultaneously is cost-effective enough to be conducted as a single step of an extensive study. This is the core challenge of retrieving relevant documents for an analysis: addressing the trade-off between identifying as exactly as possible relevant documents in a situation of imbalance on the one hand and keeping within the limits of the resources that are available for the first of several analytical steps in a study on the other hand.

In practice, the performance and the cost-effectiveness of a procedure is likely to depend on the characteristics of an application (such as the length and textual style of documents, the type of the entity of interest, or the heterogeneity vs. homogeneity of the documents in the corpus). If the entity of interest is a person or organization and there is only a small set of expressions that is usually used to refer to this entity, then a list of keywords may lead to a similar performance than the resource-intensive application of a supervised learning algorithm. If, on the other hand, the entity of interest is not easily denominated (e.g. a policy issue such as the set of restrictions implemented to address the COVID-19 pandemic), then an acceptable retrieval performance may only be achieved by training a supervised learning algorithm.

So far, however, a systematic comparison of the performances of these different retrieval methods across social science applications is lacking. Thus, it is unclear what, if anything, could be gained in terms of retrieval performance by applying a more elaborate procedure. This study seeks to answer this question by comparing the retrieval performance of a small set of predictive keywords to (1) query expansion techniques extending this initial set, (2) topic model-based classification rules as well as (3) passive and active supervised learning. The procedures are compared on the basis of three retrieval tasks: (1) the identification of tweets referring to refugees, refugee policies, and the refugee crisis from a dataset of 24,420 German tweets [71], (2) the retrieval of posts that are offensive toward mentally or physically disabled people from the Social Bias Inference Corpus (SBIC) [110] that covers 44,671 potentially toxic and offensive posts from various social media platforms, and (3) the extraction of newspaper articles referring to crude oil from the Reuters-21578 corpus [69] that comprises economically focused newspaper articles of which 10,377 are assigned to a topic.

The results show that, with the model settings studied here, query expansion techniques as well as topic model-based classification rules tend to decrease rather than increase retrieval performance compared to sets of predictive keywords. They only yield minimal improvements or acceptable results in specific settings. By contrast, active supervised learning—if implemented with a not too small number of labeled training documents—achieves relatively high retrieval performances across contexts. Moreover, in each application, active learning substantively improves upon the mediocre to fair results reached by the best-performing lists of predictive keywords. The observed differences of the mean F1-Scores of active learning with 1000 labeled training documents to the maximum F1-Scores of keyword lists range between 0.218 and 0.295. Active learning requires considerably more resources than the creation of keyword lists. Yet—compared to the usual passive supervised learning procedure—active learning can reach the same high level of retrieval performance with a substantively smaller number of training documents that have to be annotated by human coders. Also, the achieved performance enhancements are so considerable (and the consequences of selection biases potentially so severe) that researchers should consider spending more of their available resources on the step of separating relevant from irrelevant documents. Therefore, this paper points out that active learning constitutes an effective learning strategy that social scientists can use to decrease the potential magnitude of the selection bias that can result from imperfectly identifying relevant documents.

In the following Sect. 2, the link between retrieval performances and selection bias sizes is explicated. Afterward, the benefits and disadvantages of the usage of keyword lists, query expansion techniques, topic model-based classification rules, and passive as well as active supervised learning are discussed (Section 3). Then, the procedures are applied on the datasets and their retrieval performances are inspected (Sect. 4). The final discussion in Sect. 5 summarizes what has been learned and points toward aspects that merit further study.2

Selection bias, recall, precision

Any method that a researcher employs in order to identify the share of documents that are relevant for an analysis at hand from a corpus of otherwise irrelevant documents can cause a selection bias. A selection bias arises if the question of whether or not a text is selected into the sample of units to be analyzed correlates with a property of the text that serves as the outcome variable.3 To illustrate, imagine that a researcher is interested in attitudes toward Joe Biden as expressed in comments on an online platform during a given time period. Assume furthermore that in order to identify the population of comments that are of interest to her analysis (i.e. in order to identify the comments that refer to Joe Biden), the researcher uses a keyword list that contains the single search term ‘Biden’. In this case, a selection bias is induced if the attitudes expressed in comments that the researcher identifies with her selection method (i.e. comments that refer to Joe Biden as ‘Biden’ or ‘Joe Biden’) are systematically more positive or more negative from the attitudes in comments that the researcher does not identify with her selection method (for example comments that refer to Joe Biden as ‘Sleepy Joe’). This is, if truly relevant documents are systematically misclassified—in the sense that the higher (or lower) the value on the variable of interest, the higher (or lower) the probability of being assigned to the relevant category—inferences made on the basis of the set of instances classified into the relevant category are biased.

To keep these biasing effects as minimal as possible, it is very important that researchers identify as exactly as possible the population of documents that they are interested in. Separating as exactly as possible is no guarantee that selection bias will not occur, but the higher the share of documents that have been identified as relevant among all truly relevant documents (i.e. the higher the performance metric of recall), the smaller the maximum magnitude of the biasing effects that can result from imperfectly identifying relevant documents.4 Figure 1 illustrates this connection between recall and the size of bias. Figure 1 demonstrates that high recall values do not guarantee that there are no systematic misclassifications; but the higher recall, the smaller the maximum size of the bias that arises from systematic misclassifications of truly relevant documents can become.

Fig. 1.

Fig. 1

Recall and the maximum size of bias. This plot is generated from a simulation that assumes the following scenario: among a large corpus of documents, 1000 documents are relevant for an analysis. Among these 1000 relevant documents, 500 documents express a positive attitude toward a political candidate and 500 documents express a negative attitude toward the candidate. The true attitude value of all positive attitude expressing documents is 1 and the true value of all negative documents is 0. Hence, the true mean attitude value in this population of documents is 0.5. Now it is assumed that researchers in a study first apply a selection rule via which they try to identify the relevant documents from the corpus. In a second step, the researchers then compute an estimate for the mean attitude value based on those documents that the selection rule identified as being relevant. These two steps are repeated several times, each time applying a different selection rule. The question addressed in this simulation is the effect that recall (i.e. the share of the 1000 truly relevant documents that a selection rule correctly predicts to be relevant) has on the size of bias in the estimation of quantities (here the mean attitude value) from documents that are predicted to be relevant. In order to examine the effect of recall on bias in isolation from other possible biasing effects, that can arise if truly irrelevant documents are erroneously predicted to be relevant, the assumption here is that for all selection rules precision is 1 (such that actually irrelevant documents are not selected into the study). Furthermore, it is assumed that the researchers are perfectly able to determine the true attitude value of a document. For example, if a selection rule identifies 50 positive and 200 negative attitude expressing documents, then the researchers will conclude that the 50 positive documents have an attitude value of 1 and the 200 negative documents have an attitude value of 0 and hence they estimate the mean attitude value to be 0.2. The difference between such an estimated value and the true mean value of 0.5 here is called bias. The plot shows how this bias depends upon the recall of relevant documents that express a positive attitude (x-axis), the recall of relevant documents that express a negative attitude (y-axis), and the overall recall of relevant documents (indicated by the color of the dots). The plot demonstrates that an increase in overall recall does not necessarily imply that the bias in the estimator decreases. An increase in the overall recall rate can even mean that the bias increases. Note that selection bias arises if the recall of positive relevant documents is higher or lower than the recall rate of negative relevant documents. If an overall increase in recall implies that this imbalance in the recall rates increases further, then an increase in recall causes an increase in bias. Yet the size that this bias can maximally assume decreases with an increasing overall recall: As the color of the dots moves from blue (low overall recall) to red (high overall recall), the maximum magnitude that this bias can reach decreases. If a selection strategy yields high overall recall, selection bias still can occur if the recall rates vary with the value of the outcome variable of interest. But the higher recall, the less strong this biasing effect can be. (Note that the relationship between recall and the size of this form of bias also holds if the true proportion of positive vs. negative documents is different from 1:1. If among the truly relevant documents there are substantively more positive than negative documents (or the other way around), the precise functional form of this relationship between recall of positive documents, recall of negative documents, and bias is different than the function in the presented plot, but the general relationship between recall and bias remains the same.)

Whereas recall operates on the set of all truly relevant documents and focuses on the inclusion vs. exclusion of relevant documents into the analysis, the performance metric of precision exclusively takes into account all documents that have been assigned to the relevant category by the classification method and informs about the share of truly relevant documents among all documents that are predicted to fall into the relevant category. Precision thus provides no information on the potential of selection bias due the missing out of truly relevant documents. Nevertheless, precision should also be high: The lower precision (i.e. the higher the share of documents that are considered to be relevant although they do not belong to the population of interest), the more an analysis is based on documents the analysis does not want to make inferences about. Hence, high precision values are important because the higher precision, the less the degree to which estimated values can be biased by documents that are not actually of interest.

However, whereas low precision can be handled by a researcher in subsequent steps, low recall implies that a substantial proportion of truly relevant documents are never to be considered for analysis. Hence, falsely classifying a truly relevant document as irrelevant can be considered to be more severe than falsely predicting an irrelevant document to be relevant. Consequently, in this study’s context of identifying relevant documents to be used for further analyses from a set of otherwise irrelevant documents, recall as well as precision should be as high as possible; but recall is the slightly more important metric. Although recall here is considered the slightly more important measure, the harmonic mean of precision and recall, the F1-Score—because it is the measure nearly always reported—will be employed to assess the performances of the retrieval approaches evaluated in the following. Nevertheless, recall and precision values will be reported in the Appendix.

Retrieval approaches

In the following, keyword lists, query expansion techniques, topic model-based classification rules, and active as well as passive supervised learning are presented. An overview of the procedural steps involved, the resources required, as well as the advantages and disadvantages of each discussed approach is given in Table 1.

Table 1.

Overview of approaches

Technique Procedural steps Resources and (dis)advantages

Keyword lists

[keyword]

At minimum: one human researcher that sets up a keyword list

+ Requires few resources

− High risk of selection bias

Query expansion techniques

[query]

Acquiring a data source that provides candidate expansion terms

Implementation of an algorithm that ranks and selects expansion terms

− Resources[query] > resources[keyword]

+ potential to create more comprehensive keyword lists

+/- Will likely increase recall but decrease precision

− Without user feedback: no control over expansion

Topic model-based classification rules

[topic]

Estimation of a topic model

Implementation of a classification rule building procedure

− Resources[topic] > resources[keyword]

+ Requires mere recognizing of relevant topics

rather than labeling relevant documents

− Little control over estimated latent topic structure

Passive and active supervised learning [passive]

[active]

Labeling of training data

Implementation of a supervised learning algorithm (plus active learning mechanism)

− Resources[passive] > resources[active]

resources[keyword]

+ Separation between relevant and irrelevant documents is encoded in the training data and learned by the model

+ Validation set to assess classification performance

This table summarizes which procedural steps and resources are required to implement each of the compared approaches and also lists their advantages and disadvantages

Keyword lists

In social science, a very commonly used approach to identify documents on relevant entities is to set up a set of keywords and to consider those documents as relevant that contain at least one of the keywords (see for example the studies listed in Table 5 in Appendix 2). This procedure in fact is a keyword-based Boolean query in which the keywords are connected with the OR operator [74, p. 4]. Slightly more advanced are Boolean queries in which in addition to the OR operator also the AND operator is used.

Table 5.

Social science studies applying keyword lists

Study Number of keywords How are the keywords selected? Operators in Boolean query
Puglisi and Snyder [95] 11+ Likely by the authors OR, AND
King et al. [62] Unspecified Likely by the authors Unclear
Burnap et al. [27] 33 Likely by the authors OR
Jungherr et al. [57] 86 By the authors OR
Beauchamp [14] 36 Likely by the authors OR
van Atteveldt et al. [128] 1 By the authors
Baum et al. [12] 2 Likely by the authors OR
Stier et al. [122] 218 By the authors OR
Fogel-Dror et al. [47] 27–170 By the authors OR
Katagiri and Min [58] unspecified From COPDAB data bank OR, AND
Zhang and Pan [143] 50 Empirically; frequency-based OR
Rauh et al. [99] 14 Likely by the authors OR
Uyheng and Carley [127] 1 Likely by the authors
Reda et al. [100] 57 By the authors OR, AND
Gessler and Hunger [48] 94 By the authors; re-usage of lists created by other authors OR
Muchlinski et al. [84] 30–38 By the authors OR
Watanabe [135] 2–4 By the authors OR

This table exemplary lists social science studies that employ lists of keywords to retrieve documents or text elements that are relevant for (a part of) their analysis. A similar but older list of studies can be found in Linder [71, p. 5]. Note that the column ‘Number of keywords’ gives the number of keywords the authors in the listed studies use to extract documents relating to one entity of interest. If the authors are interested in several entities, then typically several keyword lists are applied which is why here for some articles a range rather than a single number is given. Note also that Katagiri and Min [58, p. 161] state that the keywords they use come from the Conflict and Peace Data Bank (COPDAB) [5]. They do not specify how they extract keywords from this data bank

The ways in which social science authors come up with a set of keywords range from simply using the most obvious terms (e.g. [12]), to collecting a set of typical denominations for the entity of interest (e.g. [14, 27, 57]), to carefully thinking about, testing, and revising sets of keywords (e.g. [48, 100, 122]), to collecting keywords empirically based on word-usage in texts known to be about the entity (e.g. [143]). Though these approaches vary in their complexity and costs, they are all still very cheap and relatively fast procedures. Another advantage of the usage of keyword lists for the extraction of relevant documents is that a researcher has full control over the terms that are included—and not included—as keywords.

Yet research suggests that the human construction of keyword lists is not reliable [61, p. 973–975]. If a researcher generates a keyword list, then another researcher or the same researcher at another point in time is likely to construct a very different set of keywords. This is problematic: Depending on which human-generated set of search terms is used to identify relevant documents, inferences drawn may vary greatly [61, p. 974–976]. Moreover, this conventional procedure of human keyword list generation can lead to biased inferences if the terms that are used to denote an entity correlate with the values of the variable of interest (see again the illustrative example regarding Joe Biden in Sect. 2). For keyword-based approaches to avoid such types of selection bias, a researcher has to set up a set of keywords that fully captures the universe of terms and expressions that are used to refer to the entity of interest in the given corpus.5 Yet as humans tend to perform very poorly when it comes to constructing an extensive set of search terms [61, p. 973–975], the application of human-generated keyword lists has a high risk of producing a selection bias.6

Query expansion

By being able to move beyond keywords that researchers are able to recall a priori, query expansion methods can be employed to create a more comprehensive set of search terms. Query expansion techniques expand the original query (i.e. the original set of keywords) by appending related terms [4, p. 1699–1700]. Here, the focus is on similarity-based automatic query expansion methods, that add new terms automatically—i.e. without interactive relevance feedback from the user—and make use of the similarity between the set of query terms to potential expansion terms [4, p. 1700, 1706]. The underlying hypothesis used here is the association hypothesis formulated by van Rijsbergen stating that “[i]f one index term is good at discriminating relevant from non-relevant documents, then any closely associated index term is also likely to be good at this” [129, p. 11]. The specific methods differ regarding

  • the data source to extract candidate terms for the expansion,

  • how candidate terms from this data source are ranked (such that the ranks reflect the relatedness to the original query), and

  • how (many) additional terms are selected and integrated into the original query

[4, p. 1701]. Data sources from which expansion terms are identified can be the corpus from which relevant documents are to be retrieved, the documents retrieved by the initial query, human-created thesauri such as WordNet, knowledge bases as Wikipedia, external corpora such as a collection of web texts, or a combination of these [4, p. 1701–1704].

Thesauri such as WordNet encode the semantic relationships between terms. Those terms that a thesaurus encodes to be related to the query terms can be considered candidate terms for expansion [4, p. 1702]. Path lengths between the synsets (word senses) in a thesaurus then can be used to compute a similarity score between a query term and potential expansion terms [4, p. 1705]. In Wikipedia, the network of hyperlinks between articles can be used to extract articles about concepts related to the query terms [2]. A similarity score, for example, can be computed based on shared ingoing and outgoing hyperlinks between articles [2, p. 6]. If the data source for query expansion is the local corpus from which documents are to be retrieved or if the data source is an external global corpus, then the similarity between terms can be assessed via similarity measures that are computed based on the terms’ vector representations [4, p. 1706]. A frequently used measure is cosine similarity:

simcos(a1,a2)=cos(θ)=z[a1]·z[a2]||z[a1]||||z[a2]||, 1

whereby z[a1] and z[a2] are the vector representations of terms a1 and a2 respectively, ||z[a1]|| and ||z[a2]|| is the length of these vectors as computed by the Euclidean norm, and θ is the angle between the vectors. Cosine similarity gives the cosine of the angle between the term representation vectors z[a1] and z[a2] [74, p. 122]. If the angle between the vectors equals 0, meaning that the vectors have the exact same orientation, the cosine is 1 [82, p. 281]. If the angle is 90, meaning that the vectors are orthogonal to each other, then cos(θ)=0 [82, p. 281].7

Frequently used term representation are word embeddings (see e.g. [33, 66, 118]). A word embedding is a real-valued vector representation of a term. Important model architectures to learn word embeddings are the continuous bag-of-words (CBOW) and the skip-gram models [79] as well as Global Vectors (GloVe) [91] and fastText [21]. In learning the word embedding for a target term at, these architectures make use of the words that occur in a context window around at [79, p. 4; 91, p. 1533–1535]. In doing so, these procedures for learning word embeddings implicitly draw on the distributional hypothesis [46] stating that the meaning of a word can be deduced from the words it typically co-occurs with [107, p. 102]. This in turn implies that semantically or syntactically similar terms are likely to have similar word embedding vectors that point into a similar direction ([15, p. 1139–1140]; [80]).

In similarity-based query expansion techniques, terms that are closest to the query terms are used as query expansion terms. The number of terms added varies from approach to approach between five to a few hundred [4, p. 1714].

To sum up, researchers that implement query expansion methods require a data source for expansion, a way to compute a measure that captures the relatedness between terms, and a procedure that determines which and how many terms are added via which process. If they plan to represent terms as word embeddings, then either pretrained word embeddings are required or the embeddings have to be learned. Consequently, considerable resources and expertise is needed. Yet, whereas individuals may fail to create a comprehensive list of search terms, query expansion methods can uncover terms that denote the entity of interest and are used in the corpus at hand. As query expansion techniques have the potential to expand the initial query with synonymous and related terms, recall is likely to increase [67, 74, p. 193]. Precision, however, may decrease—especially if the added terms are homonyms or polysemes (i.e. terms that have different meanings; whereby the meanings can be conceptually distinct (homonyms) or related (polysemes)) [75, p. 110; 74, p. 193]. It thus may be advantageous to use as a data source for query expansion a corpus or thesaurus that is specific to the domain of the retrieval task rather than a global corpus or general thesaurus [74, p. 193]. Moreover, query expansion techniques require researchers to a priori come up with an initial set of query terms (which will encode the researchers’ assumptions) and there is no guarantee the expansion starting from the initial set will capture all different denominations of the entity. For example, there is no guarantee that query expansion will succeed in moving from ‘Biden’ to ‘Sleepy Joe’.8

Topic model-based classification rules

Recently Baden et al. [6] have proposed a procedure in which documents are categorized based on classification rules that are built by researchers on the basis of topics estimated by a topic model. Baden et al. [6] call their procedure Hybrid Content Analysis. The idea is to assign those documents to a pre-defined category that are estimated to be comprised to a considerable share of topics that the researchers deem to be related to the category [6].

The family of topic models most widely applied in social science are Bayesian hierarchical mixed membership models that estimate a latent topic structure based on observed word frequencies in text documents [19, p. 993, 995–997; 18, p. 18; 103, p. 988; 144, p. 4713–4714]. This set of topic models (which are here simply referred to as topic models) assume that each topic is a distribution over the terms in the corpus and each document is characterized by a distribution over topics [19, p. 995–997; 18, p. 18 3Given a corpus of N documents, topic models estimate a latent topic structure defined by N×K document-topic matrix Θ and K×U topic-term matrix B. Topic-term matrix B=[β1,,βk,,βK] gives for each topic, k{1,,K}, the estimated probability mass function across the U unique terms in the vocabulary: βk=[βk1,,βku,,βkU]; whereby βku is the probability for the uth term to occur given topic k. Document-topic matrix Θ=[θ1,,θi,,θN] contains for each document di the estimated proportion assigned to each of K latent topics: θi=[θi1,,θik,,θiK], with θik being the estimated share of document di assigned to topic k.

Given the estimated latent topic structure characterized by K×U topic-term matrix B and N×K document-topic matrix Θ, the topic model-based classification rule building procedure proceeds as follows (see also Fig. 10 in Appendix 3) [6, p. 171–174]:

  1. Based on K×U topic-term matrix B the researcher inspects for each topic the most characteristic terms, e.g. the terms that are most likely to occur in a topic and the terms that are the most exclusive for a topic.9 Given these terms that inform about the content of each topic, the researcher determines which topics refer to the entity of interest. The researcher then creates relevance matrix C of size K×1 whose elements are 1 if the topic is considered relevant and are 0 otherwise.

  2. Then N×K document-topic matrix Θ is multiplied with C. The resulting vector r=[r1,,ri,,rN] gives for each document the sum over those topic shares that refer to relevant topics. ri can be interpreted as the share of words in document di that come from relevant topics.

  3. A threshold value ξ[0,1] is set. All documents for which ri>=ξ are considered to be relevant.

The procedure utilizes a topic model as an unsupervised tool to uncover information about the latent topic structure of a corpus. Leveraging this information for the retrieval of relevant documents allows researchers to operate without a set of explicit keywords. Rather than having to come up with information about to be retrieved documents a priori, researchers merely have to recognize topics that refer to relevant entities. As topic models are well known and frequently developed and applied in social science (e.g. [7, 11, 45, 50, 73, 96, 106, 111]) and furthermore are implemented in corresponding software packages (e.g. [51, 105]), the procedure of building classification rules based on topic models seems easily accessible to the social science community.

Fig. 10.

Fig. 10

Building topic model-based classification rules. Classification rules can be built from any topic model that on the basis of a corpus comprising N documents estimates a latent topic structure characterized by two matrices: N×K document-topic matrix Θ and K×U topic-term matrix B. βku is the estimated probability for the uth term to occur given topic k. θik is the estimated share assigned to topic k in the ith document. The topic model-based classification rule procedure proceeds as follows: Step 1: Researchers inspect matrix B, determine which topics are relevant and create K×1 relevance matrix C. Step 2: Matrix multiplication of Θ with C yields the resulting vector r. Step 3 (not shown): Documents with ri>= threshold ξ[0,1] are retrieved

Yet estimating a topic model in the first place induces costs. Especially the number of topics K has to be set a priori. To set a useful value for K, typically several topic models with varying K are estimated and after a manual inspection of the most likely and most exclusive terms for a topic as well as the computation of performance metrics (e.g. held-out likelihood), researchers decide on a topic number [104, p. 60–62]. Moreover, as topic models are unsupervised there is no way for researchers—beyond setting parameters as K—to guide the estimation process such that the results are related to the concepts of interest. Ideally one would like to have a topic model that produces one or several topics that refer to the entity of interest and are characterized by high semantic coherence as well as exclusivity.10

It is not guaranteed, however, that there is a topic that distinctly covers the relevant entity. Additionally, topic models can generate topics that relate to several entities rather than a single entity. Consequently, whether the application of topic model-based classification rules will work out in a given application is unclear as the latent topic structure uncovered by the topic model cannot be forced to neatly separate topics referring to relevant entities from topics referring to non-relevant entities. And topic models also cannot be forced to produce coherent topics referring to the entity of interest at all.

Passive and active supervised learning

Supervised learning methods have the advantage that they come with supervision: the separation between relevant and irrelevant documents is encoded in the training data set and then learned by the model. This is a considerable advantage over automatic query expansion methods and topic model-based approaches.

Moreover, as the true class assignments for the training set documents are known, supervised learning approaches allow researchers to use resampling techniques (e.g. cross-validation) in order to assess how well the retrieval of relevant documents works. The values for precision and recall not only provide information about the performance of the retrieval method but also indicate the nature of the (mis)classifications. (Is the model lenient in assigning documents to the positive relevant class and, therefore, most of the relevant documents are retrieved (high recall) but there are many false positives among the retrieved documents (low precision) or is it rather the other way round?)

Furthermore, just as the topic model-based approach, supervised learning techniques depend on recognizing rather than recalling: When creating the training data set, coders read the training documents and assign them to the relevant vs. irrelevant class as specified in coding instructions. Hence, supervised learning techniques require the coders to merely recognize relevant documents rather than creating information on relevant documents from scratch.

Supervised learning methods, however, also come with disadvantages. First, the labeling of training documents by human coders is extremely costly. Precise coding instructions have to be formulated, the coders have to be trained and paid, and the intercoder reliability (e.g. measured by Krippendorff’s α [65, p. 277–294]) has to be assessed. Reading an adequately large sample of documents and labeling each as relevant vs. irrelevant (or having this being done by trained coders) takes time.

Second, in the context of retrieving relevant documents, it is likely that the share of relevant documents is small and thus further problems arise: If the training set documents are randomly sampled from the entire corpus from which relevant documents are to be retrieved and only a small share of documents refer to the entity of interest, then a large number of training documents have to be sampled, read, and coded such that the training data set contains a sufficiently large number of documents falling into the positive relevant class for the supervised method to effectively learn the distinctions between the relevant and the irrelevant class. If, for example, 3% of documents are relevant, then after coding 1000 randomly sampled training documents only about 30 documents will be assigned to the relevant category.11

What is more, if no adjustments are made, then each training set document has the same weight in the calculation of the value of the loss function. This is, the optimization algorithm attaches the same importance to the correct classification of each training set document. Yet in a retrieval situation characterized by imbalance, researchers typically care more about the correct classification of relevant training documents than irrelevant documents (see also argumentation in Sect. 2 above) [24, p. 2–4]. Or put differently, missing a truly relevant document (false negative) is considered more problematic than falsely predicting an irrelevant document to be relevant (false positive) [25]. So, there is the question of what to do to make the supervised learning algorithm focus on correctly detecting relevant documents.

The statistical learning community has devised a large spectrum of approaches to deal with imbalanced classification problems (for an overview see [24]). Among the most common and most easily applicable procedures that are employed to make the optimization algorithm put more weight on the correct classification of instances that are part of the relevant minority class are techniques that adjust the distribution of training set instances [24, p. 7–15, 21–27]. This set of techniques comprises procedures such as random oversampling, random undersampling and the synthetic minority oversampling technique (SMOTE) [24, p. 22; 28].

In random oversampling instances of the minority class are randomly resampled with replacement and appended as duplicates to the training data set [132, p. 9833]. In random undersampling, randomly selected instances of the majority class are removed from the training set [132, p. 9831]. Both resampling strategies make the training set more balanced and thus put more weight on the minority class than in the original training set distribution. As random oversampling implies that resampled minority instances are added as exact duplicates, random oversampling can lead to overfitting on the training data and reduced generalization performance on the test data [24, p. 22]. In random undersampling, on the other hand, information from removed majority class instances is lost [26].12

Beside these techniques that adjust the training set distribution, a second set of methods to address imbalanced classification problems is the usage of cost-sensitive algorithms [24, p. 27 ff.]. A general method for cost-sensitive learning is to set up a cost matrix that specifies which cell in the confusion matrix (see Table 4 in Appendix 1) is associated with which cost [25, 41]. During training, the loss of each training instance takes into account the respective cost depending on which cell the instance is in [41, p. 973]. In this way, higher costs can be specified for false negatives than for false positives and be directly incorporated into the training process.

Table 4.

Confusion matrix

Truly positive Truly negative
Predicted positive True Positives (TP) False Positives (FP) TP+FP
Predicted negative False Negatives (FN) True Negatives (TN) FN+TN
TP+FN FP+TN N

The idea of the cost matrix also underlies the techniques that modify the distribution of training instances [41, p. 975]. The undersampling rates for the majority class or the oversampling rates for the minority class ideally should reflect the cost induced by misclassifying an instance from the respective class [25]. For example, if falsely predicting an instance from the positive minority class to be negative is considered 10 times more costly than falsely predicting an instance from the negative majority class to be positive, then the cost of a false negative is 10 and the cost for a false positive 1 (and true positives and true negatives induce no costs) [24, p. 36]. Positive minority class instances then could be randomly oversampled such that their number increases by a factor of 10, or the majority class instances could be undersampled such that their number decreases by a factor of 1/10 [24, p. 36].13

The focus of the so far mentioned methods for imbalanced classification problems has been on the difference in the misclassification costs associated with instances from the positive minority vs. negative majority class. Yet, there are other types of cost that also should be considered: As elaborated above, the annotation of training documents is costly due to the resources required. And in the context of imbalanced classification problems annotating a random sample of documents is inefficient as a disproportionately large number of documents has to be annotated until an acceptably number of instances form the minority class is labeled. These training set annotation costs are the focus of active learning strategies.

Active learning refers to learning techniques in which the learning algorithm itself indicates which training instances should be labeled next [117, p. 4]. The idea is to let the learning algorithm select instances for labeling that are likely to be informative for the learning process [117, p. 5]. The underlying hypothesis is that by letting the learner actively select the instances from which it seeks to learn, an as high as possible prediction accuracy can be achieved with an as small as possible number of annotated training instances [117, p. 4, 5]. Active learning stands in contrast to the usual supervised learning procedure in which the training set instances are randomly sampled, annotated, and then handed over to the learning algorithm. When juxtaposing active learning to this usual supervised learning procedure, the latter sometimes is called passive learning [81, p. 534].

Active learning is useful in situations in which unlabeled training instances are abundant but the labeling process is costly [117, p. 4]. There are several different scenarios in which active learning can be applied (see [117], p. 8–12). In this study, the focus is on pool-based sampling. In pool-based sampling, a large collection of instances has been collected from some data distribution in one step [117, p. 11]. At the start of the learning algorithm, labels are obtained only for a very small set of instances, denoted I, whilst the other instances are part of the large pool of unlabeled instances U [117, p. 11]. In each iteration of the active learning algorithm, the algorithm is trained on instances in the labeled set I and makes predictions for all instances in pool U [70, p. 4; 117, p. 6, 11]. The instances in pool U then are ranked according to how much information the learner would gather from an instance if it were labeled [117, p. 11–12]. Then the most informative instances in U are selected and labeled (e.g. by human coders) [117, p. 6]. The newly labeled instances are added to set I and a new iteration starts [117, p. 6].14

In the active learning community, several different strategies of how the informativeness of an instance is defined and how the most informative instances are selected have been developed (for an overview see [117], p. 12 ff.). These strategies are termed query strategies [117, p. 12]. Here, the “[p]erhaps [...] simplest and most commonly used query framework” [117, p. 12] will be presented: uncertainty sampling [70]. In uncertainty sampling, those instances are considered to be the most informative about which the learning algorithm expresses the highest uncertainty [70, p. 4]. In the context of the binary document retrieval classification task, the uncertainty could be said to be highest for instances for which the predicted probability to belong to the relevant class is closest to 0.5 [70, p. 4].15

One important aspect to be kept in mind when applying active learning techniques is that because the training instances are not sampled randomly from the underlying corpus but are purposefully selected, the distribution of the class labels in training data set I and in unlabeled pool U is different from the distribution of labels in the entire corpus [81, p. 539]. If the expected generalization error is to be estimated, then one option is to randomly sample a set of instances from the corpus at the very start of the analysis [124, p. 57; 81, p. 539, 541]. This set then is annotated and set aside such that it neither can become part of set I nor set U [124, p. 57; 81, p. 539, 541]. After each learning iteration or a fixed number of iterations, the performance of the active learning algorithm then can be evaluated on this independent test set [124, p. 57; 81, p. 539, 541].

Empirically one can say that in a majority of published works active learning reaches the same level of prediction accuracy with fewer training instances than supervised learning with random sampling of training instances (passive learning) [44, 70, 81, 117, 124]. Especially if data sets are imbalanced, active learning tends to reach the same level of classification performance with a substantively smaller number of labeled training instances than passive learning [44, p. 131; 40, p. 7954; 81, p. 543–544]. Closer inspections show that during the learning process, the training set I, which is selected by the active learning algorithm, is more balanced than the original data distribution [44, p. 133–134; 81, p. 545]. One likely reason for this observation is that as active learning algorithms tend to pick instances for labeling from the uncertain region between the classes and in this region of the feature space the class distribution tends to be more balanced, the class distribution among instances an active learning algorithm tends to select is likely to be more balanced [44, p. 129, 133–134]. A more balanced distribution implies that more weight is given to the minority class instances. Another likely reason is that because active learning algorithms tend to pick instances close to the boundary between the classes, they are able to learn the class boundary with a smaller number of training instances [117, p. 28].

Note that besides active learning, there exists a large spectrum of semi-supervised learning techniques that seek to as best as possible utilize the information available in a small set of labeled and a large set of unlabeled data [117, p. 44]. Self-training [142], for example, is a semi-supervised learning technique that operates in a manner similar to uncertainty sampling—except that not the instances the algorithm expresses the highest uncertainty about but the instances the algorithm is most certain about are added (with their predicted labels) to the training data [117, p. 44–45]. Multi-view training is another set of semi-supervised learning approaches comprising e.g. co-training [20] and tri-training [108, p. 97–98; 145]. In multi-view training, different models are trained on different data views, complementing each other [108, p. 97–98]. Again, those unlabeled instances the models are most certain about are added to the training data [117, p. 45]. For a study that applies a combination of these techniques see Khan and Lee [60].

Comparison

In the following section, retrieving documents via keyword lists is compared to a query expansion technique, topic model-based classification rules and active as well as passive supervised learning on the basis of three retrieval tasks. The code to replicate the analysis can be accessed via figshare at https://doi.org/10.6084/m9.figshare.19699840. The analysis is conducted in R [97] and Python [130]. For the analyses pertaining to active and passive supervised learning with the pretrained language representation model BERT (standing for Bidirectional Encoder Representations from Transformers), the Python code is run in Google Colab [49] in order to have access to a GPU.16

Data

Twitter: The first inspected retrieval task operates on a corpus comprising 24,420 German tweets. These tweets are a random sample of all tweets in German language in a larger collection of tweets that has been collected by Barberá [9]. Linder [71] sampled 24,420 German tweets and used CrowdFlower workers to label the sampled tweets. For each tweet, the label indicates whether the tweet refers to refugees, refugee policies, or the refugee crisis and thus is considered relevant or not [71, p. 23–24]. The task of retrieving the relevant tweets from this corpus indeed is an imbalanced classification problem as only 727 out of the 24,420 tweets (2.98%) are labeled to be about the refugee topic.

SBIC: The aim of the second retrieval task is to extract all posts from the Social Bias Inference Corpus (SBIC) [110] that have been labeled to be offensive toward mentally or physically disabled people. The SBIC includes 44,671 potentially toxic and offensive posts from Reddit, Twitter and three websites of online hate communities [110, p. 5480].17 The SBIC was collected with the aim of studying implied—rather than explicitly stated—social biases [110, p. 5477]. The subreddits and websites selected to be included in the SBIC constitute intentionally offensive online communities [110, p. 5480]. The additionally included reddit comments and tweet data sets were collected such that there is an increased likelihood that the content of the collected posts is offensive (e.g. by selecting tweets that include hashtags known to be racist or sexist) [110, p. 5480]. Sap et al. [110] used Amazon Mechanical Turk for the annotation of the posts. For each post the coder indicated, amongst others, whether the post is offensive and if so, whether the target is an individual (meaning that the post is a personal insult) or a group (implying that the post offends a social group, e.g. women, people of color) [110, p. 5479–5480]. If one or several groups were targeted, the coders were asked to name the targeted group or groups [110, p. 5479–5480]. The authors merge the 1,414 targeted groups into seven larger group categories [110, p. 5481]. One of these group categories are mentally or physically disabled people. 2.15% of the 44,671 posts are annotated as being offensive toward the disabled.18 The category of disabled people is selected as the focus of this study because this group category is the most coherent capturing a well-defined group of people.

Reuters: The third retrieval task is to identify all newspaper articles in the Reuters-21578 corpus [69] that refer to the topic surrounding crude oil. Reuters-21578 [69] is a widely used corpus for evaluating retrieval approaches [44, 55, 124]. The corpus contains 21,578 newspaper articles that were published on the Reuters financial newswire service in 1987 [55, 69]. 10,377 articles are assigned to one or several out of 135 economic subject categories called topics [69]. These categories are, e.g. ‘gold’, ‘grain’, ‘cotton’. Here, the 10,377 topic-annotated articles are used for the analysis. The aim is to identify the 566 (5.45%) newspaper articles that are labeled to be about the crude oil topic. The topic is the fourth largest. It is large enough to possibly contain enough documents for the algorithms to learn from and at the same time is small enough such that the identification of crude oil articles can be considered an imbalanced classification problem.

The three data sets employed here are selected with the aim to achieve and represent various types of retrieval tasks common in social science. Tweets, posts from online platforms, and newspaper articles are types of documents that are often analyzed in social science and whose analysis typically involves some preliminary retrieval step (see, e.g. [12, 14, 47, 62, 84, 122, 135, 143]. The entities of interest in social science studies vary widely with regard to their nature and their level of abstraction. In this study, the entities of interest range from a multi-dimensional topic that includes abstract policies, occurrences as well as a social group (refugee policies, refugee crisis, refugees), to a one-dimensional topic about a single economic product (crude oil), to a specific social group (disabled people) that is referred to in a specific (namely: offending) way.

When working with text data, humans are typically regarded as the ultimate provider of validity [16, p. 470]. Humans are usually seen best equipped for understanding and decoding the meaning of text and thus are seen as the best tools for making conceptual judgments (e.g. deciding whether a document does or does not refer to an entity and thus is relevant or not) [121, p. 551]. However, as several studies show that humans are not necessarily reliable when coding text data (e.g. [42, 78]), the question arises to what extent human assessments really can be considered valid [121, p. 553]. Human annotations of low quality can imply that what human annotations encode differs from what the annotations actually should capture [121, p. 555]. Moreover, human annotations of low quality can introduce bias in performance assessments [121, p. 555]. Research on how the reliability and validity of text-based human coding can be increased or established [52, 65, 94, 121] is highly important but beyond the scope of this article. Therefore, in this article—in consistency with the literature at large but with this note of caution in view of the unknown biases possibly introduced by human codings—the human annotations, that are available for each dataset, are used as the gold standard and compared against the predictions of the evaluated methods.

Approaches

Keyword lists

In order to compare the retrieval performance of keyword lists with the other discussed methods, keyword lists have to be generated for each of the three retrieval tasks. Due to what is known from research on the human construction of keyword lists, keyword lists created by humans are likely to overlap very little and thus are likely to be unreliable [61, p. 973–975]. This poses a problem for the planned comparison because it would be best to have a challenging and reliable basis against which the other approaches can be compared to. To address this problem, the keyword lists are not constructed by humans but rather from the set of the most predictive keywords for the positive relevant class.

To identify predictive keywords, for each of the three studied corpora, the documents are preprocessed into a document-feature matrix.19 Then, logistic regression with regularization is applied. The regularization is introduced via the least absolute shrinkage and selection operator (LASSO; L1 penalty) or ridge regression (L2 penalty) depending on the outcome of hyperparameter tuning. The model is trained on the entire corpus and then the 50 most predictive terms (i.e. the terms with the highest coefficients) are extracted. The extracted terms are listed in Tables 6, 7, and 8 in Appendix 7. From each set of 50 most predictive terms, 10 keywords are randomly sampled whereby the probability of drawing a term is proportional to the relative size of the term’s coefficient. The 10 sampled keywords constitute one keyword list. The sampling of keywords from the set of predictive terms is repeated 100 times such that for each evaluated corpus there are 100 keyword lists of length 10 that serve as a basis for evaluation and comparison.20

Table 6.

Most predictive features in the Twitter data set. A: This is a list of the 50 terms that are extracted as the most predictive features for the relevant class of documents that refer to the refugee topic (most predictive term at the top). From this list of terms, 100 samples of 10 keywords are sampled to construct initial keyword lists that then are extended via query expansion using the locally trained GloVe word embeddings. B: This is a list of the 50 terms for which a globally pretrained GloVe word embedding is available that are extracted as the most predictive features for the relevant class of documents that refer to the refugee topic (most predictive term at the top). From this list of terms, 100 samples of 10 keywords are sampled to construct initial keyword lists that then are extended via query expansion using the globally trained GloVe word embeddings

A B
#flüchtlinge migranten
migranten asylanten
#refugeeswelcome flüchtlingen
#refugee asylrecht
asylanten asylunterkunft
flüchtlingen asyl
asylrecht flüchtlinge
#migration asylbewerber
#fluechtlinge ausländer
asylunterkunft flüchtlingsheim
asyl refugees
flüchtlinge flüchtling
migrationshintergrund asylpolitik
asylbewerber ungarn
ausländer refugee
flüchtlingsheim mittelmeer
flüchtlingsheime kritisiert
flüchtlingskrise syrer
#schauhin syrischen
refugees bamberg
flüchtling brandanschlag
asylpolitik behandelt
ungarn merkels
#refugeecamp ermittelt
#bloggerfuerfluechtlinge zusammenhang
#refugees innenminister
#flüchtlingen kundgebung
flüchtlingsheimen pegida
#asyl welcome
refugee unterbringung
mittelmeer migration
kritisiert benötigt
syrer erfahrungen
proasyl sollen
syrischen heimat
bamberg tja
brandanschlag balkanroute
behandelt rechte
merkels dort
ermittelt bürger
zusammenhang merkel
innenminister demo
kundgebung mehrheit
pegida letztes
welcome geplante
#deutschland recht
refugeeswlcm_le hilfe
unterbringung europa
ndaktuell afghanistan
migration islam
Table 7.

Most predictive features in the SBIC. A: This is a list of the 50 terms that are extracted as the most predictive features for the relevant class of documents that are offensive toward disabled people (most predictive term at the top). From this list of terms, 100 samples of 10 keywords are sampled to construct initial keyword lists that then are extended via query expansion using the locally trained GloVe word embeddings. B: This is a list of the 50 terms for which a globally pretrained GloVe word embedding is available that are extracted as the most predictive features for the relevant class of documents that are offensive toward disabled people (most predictive term at the top). From this list of terms, 100 samples of 10 keywords are sampled to construct initial keyword lists that then are extended via query expansion using the globally trained GloVe word embeddings

A B
retard retard
retards retards
retarded retarded
quadriplegic quadriplegic
autistic autistic
paralyzed paralyzed
schizophrenic schizophrenic
vegetables vegetables
wheelchair wheelchair
epileptic epileptic
parkinson’s disabled
disabled anorexic
anorexic stevie
stevie cripples
cripples paralympics
paralympics adhd
adhd paraplegic
paraplegic syndrome
syndrome paralysed
paralysed midget
midget leper
leper amputee
amputee cripple
cripple tons
tons handicapped
handicapped bipolar
bipolar wheelchairs
wheelchairs dyslexic
dyslexic chromosome
chromosome blind
blind suicidal
suicidal crippled
crippled chromosomes
chromosomes vegetable
vegetable challenged
alzheimer’s special
challenged veggie
special cancer
veggie spade
cancer helen
spade jenga
helen autism
jenga medication
autism deaf
medication logan
deaf depressed
logan christopher
depressed mentally
christopher potato
mentally shouted
Table 8.

Most predictive features in Reuters-21578 Corpus. A: This is a list of the 50 terms that are extracted as the most predictive features for the relevant class of documents that are about the crude oil topic (most predictive term at the top). From this list of terms, 100 samples of 10 keywords are sampled to construct initial keyword lists that then are extended via query expansion using the locally trained GloVe word embeddings. B: This is a list of the 50 terms for which a globally pretrained GloVe word embedding is available that are extracted as the most predictive features for the relevant class of documents that are about the crude oil topic (most predictive term at the top). From this list of terms, 100 samples of 10 keywords are sampled to construct initial keyword lists that then are extended via query expansion using the globally trained GloVe word embeddings

A B
oil oil
crude crude
barrels barrels
barrel barrel
exploration exploration
energy energy
petroleum petroleum
production production
drilling drilling
bpd bpd
refinery refinery
opec opec
gulf gulf
tanker tanker
texas texas
offshore offshore
canada canada
resources resources
sea sea
rigs rigs
out out
petrobras petrobras
rig rig
refineries refineries
agency agency
along along
depressed depressed
conoco conoco
raise raise
shelf shelf
iranian iranian
platform platform
day day
maintain maintain
drill drill
total total
well well
deal deal
16 16
fiscal fiscal
upon upon
postings postings
light light
blocks blocks
tuesday tuesday
about about
meters meters
daily daily
future future
equivalent equivalent

In contrast to human-constructed keyword lists for which it would be difficult to judge whether the lists perform on the higher or lower end of all lists humans would possibly generate for the posed retrieval tasks, the here constructed keyword lists mark the situation of a good start in which the selected keywords are highly indicative for the relevant class.

Query expansion

The keyword lists serve as the starting point for query expansion. Each keyword list is expanded via the following procedure:

  1. Take a set of trained word embeddings, here denoted by {z1,,zu,,zU}.

  2. For each keyword sv in the keyword list {s1,,sV}:
    1. Get the word embedding of the keyword: z[sv]
    2. Compute the cosine similarity between z[sv] and each word embedding zu in the set {z1,,zu,,zU}:
      simcos(sv,zu)=z[sv]·zu||z[sv]||||zu|| 2
    3. Take the M terms that are not keyword sv itself and have the highest cosine similarity with keyword sv. Add these M terms to the keyword list.

This query expansion strategy makes use of word embedding representations and the cosine similarity as has been done in previous studies (e.g. [66, 118]). By not merging the keyword list into a single word vector representation but rather expanding the keyword list for each keyword separately, this expansion method allows to move into a different direction for each keyword. This might help in extracting a more varied range of linguistic denominations for the entity of interest and might be especially useful if the entity is abstract or combines several dimensions (as, e.g. is the case with the refugee topic that combines policies, occurrences, and a group of people). A similar procedure for query expansion has been studied by Kuzi et al. [66].

For each evaluated retrieval task, two different sets of word embeddings are used: embeddings that have been externally pretrained on large global corpora and embeddings trained locally on the corpus from which documents are to be retrieved. With regard to the globally pretrained embeddings for the English SBIC and the Reuters corpus, GloVe embeddings with 300 dimensions that have been trained on CommonCrawl data are made use of [91].21 For the German Twitter data set, 300-dimensional GloVe embeddings trained on the German Wikipedia are employed.22 To get locally trained embeddings, on each corpus examined here, a GloVe embedding model is trained. GloVe embeddings with 300 dimensions are obtained for all unigram features that occur at least 5 times in the corpus (see Appendix 4 for details).

The number of expansion terms M is varied from 1 to 9 such that after the expansion the lists of originally 10 keywords then comprise between 20 and 100 keywords. The original as well as the expanded keyword lists are applied on the lowercased documents. Following the logic of a Boolean query with the OR operator, a document is predicted to belong to the positive relevant class if it contains at least one of the keywords in the keyword list.

Topic model-based classification rules

When constructing topic model-based classification rules, there are three steps at which researchers have to make decisions that are likely to substantively affect the results. First, after having selected a specific type of topic model that is to be used, the number of to be estimated topics K has to be set. Second, for the construction of a topic model-based classification rule, a researcher has to determine how many and which of the estimated topics are considered to be about the entity of interest (see Step 1 of the procedure described in Sect. 3.3). Finally, threshold value ξ[0,1] has to be set. If the sum of topic shares relating to relevant topics of a document is ξ, the document is predicted to be relevant (see Step 3 of the procedure described in Sect. 3.3).

Whilst in practice a researcher has to finally settle for one of the options in each step such that a single classification rule is produced, here the aim rather is to comprehensively evaluate topic model-based classification rules and also to inspect how well topic model-based classification rules can perform if optimal decisions (w.r.t. retrieval performance) are made. Consequently, specific values for the number of topics, the number of relevant topics and threshold values are set within reasonable ranges a priori. Then, the retrieval performance for all combinations of these values is evaluated. More precisely: On each corpus, seven topic models—each with a different number of topics K{5,15,30,50,70,90,110}—are estimated. Then, for each estimated topic model with a specific topic number, initially only one topic is considered relevant, then two topics, and then three. For each number of topics considered to be relevant, all possible combinations regarding the question which topics are considered relevant are evaluated. This implies that all ways of choosing one, two and three relevant topics (irrespective of the order in which they are selected) from the overall sets of 5, 15, 30, 50, 70, 90, and 110 topics have to be determined and evaluated. This amounts to 426,725 combinations—all of which are evaluated here.23 Finally, for each of the 426,725 combinations, four different threshold values ξ are inspected: 0.1, 0.3, 0.5, and 0.7. Whereas ξ=0.7 only considers those documents to be relevant that have 70% of the words they contain estimated to be generated by relevant topics, ξ=0.1 is the most lenient solution in which all documents are classified to be relevant that have 10% of their words assigned to relevant topics. As ξ increases, recall is likely to decrease and precision is likely to increase.

The type of topic model estimated here is a Correlated Topic Model (CTM) [18]. CTM extends the basic Latent Dirichlet Allocation (LDA) [19] by allowing topic proportions to be correlated. For more details on the estimation of the CTMs see Appendix 5 and for more details on the CTM in general see Blei and Lafferty [18].

Active and passive supervised learning

Two types of supervised learning methods are employed. First, Support Vector Machines (SVMs) [23, 30], and second, BERT (standing for Bidirectional Encoder Representations from Transformers) [32].

SVMs have been applied frequently and relatively successfully to text classification tasks in social science [12, 34, 37, 43, 94, 115]; and also in active learning settings [81]. BERT is a deep neural network based on the Transformer architecture [131]. The central element of the Transformer architecture is the (self-)attention mechanism [8, 131]. This mechanism allows the representation of each token to include information from the representations of other tokens (in the same sequence) [131, p. 6001–6002]; thereby enabling the model to produce token representations that encode contextual information and token dependencies. BERT typically is applied in a sequential transfer learning setting [32, p. 4175, 4179]. In sequential transfer learning, a model first is pretrained on a source task [108, p. 64]. In this pretraining phase, the aim is to learn model parameters such that the model can function as a well-generalizing input to a large range of different target tasks [108, p. 64]. Then, in the following adaptation phase, the pretrained model (with its pretrained parameters) serves as the input for the training process on the target task [108, p. 64]. The transferral of information (in the form of pretrained model parameters) to the learning process of a target task tends to reduce the number of training instances required to reach the same level of prediction performance than when not applying transfer learning and training the model from scratch [53, p. 334].24

This potential of pretrained deep language representation models to reduce the number of required training instances is highly important for the application of deep neural networks in practice: In text classification tasks, deep neural networks tend to outperform conventional machine learning methods (such as SVMs) that often are applied on bag-of-words representations [109, 119]. But deep neural networks have a much higher number of parameters to learn than conventional models and thus require much more training instances. In situations in which the annotation of training instances is expensive or inefficient—such as in the context of retrieval with a strong imbalance between the relevant vs. irrelevant class—applying a deep neural network from scratch may become prohibitively expensive. In a transfer learning setting, however, an already pretrained deep language representation model merely has to be fine-tuned to the target task at hand. If the pretrained model generalizes well, the number of training instances required to reach the same level of performance as a deep neural network that is not used in a transfer learning setting is reduced by several times [53, p. 334]. This allows deep neural networks to be applied to natural language processing tasks for which only relatively few training instances are available. Moreover, Ein-Dor et al. [40] show that especially in imbalanced classification settings active learning strategies can further improve the prediction performance of BERT such that even fewer training instances are needed for the same performance levels.

For all applications, the pretrained BERT models are taken from HuggingFace’s Transformers open source library [141]. The BERT model that is used as a pretrained input for the English applications on the SBIC and the Reuters corpus, has been pretrained on the English Wikipedia and the BooksCorpus [146] as in the original BERT paper [32]. For the data set of German tweets, a German BERT model pretrained on, amongst others, Wikipedia and CommonCrawl data by the digital library team at the Bavarian State Library is used [85]. All BERT models are employed in the base (rather than the large) model version and operate on lowercased (rather than cased) tokens. (For more details on the implemented text preprocessing steps as well as the training settings used for SVM and BERT see Appendix 6).

For both models, SVM and BERT, an active and a passive supervised learning procedure is implemented. The procedures consist of the following steps. (If the procedures differ between the active and the passive learning setting, it will be explicitly pointed out.):

  • The data are randomly separated into 10 (SBIC, Twitter) or 5 (Reuters) equally sized folds.

  • Then, for each fold g of the 10 (SBIC, Twitter) or 5 (Reuters) folds the data have been separated into the following steps are conducted:
    1. Fold g is set aside as a test set.
    2. From the remaining folds, 250 instances are randomly sampled to form the initial set of labeled instances I. The other instances constitute the pool of unlabeled instances U.
    3. The model is trained on the instances in set I and afterward makes predictions for all instances in pool U and the set aside test fold g. Recall, precision and the F1-Score for the predictions made for pool U and test fold g are separately recorded. During training in the passive learning setting, random oversampling of the instances falling into the positive relevant class is conducted such that the number of positive relevant instances increases by a factor of 5—thereby reflecting a cost matrix in which the cost of a false negative prediction is set to 5 and the cost of a false positive prediction is set to 1. In the active learning setting, no random oversampling is conducted.
    4. A batch of 50 instances from pool U is added to the set of labeled instances in set I. In passive learning, these 50 instances are randomly sampled from pool U. In active learning, the following query strategies are applied: In the active learning setting with BERT, the 50 instances whose predicted probability to fall into the positive relevant class is closest to 0.5 are selected. When applying an SVM for active learning, the 50 instances with the smallest perpendicular distance to the hyperplane are retrieved and added to I.
    5. Steps 3 and 4 are repeated for 15 iterations, i.e. until set I comprises 1000 labeled instances.

Hence, passive supervised learning with random oversampling and pool-based active learning with uncertainty sampling are applied. As the described learning procedures are repeated for 10 (SBIC, Twitter) or 5 (Reuters) times and are evaluated on each of the 10 (SBIC, Twitter) or 5 (Reuters) folds the data have been separated into, this allows taking the mean of the F1-Scores across the 10 (SBIC, Twitter) or 5 (Reuters) test folds as an estimate of the expected generalization error of the applied models.

Results

The results are presented in Figs. 2, 3, 4, 5, 6, 7, 8, 9 and Tables 2 and 3.

Fig. 2.

Fig. 2

F1-Scores for Retrieving Relevant Documents with Keyword Lists and Query Expansion. This plot shows the F1-Scores resulting from the application of the keyword lists of 10 highly predictive terms as well as the evolution of the F1-Scores across the query expansion procedure based on locally trained GloVe embeddings (left column) and globally trained GloVe embeddings (right column). For each of the sampled 100 keyword lists that then are expanded, one light blue line is plotted. The thick dark blue line gives the mean over the 100 lists

Fig. 3.

Fig. 3

F1-Scores for retrieving relevant documents with topic model-based classification rules. The height of a bar indicates the F1-Score resulting from the application of a topic model-based classification rule. For each number of topics K{5,15,30,50,70,90,110}, those two combinations out of all explored combinations regarding the question how many and which topics are considered relevant are shown that reach the highest F1-Score for the given topic number. The labels of the combinations are such that the first number indicates the number of topics K and the second number denotes the number of topics considered to be relevant by the combination. For example, in the Twitter data set the two best performing combinations for a topic model with K=70 topics both regard three topics as being related to the Twitter topic. So the label for both is 70-3. (The two combinations, of course, differ regarding which three topics they assume to be relevant.) For each combination, the F1-Score for each evaluated threshold value ξ{0.1,0.3,0.5,0.7} is given. Classification rules that assign none of the documents to the positive relevant class have a recall value of 0 and an undefined value for precision and the F1-Score. Undefined values here are visualized by the value 0

Fig. 4.

Fig. 4

Retrieving relevant documents with active and passive supervised learning. F1-Scores achieved on the set aside test set as the number of unique labeled documents in set I increases from 250 to 1000. Passive supervised learning results are visualized by blue lines, active learning results are given in red. For each of the 10 (Twitter, SBIC) or 5 (Reuters) conducted iterations, one light-colored line is plotted. The thick and dark blue and red lines give the means across the iterations. If a trained model assigns none of the documents to the positive relevant class, then it has a recall value of 0 and an undefined value for precision and the F1-Score. Undefined values here are visualized by the value 0

Fig. 5.

Fig. 5

Share of relevant documents in the training set. Share of documents in the training set that fall into the positive relevant class as the number of unique labeled documents in set I increases from 250 to 1000. Passive supervised learning shares are visualized by blue lines, active learning shares are given in red. For each of the 10 (Twitter, SBIC) or 5 (Reuters) conducted iterations, one light-colored line is plotted. The thick and dark blue and red lines give the means across the iterations. The black dashed line visualizes the share of relevant documents in the entire corpus

Fig. 6.

Fig. 6

SVM: Comparing passive learning with two oversampling factors to active learning. Left column: F1-Scores achieved by the SVMs on the set aside test set as the number of unique labeled documents in set I increases from 250 to 1000. Undefined values here are visualized by the value 0. Right column: Share of documents in the training set that falls into the positive relevant class as the number of unique labeled documents in set I increases from 250 to 1000. The black dashed line visualizes the share of relevant documents in the entire corpus. Both columns: The results are presented for passive learning with a random oversampling factor of 5 (blue lines), passive learning with random oversampling factors of 17 (Twitter), 20 (SBIC), and 10 (Reuters) (golden lines), as well as pool-based active learning with uncertainty sampling (red lines). The thick and dark blue, golden, and red lines give the means across the iterations

Fig. 7.

Fig. 7

Twitter: comparison of retrieval methods. This plot summarizes the retrieval performances of the here evaluated approaches on the retrieval task associated with the Twitter corpus. a F1-Scores for the 100 lists of 10 predictive keywords that then are expanded in the local and global embedding spaces. A gray transparent dot marks the F1-Score reached by a single (expanded) keyword list. A blue opaque dot marks the mean of the F1-Scores across 100 (expanded) keyword lists that contain the same number of search terms. The larger the number of terms in an (expanded) keyword list, the larger the size and the lighter the color of the printed dots. b F1-Scores of the topic model-based classification rules with different values for threshold ξ. The larger the number of topics in an estimated CTM, the larger the size and the lighter the color of the printed dots. (From all CTMs with a given topic number that have been estimated here, the best two performing combinations regarding the question of how many and which topics are considered relevant are presented.) c F1-Scores for active as well as passive supervised learning with SVM and BERT. A gray transparent dot marks the F1-Score reached by a model that has been trained by a given number of unique training documents and is evaluated on one set aside test set g. A blue opaque dot marks the mean of the F1-Scores of 10 such models that have been trained on the same number of training documents. Each of the 10 models is evaluated on another test set and together the 10 test sets constitute the entire Twitter corpus. The larger the number of unique labeled training instances, the larger the size and the lighter the color of the printed dots

Fig. 8.

Fig. 8

SBIC: comparison of retrieval methods. This plot summarizes the retrieval performances of the here evaluated approaches on the retrieval task associated with the SBIC. a F1-Scores for the 100 lists of 10 predictive keywords that then are expanded in the local and global embedding spaces. A gray transparent dot marks the F1-Score reached by a single (expanded) keyword list. A blue opaque dot marks the mean of the F1-Scores across 100 (expanded) keyword lists that contain the same number of search terms. The larger the number of terms in an (expanded) keyword list, the larger the size and the lighter the color of the printed dots. b F1-Scores of the topic model-based classification rules with different values for threshold ξ. The larger the number of topics in an estimated CTM, the larger the size and the lighter the color of the printed dots. (From all CTMs with a given topic number that have been estimated here, the best two performing combinations regarding the question of how many and which topics are considered relevant are presented.) (c) F1-Scores for active as well as passive supervised learning with SVM and BERT. A gray transparent dot marks the F1-Score reached by a model that has been trained by a given number of unique training documents and is evaluated on one set aside test set g. A blue opaque dot marks the mean of the F1-Scores of 10 such models that have been trained on the same number of training documents. Each of the 10 models is evaluated on another test set and together the 10 test sets constitute the entire SBIC. The larger the number of unique labeled training instances, the larger the size and the lighter the color of the printed dots

Fig. 9.

Fig. 9

Reuters: comparison of retrieval methods. This plot summarizes the retrieval performances of the here evaluated approaches on the retrieval task associated with the Reuters-21578 corpus. (a) F1-Scores for the 100 lists of 10 predictive keywords that then are expanded in the local and global embedding spaces. A gray transparent dot marks the F1-Score reached by a single (expanded) keyword list. A blue opaque dot marks the mean of the F1-Scores across 100 (expanded) keyword lists that contain the same number of search terms. The larger the number of terms in an (expanded) keyword list, the larger the size and the lighter the color of the printed dots. (b) F1-Scores of the topic model-based classification rules with different values for threshold ξ. The larger the number of topics in an estimated CTM, the larger the size and the lighter the color of the printed dots. (From all CTMs with a given topic number that have been estimated here, the best two performing combinations regarding the question of how many and which topics are considered relevant are presented.) (c) F1-Scores for active as well as passive supervised learning with SVM and BERT. A gray transparent dot marks the F1-Score reached by a model that has been trained by a given number of unique training documents and is evaluated on one set aside test set g. A blue opaque dot marks the mean of the F1-Scores of 5 such models that have been trained on the same number of training documents. Each of the 5 models is evaluated on another test set and together the 5 test sets constitute the entire Reuters-21578 corpus. The larger the number of unique labeled training instances, the larger the size and the lighter the color of the printed dots

Table 2.

Example SBIC expansion terms. This table gives for each of the highly predictive terms ‘retard’, ‘vegetables’, and ‘epileptic’ the nine terms with the highest cosine similarity in the local and global embedding spaces

Embeddings Initial term Terms with the highest cosine similarity to the initial term
local retard anymore, blend, float, sex, arguments, college, 93, meanjokes, fever
local vegetables 100,000, name, U+1f407, knew, combination, traveled, pulled, strip, developed
local epileptic oj, blond, include, tactics, crown, tampons, demands, prostitutes, newspapers
global retard retards, retarded, dumbass, moron, idiot, faggot, fuckin, stfu, stupid
global vegetables veggies, fruits, vegetable, potatoes, carrots, tomatoes, meats, onions, cooked
global epileptic seizures, seizure, psychotic, schizophrenic, epileptics, fainting, migraine, spasms, disorder

Table 3.

Comparison of retrieval methods

Approach Twitter (F1) SBIC (F1) Reuters (F1) Resources vs. Performance

keyword

(local)

0.266

(0.150, 0.417)

0.281

(0.134, 0.404)

0.432

(0.264, 0.645)

resources[keyword] = low

perform[keyword] = low

query

(global; 10 exp. terms)

0.309

(0.157, 0.405)

0.291

(0.122, 0.439)

0.367

(0.225, 0.606)

resources[query] > resources[keyword]

perform[query] = low

topic

(best

combination)

0.253 0.175 0.685

resources[topic] > resources[keyword]

perform[topic] = low/medium

passive

(BERT; train:1000)

0.667

(0.518, 0.760)

0.216

(0.037, 0.511)

0.869

(0.820, 0.906)

resources[passive] resources[keyword]

perform[passive] = low/high

active

(BERT; train:1000)

0.712

(/, 0.848)

0.622

(0.558, 0.719)

0.908

(0.897, 0.922)

resources[active] resources[keyword]

resources[active] < resources[passive]

perform[active] = high

This table summarizes the retrieval performances of the here evaluated approaches. keyword gives the mean F1-Scores for the 100 lists of 10 predictive keywords; minimum and maximum observed F1-Scores are given in parentheses. query gives the mean F1-Scores for the 100 lists of predictive keywords when expanded by 10 expansion terms based on the global embedding space; minimum and maximum observed F1-Scores are given in parentheses. topic informs about the best-performing topic-model based classification rule observed across all evaluated combinations. passive gives the the mean F1-Scores for a BERT model that has been trained on 1000 unique training documents; minimum and maximum observed F1-Scores are given in parentheses. active gives the the mean F1-Scores for a BERT model that has been trained in active learning mode on 1000 unique training documents; minimum and maximum observed F1-Scores are given in parentheses

Keyword lists and query expansion

Figure 2 visualizes for each of the three studied retrieval tasks (Twitter, SBIC, Reuters) the F1-Scores resulting from the application of the 100 keyword lists of 10 highly predictive terms as well as the evolution of the F1-Scores across the query expansion procedure based on locally trained GloVe embeddings (left column) and globally trained GloVe embeddings (right column).

In general, the retrieval performances of the initial keyword lists of 10 predictive keywords are mediocre. Only the initial keyword lists for the Reuters corpus achieve what could be called acceptable performance levels. The maximum F1-Scores reached by the initial lists of 10 predictive keywords are 0.417 (Twitter), 0.404 (SBIC), and 0.645 (Reuters).

Interestingly, the applied query expansion technique tends to decrease rather than increase the F1-Score and only shows some improvement of the F1-Score for the Twitter and SBIC data sets—and only if operating on the basis of word embeddings that are trained on large global external corpora rather than the local corpus at hand.

There are several factors that are likely to play a role here. First, when retrieving those terms that have the highest cosine similarity with an initial starting term, the terms retrieved from the global embedding space seem semantically or syntactically related to the initial term, whereas this is not the case for the local word embeddings (as an example see Table 2). One reason why the local embedding space does not yield word embeddings that position related terms closely together could be that the three corpora used here are relatively small. The information provided by the context window-based co-occurrence counts of terms thus could be too little for the embeddings to be effectively trained.

Second, as is to be expected, query expansion increases recall and decreases precision (see Figs. 11 and 12 in Appendix 8). Hence, in general query expansion is only worthwhile if—and as long as—the increase in recall outweighs the decrease in precision. Applying the initial set of 10 highly predictive keywords on the Twitter and SBIC data sets, yields a retrieval result that is characterized by low recall and high precision, whereas applying the initial set of 10 highly predictive keywords on the Reuters corpus, leads to very high (sometimes even perfect) recall and low precision (see Figs. 11 and 12 in Appendix 8). Whereas in the second situation, there is no room for query expansion to further improve the retrieval performance via increasing recall (and thus the F1-Score for the Reuters corpus is moving downward), in the low-recall-high-precision situation of the Twitter and SBIC data sets there is at least the potential for query expansion to increase recall without causing a too strong decrease in precision. This potential is realized in some iterations at some expansion sets, but the decrease in precision more often than not tends to outweigh the increase in recall.

Fig. 11.

Fig. 11

Recall and precision for retrieving relevant documents with keyword lists and global query expansion. This plot shows recall and precision scores resulting from the application of the keyword lists of 10 highly predictive terms as well as the evolution of the recall and precision scores across the query expansion procedure based on globally trained GloVe embeddings. For each of the sampled 100 keyword lists that then are expanded, one light blue line is plotted. The thick dark blue line gives the mean over the 100 lists

Fig. 12.

Fig. 12

Recall and Precision for Retrieving Relevant Documents with Keyword Lists and Local Query Expansion. This plot shows recall and precision scores resulting from the application of the keyword lists of 10 highly predictive terms as well as the evolution of the recall and precision scores across the query expansion procedure based on locally trained GloVe embeddings. For each of the sampled 100 keyword lists that then are expanded, one light blue line is plotted. The thick dark blue line gives the mean over the 100 lists. Note that the strong increase in recall for some keyword lists in the Twitter data set is due to the fact that the textual feature with the highest cosine similarity to the highly predictive initial term ‘flüchtlinge’ (translation: ‘refugees’) is the colon ‘:

A further reason why query expansion does not perform very well also for global embeddings is the meaning conflation deficiency [93, p. 60]: Because word embedding models such as GloVe represent one term by a single embedding vector, a polyseme or homonym is likely to have the various meanings that it refers to encoded within its single representation vector [86, p. 1059]. The meanings get subsumed into one representation [112, p. 102]. Here, it seems that the conflation of meanings for the GloVe embeddings that have been pretrained on large, global corpora proceeds unequally: The global embedding space tends to position polysemous or homonymous terms close to terms that are semantically or syntactically related to the most common and general meaning of the polysemous or homonymous term (see for example the term ‘vegetables’ in Table 2). Query expansion in the global embedding space thus fails if an initial query term is a polyseme or homonym and its intended meaning is highly context-specific.

One possible solution would be to take embeddings trained on a global corpus and adapt them to the local corpus at hand, such that global, general-purpose representations can be adapted to the local, task-specific domain. Mittens [35] is an extension of GloVe that allows for fine-tuning general-purpose embeddings obtained from GloVe to task-specific settings. Fine-tuning global GloVe representations to the SBIC corpus via Mittens, however, does not yield substantive improvements in the F1-Score (see Fig. 13 in Appendix 9).

Fig. 13.

Fig. 13

Retrieving relevant documents with query expansion based on Mittens embeddings. This plot shows the F1-Scores, as well as the recall and precision values, resulting from the application of the keyword lists of 10 highly predictive terms as well as the evolution of the F1-Scores, recall values, and precision values across the query expansion procedure based on Mittens embeddings. For each of the sampled 100 keyword lists that then are expanded, one light blue line is plotted. The thick dark blue line gives the mean over the 100 lists

Topic model-based classification rules

Figure 3 presents the F1-Scores reached by topic model-based classification rules. The most notable aspect is that the retrieval performance of topic model-based classification rules is low for the Twitter and SBIC corpora and relatively high for the Reuters corpus. The highest F1-Score reached in the Twitter retrieval task is 0.253 and regarding the SBIC is 0.175, whereas on the Reuters corpus a score of 0.685 is achieved.

To better understand this result, for each topic, the terms with the highest occurrence probabilities and the terms with the highest FREX-Scores are inspected. (The FREX-Score is the weighted harmonic mean of a term’s occurrence probability and a term’s exclusivity, see 103, p. 993 for details.) This inspection (see Tables 9, 10, and 11 in Appendix 10) reveals that whether and in how far there are exclusive and coherent topics that relate to the entity of interest likely determines whether a topic model-based classification rule can effectively retrieve relevant documents or not.

Table 9.

Twitter: terms with the highest probability and the highest FREX-Score

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
Prob. @therealliont alt zeit klein ne
facebook stund seid zwei woch
lass los scheiss #pegida nach
FREX gepostet fertig zeit #pegida wert
facebook cool #emabiggestfans1d #nopegida passt
monday-giveaway wahrschein gonn setz woch
Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
Prob. toll fall best the eigent
richtig nochmal nacht of sonntag
dabei h onlin hatt eig
FREX toll nochmal schulmobel team eigent
wenigst fall videos artikel lag
manchmal #breslau onlin #techjob sonntag
Topic 11 Topic 12 Topic 13 Topic 14 Topic 15
Prob. wurd xd sich gern @youtube-playlist
viel mensch suss sammelt hinzugefugt
lohnt frag vielleicht kind @youtub
FREX bordelldatenbank.eu wunderschon reinhard_4711 sammelt #younow
erforscht frag mallorcamagazin zeig nightcor
publiziert geil sich glucklich vs
Topic 16 Topic 17 Topic 18 Topic 19 Topic 20
Prob. #iphonegam #iphon weiss lang oh
fahrt steh welt veranstalt schnell
hotel gruss end event nix
FREX antalya #iphon kompakt lkr mag
erendiz #blondin deintraum veranstalt aufgeregt
sightseeing #blondinenwitz haus event sing
Topic 21 Topic 22 Topic 23 Topic 24 Topic 25
Prob. endlich bitt folg beim war
gross abgeschloss warum spass grad
brauch frau #votesami seit sowas
FREX #immortalis mission erfullt abonni geschlecht
immortalis bitt belohn total mutt
pvp-gefecht aufgab international herzlich star
Topic 26 Topic 27 Topic 28 Topic 29 Topic 30
Prob. #kca komm find retweet halt
twitt steht voll leut zuruck
#votedagi klar mach bett uhr
FREX gezuchtet #hamburg wahr @ischtaris anschau
ratselhaft geplant find bett zuruck
#votedagi hamburg zeigt #kca uberhaupt

For each topic in the CTM with 30 topics estimated on the Twitter corpus, this Table presents the 3 terms with the highest probability (Prob.) and the 3 terms with the highest FREX-Score (FREX). See Topic 4 for the here only moderately coherent Pegida topic. Note that German umlauts here are removed as the preprocessing procedure for the CTM involved stemming—which here also implied removing umlauts

Table 10.

SBIC: terms with the highest probability and the highest FREX-Score

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8
Prob. U+1F602 go now more sinc as time bitch
rt we right take move year back got
bad out left doe fire had actual hoe
FREX U+1F602 go #releasethememo mani season ago comput U+1F612
U+1F62D let now take U+1F643 best hitler these
bad tonight hillari wors move almost finish retard
Topic 9 Topic 10 Topic 11 Topic 12 Topic 13 Topic 14 Topic 15 Topic 16
Prob. her day by there don’t dark man guy
she today children eat know movi into gay
make play hit lot women an ask posit
FREX girlfriend april = hide sexist chees bar gay
her fool bottl eat women prostitut walk fag
cri humid mosquito space don’t humor into straight
Topic 17 Topic 18 Topic 19 Topic 20 Topic 21 Topic 22 Topic 23 Topic 24
Prob. well love hate your our black muslim he
made shit who will white call was
friend i’m r their us between red his
FREX oh dirti reason your immigr pizza kid
god love crime peac russia black rose dad
^ yo asshol educ nation common ice father
Topic 25 Topic 26 Topic 27 Topic 28 Topic 29 Topic 30
Prob. look tri off see > would
good start im here < one
incel stori done post s can
FREX hair case piss post > would
look ethiopia youtub see < never
normal touch im pictur number one

For each topic in the CTM with 30 topics estimated on the SBIC, this Table presents the 3 terms with the highest probability (Prob.) and the 3 terms with the highest FREX-Score (FREX). See Topic 8 for a non-coherent and non-exclusive topic that slightly touches disrespectful posts about disabled people

Table 11.

Reuters: terms with the highest probability and the highest FREX-Score

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8
Prob. cts profit pct share export bank ec unit
avg oper growth pct nil rate european subsidiari
shrs gain rise stock coffe pct communiti agreement
FREX shrs profit gnp smc seamen 9-13 ec mhi
avg pretax economi ucpb prev bank communiti cetus
cts extraordinari growth calmat ibc 9-7 ecus squibb
Topic 9 Topic 10 Topic 11 Topic 12 Topic 13 Topic 14 Topic 15 Topic 16
Prob. pct vs trade franc gulf oil offer dollar
billion loss u. french ship price share rate
januari rev japan group u. gas sharehold currenc
FREX unadjust rev lyng ferruzzi missil opec caesar louvr
januari vs chip cgct warship herrington sosnoff miyazawa
fell mths miti cpc tehran bpd cyacq poehl
Topic 17 Topic 18 Topic 19 Topic 20 Topic 21 Topic 22 Topic 23 Topic 24
Prob. share billion tonn ltd price food analyst canadian
stock trade wheat plc contract beef earn canada
dividend reserv export pct cent philippin market credit
FREX dividend fed beet csr octan satur analyst card
payabl surplus cane transcanada kwacha diseas ibm canadian
payout taiwan rapese monier sulphur nppc rumor nova
Topic 25 Topic 26 Topic 27 Topic 28 Topic 29 Topic 30
Prob. quarter price chemic court gold debt
first produc u. file mine payment
earn stock busi general ton loan
FREX fourth cocoa gaf gencorp assay debt
quarter buffer hanson afg uranium payment
earn icco borg-warn court gold repay

For each topic in the CTM with 30 topics estimated on the Reuters corpus, this Table presents the 3 terms with the highest probability (Prob.) and the 3 terms with the highest FREX-Score (FREX). See Topic 14 for the crude oil topic and see Topic 13 for the military topic that at times touches crude oil

The entity of interest in the Twitter data set is multi-dimensional. It includes refugees as a social group, refugee policies as well as actions and occurrences revolving around the refugee crisis. When examining the most likely and exclusive terms for topic models estimated on the Twitter corpus, it becomes clear that not each aspect of this multi-dimensional refugee topic is captured in a coherent and exclusive topic (see for example Table 9 in Appendix 10). In each model for K30 there is one relatively coherent topic on Pegida, an anti-Islam and anti-immigration movement that held many demonstrations in the context of the refugee crisis. Beside that there are further more or less integrated topics that touch refugees and refugee policies without, however, being exclusively about these entities. Thus, the topic models do not offer a set of topics that, taken together, cover all dimensions of the refugee topic in an exclusive manner.

Regarding the SBIC, the situation is even more disadvantageous. Across all CTMs estimated on the SBIC there is no topic that identifiably relates to disabled people in a disrespectful way (as an example see Table 10 in Appendix 10). The CTMs with higher topic numbers include some topics that very slightly touch disabilities, but these topics are not coherent. Applying topic model-based classification rules in this situation is futile. Among all evaluated 426,725 × 4 = 1,706,900 settings, an F1-Score of 0.175 is as good as it maximally gets.

The situation is entirely different for the crude oil topic. For K30 each estimated topic model contains at least one coherent topic that clearly refers to aspects of crude oil (e.g. ‘opec’, ‘bpd’, ‘oil’; see Table 11 in Appendix 10). These coherent crude oil topics are not completely but relatively exclusive. Some of the crude oil topics also cover another energy source (namely: ‘gas’) and there is one reappearing conflict topic that refers to military aspects but also touches crude oil (‘gulf’, ‘missil’, ‘warship’, ‘oil’). Other than that, no other entities are substantially covered by crude oil topics. Building topic model-based classification rules on the basis of these crude oil topics yields relatively high recall and precision values.

Hence, topic model-based classification rules can be a useful tool—but only if the estimated topics coherently and exclusively cover the entity of interest in all its aspects.

In all three applications, and as is to be expected, high recall and low precision values tend to be achieved for topic models with smaller number of topics and lower values for threshold ξ, whereas low recall and high precision values tend to result from topic models with a higher number of topics and higher values for ξ (see Fig. 14 in Appendix 11).25 Classification rules that use topic models with a higher topic number and lower threshold ξ neither tend to exhibit the highest recall nor the highest precision values but they tend to strike the best balance between recall and precision and achieve the highest F1-Scores (see Fig. 3 here and Fig. 14 in Appendix 11).

Fig. 14.

Fig. 14

Recall and precision of topic model-based classification rules. The height of a bar indicates the recall values (left column) and precision values (right column) resulting from the application of a topic model-based classification rule. For each number of topics K{5,15,30,50,70,90,110}, those two combinations out of all explored combinations regarding the question how many and which topics are considered relevant are shown that reach the highest F1-Score for the given topic number. For each combination, recall and precision values for each threshold value ξ{0.1,0.3,0.5,0.7} is given. Classification rules that assign none of the documents to the positive class have a recall value of 0 and an undefined value for precision and the F1-Score. Undefined values here are visualized by the value 0

Active and passive supervised learning

Figure 4 visualizes for each of the three studied retrieval tasks (Twitter, SBIC, Reuters), for each employed supervised learning model (BERT and SVM), for each applied learning setting (passive learning with random oversampling as well as pool-based active learning with uncertainty sampling), and for each of the 10 (Twitter, SBIC) or 5 (Reuters) conducted iterations, the F1-Score achieved on the set aside test set (fold g) as the number of labeled training documents in set I increases from 250 to 1000. Passive supervised learning results are visualized by blue lines, active learning results are given in red. The thick and dark blue and red lines give the mean F1-Scores across the iterations. They visualize the estimate of the expected generalization error.26

Across all three applications and for BERT as well as SVM, active learning with uncertainty sampling tends to dominate passive learning with random oversampling. Passive learning with random oversampling on average only shows a similar or higher F1-Score for the first learning iteration (i.e. at the start when training is conducted on the randomly sampled training set of 250 labeled instances). Then, however, the active learning retrieval performance strongly increases such that for the same number of labeled training instances, active learning on average produces a higher F1-Score than passive learning.

One likely reason for this difference between passive and active learning is revealed in Fig. 5 that contains the same information as Fig. 4; except that on the y-axis not the F1-Score but the share of documents from the positive relevant class in the training set is shown. The black dashed line visualizes the share of relevant documents in the entire corpus and thus would be the expected share of relevant documents in a randomly sampled training set if neither random oversampling nor active learning were conducted. In passive learning with random oversampling (shown in blue) the 50 training instances, that are added in each step to the set of labeled training instances I, are randomly sampled from pool U. Then, the relevant instances in set I are randomly oversampled such that their number increases by a factor of 5. For this reason, the share of positive training instances in the passive learning setting is higher than in the corpus (black dashed line) but remains relatively constant across the training steps. In active learning (shown in red), no random oversampling is conducted—which is why at the beginning the share of relevant documents in at about equals the share of relevant documents in the corpus. Then, however, active learning at each step selects the 50 instances the algorithm is most uncertain about. As has been observed in other studies before [44, p. 133–134; 81, p. 545], this implies that disproportionately many instances from the relevant minority class are selected into set I. The share of positive training instances increases substantively—which in turn tends to increase generalization performance on the test set as shown in Fig. 4.

When decomposing the F1-Score into recall and precision (see Fig. 15 in Appendix 12), it is revealed that the supervised models’ recall values gradually improve as the number of training instances increases. The precision values early reach higher levels and exhibit a more volatile path. The observed retrieval performance enhancements hence particularly are caused by the models from step to step becoming better at identifying a larger share of the truly relevant documents from the corpora. Models trained in active rather than passive learning mode tend to yield higher recall values.

Fig. 15.

Fig. 15

Fig. 15

Fig. 15

Recall and Precision of Active and Passive Supervised Learning. Recall values and precision values achieved on the set aside test set as the number of unique labeled documents in set I increases from 250 to 1000. Passive supervised learning results are visualized by blue lines, active learning results are given in red. For each of the 10 (Twitter, SBIC) or 5 (Reuters) conducted iterations, one light colored line is plotted. The thick and dark blue and red lines give the means across the iterations. If a trained model assigns none of the documents to the positive relevant class, then it has a recall value of 0 and an undefined value for precision and the F1-Score. Undefined values here are visualized by the value 0

Yet, there is the question of whether active learning exhibits a superior performance to passive learning with random oversampling simply because after a certain number of training steps the share of training instances is higher for active than for passive learning or whether active learning dominates passive learning (also) because active learning, due to focusing on the uncertain region between the classes and due to operating on unique—rather than duplicated—positive training instances, learns a better generalizing class boundary with fewer training instances [117, p. 28]. To inspect this question, for the SVMs passive learning with random oversampling is repeated whereby positive relevant documents are randomly oversampled such that their number increases by a factor of 10 (Reuters), 17 (Twitter), or 20 (SBIC) (instead of a factor of 5 as before). This results in higher shares of relevant documents in the training set for passive learning (see right column in Fig. 6). Yet, the prediction performance on the test set as measured by the F1-Score either does not or does only minimally increase compared to the situation of random oversampling by a factor of 5 (see left column in Fig. 6). Moreover, although the share of relevant documents in the stronger oversampled passive learning training data sets is similar to those of active learning, active learning still yields considerably higher F1-Scores. This indicates that from a certain point, merely duplicating positive instances by random oversampling has no or only a small effect on the class boundary learned by the SVM. The finding also indicates that active learning improves upon passive learning because it is effectively able to select a large share of truly positive documents for training that are not duplicated but unique and because its selection of uncertain documents provides crucial information on the class boundary.

Another important observation concerns the performances’ variability (see again Fig. 4): For all models and learning modes, given a fixed number of labeled training instances, the F1-Scores on the set aside test sets can vary considerably between iterations. Which set of documents is randomly sampled to form the (initial) training set and which set aside test fold is used for evaluation thus can have a profound effect on the measured retrieval performance.

A further observation is that BERT on average tends to outperform SVM (see also Fig. 16 in Appendix 13). With regard to the Twitter and SBIC retrieval tasks, however, BERT exhibits a higher instability from one learning step to the next as the number of labeled training instances I increases by a batch of 50 documents (see Fig. 17 in Appendix 13). After adding a new batch of 50 labeled training instances and fine-tuning BERT on this new, slightly expanded training data set, the F1-Score achieved on the test set may not only increase but also decrease considerably. The strongest de- and increases can be observed for active learning on the Twitter data set where drops and rises of the F1-Score by a value of about 0.85 occur. Because of this high degree of variability in performance, monitoring retrieval performance with a set aside test set seems important for researchers to detect situations in which (likely due to vanishing gradients [83]) retrieval performance drops to low values. Such situations can be easily fixed by, for example, choosing another random seed for initialization.

Fig. 16.

Fig. 16

Comparing BERT and SVM for Active and Passive Supervised Learning I. F1-Scores achieved on the set aside test set as the number of unique labeled documents in set I increases from 250 to 1000. Active learning results are visualized in the left panels, passive learning results are given in the right panels. F1-Scores of the SVMs are visualized by blue lines, BERT performances are given in red. For each of the 10 (Twitter, SBIC) or 5 (Reuters) conducted iterations, one light colored line is plotted. The thick and dark blue and red lines give the mean across the iterations. If a trained model assigns none of the documents to the positive relevant class, then it has a recall value of 0 and an undefined value for precision and the F1-Score. Undefined values here are visualized by the value 0

Fig. 17.

Fig. 17

Comparing BERT and SVM for active and passive supervised learning II. Distribution of the differences in the F1-Scores achieved on the set aside test set as the number of unique labeled documents in set I increases from one training step to the next by a batch of 50 documents. Boxplots visualizing the distribution of differences in F1-Scores of the SVMs are presented in blue. F1-Score differences for BERT are given in red. The mean is visualized by a star dot. The value of the mean as well as the standard deviation (SD) are given below the respective boxplots

Comparison across approaches

The central question this study seeks to answer is as follows: What, if anything, can be gained by applying more costly retrieval approaches, such as query expansion, topic model-based classification rules, or supervised learning, instead of the relatively simple and inexpensive usage of a Boolean query with a keyword list? In order to finally answer this question and compare the approaches against each other, Figs. 7, 8, and 9 summarize the retrieval performance—as measured by the F1-Score—of the evaluated approaches on the retrieval tasks associated with the Twitter corpus (Fig. 7), the SBIC (Fig. 8), and the Reuters-21578 corpus (Fig. 9). In each Figure, the left panel gives the F1-Scores for the lists of 10 predictive keywords that then are expanded in the local and global embedding spaces. The middle panel shows the F1-Scores of topic model-based classification rules with different values for threshold ξ. The right panel visualizes the F1-Scores for active as well as passive supervised learning with SVM and BERT. Moreover, Table 3 gives the F1-Scores of specific learning settings across all evaluated applications.

In general, the direct comparison shows that, when taking keyword lists comprising 10 empirically predictive terms as the baseline, then the application of more complex and more expensive retrieval techniques does not guarantee better retrieval results.

Query expansion techniques here rather decrease than increase the F1-Score. Minimal improvements only occur sporadically in the embedding space trained on external global corpora if the increase in recall outweighs the decrease in precision. The farther the expansion, the worse the results tend to become.

Topic model-based classification rules work relatively well for the Reuters corpus but not the other corpora. Hence, if there are no coherent and exclusive topics that cover the entity of interest in all its aspects, topic model-based classification rules exhibit rather poor retrieval performances. If, on the other hand, coherent and exclusive topics relating to the entity of interest exist (as is the case for the Reuters corpus), acceptable retrieval results are possible. Here, gains over the best performing keyword lists are achieved for combinations with larger topic numbers and smaller values for threshold ξ. The best performing topic model-based classification rule on the Reuters corpus is based on a CTM with 70 topics that considers 2 topics to be relevant and predicts documents to be relevant that have 10% of their words assigned to these two relevant topics. This topic model-based classification rule reaches an F1-Score that is 0.04 higher than the F1-Score of the best performing keyword list (see Table 3).

Whereas query expansion techniques and topic model-based classification rules show no or small improvements, supervised learning—if conducted in an active learning mode—has the potential to yield a substantively higher retrieval performance than a list of 10 predictive keywords. The prerequisite for this, however, is that not too few training instances are used. The larger the number of training instances, the higher the F1-Score tends to be. Yet, as has been established above, especially for BERT this relationship is not monotonic and can exhibit considerable variability. What number of training documents is required to produce acceptable retrieval results that are better than what could be achieved with a keyword list, depends on the specifics of the retrieval task at hand and the employed learning mode and model. Nevertheless, across the applications inspected here, applying active learning with BERT until 1000 training instances have been labeled produces a good separation of relevant and irrelevant documents that considerably improves upon the separation achieved by applying a keyword list (see Table 3). The mean F1-Scores of BERT applied in an active learning mode with a training budget of 1000 labeled instances are 0.712 (Twitter), 0.622 (SBIC), and 0.908 (Reuters), whereas the maximum F1-Scores reached by the empirically constructed initial keyword lists are 0.417 (Twitter), 0.404 (SBIC), and 0.645 (Reuters). Hence, the improvements in the F1-Scores that are achieved by applying active learning with BERT rather than the best performing keyword list are 0.295 (Twitter), 0.218 (SBIC), and 0.263 (Reuters).

Note that the performance enhancements of active learning here are observed across applications. Irrespective of document length, textual style, the type of the entity of interest, and the homogeneity or heterogeneity of the corpus from which the documents are retrieved, active learning with 1000 training documents shows superior performance to keyword lists and the other approaches.

Thus, in terms of retrieval performance, supervised learning in an active learning mode—preferably with a pretrained deep neural network—is the procedure to be preferred to all other approaches evaluated here. However, this procedure is also the most expensive of the evaluated methods. Supervised learning implies human, financial, and time resources for annotating the training documents. A training data set comprising 1000 instances is very small for usual supervised learning settings, but the coding process also has to be monitored and coordinated. Moreover, active learning involves a dynamic labeling process in which after each iteration those documents are annotated for which a label is requested by the model. While active learning reduces the overall number of training instances for which a label is required, the dynamic labeling process may increase coordination costs or the time coders spend on coding as they wait for the model to request the next labels. Yet the precise separation of documents that refer to the entity of interest and thus are relevant for the planned study at hand from documents that are irrelevant is an essential analytic step. This step defines the set of documents on which all the following analyses are conducted. Selection biases induced by the applied retrieval method ultimately bias the study’s results. Therefore, attention and care should be taken when it comes to extracting relevant documents. Compared to the creation of a set of keywords, active learning requires substantive amounts of additional resources. But given the observed considerably higher retrieval performances achieved by active learning compared to keyword lists, spending these resources is likely to be worthwhile for the quality of the study.

Conclusion

In text-based analyses, researchers typically are interested to study documents referring to a particular entity. Yet, textual references to specific entities are often contained within multi-thematic corpora. In consequence, documents that contain references toward the entities of interest have to be separated from those that do not. A very common approach in social science to retrieve relevant documents is to apply a list of keywords. Keyword lists are inexpensive and easy to apply, but they may result in biased inferences if they systematically miss out relevant documents. Query expansion techniques, topic model-based classification rules and active as well as passive supervised learning constitute alternative, more expensive, more complex, and in social science rarely applied procedures for the retrieval of relevant documents. These more complex procedures theoretically can have the potential to reach a higher retrieval performance than keyword lists and thus can reduce the potential size of selection biases. So far, a systematic comparison of these approaches was lacking and therefore it was unclear, whether the employment of any of these more expensive methods would yield any improvements of retrieval performance, and if so how large and consistent across contexts the improvement would be.

This study closed this gap. The comparison of the approaches on the basis of retrieval tasks associated with a data set of German tweets [71], the Social Bias Inference Corpus (SBIC) [110], and the Reuters-21578 corpus [69] shows that neither of the applied more complex approaches necessarily enhances the retrieval performance, as measured by the F1-Score, over the application of a keyword list containing 10 empirically predictive terms. Yet whereas across all settings and combinations evaluated for query expansion techniques and topic model-based classification rules at the very best small increases in the F1-Score can be observed, active supervised learning increases the F1-Scores across application contexts substantively if the number of labeled training documents used in the active learning process is not too small.

The limitations of this article stem from the inevitable problem that different approaches can only be compared by applying specific, concrete models: The aim of this study was to compare different learning approaches: keyword lists, query expansion techniques, topic model-based classification rules, and active as well as passive supervised learning. One problem that naturally arises if one seeks to compare learning approaches is that different approaches can only be compared on the basis of specific models that follow specific procedures, that have specific hyperparameter settings, that are trained on a specific finite set of training documents and are evaluated on a specific finite set of test set documents, and that are initialized by specific random seed values [101]. Here, care was taken to have a broad range of several specific models with different settings for each approach. For example, with regard to active and passive supervised learning, two different types of models (SVM and BERT) were applied 10 or 5 times with different random initializations. In each of the 10 or 5 runs, a different initial training set was used that then was enlarged by passive random sampling or active selection in 15 iterations. This broad evaluation setting makes it more likely that the conclusions drawn here on the set of models evaluated for each approach hold and that active learning indeed is superior to keyword lists for the studied tasks.

How do the results obtained here compare to other studies that have used these data sets? Miller et al. [81, Figure 3, p. 544] compare different types of active learning query strategies on the German Twitter corpus and, for the uncertainty sampling query strategy with 1000 labeled training instances, obtain performance levels that are similar to the ones reached here. The SBIC and the Reuters corpus here have been used in a binary classification setting with the aim to identify only those documents that have been labeled to refer to specific entities (the disabled and crude oil). The SBIC, however, comes with rich annotations and frequently is used for the separation of offensive vs. not offensive comments. The Reuters data set is typically used to evaluate models on a multi-label classification task. In addressing these tasks, researchers explore strategies that, in future, might also be useful for the low-resource imbalanced binary classification setting studied here: Wang and Chang [133], for example, explore how well prompt-based zero-shot inference works for detecting offensiveness in the SBIC. The F1-Score for the offensive class is marginally higher than for random prediction (0.51), but performance tends to increase with the number of parameters of the pretrained language model [133, p. 1,5]. Huang et al. [54] show how a modified loss function that accounts for label imbalance and label correlation can increase performance on the multi-label Reuters classification task. They report a macro-averaged F1-Score of 0.645 [54, Table 2].

A further avenue for future applications is the usage of more advanced language representation models. Here, in supervised learning, the iconic BERT model was employed as a pretrained input for all applications. Yet especially with regard to the German Twitter application, a multilingual pretrained model (e.g. the XLM-T [10] that is pretrained by continuing to pretrain the XLM-R [29] on a multilingual Twitter corpus) may constitute an improved foundation for training on the target task.

Appendix 1: Imbalanced classification, precision, recall

Imbalanced classification problems are common in information retrieval tasks [74, p. 155]. They are characterized by an imbalance in the proportions made up by one vs. the other category. When retrieving relevant documents from large corpora typically only a small fraction of documents falls into the positive relevant category, whereas an overwhelming majority of documents is part of the negative irrelevant category [74, p. 155].

When evaluating the performance of a method in a situation of imbalance, the accuracy measure that gives the share of correctly classified documents is not adequate [74, p. 155]. The reason is that a method that would assign all documents to the negative irrelevant category would get a very high accuracy value [74, p. 155] Thus, evaluation metrics that allow for a refined view, such as precision and recall, should be employed [74, p. 155]. Precision and recall are defined as follows:

Precision=TPTP+FP 3
Recall=TPTP+FN, 4

whereby TP, FP and FN are defined in Table 4. Precision and recall are in the range [0, 1]. However, if none of the documents is predicted to be positive, then TP+FP=0 and precision is undefined. If there are no truly positive documents in the corpus, then TP+FN=0 and recall is undefined. The higher precision and recall, the better.

Precision exclusively takes into account all documents that have been assigned to the positive relevant category by the classification method and informs about the share of truly positive documents among all documents that are predicted to fall into the positive category. Recall, on the other hand, exclusively focuses on the truly relevant documents and informs about the share of documents that has been identified as relevant among all truly relevant documents.

There is a trade-off between precision and recall [74, p. 156]. A keyword list comprising many terms or a classification algorithm that is lenient in considering documents to be relevant will likely identify many of the truly relevant documents (high recall). Yet, as the hurdle for being considered relevant is low, they will also classify many truly irrelevant documents into the relevant category (low precision). A keyword list consisting of few specific terms or a classification algorithm with a high threshold for assigning documents to the relevant class will likely miss out many relevant instances (low recall), but among those considered relevant many are likely to indeed be relevant (high precision).

The following trade-off between precision and recall is incorporated in the Fω-measure, which is the weighted harmonic mean of precision and recall [74, p. 156]:

Fω=(ω2+1)·Precision·Recallω2·Precision+Recall 5

The Fω-measure also is in the range [0, 1]. ω is a real-valued factor balancing the importance of precision vs. recall [74, p. 156]. For ω>1 recall is considered more important than precision and for ω<1 precision is weighted more than recall [74, p. 156]. A very common choice for ω is 1 [74, p. 156]. In this case, the F1-measure (or synonymously: F1-Score) is the harmonic mean between precision and recall [74, p. 156].

F1=2·Precision·RecallPrecision+Recall 6

Appendix 2: Social science studies applying keyword lists

See Table 5.

Appendix 3: Topic model-based classification rule building procedure

See Fig. 10.

Appendix 4: Details on training local glove embeddings

In training, a symmetric context window size of six tokens on either side of the target feature as well as a decreasing weighting function is used; such that a token that is q tokens away from the target feature counts 1/q to the co-occurrence count [91]. After training, following the approach in Pennington et al. [91], the word embedding matrix and the context word embedding matrix are summed to yield the finally applied embedding matrix. Note that in their analysis of a large spectrum of settings for training word embeddings, Rodriguez and Spirling [107] found that the here used popular setting of using 300-dimensional embeddings with a symmetric window size of six tokens tends to be a setting that yields good performances whilst at the same time being cost-effective regarding the embedding dimensions and the context window size.

Appendix 5: Details on estimating CTMs

The CTM is estimated via the stm R-package [105] that originally is designed to estimate the Structural Topic Model (STM) [103]. The STM extends the LDA by allowing document-level variables to affect the topic proportions within a document (topical prevalence) or to affect the term probabilities of a topic (topical content) [103, p. 989]. If no document-level variables are specified (as is done here), the STM reduces to the CTM [103, p. 991]. In estimation, the approximate variational expectation-maximization (EM) algorithm as described in Roberts et al. [103, p. 992–993] is employed. This estimation procedure tends to be faster and tends to produce higher held-out log-likelihood values than the original variational approximation algorithm for the CTM presented in Blei and Lafferty [18] [105, p. 29–30]. The model is initialized via spectral initialization [3; 104, p. 82–85; 105, p. 11]. The model is considered to have converged if the relative change in the approximate lower bound on the marginal likelihood from one step to the next is smaller than 1e-04 [103, p. 992; 105, p. 10, 28].

Appendix 6: Details on text preprocessing and training settings for SVM and BERT

An SVM operates on a document-feature matrix, X. In a document-feature matrix, each document is represented as a feature vector of length U: xi=(xi1,,xiu,,xiU). The information contained in the vector’s entries, xiu, typically is based on the frequency with which each of the U textual features occurs in the ith document [125, p. 147]. Given the document feature vectors, {xi}i=1N, and corresponding binary class labels, {yi}i=1N, whereby yi{-1,+1}, an SVM tries to find a hyperplane, that separates the training documents as well as possible into the two classes [30].

To create the required vector representation for each document, here the following text preprocessing steps are applied: The documents are tokenized into unigram tokens. Punctuation, symbols, numbers, and URLs are removed. The tokens are lowercased and stemmed. Subsequently terms whose mean tf-idf value across all documents in which they occur belongs to the lowest 0.1% (Twitter, SBIC) or 0.2% (Reuters) of mean tf-idf values of all terms in the corpus are discarded. Also terms that occur in only one (Twitter) or two (SBIC, Reuters) documents are removed. Finally, a Boolean weighting scheme, in which only the absence (0) vs. presence (1) of a term in a corpus is recorded, is applied on the document-feature matrix. To determine the hyperparameter values for the SVMs, hyperparameter tuning via a grid search across sets of hyperparameter values is conducted in a stratified 5-fold cross-validation setting on one fold of the training data. A linear kernel and a Radial Basis Function (RBF) kernel are tried. Moreover, for the inverse regularization parameter C, that governs the trade-off between the slack variables and the training error, the values {0.1,1.0,10.0,100.0} (linear) and {0.1,1.0,10.0} (RBF) are inspected. Additionally, for the RBF’s parameter γ, that governs the training example’s radius of influence, the values {0.001,0.01,0.1} are evaluated.27 The folds are stratified such that the share of instances falling into the relevant minority class is the same across all folds. In each cross-validation iteration, in the folds used for training, random oversampling of the minority class is conducted such that the number of relevant minority class examples increases by a factor of 5. Among the inspected hyperparameter settings, the setting that achieves the highest F1-Score regarding the prediction of the relevant minority class and does not exhibit excessive overfitting is selected.

There are two limiting factors when applying BERT: First, due to memory limitations, BERT cannot process text sequences that are longer than 512 tokens [32, p. 4183]. This poses no problem for the Twitter corpus that has a maximum sequence length of 73 tokens. In the Reuters news corpus, however, whilst the largest share of articles is shorter than 512 tokens, there is a long tail of longer articles comprising up to around 1500 tokens.28 Following the procedure by Sun et al. [123], Reuters news stories that exceed 512 tokens are reduced to the maximum accepted token length by keeping the first 128 and keeping the last 382 tokens whilst discarding the remaining tokens in the middle.29 The maximum sequence length recorded for the SBIC is 354 tokens. In order to reduce the required memory capacities, the few posts that are longer than 250 tokens are shortened to 250 tokens by keeping the first 100 and the last 150 tokens.

The second limiting factor is that the prediction performance achieved by BERT after fine-tuning on the target task can vary considerably—even if the same training data set is used for fine-tuning and only the random seeds, that initialize the optimization process and set the order of the training data, differ ([32], p. 4176; [92], p. 5–7; [36]]. Especially when the training data set is small (e.g. smaller than 10,000 or 5000 documents), fine-tuning with BERT has been observed to yield unstable prediction performances [32, p. 4176; 92, p. 5–7]. Recently, Mosbach et al. [83] established that the variance in the prediction performance of BERT models, that have been fine-tuned on the same training data set with different seeds, to a large extent are likely due to vanishing gradients in the fine-tuning optimization process. Mosbach et al. [83, p. 5] also note that it is not that small training data sets per se yield unstable performances but rather that if small data sets are fine-tuned for the same number of epochs than larger data sets (typically for 3 epochs), then this implies that smaller data sets are fine-tuned for a substantively smaller number of training iterations—which in turn negatively affects the learning rate schedule and the generalization ability [83, p. 4–5]. Finally, Mosbach et al. [83, p. 2, 8–9] show that fine-tuning with a small learning rate (in the paper: 2e−05), with warmup, bias correction, and a large number of epochs (in the paper: 20) not only tends to increase prediction performances but also significantly decreases the performance instability in fine-tuning. Here, the advice of Mosbach et al. [83] is followed. For BERT, the AdamW algorithm [72] with bias correction, a warmup period lasting 10% of the training steps, and a global learning rate of 2e− 05 is used. Training is conducted for 20 epochs. Dropout is set to 0.1. The batch size is set to 16.

Appendix 7: Most predictive terms

Note on Tables 6, 7 and 8: The keyword lists comprising empirically highly predictive terms are not only applied on the corpora to evaluate the retrieval performance of keyword lists, but also form the basis for query expansion (see Sect. 4.2.2). The query expansion technique makes use of GloVe word embeddings [91] trained on the local corpora at hand and also makes use of externally obtained GloVe word embeddings trained on large global corpora. In the case of the locally trained word embeddings there is a learned word embedding for each predictive term. Thus, the set of extracted highly predictive terms can be directly used as starting terms for query expansion. In the case of the globally pretrained word embeddings, however, not all of the highly predictive terms have a corresponding global word embedding. Hence, for the globally pretrained embeddings the 50 most predictive terms for which a globally pretrained word embedding is available are extracted. If a predictive term has no corresponding global embedding, the set of extracted predictive terms is enlarged with the next most predictive term until there are 50 extracted terms. In consequence, below for each corpus two lists of the most predictive features are shown.

Appendix 8: Recall and precision of keyword lists and query expansion

See Figs. 11 and 12.

Appendix 9: Mittens: F1-score, recall, and precision

See Fig. 13.

Appendix 10: Terms with the highest probabilities and the highest FREX-Scores

See Tables 9, 10, and 11.

Appendix 11: Recall and precision of topic model-based classification rules

See Fig. 14.

Appendix 12: Recall and precision of active and passive supervised learning

See Fig. 15.

Appendix 13: Comparing BERT and SVM for active and passive supervised learning

See Figs. 16 and 17.

Data availibility

The code that supports the findings of this study are available from https://doi.org/10.6084/m9.figshare.19699840.

Declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Footnotes

1

A corpus is a set of documents. A document is the unit of observation. A document can be a very short to a very long text (e.g. a sentence, a speech, a newspaper article). Here, the term corpus refers to the set of documents a researcher has collected and from which he or she then seeks to retrieve the relevant documents.

2

Note that the vocabulary used in this study often makes use of the term retrieval. As this study focuses on contexts in which the task is to retrieve relevant documents from corpora of otherwise irrelevant documents, the usage of the term retrieval seems adequate. Yet, the task examined in this study is different from the task that is typically examined in document retrieval. Document retrieval is a subfield of information retrieval in which the task usually is to rank documents according to their relevance for an explicitly stated user query [74, p. 14, 16]. In this study, in contrast, the aim was to classify—rather than rank—documents as being relevant vs. not relevant. Moreover, not all of the approaches evaluated here require the query, that states the information need, to be expressed explicitly in the form of keywords or phrases.

3

More specifically: Selection biases occur (1) if the mechanism that selects observational units for an analysis from a larger population of units for which inferences are to be made is correlated with the units’ values on the dependent variable, (2) if the assignment of units to the values of the explanatory variables is correlated with the units’ values on the dependent variable, or (3) if the selection rule or the assignment rule are correlated with the size of the causal effect that units will experience [63, p. 115–116, 124–125, 138–139]. Here, the focus is on the first mentioned type of selection bias. If a study seeks to make descriptive rather than causal inferences, a selection bias is produced if the rule for selecting observational units for analysis is correlated with the variable of interest (rather than the dependent variable) [63, p. 135]. For the sake of simplicity, the following illustrations focus on studies whose goal is descriptive rather than causal inference. Accordingly, when referring to the definition of selection bias, here the expressions ‘the outcome variable’ or ‘the variable of interest’ rather than ‘the dependent variable’ are used.

4

Formal definitions of the performance metrics of recall, precision, and the F1-Score are given in Appendix 1.

5

Such a comprehensive list of keywords implies low precision and thus would come with another problem: a large share of false positives. Nevertheless, a comprehensive list would imply perfect recall and thus would preclude selection bias due to false negatives.

6

There are several likely reasons for the problems human researchers encounter when trying to set up an extensive list of keywords. First, language is highly varied [39]. There are numerous ways to refer to the same entity—and entities also can be referred to indirectly without the usage of proper names or well-defined denominations [6, p. 167]. Especially if the entity of interest is abstract and/or not easily denominated, the universe of terms and expressions referring to the entity is likely to be large and not easily to be captured [6, p. 167]. Such entities are abundant in social science. Typical entities of interest, for example, are policies (e.g. the policies implemented to address the COVID-19 pandemic), concepts (e.g. European integration or homophobia), and occurrences (e.g. the 2015 European refugee crisis or the 2021 United States Capitol riot). A second likely reason for the human inability to come up with a comprehensive keyword list are inhibitory processes [13; 61, p. 974]. After a set of concepts has been retrieved from memory, inhibitory processes suppress the representation of related, non-retrieved concepts in memory and thereby reduce the probability of those concepts to be recovered [13]. One method that has the potential to alleviate this second aspect are query expansion methods.

7

If the elements of the term vectors are non-negative, e.g. because they indicate the (weighted) frequency with which a term occurs across the documents in the corpus, then the angle between the vectors will be in the range [0,90] and cosine similarity will be in the range [0, 1]. If, on the other hand, elements of term representation vectors can become negative, then the vectors can also point into opposing directions. In the extreme: If the vectors point into diametrically opposing directions, then cos(θ)=-1.

8

Note that beside the similarity-based automatic query expansion approaches discussed so far, there are further expansion methods. Most prominently there are query language modeling and operations based on relevance feedback from the user or pseudo-relevant feedback [67; 74, p. 177–188; 4, p. 1709-1713].

9

The most likely terms are the terms with the highest occurrence probabilities, βku, for a given topic k. The most exclusive terms refer to highly discriminating terms whose probability to occur is high for topic k but low for all or most other topics. Exclusivity can be measured as: exclusivityku=βku/j=1Kβju (see for example [105], p. 12).

10

A coherent topic referring to the entity of interest would have high occurrence probabilities for frequently co-occurring terms that refer to the entity [106, p. 1069; 105, p. 10]. It would be clearly about the entity of interest rather than being a fuzzy topic without a nameable content. An exclusive topic would solely refer to the entity of interest and would not refer to any other entities.

11

Note that research suggests that it is rather the number of training examples in the positive relevant class than the number of all documents in the training set that affects the amount of information provided to the learning method [132].

12

Beside these random resampling techniques mentioned here, there are also methods that perform oversampling or undersampling in an informed way; e.g. based on distance criteria (see [24, p. 23–24]).

13

Note that the outlined relationship between cost ratios and over- or undersampling rates only holds if the threshold at which the classifier considers an instance to fall into the positive rather than the negative class is at p=0.5 [41, p. 975]. Note furthermore that although it would be good practice for resampling rates to reflect an underlying distribution of misclassification costs as specified in the cost matrix, resampling with rates reflecting misclassification costs will not yield the same results as incorporating misclassification costs into the learning process [24, p. 36]. One reason, for example, is that in random undersampling instances are removed entirely [24, p. 36]. For information on the relationship between oversampling/undersampling, cost-sensitive learning, and domain adaptation see Kouw and Loog [64, p. 4–5, 7].

14

Ideally, a single instance is selected and labeled in each iteration [70, p. 4]. Yet, re-training a model often is costly and time-consuming. An economic alternative is batch-mode active learning [117, p. 35]. Here a batch of instances is selected and labeled in each iteration [117, p. 35]. When selecting a batch of instances, there is the question of which instances to select. Selecting the K most informative instances is one strategy that, however, ignores the homogeneity of the selected instances [117, p. 35]. Alternative approaches that seek to increase the heterogeneity among the selected instances have been developed (see [117], p. 35).

15

Note that in multi-class classification tasks it is less straight forward to operationalize uncertainty. Here one can distinguish between least confident sampling, margin sampling, and entropy-based sampling (for precise definitions, see [117], p. 12–13). Moreover, the usage of such a definition of uncertainty and informativeness only is possible for learning methods that return predicted probabilities [117, p. 12]. For methods that do not, other uncertainty-based sampling strategies have been developed (see [117], p. 14–15). With regard to SVMs, Tong and Koller [124] have introduced three theoretically motivated query strategies. In their Simple Margin strategy, the data point that is closest to the hyperplane is selected to be labeled next [124, p. 53–54].

16

The employed R packages are data.table [38], dplyr [138], facetscales [88], ggplot2 [136], ggridges [140], lsa [139], plot3D [120], quanteda [17], RcppParallel [1], rstudioapi [126], stm [105], stringr [137], text2vec [116], and xtable [31]. The used Python packages and libraries are Beautiful Soup [102], gdown [59], imbalanced-learn [68], matplotlib [56], NumPy [87], mittens [35], pandas [76], seaborn [77], scikit-learn [90], PyTorch [89], watermark [98], and HuggingFace’s Transformers [141]. If a GPU was used, an NVIDIA Tesla P100-PCIE-16GB was employed.

17

For a detailed elaboration about the exact composition of the SBIC see Sap et al. [110, p. 5480]

18

Note that each post was annotated by three independent coders and that the data shared by Sap et al. [110] lists each annotation separately. Here the SBIC is preprocessed such that the post is considered to be offensive toward a group category if at least on annotator indicated that a group falling into this category was targeted.

19

The documents are preprocessed by tokenization into unigrams, lowercasing, removing terms that occur in less than 5 documents or less than five times throughout the corpus, and applying a Boolean weighting on the entries of the document-feature matrix such that a 1 signals the occurrence of a term in a document and a 0 indicates the absence of the term in a document.

20

Note that the keyword lists comprising empirically highly predictive terms are not only applied on the corpora to evaluate the retrieval performance of keyword lists, but also form the basis for query expansion (see Section 4.2.2). The query expansion technique makes use of GloVe word embeddings [91] trained on the local corpora at hand and also makes use of externally obtained GloVe word embeddings trained on large global corpora. In the case of the locally trained word embeddings there is a learned word embedding for each predictive term. Thus, the set of extracted highly predictive terms can be directly used as starting terms for query expansion. In the case of the globally pretrained word embeddings, however, not all of the highly predictive terms have a corresponding global word embedding. Hence, for the globally pretrained embeddings the 50 most predictive terms for which a globally pretrained word embedding is available are extracted. If a predictive term has no corresponding global embedding, the set of extracted predictive terms is enlarged with the next most predictive term until there are 50 extracted terms. Consequently, in Tables 6, 7, and 8 in Appendix 7 for each corpus two lists of the most predictive features are shown. Moreover, for the evaluation of the initial keyword lists of 10 predictive keywords, the local keyword lists have to be used because the global keyword lists have been adapted for the purposes of query expansion on the global word embedding space.

21

The embeddings can be downloaded from https://nlp.stanford.edu/projects/glove/. GloVe embeddings here are used because they tend to be frequently employed in social science [107, p. 104].

22

The embeddings can be downloaded from https://deepset.ai/german-word-embeddings.

23

For example, in a topic model with K=15 topics, there are 15 ways to select one relevant topic from 15 topics (namely: the first, the second, ..., and the 15th); and there are 152=105 ways of choosing two relevant topics from the set of 15 topics, and there are 153=455 ways to pick three topics from 15 topics.

24

Transfer learning with deep neural networks has sparked substantive enhancements in prediction performances across the field of NLP and beyond and now constitutes the foundation of modern NLP [22]. Yet this mode of learning also comes with limitations. One issue is that the models tend to learn the representational biases that are reflected in the texts on which they are pretrained on [22, p. 129–131]. For an introduction to transfer learning with Transformer-based language representation models see Wankmüller [134]. For a detailed discussion of the potentials and limitations of the usage of pretrained deep neural networks see Bommasani et al. [22].

25

A too high ξ, however, at times may lead to a classification rule in which none of the documents is assigned to the positive relevant class—thereby producing an undefined value for precision and the F1-Score (here visualized by 0).

26

Note that the x-axis denotes the number of unique labeled instances in set I. As in passive learning with random oversampling the documents from the relevant minority class in set I are randomly resampled with replacement to then form the training set on which the model is trained, in passive learning the size of the training set is larger than the size of set I. Yet, the size of I indicates the number of unique documents on which training is performed and—as only unique documents have to be annotated—it indicates the annotation costs.

27

On the precise definition of C and γ see scikit-learn Developers [114] and scikit-learn Developers [113].

28

There is a single outlier article that is as long as 3,797 tokens.

29

Note that to meet the input format required by BERT in single sequence text classification tasks, two additional special tokens, ‘[CLS]’ and ‘[SEP]’, have to be added [32, p. 4174].

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Allaire, J. J., Francois, R., Ushey, K., Vandenbrouck, G., Geelnard, M., & Intel (2020). RcppParallel: Parallel programming tools for ‘Rcpp’ (Version 5.0.2). [R package]. CRAN. https://CRAN.R-project.org/package=RcppParallel.
  • 2.ALMasri, M., Berrut, C., & Chevallet, J.-P. (2013). Wikipedia-based semantic query enrichment. In: Bennett, P. N., Gabrilovich, E., Kamps, J., & Karlgren, J. (Eds.), Proceedings of the sixth international workshop on exploiting semantic annotations in information retrieval (ESAIR ’13) (pp. 5–8). Association for Computing Machinery.
  • 3.Arora, S., Ge, R., Halpern, Y., Mimno, D., Moitra, A., Sontag, D., Wu, Y., & Zhu, M. (2013). A practical algorithm for topic modeling with provable guarantees. In: Dasgupta, S., & McAllester, D. (Eds.), Proceedings of the 30th international conference on machine learning (pp. 280–288). Proceedings of Machine Learning Research.
  • 4.Azad HK, Deepak A. Query expansion techniques for information retrieval: A survey. Information Processing and Management. 2019;56(5):1698–1735. doi: 10.1016/j.ipm.2019.05.009. [DOI] [Google Scholar]
  • 5.Azar, E. E. (2009). Conflict and Peace Data Bank (COPDAB), 1948-1978. [Data set]. Inter-University Consortium for Political and Social Research. 10.3886/ICPSR07767.v4.
  • 6.Baden C, Kligler-Vilenchik N, Yarchi M. Hybrid content analysis: Toward a strategy for the theory-driven, computer-assisted classification of large text corpora. Communication Methods and Measures. 2020;14(3):165–183. doi: 10.1080/19312458.2020.1803247. [DOI] [Google Scholar]
  • 7.Baerg N, Lowe W. A textual Taylor rule: Estimating central bank preferences combining topic and scaling methods. Political Science Research and Methods. 2020;8(1):106–122. doi: 10.1017/psrm.2018.31. [DOI] [Google Scholar]
  • 8.Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In: Bengio, Y., & LeCun, Y. (Eds.), 3rd International conference on learning representations (ICLR 2015) (pp. 1–15).
  • 9.Barberá, P. (2016). Less is more? How demographic sample weights can improve public opinion estimates based on twitter data. Manuscript. Retrieved June 4, 2021 from http://pablobarbera.com/static/less-is-more.pdf.
  • 10.Barbieri, F., Anke, L. E., & Camacho-Collados, J. (2021). XLM-T: Multilingual language models in twitter for sentiment analysis and beyond. arXiv:2104.12250.
  • 11.Bauer PC, Barberá P, Ackermann K, Venetz A. Is the left-right scale a valid measure of ideology? Political Behavior. 2017;39(3):553–583. doi: 10.1007/s11109-016-9368-2. [DOI] [Google Scholar]
  • 12.Baum M, Cohen DK, Zhukov YM. Does rape culture predict rape? Evidence from U.S. newspapers, 2000–2013. Quarterly Journal of Political Science. 2018;13(3):263–289. doi: 10.1561/100.00016124. [DOI] [Google Scholar]
  • 13.Bäuml K-H. Making memories unavailable: The inhibitory power of retrieval. Journal of Psychology. 2007;215(1):4–11. doi: 10.1027/0044-3409.215.1.4. [DOI] [Google Scholar]
  • 14.Beauchamp N. Predicting and interpolating state-level polls using Twitter textual data. American Journal of Political Science. 2017;61(2):490–503. doi: 10.1111/ajps.12274. [DOI] [Google Scholar]
  • 15.Bengio Y, Ducharme R, Vincent P, Janvin C. A neural probabilistic language model. Journal of Machine Learning Research. 2003;3:1137–1155. [Google Scholar]
  • 16.Benoit, K. (2020). Text as data: An overview. In: Curini, L., & Franzese, R. (Eds.), The SAGE handbook of research methods in political science and international relations (pp. 461–497). London. SAGE Publications. 10.4135/9781526486387.n29.
  • 17.Benoit K, Watanabe K, Wang H, Nulty P, Obeng A, Müller S, Matsuo A. quanteda: An R package for the quantitative analysis of textual data. Journal of Open Source Software. 2018;3(30):774. doi: 10.21105/joss.00774. [DOI] [Google Scholar]
  • 18.Blei DM, Lafferty JD. A correlated topic model of science. The Annals of Applied Statistics. 2007;1(1):17–35. doi: 10.1214/07-AOAS114. [DOI] [Google Scholar]
  • 19.Blei DM, Ng AY, Jordan MI. Latent Dirichlet allocation. Journal of Machine Learning Research. 2003;3:993–1022. [Google Scholar]
  • 20.Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. In: Proceedings of the eleventh annual conference on computational learning theory (COLT’98) (pp. 92–100) New York, NY, USA. Association for Computing Machinery. 10.1145/279943.279962.
  • 21.Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics. 2017;5:135–146. doi: 10.1162/tacl\_a_00051. [DOI] [Google Scholar]
  • 22.Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., Brynjolfsson, E., Buch, S., Card, D., Castellon, R., Chatterji, N. S., Chen, A. S., Creel, K., Davis, J. Q., Demszky, D., Donahue, C., Doumbouya, M., Durmus, E., Ermon, S., Etchemendy, J., Ethayarajh, K., Fei-Fei, L., Finn, C., Gale, T., Gillespie, L., Goel, K., Goodman, N. D., Grossman, S., Guha, N., Hashimoto, T., Henderson, P., Hewitt, J., Ho, D. E., Hong, J., Hsu, K., Huang, J., Icard, T., Jain, S., Jurafsky, D., Kalluri, P., Karamcheti, S., Keeling, G., Khani, F., Khattab, O., Koh, P. W., Krass, M. S., Krishna, R., & Kuditipudi, R., et al. (2021). On the opportunities and risks of foundation models. arXiv:2108.07258.
  • 23.Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. In: Haussler, D. (Ed.), Proceedings of the fifth annual workshop on computational learning theory (COLT ’92) (pp. 144–152). Association for Computing Machinery.
  • 24.Branco P, Torgo L, Ribeiro RP. A survey of predictive modeling on imbalanced domains. ACM Computing Surveys. 2016;49(2):1–50. doi: 10.1145/2907070. [DOI] [Google Scholar]
  • 25.Brownlee, J. (2020). Cost-sensitive learning for imbalanced classification. Machine learning mastery. Retrieved June 9, 2021 from https://machinelearningmastery.com/cost-sensitive-learning-for-imbalanced-classification/.
  • 26.Brownlee, J. (2021). Random oversampling and undersampling for imbalanced classification. Machine learning mastery. Retrieved June 8, 2021, from https://machinelearningmastery.com/random-oversampling-and-undersampling-for-imbalanced-classification/.
  • 27.Burnap P, Gibson R, Sloan L, Southern R, Williams M. 140 characters to victory?: Using Twitter to predict the UK 2015 General Election. Electoral Studies. 2016;41:230–233. doi: 10.1016/j.electstud.2015.11.017. [DOI] [Google Scholar]
  • 28.Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research. 2002;16(1):321–357. doi: 10.1613/jair.953. [DOI] [Google Scholar]
  • 29.Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2020). Unsupervised cross-lingual representation learning at scale. In: Jurafsky, D., Chai, J., Schluter, N., & Tetreault, J. (Eds.), Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 8440–8451). Stroudsburg, PA, USA. Association for Computational Linguistics. 10.18653/v1/2020.acl-main.747.
  • 30.Cortes C, Vapnik V. Support-vector networks. Machine Learning. 1995;20(3):273–297. doi: 10.1007/BF00994018. [DOI] [Google Scholar]
  • 31.Dahl, D. B., Scott, D., Roosen, C., Magnusson, A., & Swinton, J. (2019). xtable: Export tables to LaTeX or HTML (version 1.8-4). [R package]. CRAN. https://cran.r-project.org/web/packages/xtable/index.html.
  • 32.Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional Transformers for language understanding. In: Burstein, J., Doran, C., & Solorio, T. (Eds.), Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: human language technologies (pp. 4171–4186). Association for Computational Linguistics. 10.18653/v1/N19-1423.
  • 33.Diaz, F., Mitra, B., & Craswell, N. (2016). Query expansion with locally-trained word embeddings. In: Erk, K. & Smith, N. A. (Eds.), Proceedings of the 54th annual meeting of the association for computational linguistics (pp. 367–377). Association for Computational Linguistics. 10.18653/v1/P16-1035.
  • 34.Diermeier D, Godbout J-F, Yu B, Kaufmann S. Language and ideology in Congress. British Journal of Political Science. 2011;42(1):31–55. doi: 10.1017/S0007123411000160. [DOI] [Google Scholar]
  • 35.Dingwall, N., & Potts, C. (2018). Mittens: an extension of GloVe for learning domain-specialized representations. In: Proceedings of the 2018 conference of the North American chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 2 (Short Papers) (pp. 212–217). Association for Computational Linguistics. 10.18653/v1/N18-2034.
  • 36.Dodge, J., Ilharco, G., Schwartz, R., Farhadi, A., Hajishirzi, H., & Smith, N. (2020). Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv:2002.06305v1 [cs.CL].
  • 37.D’Orazio V, Landis ST, Palmer G, Schrodt P. Separating the wheat from the chaff: Applications of automated document classification using Support Vector Machines. Political Analysis. 2014;22(2):224–242. doi: 10.1093/pan/mpt030. [DOI] [Google Scholar]
  • 38.Dowle, M., & Srinivasan, A. (2020). data.table: Extension of ‘data.frame’ (Version 1.13.0). [R package]. CRAN. https://CRAN.R-project.org/package=data.table.
  • 39.Durrell, M. (2008). Linguistic variable - linguistic variant. In: Ammon, U., Dittmar, N., Mattheier, K. J., & Trudgill, P. (Eds.), Sociolinguistics (pp. 195–200). De Gruyter Mouton. 10.1515/9783110141894.1.2.195.
  • 40.Ein-Dor, L., Halfon, A., Gera, A., Shnarch, E., Dankin, L., Choshen, L., Danilevsky, M., Aharonov, R., Katz, Y., & Slonim, N. (2020). Active learning for BERT: An empirical study. In: Webber, B., Cohn, T., He, Y., & Liu, Y. (Eds.), Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) (pp. 7949–7962). Association for Computational Linguistics. 10.18653/v1/2020.emnlp-main.638.
  • 41.Elkan, C. (2001). The foundations of cost-sensitive learning. In: Proceedings of the 17th international joint conference on artificial intelligence (IJCAI ’01), (pp. 973–978). Morgan Kaufmann Publishers Inc.
  • 42.Ennser-Jedenastik L, Meyer TM. The impact of party cues on manual coding of political texts. Political Science Research and Methods. 2018;6(3):625–633. doi: 10.1017/psrm.2017.29. [DOI] [Google Scholar]
  • 43.Erlich, A., Dantas, S. G., Bagozzi, B. E., Berliner, D., & Palmer-Rubin, B. (2021). Multi-label prediction for political text-as-data. Political analysis (pp. 1–18). 10.1017/pan.2021.15.
  • 44.Ertekin, S., Huang, J., Bottou, L., & Giles, L. (2007). Learning on the border: Active learning in imbalanced data classification. In: Proceedings of the sixteenth ACM conference on information and knowledge management (CIKM ’07) (pp. 127–136). Association for Computing Machinery. 10.1145/1321440.1321461.
  • 45.Eshima, S., Imai, K., & Sasaki, T. (2021). Keyword assisted topic models. arXiv:2004.05964v2 [cs.CL].
  • 46.Firth, J. R. (1957). Studies in linguistic analysis. Blackwell, Publications of the Philological Society.
  • 47.Fogel-Dror Y, Shenhav SR, Sheafer T, Atteveldt WV. Role-based association of verbs, actions, and sentiments with entities in political discourse. Communication Methods and Measures. 2019;13(2):69–82. doi: 10.1080/19312458.2018.1536973. [DOI] [Google Scholar]
  • 48.Gessler, T. & Hunger, S. (2021). How the refugee crisis and radical right parties shape party competition on immigration. Political Science Research and Methods, 1–21. 10.1017/psrm.2021.64.
  • 49.Google Colaboratory. (2020). Google colaboratory frequently asked questions. Google Colaboratory. Retrieved October 28, 2020, from https://research.google.com/colaboratory/faq.html.
  • 50.Grimmer J. Appropriators not position takers: The distorting effects of electoral incentives on Congressional representation. American Journal of Political Science. 2013;57(3):624–642. doi: 10.1111/ajps.12000. [DOI] [Google Scholar]
  • 51.Grün B, Hornik K. topicmodels: An R package for fitting topic models. Journal of Statistical Software. 2011;40(13):1–30. doi: 10.18637/jss.v040.i13. [DOI] [Google Scholar]
  • 52.Hayes AF, Krippendorff K. Answering the call for a standard reliability measure for coding data. Communication Methods and Measures. 2007;1(1):77–89. doi: 10.1080/19312450709336664. [DOI] [Google Scholar]
  • 53.Howard, J., & Ruder, S. (2018). Universal language model fine-tuning for text classification. In: Gurevych, I., & Miyao, Y. (Eds.), Proceedings of the 56th annual meeting of the association for computational linguistics, pp. 328–339. Association for Computational Linguistics. 10.18653/v1/P18-1031.
  • 54.Huang, Y., Giledereli, B., Köksal, A., Özgür, A., & Ozkirimli, E. (2021). Balancing methods for multi-label text classification with long-tailed class distribution. In Proceedings of the 2021 Conference on empirical methods in natural language processing (pp. 8153–8161). Association for Computational Linguistics. 10.18653/v1/2021.emnlp-main.643.
  • 55.HuggingFace (2021). Dataset card for reuters21578. Retrieved May 19, 2021, from https://huggingface.co/datasets/reuters21578.
  • 56.Hunter JD. Matplotlib: A 2D graphics environment. Computing in Science & Engineering. 2007;9(3):90–95. doi: 10.1109/MCSE.2007.55. [DOI] [Google Scholar]
  • 57.Jungherr A, Schoen H, Jürgens P. The mediation of politics through Twitter: An analysis of messages posted during the campaign for the German Federal Election 2013. Journal of Computer-Mediated Communication. 2016;21(1):50–68. doi: 10.1111/jcc4.12143. [DOI] [Google Scholar]
  • 58.Katagiri A, Min E. The credibility of public and private signals: A document-based approach. American Political Science Review. 2019;113(1):156–172. doi: 10.1017/S0003055418000643. [DOI] [Google Scholar]
  • 59.Kentaro, W. (2020). gdown: Download a large file from Google Drive. [Python package]. GitHub. https://github.com/wkentaro/gdown.
  • 60.Khan, J., & Lee, Y.-K. (2019). Lessa: A unified framework based on lexicons and semi-supervised learning approaches for textual sentiment classification. Applied Sciences, 9(24). 10.3390/app9245562.
  • 61.King G, Lam P, Roberts ME. Computer-assisted keyword and document set discovery from unstructured text. American Journal of Political Science. 2017;61(4):971–988. doi: 10.1111/ajps.12291. [DOI] [Google Scholar]
  • 62.King G, Pan J, Roberts ME. How censorship in China allows government criticism but silences collective expression. American Political Science Review. 2013;107(2):326–343. doi: 10.1017/S0003055413000014. [DOI] [Google Scholar]
  • 63.King K, Keohane RO, Verba S. Designing Social Inquiry: Scientific Inference in Qualitative Research. Princeton: Princeton University Press; 1994. [Google Scholar]
  • 64.Kouw, W. M., & Loog, M. (2019). A review of domain adaptation without target labels. arXiv:1901.05335. [DOI] [PubMed]
  • 65.Krippendorff, K. (2013). Content analysis: An introduction to its methodology. Sage Publications, 3rd edition.
  • 66.Kuzi, S., Shtok, A., & Kurland, O. (2016). Query expansion using word embeddings. In: Proceedings of the 25th ACM international on conference on information and knowledge management (CIKM ’16) (pp. 1929–1932). Association for Computing Machinery. 10.1145/2983323.2983876.
  • 67.Lavrenko, V., & Croft, W. B. (2001). Relevance based language models. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR ’01) (pp. 120–127). Association for Computing Machinery. 10.1145/383952.383972.
  • 68.Lemaître G, Nogueira F, Aridas CK. Imbalanced-learn: A Python toolbox to tackle the curse of imbalanced datasets in machine learning. Journal of Machine Learning Research. 2017;18(17):1–5. [Google Scholar]
  • 69.Lewis, D. D. (1997). Reuters-21578 (Distribution 1.0). [Data set]. http://www.daviddlewis.com/resources/testcollections/reuters21578/.
  • 70.Lewis, D. D., & Gale, W. A. (1994). A sequential algorithm for training text classifiers. In Croft, B. W. & van Rijsbergen, C. J. (Eds.), Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR ’94) (pp. 3–12). Springer.
  • 71.Linder F. Reducing bias in online text datasets: Query expansion and active learning for better data from keyword searches. SSRN. 2017 doi: 10.2139/ssrn.3026393. [DOI] [Google Scholar]
  • 72.Loshchilov, I., & Hutter, F. (2019). Decoupled weight decay regularization. In: 7th International conference on learning representations (ICLR 2019). OpenReview.net.
  • 73.Maier D, Waldherr A, Miltner P, Wiedemann G, Niekler A, Keinert A, Pfetsch B, Heyer G, Reber U, Häussler T, Schmid-Petri H, Adam S. Applying LDA topic modeling in communication research: Toward a valid and reliable methodology. Communication Methods and Measures. 2018;12(2–3):93–118. doi: 10.1080/19312458.2018.1430754. [DOI] [Google Scholar]
  • 74.Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.
  • 75.Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. MIT Press, Cambridge. https://mitpress.mit.edu/books/foundations-statistical-natural-language-processing.
  • 76.McKinney, W. (2010). Data structures for statistical computing in Python. In: van der Walt, S., & Millman, J. (Eds.), Proceedings of the 9th Python in science conference (SciPy 2010) (pp. 56–61). 10.25080/Majora-92bf1922-00a.
  • 77.Michael Waskom and Team. (2020). Seaborn. [Python package]. Zenodo. https://zenodo.org/record/4379347.
  • 78.Mikhaylov S, Laver M, Benoit KR. Coder reliability and misclassification in the human coding of party manifestos. Political Analysis. 2012;20(1):78–91. doi: 10.1093/pan/mpr047. [DOI] [Google Scholar]
  • 79.Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv:1301.3781v3 [cs.CL].
  • 80.Mikolov, T., Yih, W.-T., & Zweig, G. (2013). Linguistic regularities in continuous space word representations. In: Vanderwende, L., Daumé III, H., & Kirchhoff, K. (Eds.), Proceedings of the 2013 conference of the North American chapter of the association for computational linguistics: Human language technologies (pp. 746–751). Association for Computational Linguistics.
  • 81.Miller B, Linder F, Mebane WR. Active learning approaches for labeling text: Review and assessment of the performance of active learning approaches. Political Analysis. 2020;28(4):532–551. doi: 10.1017/pan.2020.4. [DOI] [Google Scholar]
  • 82.Moore, W. H., & Siegel, D. A. (2013). A Mathematics Course for Political and Social Research. Princeton University Press.
  • 83.Mosbach, M., Andriushchenko, M., & Klakow, D. (2021). On the stability of fine-tuning BERT: Misconceptions, explanations, and strong baselines. In: International conference on learning representations (ICLR 2021). OpenReview.net.
  • 84.Muchlinski D, Yang X, Birch S, Macdonald C, Ounis I. We need to go deeper: Measuring electoral violence using Convolutional Neural Networks and social media. Political Science Research and Methods. 2021;9(1):122–139. doi: 10.1017/psrm.2020.32. [DOI] [Google Scholar]
  • 85.Münchener Digitalisierungszentrum der Bayerischen Staatsbibliothek (dbmdz). (2021). Model card for bert-base-german-uncased from dbmdz. Retrieved May 19, 2021, from https://huggingface.co/dbmdz/bert-base-german-uncased.
  • 86.Neelakantan, A., Shankar, J., Passos, A., & McCallum, A. (2014). Efficient non-parametric estimation of multiple embeddings per word in vector space. In: Moschitti, A., Pang, B., & Daelemans, W. (Eds.), Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1059–1069). Stroudsburg, PA, USA. Association for Computational Linguistics. 10.3115/v1/D14-1113.
  • 87.Oliphant, T. E. (2006). A guide to NumPy. Trelgol Publishing USA.
  • 88.Oller Moreno, S. (2021). facetscales: Facet grid with different scales per facet (Version 0.1.0.9000). [R package]. GitHub. https://github.com/zeehio/facetscales.
  • 89.Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., & Chintala, S. (2019). PyTorch: An imperative style, high-performance deep learning library. In: H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, & R. Garnett (Eds.), Advances in neural information processing systems 32 (pp. 8024–8035). Curran Associates Inc.
  • 90.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research. 2011;12:2825–2830. [Google Scholar]
  • 91.Pennington, J., Socher, R., & Manning, C. (2014). GloVe: Global vectors for word representation. In Moschitti, A., Pang, B., & Daelemans, W. (Eds.), Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532–1543). Association for Computational Linguistics. 10.3115/v1/D14-1162.
  • 92.Phang, J., Févry, T., & Bowman, S. R. (2019). Sentence encoders on STILTs: Supplementary training on intermediate labeled-data tasks. arXiv:1811.01088v2 [cs.CL].
  • 93.Pilehvar, M. T., & Camacho-Collados, J. (2020). Embeddings in natural language processing: theory and advances in vector representations of meaning. Morgan & Claypool Publishers. 10.2200/S01057ED1V01Y202009HLT047
  • 94.Pilny A, McAninch K, Slone A, Moore K. Using supervised machine learning in automated content analysis: An example using relational uncertainty. Communication Methods and Measures. 2019;13(4):287–304. doi: 10.1080/19312458.2019.1650166. [DOI] [Google Scholar]
  • 95.Puglisi R, Snyder JM. Newspaper coverage of political scandals. The Journal of Politics. 2011;73(3):931–950. doi: 10.1017/s0022381611000569. [DOI] [Google Scholar]
  • 96.Quinn KM, Monroe BL, Colaresi M, Crespin MH, Radev DR. How to analyze political attention with minimal assumptions and costs. American Journal of Political Science. 2010;54(1):209–228. doi: 10.1111/j.1540-5907.2009.00427.x. [DOI] [Google Scholar]
  • 97.R Core Team (2020). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
  • 98.Raschka, S. (2020). watermark. [Python package]. GitHub. https://github.com/rasbt/watermark.
  • 99.Rauh C, Bes BJ, Schoonvelde M. Undermining, defusing or defending European integration? assessing public communication of European executives in times of EU politicisation. European Journal of Political Research. 2020;59:397–423. doi: 10.1111/1475-6765.12350. [DOI] [Google Scholar]
  • 100.Reda, A. A., Sinanoglu, S., & Abdalla, M. (2021). Mobilizing the masses: Measuring resource mobilization on twitter. Sociological Methods & Research, 1–40.
  • 101.Reimers, N., & Gurevych, I. (2018). Why comparing single performance scores does not allow to draw conclusions about machine learning approaches. arXiv:1803.09578.
  • 102.Richardson, L. (2020). Beautiful Soup 4. [Python library]. Crummy. https://www.crummy.com/software/BeautifulSoup/.
  • 103.Roberts ME, Stewart BM, Airoldi EM. A model of text for experimentation in the social sciences. Journal of the American Statistical Association. 2016;111(515):988–1003. doi: 10.1080/01621459.2016.1141684. [DOI] [Google Scholar]
  • 104.Roberts, M. E., Stewart, B. M., & Tingley, D. (2016). Navigating the local modes of big data: The case of topic models. In: Alvarez, R. M. (Ed.), Computational social science: discovery and prediction (pp. 51–97). Cambridge University Press.
  • 105.Roberts ME, Stewart BM, Tingley D. stm: An R package for structural topic models. Journal of Statistical Software. 2019;91(2):1–40. doi: 10.18637/jss.v091.i02. [DOI] [Google Scholar]
  • 106.Roberts ME, Stewart BM, Tingley D, Lucas C, Luis JL, Gadarian SK, Albertson B, Rand DG. Structural Topic Models for open-ended survey responses. American Journal of Political Science. 2014;58(4):1064–1082. doi: 10.1111/ajps.12103. [DOI] [Google Scholar]
  • 107.Rodriguez PL, Spirling A. Word embeddings: What works, what doesn’t, and how to tell the difference for applied research. The Journal of Politics. 2022;84(1):101–115. doi: 10.1086/715162. [DOI] [Google Scholar]
  • 108.Ruder, S. (2019). Neural transfer learning for natural language processing. PhD thesis, National University of Ireland, Galway.
  • 109.Ruder, S. (2020). NLP-Progress. Retrieved June 21, 2021, from https://nlpprogress.com/english/text_classification.html.
  • 110.Sap, M., Gabriel, S., Qin, L., Jurafsky, D., Smith, N. A., & Choi, Y. (2020). Social bias frames: Reasoning about social and power implications of language. In: Jurafsky, D., Chai, J., Schluter, N., & Tetreault, J. (Eds.), Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 5477–5490). Association for Computational Linguistics. 10.18653/v1/2020.acl-main.486.
  • 111.Schulze, P., Wiegrebe, S., Thurner, P. W., Heumann, C., Aßenmacher, M., & Wankmüller, S. (2021). Exploring topic-metadata relationships with the STM: A Bayesian approach. arXiv:2104.02496v1 [cs.CL].
  • 112.Schütze, H. (1998). Automatic word sense discrimination. Computational Linguistics, 24(1), 97–123. https://aclanthology.org/J98-1004.
  • 113.scikit-learn Developers. (2020). 1.4. Support Vector Machines. Retrieved November 23, 2020, from https://scikit-learn.org/stable/modules/svm.html.
  • 114.scikit-learn Developers (2020). RBF SVM Parameters. Retrieved November 23, 2020, from https://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html.
  • 115.Sebők, M., & Kacsuk, Z. (2020). The multiclass classification of newspaper articles with machine learning: The hybrid binary snowball approach. Political Analysis, 1–14. 10.1017/pan.2020.27.
  • 116.Selivanov, D., Bickel, M., & Wang, Q. (2020). text2vec: Modern text mining framework for R. [R package]. CRAN. https://CRAN.R-project.org/package=text2vec.
  • 117.Settles, B. (2010). Active learning literature survey. Computer Sciences Technical Report 1648. University of Wisconsin–Madison. http://burrsettles.com/pub/settles.activelearning.pdf.
  • 118.Silva, A., & Mendoza, M. (2020). Improving query expansion strategies with word embeddings. In: Proceedings of the ACM symposium on document engineering 2020 (DocEng ’20) (pp. 1–4). Association for Computing Machinery. 10.1145/3395027.3419601.
  • 119.Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A., & Potts, C. (2013). Recursive deep models for semantic compositionality over a sentiment treebank. In Yarowsky, D., Baldwin, T., Korhonen, A., Livescu, K., & Bethard, S. (Eds.), Proceedings of the 2013 conference on empirical methods in natural language processing (pp. 1631–1642). Association for Computational Linguistics.
  • 120.Soetaert, K. (2019). plot3D: Plotting multi-dimensional data (Version 1.3). [R package]. CRAN. https://CRAN.R-project.org/package=plot3D.
  • 121.Song H, Tolochko P, Eberl J-M, Eisele O, Greussing E, Heidenreich T, Lind F, Galyga S, Boomgaarden HG. In validations we trust? The impact of imperfect human annotations as a gold standard on the quality of validation of automated content analysis. Political Communication. 2020;37(4):550–572. doi: 10.1080/10584609.2020.1723752. [DOI] [Google Scholar]
  • 122.Stier, S., Bleier, A., Bonart, M., Mörsheim, F., Bohlouli, M., Nizhegorodov, M., Posch, L., Maier, J., Rothmund, T., & Staab, S. (2018). Systematically Monitoring Social Media: the Case of the German Federal Election 2017. GESIS - Leibniz-Institut für Sozialwissenschaften. 10.21241/ssoar.56149.
  • 123.Sun, C., Qiu, X., Xu, Y., & Huang, X. (2019). How to fine-tune BERT for text classification? arXiv:1905.05583v3 [cs.CL].
  • 124.Tong S, Koller D. Support Vector Machine active learning with applications to text classification. Journal of Machine Learning Research. 2002;2:45–66. doi: 10.1162/153244302760185243. [DOI] [Google Scholar]
  • 125.Turney PD, Pantel P. From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research. 2010;37:141–188. doi: 10.1613/jair.2934. [DOI] [Google Scholar]
  • 126.Ushey, K., Allaire, J., Wickham, H., & Ritchie, G. (2020). rstudioapi: Safely Access the RStudio API (Version 0.11). [R package]. CRAN. https://CRAN.R-project.org/package=rstudioapi.
  • 127.Uyheng J, Carley KM. Bots and online hate during the COVID-19 pandemic: Case studies in the United States and the Philippines. Journal of Computational Social Science. 2020;3:445–468. doi: 10.1007/s42001-020-00087-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 128.van Atteveldt W, Sheafer T, Shenhav SR, Fogel-Dror Y. Clause analysis: Using syntactic information to automatically extract source, subject, and predicate from texts with an application to the 2008–2009 Gaza War. Political Analysis. 2017;25(2):207–222. doi: 10.1017/pan.2016.12. [DOI] [Google Scholar]
  • 129.van Rijsbergen, C. J. (2000). Information retrieval—Session 1: Introduction to information retrieval. [Lecture notes]. Universität Duisburg Essen. https://www.is.inf.uni-due.de/courses/dortmund/lectures/ir_ws00-01/folien/keith_intro.ps.
  • 130.Van Rossum, G., & Drake, F. L. (2009). Python 3 Reference Manual. CreateSpace.
  • 131.Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 30 (pp. 5998–6008). Curran Associates Inc.
  • 132.Wang, H. (2020). Logistic regression for massive data with rare events. In Daumé III, H., & Singh, A. (Eds.), Proceedings of the 37th international conference on machine learning (pp. 9829–9836). Proceedings of Machine Learning Research.
  • 133.Wang, Y.-S. & Chang, Y. (2022). Toxicity detection with generative prompt-based inference. arXiv:2205.12390.
  • 134.Wankmüller, S. (2021). Neural transfer learning with Transformers for social science text analysis. arXiv:2102.02111v1 [cs.CL].
  • 135.Watanabe K. Latent Semantic Scaling: A semisupervised text analysis technique for new domains and languages. Communication Methods and Measures. 2021;15(2):81–102. doi: 10.1080/19312458.2020.1832976. [DOI] [Google Scholar]
  • 136.Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer.
  • 137.Wickham, H. (2019). stringr: Simple, consistent wrappers for common string operations (Version 1.4.0). [R package]. CRAN. https://CRAN.R-project.org/package=stringrCRAN.
  • 138.Wickham, H., François, R., Henry, L., & Müller, K. (2021). dplyr: A grammar of data manipulation (version 1.0.6). [R package]. CRAN. https://CRAN.R-project.org/package=dplyr.
  • 139.Wild, F. (2020). lsa: Latent semantic analysis (Version 0.73.2). [R package]. CRAN. https://CRAN.R-project.org/package=lsa.
  • 140.Wilke, C. O. (2021). ggridges: Ridgeline plots in ’ggplot2’ (version 0.5.3). [R package]. CRAN. https://CRAN.R-project.org/package=ggridgesCRAN.
  • 141.Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Scao, T. L., Gugger, S., Drame, M., Lhoest, Q., & Rush, A. M. (2020). HuggingFace’s transformers: state-of-the-art natural language processing. arXiv:1910.03771v5 [cs.CL].
  • 142.Yarowsky, D. (1995). Unsupervised word sense disambiguation rivaling supervised methods. In: 33rd Annual meeting of the association for computational linguistics (pp. 189–196). Cambridge, Massachusetts, USA. Association for Computational Linguistics. 10.3115/981658.981684.
  • 143.Zhang H, Pan J. CASM: A deep-learning approach for identifying collective action events with text and image data from social media. Sociological Methodology. 2019;49(1):1–57. doi: 10.1177/0081175019860244. [DOI] [Google Scholar]
  • 144.Zhao, H., Phung, D., Huynh, V., Jin, Y., Du, L., & Buntine, W. (2021). Topic modelling meets deep neural networks: A survey. In: Zhou, Z.-H. (Ed.), Proceedings of the thirtieth international joint conference on artificial intelligence, IJCAI-21. (pp. 4713–4720). International Joint Conferences on Artificial Intelligence Organization. 10.24963/ijcai.2021/638.
  • 145.Zhou Z-H, Li M. Tri-training: Exploiting unlabeled data using three classifiers. IEEE Transactions on Knowledge and Data Engineering. 2005;17(11):1529–1541. doi: 10.1109/TKDE.2005.186. [DOI] [Google Scholar]
  • 146.Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., & Fidler, S. (2015). Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In: Proceedings of the 2015 IEEE international conference on computer vision (ICCV ’15) (pp. 19–27). IEEE Computer Society. 10.1109/ICCV.2015.11.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The code that supports the findings of this study are available from https://doi.org/10.6084/m9.figshare.19699840.


Articles from Journal of Computational Social Science are provided here courtesy of Nature Publishing Group

RESOURCES