Abstract
Citations are widely used in scientific literature. The traditional model of referencing considers all citations to be the same; however, semantically, citations play different roles. By studying the context in which citations appear, it is possible to determine the role that they play. Here, we report on the development of an eight-category classification scheme, annotation using that scheme, and development and evaluation of supervised machine-learning classifiers using the annotated data. We annotated 1,710 sentences using the annotation schema and our trained classifier obtained an average F1-score of 76.5%. The classifier is available for free as a Java API from http://citation.askhermes.org.
Introduction
Citations are commonly used in scientific literature; for example, in an analysis of over 150,000 open-access articles deposited in PubMed Central, we found an average of 41 citations per article. These citations serve several purposes, such as acknowledging existing work in the domain, providing additional background resources to the user, justifying research questions and hypotheses, and placing the study in the context of other works in the domain1. In addition, citations play an important role in representing the semantic content of full-text biomedical articles, and as a result, citation information has been used to assist in biomedical text mining tasks (e.g., 2–7). For example, two articles can be considered as “related” if they share a significant set of co-citations, and a recent study incorporating this assumption showed that it improved information retrieval6. The number of times a citation is cited in a paper may also indicate its relevance to the citing paper2,7. Citances (that is, citation sentences) represent the condensed semantic content of the documents they identify3,5 and can be used to extract scientific fact4 and for summarization tasks5. At the same time, citations can be viewed as links between articles, and these links have been used to create the Science Citation Index, which is used to measure the impact factor of scientific journals and articles8. Impact factors that are quantified in such a way are used to indicate the importance of a journal or an article to its field9.
The above mentioned applications of citations do not make distinctions between citation types. Semantically, however, citations play different roles. For example, if the citing article mentions that one of the cited articles obtained results similar to the ones seen in the citing article, then this cited article is more relevant to the citing article than a cited article that was cited perfunctorily. In this study, we attempt to determine the type of relation between the cited article and the citing article. We first describe the citation classification schema that we used to annotate articles. We then trained machine-learning algorithms to automatically determine the type of relationship between cited and citing articles. We believe that our citation classifier can help refine citation indexes, which will in turn benefit text mining applications that make use of citation indexes.
Related Work
The literature that categorizes the relations between cited and citing articles is rich3,10–22. Garfield (1965) pioneered 15 reasons why an author introduces a citation3, including paying homage to pioneers, paying homage to peers, providing background reading, and criticizing previous work. Moravcsik and Murugesan (1975) classified the role of a citation along multiple dimensions10, including conceptual (is the citation made in connection with a concept or theory?) or operational (is it made in connection with a technique used in the citing article?), and organic (is the citation truly needed for the understanding of the citing article?) or perfunctory (is the citation an acknowledgment that some other work in the same general area has been performed?).
Chubin and Moitra (1975) classified cited papers into several categories23, including basic essential citation (the cited paper is declared central to the reported research; the reported findings depend on the cited paper), subsidiary essential citation (the cited paper is not directly connected to the subject of the letter or article but is still essential to the reported research), partial negational citation (the cited paper is erroneous in part, and the author of the citing paper offers a correction), and total negational citation (the cited paper is completely wrong, and the author of the citing paper offers an independent interpretation of the solution). Spiegel-Rosing (1977) defined multiple general categories11, including that the “cited source is positively evaluated.” Hanney et al. (2005) divided citations into four categories: limited, peripheral, considerable, and essential12. See the review article 13 for details.
On the other hand, the work in automatic citation classification is limited. Garzone and Mercer treated the citation classification as a task of sentence categorization24–26. They extracted a sentence that incorporated citations and then applied manually curated lexical and grammar rules to assign the citations to one of 35 predefined categories. These categories included “citing work totally disputes some aspect of cited work,” “citing work totally confirms cited work,” and “citing work refers to assumed knowledge which is general background.” More recently, Radoulov27 built upon the work of Garzone and Mercer24–26 to develop supervised machine-learning approaches for automatic citation classification. The citation schema they built includes reason for citation and what is cited. Reason for citation which further incorporate sub-categories such as confirms, supports, interprets results, extends model, contrasts, future research, and uses. What is cited includes the sub-categories general background, specific background, historical account, pioneering work, related work, concept, method, product, and data. They report a performance of ∼70% f-score for automatic citation classification, although some features (e.g., cue words) were manually identified. However, a fine-grained citation schema, such as that described above, can result in inconsistent annotation if there is overlap in the categories, as there seemed to be among those defined in Radoulov, 200827. For example, Confirms seems to overlap with Supports, and General Background proved to be difficult to distinguish from Specific Background.
Development of a simpler, non-overlapping citation classification annotation schema
Based on non-overlapping categories in existing annotation schemas, we had developed an annotation schema described in Yu et al., 200928. The categories were arranged in a hierarchy. In annotating articles with this annotation schema, we found that the inter-annotator agreement for some categories was poor, and while seeking ways to improve the schema found that the annotators often reported confusion about certain categories due to the hierarchy of the schema. To alleviate this problem, we further simplified our annotation schema by removing the hierarchy, merging certain categories and removing others. For example, under the node ‘evaluation’ the three subcategories ‘first to discover’, ‘positive evaluation’ and ‘quantitative evaluation’ are merged as ‘evaluation’. We removed categories such as ‘reported speech and discovery’ because there was no semantic difference between this category and ‘no author attribution’; for example the sentence “None of the cDNAs was present in a list of known false positives [34]” was annotated as ‘no author attribution’ and the sentence “This missense mutation (Q422R) has been reported in two patients, one affected by anterior segment dysgenesis with uveal ectropion and one with typical aniridia and foveal hypoplasia [27,33].” was annotated as ‘reported speech and discovery’ due to presence of cue phrase ‘has been reported’. The schema we used for annotation in this paper now consists of eight independent top-level categories: Background/ Perfunctory, Contemporary, Contrast/ Conflict, Evaluation, Explanation of results, Material/ Methods, Modality, and Similarity/ Consistency. Their definitions with examples are shown below (all examples are from29):
Background/Perfunctory: Checks whether the cited article is merely a part of the relevant literature and is not analyzed or compared to other literature. Example: We and others have shown that there is a significant familial risk of breast cancer in Iceland that extends to at least fifth-degree relatives [8, 9].
Contemporary: Checks whether the given citation is explicitly characterized as “recent” by the author. Example: Recently, the Cys557Ser variant was reported at increased frequency in hereditary breast cancer probands from Finland [26].
Contrast/Conflict: Checks whether results or opinions in a given citation show contrast to or conflict with an opinion or result presented by the author of the citing paper. Example: The observation of Cys557Ser risk extending to BRCA2 carriers contrasts markedly with reports of the interactions between the CHEK2 1100delC variant and BRCA mutations [36, 37, 43].
Evaluation: Checks if the results in the cited study are evaluated in the citing study. Example: The complex is important for the roles of BRCA1 in homologous-recombination-directed DNA repair and transcription-coupled repair [14, 15]..
Explanation of results: Checks whether the cited work helps explain the results or hypotheses in current study. Example: The excess of geographical ancestry of Cys557Ser carriers in S-Múlasýsla indicates that most copies of the variant now present in Iceland originated from a relatively small number of ancestors who resided in a single geographical region prior to the expansion of the Icelandic population from approximately 40,000 at the end of the 18th century to its current size of 300,000 [39].
Material/Method: Checks whether the given citation is a methodology that was followed in the citing work (with or without modifications). Example: Phase and haplotype frequencies were determined using deCODE Genetics Allegro and NEMO software [33, 34].
Modality: Checks whether the citing author expresses lack of certainty over a result or opinion in the cited study. Example: There are also indications of an association between medullary cancer and familiality [44, 48].
Similarity/Consistency: Checks whether results or opinions in the given citation are similar or consistent with the given study or another cited work. Example: Karppinen et al. [26] reported that the frequency of the Cys557Ser variant is significantly elevated only in groups of patients with familial breast cancer. These data are similar to the initial reports for CHEK2, where the 1100delC variant was found at significantly increased frequency only in familial breast cancer patients [37, 43].
We asked annotators to assign a citation to its taxonomy based on the following context: the previous sentence, the sentence in which the citation appears, and the following sentence. Every category, except for Background/ Perfunctory, was assessed based on the presence of cue word. Each annotator assigned either a ‘true’ or ‘false’ value to each category. We did not use a predefined list of cue words; the annotators were asked to pick cues. The category Background/Perfunctory was indicated by the absence of cue words. If citations appeared as part of a group of citations, then all citations in the group were assigned the same value for each category. For example, in the example sentence for Similarity/Consistency, the citations 37 and 43 appear as a group, hence they are assigned the same values.
Annotation
We designed an annotation corpus of 43 open-access full-text biomedical articles. Of these, 24 articles were selected from the GENIA corpus, and the remaining 19 articles were randomly selected. The first author of this article annotated all 43 articles, while the second author, who is a graduate in life sciences, annotated 10 articles. The 43 articles contained a total of 2,977 annotations in 1,710 sentences, while the 10 articles annotated by the second author contained 514 annotations. Both annotators were asked to assign a confidence value for each category upon annotation. The annotators assigned a confidence value of ‘High’, ‘Medium’ or ‘Low’.
To measure inter-annotator agreement, we calculated the kappa value and the F1-score. To calculate the inter-annotator F1-score, we considered the first authors annotations as gold standard.
Training and Evaluating Supervised Machine-Learning Algorithms
We trained and tested supervised machine-learning algorithms on the data annotated by the first author. A separate model was built for each category. We trained Support Vector Machine (SVM) and Multinomial Naïve Based (MNB) based models. We used the SVM and MNB algorithm provided by the open-source Java library Weka30. We used unigrams (individual words) and bigrams (two consecutive words) as features to train the algorithms. We trained and tested the algorithms with the top 25, 50, 100, 150, 200, 250 and 500 features. Features were ranked using mutual information.
To evaluate the performance of the algorithm, we split the annotated data into 10 equal folds of 171 sentences. One fold was held as the test data, and the remaining nine folds were used to train the model. The trained model was then used to predict the value of the category in the test data. We calculated the accuracy, recall, precision and F1-score for each category.
Results
The annotation agreement for each category is shown in Table 1.
Table 1:
Inter-annotator agreement over the 10 articles that were annotated by the two annotators
Category | Kappa Value | F1-Score (%) |
---|---|---|
Background/Perfunctory | 0.49 | 73.20 |
Contemporary | 0.87 | 86.96 |
Contrast/Conflict | 0.67 | 72.22 |
Evaluation | 0.53 | 59.13 |
Explanation | 0.49 | 53.93 |
Method | 0.89 | 90.65 |
Modality | 0.60 | 65.30 |
Similarity/Consistency | 0.48 | 52.38 |
Average | 0.63 | 69.22 |
The performance of the SVM and MNB models at predicting the value of each category is shown in Table 2. We only report the performance of the best classifier.
Table 2:
Performance of the SVM and MNB algorithms for each category. The models were evaluated using a 10-fold cross validation paradigm. All values in %. SVM: Support Vector Machine, MNB: Multinomial Naïve Bayes
Category | Number Feature | Best model | Recall | Precision | F1-Score | Accuracy |
---|---|---|---|---|---|---|
Background/ Perfunctory | 150 | SVM | 66.4±5.4 | 80.5±7.1 | 72.7±5.7 | 78.8±4.4 |
Contemporary | 25 | SVM | 86.4±16.1 | 95.9±6.6 | 89.7±8.3 | 99.0±0.9 |
Contrast/Conflict | 50 | SVM | 79.1±8.5 | 88.5±6.6 | 83.4±7.1 | 92.9±3.1 |
Evaluation | 200 | SVM | 71.3±9.8 | 46.7±8.1 | 55.8±6.6 | 89.5±1.5 |
Explanation | 100 | SVM | 82.6±11.9 | 58.6±11.8 | 68.0±9.9 | 95.3±1.4 |
Method | 250 | MNB | 89.5±6.4 | 73.3±10.4 | 80.0±5.7 | 94.0±1.4 |
Modality | 100 | SVM | 86.0±8.1 | 78.7±5.1 | 81.9±4.8 | 93.8±1.8 |
Similarity/Consistency | 50 | SVM | 79.2±9.9 | 61.8±6.7 | 68.8±4.6 | 94.6±1.0 |
Average | - | - | 80.1 | 73.0 | 75.0 | 92.2 |
Discussion
We developed a citation classification schema based on eight categories. The categories are defined such that their overlap with other categories is minimized. Our results indicate that we obtained a moderate to substantial agreement on all categories. The disagreements between the two annotators appeared to be the result of different interpretations of the categories. For example, the following sentence was annotated as Contrast by the first author but not by the second author because the cue word “as compared” was treated differently by them – “This finding is similar to that seen with VLCAD where mouse VLCAD is most active toward C16 acyl-substrates as compared to human VLCAD with the most enzymatic activity toward C14 acyl-substrates [16].” We also evaluated the models using only annotations with a ‘High’ confidence value and expected the performance to improve as a result. However, we noticed only a very small improvement, which was not statistically significant (average F1-score: 77.1, accuracy: 93.0).
We noticed that the performance of models on the Evaluation, Explanation and Similarity/Consistency categories was not very high. We believe that increasing the number of annotated articles could alleviate this problem. Moreover, inter-annotator agreement for these categories was not very good, which indicates that the annotation guidelines for these categories require further development. Specifically, we plan to include a set of representative cue words with examples in the annotation guidelines for these categories.
On analyzing the cases that were misclassified by our system, we noted in most cases, an infrequent cue word was not recognized by the model. For example, in the sentence “It is estimated that approximately 59% of patients presenting clinically between 15 to 26 mo of age die during their first clinical episode [1].”, the word “estimated” was not recognized by the model as a modality cue word because it occurred infrequently in the training data. We believe that a larger training dataset could resolve this problem. In some cases, the gold-standard was inconsistent. For example, the sentence “Recently, CCR3 has been shown to be upregulated on neutrophils and monocytoid U937 cells by interferons in vitro and to be expressed by endothelial cells, epithelial cells and mast cells [11–16].”, was annotated as false for Contemporary, whereas the model assigned it as true.
Conclusions
Our results show a reasonable inter-annotator agreement (kappa 0.48–0.89) on our eight-category citation classification schema. Our classification results show that the best supervised machine learning classifier achieved an average of 75% F1-score (92.2% accuracy). We believe our citation classifier can help other text-mining applications. The classifier is open-source, available as a Java API at http://citation.askhermes.org/.
Acknowledgments
We thank Dr. Lamont Antieau for proofreading this manuscript. We acknowledge support from NLM grant number 1R01LM009836.
References
- 1.Lehnert W, Cardie C, Riloff E. Analyzing research papers using citation sentences. Proceedings of the 12th Annual Conference on Cognitive Science. 1990:511–518. [Google Scholar]
- 2.Herlach G. Can retrieval of information from citation indexes be simplified? Multiple mention of a reference as a characteristic of the link between cited and citing article. Journal of the American Society for Information Science. 1978;29(6):308–310. [Google Scholar]
- 3.Garfield E. Can citation indexing be automated. In: Stevens M, Guiliano V, Heilprin L, editors. Statistical association methods for mechanized documentation: Symposium proceedings, 1977. 1965. pp. 189–192. [Google Scholar]
- 4.Nakov P, Schwartz A, Hearst M. Citation sentences for semantic analysis of bioscience text. Proceedings of the SIGIR’04 workshop on Search and Discovery in Bioinformatics; Sheffield, UK. 2004. [Google Scholar]
- 5.Schwartz AS, Hearst M. Proceedings of the Workshop on Linking Natural Language Processing and Biology: Towards Deeper Biological Literature Analysis. New York City, New York: Association for Computational Linguistics; 2006. Summarizing key concepts using citation sentences; pp. 134–135. [Google Scholar]
- 6.Tbahriti I, Chichester C, Lisacek F, Ruch P. Using argumentation to retrieve articles with similar citations: An inquiry into improving related articles search in the MEDLINE digital library. International Journal of Medical Informatics. 2006;75(6):488–495. doi: 10.1016/j.ijmedinf.2005.06.007. [DOI] [PubMed] [Google Scholar]
- 7.Voos H, Dagaev KS. Are All Citations Equal? Or, Did We Op. Cit. Your Idem? 1976.
- 8.Garfield E. Citation indexing: its theory and application in science, technology, and humanities. Philadelphia: ISI Press; 1979. [Google Scholar]
- 9.Garfield E. Citation Analysis as a Tool in Journal Evaluation. 1972. [DOI] [PubMed]
- 10.Moravcsik MJ, Murugesan P. Some Results on the Function and Quality of Citations. Social Studies of Science. 1975;5(1):86–92. [Google Scholar]
- 11.Spiegel-Rosing I. Science studies: Bibliometric and content analysis. Social Studies of Science. 1977;7(1):97–113. [Google Scholar]
- 12.Hanney S, Grant J, Jones T, Buxton M. Categorising citations to trace research impact; Proceedings of the 10th International Conference of the International Society for Scientometrics and Informetrics; Stockholm, Sweden: 2005. [Google Scholar]
- 13.Bornmann L, Daniel H. What do citation counts measure? A review of studies on citing behavior. Journal of Documentation. 2008;64:45–80. [Google Scholar]
- 14.Frost CO. The Use of Citations in Literary Research: A Preliminary Classification of Citation Functions. The Library Quarterly. 1979;49(4):399–414. [Google Scholar]
- 15.Lipetz BA. Problems of citation analysis: Critical review. American Documentation. 1965;16(381–390):10. [Google Scholar]
- 16.Cole S. The growth of scientific knowledge: Theories of deviance as a case study. The idea of social structure: Papers in honor of Robert K Merton. 1975:175–220. [Google Scholar]
- 17.Peritz B. A classification of citation roles for the social sciences and related fields. Scientometrics. 1983;5(5):303–312. [Google Scholar]
- 18.Small HG. Cited Documents as Concept Symbols. Social Studies of Science. 1978;8(3):327–340. [Google Scholar]
- 19.Weinstock M. Citation indexes. Encyclopedia of library and information science. 1971;5:16–40. [Google Scholar]
- 20.Finney B.The reference characteristics of scientific texts Unpublished dissertation, City University of London, Centre for Information Science1979
- 21.Duncan EB, Anderson F, McAleese R. Qualified Citation Indexing: Its Relevance to Educational Technology. Information Retrieval in Educational Technology: Proceedings of the First Symposium on Information Retrieval in Educational Technology; University of Aberdeen. 1981. pp. 70–79. [Google Scholar]
- 22.Bernstam EV, Herskovic JR, Aphinyanaphongs Y, et al. Using Citation Data to Improve Retrieval from MEDLINE. J Am Med Inform Assoc. 2006;13(1):96–105. doi: 10.1197/jamia.M1909. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Chubin DE, Moitra SD. Content Analysis of References: Adjunct or Alternative to Citation Counting? Social Studies of Science. 1975;5(4):423–441. [Google Scholar]
- 24.Garzone M, Mercer R. Towards an Automated Citation Classifier. Advances in Artificial Intelligence. 2000:337–346. [Google Scholar]
- 25.Di Marco C, Mercer RE. Toward a catalogue of citation-related rhetorical cues in scientific texts. Proceedings of the Pacific Association for Computational Linguistics (PACLING) Conference; Halifax, Canada. 2003. [Google Scholar]
- 26.DiMarco C, Mercer RE, Rubin VL. A Design Methodology for a Document Indexing Tool Using Pragmatic Evidence in Text. Proceedings of the 3rd Annual Scientific Conference of Intelligent, Interactive, Learning Object Repositories Network (LORNET Research Network); Montreal, Quebec, Canada. 2006. [Google Scholar]
- 27.Radoulov R. Exploring automatic citation classification. In Computer Science. University of Waterloo. 2008;(102) [Google Scholar]
- 28.Yu H, Agarwal S, Frid N. IEEE International Conference on Bioinformatics & Biomedicine. Washington, DC: IEEE; 2009. Investigating and Annotating the Role of Citation in Biomedical Full-Text Articles. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Stacey SN, Sulem P, Johannsson OT, et al. The BARD1 Cys557Ser Variant and Breast Cancer Risk in Iceland. PLoS Med. 2006;3(7):e217. doi: 10.1371/journal.pmed.0030217. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Hall M, Frank E, Holmes G, et al. The WEKA Data Mining Software: An Update. SIGKDD Explorations. 2009;11(1) [Google Scholar]