Abstract
We illustrate the use of machine learning techniques to analyze, structure, maintain, and evolve a large online corpus of academic literature. An emerging field of research can be identified as part of an existing corpus, permitting the implementation of a more coherent community structure for its practitioners.
The arXiv (see http://arXiv.org; for general background, see ref. 1) is an automated repository of >250,000 full-text research articles (as of mid-October 2003) in physics and related disciplines, going back more than a decade and growing at a rate of 40,000 new submissions per year. It serves >10 million requests per month (2), including tens of thousands of search queries per day and >20 million full-text downloads during 2002. It is a significant example of a Web-based service that has changed the practice of research in a major scientific discipline. It provides nearly comprehensive coverage of large areas of physics and serves as an on-line seminar system for those areas. It also provides a significant resource for model building and algorithmic experiments in mapping scholarly domains. Together with the SLAC SPIRES-HEP database, it provides a public resource of full-text articles and associated citation trees of many millions of links, with a focused disciplinary coverage and rich usage data. [The Stanford Linear Accelerator Center SPIRES-HEP database has comprehensively catalogued the High Energy Particle Physics (HEP) literature online since 1974, and indexes >500,000 high-energy physics-related articles including their full citation tree (see ref. 3).]
In what follows, we use arXiv data to illustrate how machine learning methods can be used to analyze, structure, maintain, and evolve a large online corpus of academic literature. The specific application will be to train a support vector machine text classifier to extract an emerging research area from a larger-scale resource. The automated detection of such subunits can play an important role in disentangling other subnetworks and associated subcommunities from the global network. Our notion of “mapping” here is in the mathematical sense of associating attributes to objects, rather than in the sense of visualization tools. Although the former underlies the latter, we expect the two will increasingly go hand-in-hand (for a recent review, see ref. 4).
Text Classification
The goal of text classification is the automatic assignment of documents to a fixed number of semantic categories. In the “multilabel” setting, each document can be in zero or one or more categories. Efficient automated techniques are essential to avoid tedious and expensive manual category assignment for large document sets. A “knowledge-engineering” approach, involving hand-crafting accurate text-classification rules, is surprisingly difficult and time-consuming (5). We therefore take a machine learning approach to generating text-classification rules automatically from examples.
The machine learning approach can be phrased as a supervised learning problem. The learning task is represented by the training sample Sn
![]() |
1 |
of size n documents, where
represents the document content. In the multilabel setting, each category label is treated as a separate binary classification problem. For each such binary task, yi ∈ {-1,+1} indicates whether a document belongs to a particular class. The task of the learning algorithm, ℒ, is to find a decision rule,
, based on Sn that classifies new documents, x, as accurately as possible.
Documents need to be transformed into a representation suitable for the learning algorithm and the classification task. Information retrieval research suggests that words work well as representation units and that for many tasks their ordering can be ignored without losing too much information. This type of representation is commonly called the “bag-of-words” model, an attribute-value representation of text. Each text document is represented by a vector in the lexicon space, i.e., by a “term frequency” (TF) feature vector TF(wi, x), with component values equal to the number of times each distinct word, wi, in the corpus occurs in the document x. Fig. 1 shows an example feature vector for a particular document.
Fig. 1.
Representing text as a feature vector.
This basic representation is ordinarily refined in a few ways.
TF × Inverse Document Frequency (IDF) Weighting. Scaling the components of the feature vector with their IDF(wi) (6) often leads to improved performance. In general, IDF(wi) is some decreasing function of the word frequency DF(wi), equal to the number of documents in the corpus which contain the word, wi. For example,
![]() |
2 |
where n is the total number of documents. Intuitively, the IDF assumes that rarer terms have more significance for classification purposes and hence gives them greater weight. To compensate for the effect of different document lengths, each document feature vector
is normalized to unit length:
.
Stemming. Instead of treating each occurrence form of a word as a different feature, stemming is used to project the different forms of a word onto a single feature, the word stem, by removing inflection information (7). For example “computes,” “computing,” and “computer” are all mapped to the same stem “comput.” The terms “word” and “word stem” are used synonymously in the following text.
Stopword Removal. For many classification tasks, common words like “the,” “and,” or “he” do not help discriminate between document classes. Stopword removal describes the process of eliminating such words from the document by matching against a predefined list of stopwords. We use a standard stoplist of ≈300 words.
Support Vector Machines (SVMs)
SVMs were developed by Vapnik and coworkers (8) based on the structural risk minimization principle from statistical learning theory. They have proven to be a highly effective method for learning text-classification rules, achieving state-of-the-art performance on a broad range of tasks (9, 10). Two main advantages of using SVMs for text classification lie in their ability to handle the high-dimensional feature spaces arising from the bag-of-words representation. From a statistical perspective, they are robust to overfitting and are well suited for the statistical properties of text. From a computational perspective, they can be trained efficiently despite the large number of features. A detailed overview of the SVM approach to text classification, with more details on the notation used below, is given in ref. 11.
In their basic form, SVMs learn linear decision rules,
![]() |
3 |
described by a weight vector
and a threshold b, from an input sample of n training examples,
,
, yi ∈ {-1,+1}. For a linearly separable Sn, the SVM finds the hyperplane with maximum Euclidean distance to the closest training examples. This distance is called the margin δ, as depicted in Fig. 2. Geometrically, the hyperplane is defined by its normal vector,
, and its distance from the origin, -b. For nonseparable training sets, the amount of training error is measured by using slack variables, ξi.
Fig. 2.
A binary classification problem (+ vs. -) in two dimensions. The hyperplane h* separates positive and negative training examples with maximum margin δ. The examples closest to the hyperplane are called “support vectors” (marked with circles).
Computing the position of the hyperplane is equivalent to solving the following convex quadratic optimization problem (8):
Optimization Problem 1 [SVM (Primal)].
![]() |
4 |
![]() |
5 |
![]() |
6 |
The margin of the resulting hyperplane is
.
The constraints (Eq. 5) require that all training examples are classified correctly up to some slack ξi. If a training example lies on the “wrong” side of the hyperplane, we have the corresponding ξi ≥ 1, and thus
is an upper bound on the number of training errors. The factor C in Eq. 4 is a parameter that allows trading off training error vs. model complexity. The optimal value of this parameter depends on the particular classification task and must be chosen by means of cross-validation or by some other model selection strategy. For text classification, however, the default value of
has proven to be effective across a large range of tasks (11).
OP1 has an equivalent dual formulation:
Optimization Problem 2 [SVM (Dual)].
![]() |
7 |
![]() |
8 |
![]() |
9 |
From the solution of the dual, the classification rule solution can be constructed as
![]() |
10 |
where (
, yusv) is some training example with 0 < αusv < C. For the experiments in this article, SVMlight (11) is used for solving the dual-optimization problem. (SVMlight is available at http://svmlight.joachims.org/.) More detailed introductions to SVMs can be found in refs. 12 and 13.
An alternative to the approach here would be to use Latent Semantic Analysis (14) to generate the feature vectors. Latent Semantic Analysis can potentially give better recall by capturing partial synonymy in the word representation, i.e., by bundling related words into a single feature. The reduction in dimension can also be computationally efficient, facilitating capture of the full-text content of papers, including their citation information. On the other hand, the SVM already scales well to a large feature space and performs well at determining relevant content through suitable choices of weights. On a sufficiently large training set, the SVM might even benefit from access to fine distinctions between words potentially obscured by the latent semantic analysis bundling. It is thus an open question, well worth further investigation, as to whether latent semantic analysis applied to full text could improve performance of the SVM in this application without loss of computational efficiency. Generative models for text classification provide yet another alternative approach, in which each document is viewed as a mixture of topics, as in the statistical inference algorithm for Latent Dirichlet Allocation used in ref. 15. This approach can, in principle, provide significant additional information about document content but does not scale well to a large document corpus, so the real-time applications intended here would not yet be computationally feasible in that framework.
arXiv Benchmarks
Before using the machine learning framework to identify new subject area content, we first assessed its performance on the existing (author-provided) category classifications. Roughly 180,000 titles and abstracts were fed to model-building software, which constructed a lexicon of ≈100,000 distinct words and produced training files containing the TD×IDF document vectors for SVMlight. [Although the SVM machinery could easily be used to analyze the full document content, previous experiments (11) suggest that well written titles and abstracts provide a highly focused characterization of content at least as effective for our document classification purposes.] The set of support vectors and weight parameters output by SVMlight was converted into a form specific to the linear SVM, Eq. 3: a weight vector
and a threshold bc, where c is an index over the categories.
As seen in Fig. 3, the success of the SVM in classifying documents improves as the size of a category increases. The SVM is remarkably successful at identifying documents in large (>10,000 documents) categories and less successful at identifying smaller subject areas (<500 documents). A cutoff was imposed to exclude subject areas with <100 documents. (Some of the smaller subject areas are known to be less topically focused, so the difficulty in recall, based solely on title/abstract terminology, was expected.)
Fig. 3.
Recall and precision as functions of category size for 78 arXiv major categories and minor subject classes. Two thirds of the sample was used as a training set and one third as a test set. The four largest categories, each with >30,000 documents are cond-mat, astro-ph, hep-th, and hep-ph.
In experiments with small categories (N < 1,000), the SVM consistently chose thresholds that resulted in unacceptably low recall (i.e., missed documents relevant to the category). This is in part because the null hypothesis, that no documents are in the category, describes the data well for a small category; e.g., if only 1,000 of 100,000 documents are in a category, the null hypothesis has a 99% accuracy. (”Accuracy” is the percentage of documents correctly classified. We also employ other common terminology from information retrieval: “precision” is the fraction of those documents retrieved that are relevant to a query, and “recall” is the fraction of all relevant documents in the collection that are retrieved.) To counteract the bias against small categories, 10% of the training data was used as a validation set to fit a sigmoid probability distribution 1/[1 + exp(Af + B)] (16). This converts the dot product
between document vector and weight vector into a probability, P(i ∈ c| xi), that document i is a member of category c, given its feature vector xi. Use of the probability in place of the uncalibrated signed distance from the hyperplane output by the SVM permitted greater levels of recall for small categories.
Other experiments showed that the use of TF×IDF weighting as in Eq. 2 improved accuracy consistently over pure TF weighting, so TF×IDF weighting was used in the experiments to follow. We also used a document frequency threshold to exclude rare words from the lexicon but found little difference in accuracy between using a document occurrence threshold of two and five. (Words that appeared in fewer than two documents constituted ≈50% of the lexicon, and those that appeared in fewer than five documents, ≈70%. Ignoring rare and consequently uninformative words hence reduces the computational needs.) Increasing the weight of title words with respect to abstract words, on the other hand, consistently worsened accuracy, indicating that words in a well written abstract contain as much or more content classification import as words in the title. Changes in the word tokenization and stemming algorithms did not have a significant impact on overall accuracy. The default value of C = 1 in Eq. 9 was preferred.
q-bio Extraction
Recent anecdotal evidence indicates an intellectual trend among physicists toward work in biology, ranging from biomolecules, molecular pathways and networks, gene expression, cellular and multicellular systems to population dynamics and evolution. This work has appeared in separate parts of the archive, in particular, under “Soft Condensed Matter,” “Statistical Mechanics,” “Disordered Systems and Neural Networks,” “Biophysics,” and “Adaptive and Self-Organizing Systems” (abbreviated cond-mat.soft, cond-mat.stat-mech, cond-mat.dis-nn, physics.bio-ph, and nlin.AO). A more coherent forum for the exchange of these ideas was requested, under the nomenclature “Quantitative Biology” (abbreviated “q-bio”).
To identify first whether a real trend to nurture and amplify indeed existed and to create a training set, volunteers were enlisted to identify the q-bio content from the subject areas above in which it was most highly focused. Of 5,565 such articles received from January 2002 through March 2003, 466 (8.4%) were found to have one of the biological topics above as its primary focus. The total number of distinct words in these titles, abstracts, plus author names, was 23,558, of which 7,984 were above the DF = 2 document frequency threshold. (Author names were included in the analysis because they have potential “semantic” content in this context, i.e., are potentially useful for document classification. The SVM algorithm will automatically determine whether to use the information by choosing suitable weights.)
A data-cleaning procedure was used, in which SVMlight was first run with C = 10. We recall from Eqs. 4 and 9 that larger C penalizes training errors and requires larger α values to fit the data. Inspecting the “outlier” documents (17) with the largest|αi| then permitted manual cleaning of the training set. Ten documents were moved into q-bio, and 15 documents were moved out, for a net movement to 461 q-bio (8.3%) of the 5,565 total. Some of the others flagged involved word confusions, e.g., “genetic algorithms” typically involved programming rather than biological applications. Other q-bio words with frequent non-biological senses were “epidemic” (used for rumor propagation), “evolution” (used also for dynamics of sandpiles), “survival probabilities,” and extinction. “Scale-free networks” were sometimes used for biological applications and sometimes not. To help resolve some of these ambiguities, the vocabulary was enlarged to include a list of most frequently used two-word phrases with semantic content different from their constituent words.
With a training set fully representative of the categories in question, it was then possible to run the classifier on the entirety of the same subject area content received from 1992 through April 2003, a total of 28,830 documents. The results are shown in Fig. 4. A total of 1,649 q-bio documents was identified, and the trend toward an increasing percentage of q-bio activity among these arXiv users is evident; individual authors can be tracked as they migrate into the domain. Visibility of the new domain can be further enhanced by referring submitters in real time by means of the automated classifier running behind the submission interface.
Fig. 4.
The number of submissions per year from 1992 through April 2003 in the particular subsets of arXiv subject areas of cond-mat, physics, and nlin most likely to have “quantitative biology” content. The percentage of q-bio content in these areas grew from ≈1% to nearly 10% during this timeframe, suggesting a change in intellectual activity among arXiv-using members of these communities.
Some components of the weight vector generated by the SVM for this training set are shown in Table 1. Because the distance of a document to the classifying hyperplane is determined by taking the dot product with the normal vector, its component values can be interpreted as the classifying weight for the associated words. The approach here illustrates one of the major lessons of the past decade: the surprising power of simple algorithms operating on large datasets.
Table 1. The most positive, a few intermediate, and the most negative components of the q-bio classifying weight vector.
| protein | +8.57 | ⋮ | point | −1.04 | |
| dna | +7.08 | forward | +0.00 | equation | −1.05 |
| biological | +5.06 | minimalist | +0.00 | boundary | −1.06 |
| neuron | +5.05 | region | +0.00 | social | −1.06 |
| rna | +3.30 | confinement | +0.00 | n | −1.09 |
| gene | +3.21 | implies | +0.00 | relaxation | −1.14 |
| mutation | +3.15 | 96 | +0.00 | fluid | −1.15 |
| population | +3.11 | y_togashi | +0.00 | indian | −1.15 |
| epidemic | +3.05 | n_wingreen | +0.00 | spin | −1.17 |
| biology | +3.02 | mean_free | +0.00 | spin_glass | −1.17 |
| disease | +2.93 | narrower | +0.00 | traffic | −1.18 |
| cell | +2.90 | shot | −0.00 | system | −1.30 |
| neural | +2.89 | repton | −0.00 | polymer | −1.33 |
| brain | +2.83 | kyoto | −0.00 | class | −1.35 |
| ecosystem | +2.56 | regular | −0.00 | emerge | −1.36 |
| tissue | +2.52 | generalization | −0.00 | gradient | −1.39 |
| sequence | +2.51 | d_saakian | −0.00 | quantum | −1.43 |
| genetic | +2.51 | conformity | −0.00 | surface | −1.43 |
| bacterial | +2.48 | aware | −0.00 | synchronization | −1.45 |
| blood | +2.43 | even_though | −0.00 | market | −1.47 |
| genome | +2.37 | practitioner | −0.00 | particle | −1.52 |
| peptide | +2.37 | permittivity | −0.00 | polyelectrolyte | −1.53 |
| infection | +2.34 | ⋮ | world | −1.57 |
Although implemented as a passive-dissemination system, the arXiv has also played a social engineering role, with active research users developing an affinity to the system and adjusting their behavior accordingly. They scan new submissions on a daily basis, assume others in their field do so, are consequently aware of anything relevant that has appeared there (whereas anything that doesn't may as well not exist), and use it to stake intellectual priority claims in advance of journal publication. We see further that machine learning tools can characterize a subdomain and thereby help accelerate its growth by the interaction of an information resource with the social system of its practitioners. [Some other implications of these methods, including potential modification of the peer review system, are considered elsewhere (2).]
Postscript
The q-bio extraction described above was not just a thought experiment, but a prelude to an engineering experiment. The new category went online in mid-September 2003 (see http://arXiv.org/new/q-bio.html), at which time past submitters flagged by the system were asked to confirm the q-bio content of their submissions and to send future submissions to the new category. The activity levels at the outset corresponded precisely to the predictions of the SVM text classifier and later began to show indications of growth catalyzed by the public existence of the new category.
Acknowledgments
This work was supported by National Science Foundation Agreement 0132355 (to P.G.), funding from the Cornell University Library (to P.H.), National Science Foundation Career Award 0237381 (to T.J.), and the National Science Foundation KD-D Project (to T.J.).
This paper results from the Arthur M. Sackler Colloquium of the National Academy of Sciences, “Mapping Knowledge Domains,” held May 9-11, 2003, at the Arnold and Mabel Beckman Center of the National Academies of Sciences and Engineering in Irvine, CA.
Abbreviations: SVM, support vector machine; TF, term frequency; IDF, inverse document frequency.
References
- 1.Ginsparg, P. (2001) in Electronic Publishing in Science II, Proceedings of Joint ICSU Press/UNESCO Conference, Paris, France (ICSU Press, Paris) Available at http://users.ox.ac.uk/icsuinfo/ginspargfin.htm and at http://arXiv.org/blurb/pg01unesco.html. Accessed January 22, 2004.
- 2.Ginsparg, P. (2003) Sci. Technol. Libraries 22, 5-18, preprint available at http://arXiv.org/blurb/pg02pr.html.
- 3.O'Connell, H. B. (2002) http://arXiv.org/physics/0007040.
- 4.Borner, K., Chen, C. & Boyack, K. (2003) Annu. Rev. Inf. Sci. Technol. 37, 179-255. [Google Scholar]
- 5.Hayes, P. & Weinstein, S. (1990) in Proceedings of IAAI-90, 2nd Conference on Innovative Applications of Artificial Intelligence, eds. Rappaport, A. & Smith, R. (AAAI Press, Menlo Park, CA), pp. 49-66.
- 6.Salton, G. & Buckley, C. (1988) Inf. Processing Manage. 24, 513-523. [Google Scholar]
- 7.Porter, M. (1980) Program (Autom. Libr. Inf. Syst.) 14, 130-137. [Google Scholar]
- 8.Vapnik, V. (1998) Statistical Learning Theory (Wiley, Chichester, U.K.).
- 9.Dumais, S., Platt, J., Heckerman, D. & Sahami, M. (1998) in Proceedings of CIKM-98, Seventh ACM International ACM Conference in Information and Knowledge Managment (Association for Computing Machinery, New York), pp. 148-155.
- 10.Joachims, T. (1998) in Proceedings of the European Conference on Machine Learning (Springer. Berlin), pp. 137-142.
- 11.Joachims, T. (2002) Learning to Classify Text Using Support Vector Machines: Methods, Theory, and Algorithms (Kluwer, Dordrecht, The Netherlands).
- 12.Burges, C. (1998) Data Mining Knowledge Discovery 2, 121-167. [Google Scholar]
- 13.Cristianini, N. & Shawe-Taylor, J. (2000) An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods (Cambridge Univ. Press, NY).
- 14.Landauer, T. K. & Dumais, S. T. (1997) Psychol. Rev. 104, 211-240. [Google Scholar]
- 15.Griffiths, T. L. & Steyvers, M. (2004) Proc. Natl. Acad. Sci. USA 101, 5228-5235. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Platt, J. (1999) in Advances in Large Margin Classifiers (MIT Press, Cambridge, MA), pp. 61-74.
- 17.Guyon, I., Matic, N. & Vapnik, V. (1996) in Advances in Knowledge Discovery and Data Mining (MIT Press, Cambridge, MA; AAAI Press, Menlo Park, CA), pp. 181-203.














