Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2010 Nov 15;107(49):20899–20904. doi: 10.1073/pnas.1013452107

Reconceptualizing the classification of PNAS articles

Edoardo M Airoldi a, Elena A Erosheva b, Stephen E Fienberg c,d,1, Cyrille Joutard e, Tanzy Love f, Suyash Shringarpure d
PMCID: PMC3000298  PMID: 21078953

Abstract

PNAS article classification is rooted in long-standing disciplinary divisions that do not necessarily reflect the structure of modern scientific research. We reevaluate that structure using latent pattern models from statistical machine learning, also known as mixed-membership models, that identify semantic structure in co-occurrence of words in the abstracts and references. Our findings suggest that the latent dimensionality of patterns underlying PNAS research articles in the Biological Sciences is only slightly larger than the number of categories currently in use, but it differs substantially in the content of the categories. Further, the number of articles that are listed under multiple categories is only a small fraction of what it should be. These findings together with the sensitivity analyses suggest ways to reconceptualize the organization of papers published in PNAS.

Keywords: text analysis, hierarchical modeling, Monte Carlo Markov chain, variational inference, Dirichlet process


The Proceedings of the National Academy of Sciences (PNAS) is indexed by Physical, Biological, and Social Sciences categories, and, within these, by subclassifications that correspond to traditional disciplinary topics. When submitting a paper, authors classify it by selecting a major and a minor category. Although authors may opt to have dual or even triple indexing, only a small fraction of published PNAS papers do so. How well does the current classification scheme capture modern interdisciplinary research? Could some alternative structure better serve PNAS in fostering publication and visibility of the best interdisciplinary research? These questions may be thought of as falling under the broad umbrella of “knowledge mapping.”

A special 2004 supplement of PNAS, based on the Arthur M. Sackler Colloquium on Mapping Knowledge Domains, presented a number of articles that applied various knowledge mapping techniques to the contents of PNAS itself (1). What was striking about the issue is that two articles by Erosheva, et al. (2, henceforth EFL) and Griffiths and Steyvers (3, henceforth GS), based on similar statistical machine learning models, made statements about the number of inferred categories needed to describe semantic patterns in PNAS articles that differed by more than an order of magnitude (10 versus 300). Here we revisit these earlier analyses in the light of a new one and attempt (i) to understand the differences between them and (ii) to estimate the minimal number of latent categories necessary to describe modern scientific research, often interdisciplinary, as reported in PNAS.

To set the stage, we provide a brief overview of the relevant models and summarize the similarities and differences between the two approaches and corresponding analyses presented in refs. 2 and 3. Using the same database as in EFL (2), we explore a wide range of analytic and modeling choices in our attempt to reconcile the differences in prior analyses. We approach the choice of the number of “latent categories,” which are inferred from data, with multiple strategies including one similar to that used by GS (3). Our findings suggest that 20 to 40 latent categories suffice to describe PNAS Biological Sciences publications, 1997–2001. Thus a reconceptualization of the indexing for PNAS Biological Sciences articles would require at most doubling the 19 traditional disciplinary categories. Because the true number of underlying semantic patterns is unknown and unknowable, we also report on a simulation study that confirms that, were there as few as 20 topics, our methodology would come close to estimating this number in a reasonable way. We also suggest some implications of our reconceptualization for the multiple indexing of interdisciplinary research in PNAS and elsewhere.

Overview of the Earlier Analyses

EFL (2) and GS (3) both analyzed data extracted from PNAS articles from an overlapping time period using versions of mixed-membership models (4). A distinctive feature of mixed-membership models for documents is the assumption that articles may combine words (plus any other attributes such as references) from several latent categories according to proportions of the article’s membership in each category. The latent categories are not observable. They are typically estimated from data together with the proportions. The latent categories need not correspond to existing PNAS disciplinary classifications. Rather, each category can be thought of as a probability distribution over document-specific attributes that specifies which set of, say, words and references, co-occur frequently. The latent categories are often a quantitative by-product of concepts and semantic patterns that are used in a specific disciplinary area more than in others.

A mixed-membership structure allows for a parsimonious representation of interdisciplinary research without the need to create separate categories to accommodate both existing disciplinary links and new forms of collaborative research. Mixed-membership models achieve this through specifying article-level membership parameter vectors. In general, formulating mixed-membership models requires a combination of assumptions at the population level (e.g., PNAS Biological Sciences), subject level (individual articles), latent variable level (article’s membership vector), and the sampling scheme for generating subject’s attributes (article’s words and/or references). Variations of these assumptions can easily produce different mixed-membership models, and the models used by EFL and GS are special cases of the general mixed-membership model framework presented by EFL.

We summarize other aspects of analytic choices, model fitting, and model selection strategies by EFL and GS in Table 1. We believe that analytic decisions, such as working with the Biological Sciences articles* versus with all PNAS articles, including commentaries and reviews in the database, or excluding rare words from the analysis, cannot account for the order of magnitude difference in the most likely number of latent categories inferred from the similar data. Given that the models were so similar, we questioned the discrepancy between 8 to 10 latent categories used by EFL and 300 likely latent categories reported by GS. Why was there such a large difference in this key feature around which all other results revolved? More importantly, in light of this issue, can this type of statistical model support a substantive reconceptualization of the classification scheme in use by PNAS?

Table 1.

Comparison of analytic choices in previous analyses

2004 analyses Erosheva, et al. Griffiths and Steyvers
PNAS database
Years 1997–2001 1991–2001
Scope Biological Sciences All areas
Article type Only research articles All publications
Article data
Data source Words (abstract) and references Words (abstract)
Types of words included Frequent, rare, “stop” Only frequent
Model structure
Number of latent categories K K
Mixed membership λ ∼ Dirichlet(α1,…,αK) λ ∼ Dirichlet(α1,…,αK),αk = α ∀ k
Distribution for words Multinomial (Inline graphic) Multinomial (Inline graphic)
Distribution for references Multinomial (Inline graphic) None
Estimation
Strategy Variational expectation–maximization Gibbs sampler
Hyperparameters Estimated α1,…,αK Set α = 50/K
Dimensionality selection
Main objective Descriptive model Predictive model
Dimensions considered K = 8; 10 50 ≤ K ≤ 1000

Below, we report on new analyses and results for the PNAS data and offer evidence in support of the utility of mixed-membership analysis for grounding considerations about a useful reconceptualization of PNAS categories.

Main Analysis

Mixed-Membership Models.

We attempted to reconcile the differences in the original analyses of EFL and GS as follows: First, we used a common database for all models considered in this paper. Second, we varied data sources and hyperparameter estimation strategies to closely match those of the original analyses. Third, we remedied the absence of dimensionality selection strategy in EFL by allowing the number of latent categories, K, to change between 2 and 1,000, and comparing goodness of fit for different values of K.

Table 2 summarizes the resulting four mixed-membership models in a 2 × 2 layout. Model 3 is the closest to EFL’s model except that we now employ a symmetric Dirichlet distribution (αk = α for all k) that matches GS’s assumption. Model 2 uses the same data source and hyperparameter estimation strategy as in GS. We include models 1 and 4 to complement the other two by balancing the choice of data and estimation strategies.

Table 2.

Mixed-membership models in our analysis

Data source(s) Hyperparameter α
Estimated
Set at 50/K
Abstract Model 1 Model 2
Abstract + bibliography Model 3 Model 4

Let x1 be the observed words in the article’s abstract and x2 be the observed references in the bibliography. We assume that words and references come from finite discrete sets (vocabularies) of sizes V1 and V2, respectively. For simplicity, we assume that the vocabulary sets are common to all articles, independent of the publication time. We assume that the distribution of words and references in an article is driven by an article’s membership in each of K latent categories, λ = (λ1,…,λK), representing proportions of attributes that arise from a given latent pattern; λk≥0 for k = 1,2,…,K and Inline graphic. We denote the probabilities of the V1 words and the V2 references in the kth pattern by θk1 and θk2, for k = 1,2,…,K. These vectors of probabilities define multinomial distributions over the two vocabularies of words and references for each latent category. We assume that article-specific (latent) vectors of mixed-membership scores are realizations from a symmetric Dirichlet distribution. For an article with R1 words in the abstract and R2 references in the bibliography, the generative sampling process for the mixed-membership model is as follows:

Mixed-Membership Models: Generative Process.

  1. Sample λ ∼ Dirichlet(α1,α2,…,αK), where αk = α, for all k.

  2. Sample x1 ∼ Multinomial(p1λ,R1), where Inline graphic.

  3. Sample x2 ∼ Multinomial(p2λ,R2), where Inline graphic.

This process corresponds to models 3 and 4 in Table 2. The process for models 1 and 2 relies on steps 1 and 2 where only words in abstracts, x1, are sampled. The conditional probability of words and references in an article is then

graphic file with name pnas.1013452107eq9.jpg

Estimation and Posterior Inference.

Given a collection of articles, we treat pattern-specific distributions of words and references, {θk1} and {θk2}, as constant quantities to be estimated, and article-specific proportions of membership λk as incidental parameters whose posterior distributions we compute. We assume that the hyperparameter α is unknown and estimated from the data in models 1 and 3; we fix the value of α at 50/K following the GS’s heuristic in models 2 and 4. We carry out estimation and inference using the variational expectation-maximization algorithm (5, 6). Variational methods provide an approximation to a joint posterior distribution when the likelihood is intractable. When we fix the hyperparameter α, as in models 2 and 4, we can use a Gibbs sampler to obtain the exact joint posterior distribution as implemented by GS in their original analysis. When we estimate α, however, we rely on variational approximations for estimation and inference. Simulation studies for the Grade of Membership model have shown that results obtained from both estimation methods are similar (7). We give full details in SI Text.

Dimensionality Selection.

Each time we fit a mixed-membership model to data, we must specify the number of latent categories, K, in the model. The goal of dimensionality selection is to identify a number of latent categories K that is optimal in some sense. We identify the number of latent categories that leads to an optimal model-based summary of the database of scientific articles in a predictive sense, by means of a battery of out-of-sample experiments involving a form of cross-validation. We use 5-fold cross-validation, common in the machine learning literature, e.g., ref. 8, and explain the rationale for this choice in SI Text. Each out-of-sample experiment consists of five model fits for a given value of K. First, we split the N articles into five batches. Then, in turn, we estimate the model parameters using the articles in four batches, and we compute the likelihood of the articles in the fifth held-out batch. This leads to mean and variability estimates of quantities that summarize the goodness of fit of the model for a given K, on a batch of articles not included in the estimation. We consider a grid of values for K that range from a small to a large number of latent categories; namely, K = 2,…,5,10,…,45,50,75,100,200,…,900,1,000.

Sensitivity Analyses

Fitting the four mixed-membership models from Table 3 to the PNAS dataset allows us to examine the impact of using references and estimating the hyperparameter α in a 2 × 2 design. We examine the sensitivity of empirical PNAS results obtained with mixed-membership models by considering the impact on model fit and selection of (i) our key assumption of mixed membership and our simple bag-of-references model. In addition, (ii) we use a simulation study to investigate the methodological issue of the potential impact on dimensionality selection due to fixing hyperparameter α, following the strategy of GS. Finally, to address interpretation issues, we study (iii) the distributions of shared memberships for different values of K and investigate (iv) whether increases in model dimension K beyond some optimal value change the macrostructure of the latent categories.

Table 3.

Summary statistics for dual-classified articles and predictions based on mixed-membership model applied to 5 years of PNAS data

Articles in database Prediction with model:
Published Category Primary Sec. 1 2 3 4
Biochemistry 2,580 33 51 8 230 109
Medical Sciences 1,547 13 13 2 39 18
Neurobiology 1,343 10 65 29 104 10
Cell Biology 1,230 10 5 3 10 3
Genetics 980 14 20 2 55 15
Immunology 865 9 43 1 80 45
Biophysics 637 40 139 37 231 131
Evolution 510 12 101 6 103 133
Microbiology 498 11 8 3 13 9
Plant Biology 488 4 2 0 8 1
Developmental Bio 367 2 2 0 3 1
Physiology 341 2 0 2 17 3
Pharmacology 189 2 0 1 9 2
Ecology 133 5 49 3 34 42
Applied Bio Sci 95 6 5 0 2 1
Psychology 88 1 34 14 52 1
Agricultural Sci 43 2 2 0 4 0
Population Biology 43 5 10 3 12 13
Anthropology 10 0 5 0 2 1
Total 11,988 181 554 114 1,008 538

(i). Alternative Models.

To study sensitivity of our latent dimensionality results to the key assumption of mixed membership, we implement another mixture model assuming that research reports belong to only one of the latent categories. This full-membership model can be thought of as a special case of the mixed-membership model where, for each article, all but one of the membership scores are restricted to be zero. As opposed to traditional finite mixture models that are formulated conditional on the number of latent categories K, this model variant allows the joint estimation of the latent categories, θ, and of the model dimension K.

We assume an infinite number of categories and implement this assumption through a Dirichlet process prior, Dα; for λ, e.g., see refs. 9 and 10. The distribution Dα models the prior probabilities of latent pattern assignment for the collection of documents. In particular, for the nth article, given the set of assignments for the remaining articles, λ-n, this prior puts a probability mass on the kth pattern (out of K-n distinct patterns observed in the collection of documents excluding the nth one), which is proportional to the number of documents associated with it. The prior distribution also puts a probability mass on a new, (K-n + 1)th latent semantic pattern, which is distinct from the patterns (1,…,K-n) observed in λ-n. That is, Dα entails prior probabilities for each component of λn as follows:

(i).

where m(-n,k) is the number of documents that are associated with the kth latent pattern, excluding the nth document, i.e., Inline graphic.

The generative sampling process for this full-membership model is as follows:

  1. Sample λ ∼ Dirichlet Process(α)

  2. Sample x1 ∼ Multinomial(θk1,R1), where λn[k] = 1.

  3. Sample x2 ∼ Multinomial(θk2,R2), where λn[k] = 1.

As with mixed-membership models, we considered two versions of the data: words from the abstract and references from the bibliography of the collection of articles. Model 5 corresponds to this process with steps 1 and 2, where we sample only words, x1. Model 6 corresponds to this process with steps 1–3, where we sample words and references, x1 and x2. We provide full details about estimation and inference via Markov chain Monte Carlo methods in SI Text.

Additionally, we fit a mixed-membership model with a time-dependent bag of references. This confirmed that giving up the time resolution of the articles in our database has a negligible impact on model selection results.

(ii). Simulation Study.

We simulated data from a mixed-membership model with K* = 20 latent categories to obtain a corpus of documents we could use as ground truth. We used a vocabulary of 1,000 words and simulated 5,000 documents. We sampled the length of each document from a Poisson distribution with a mean of 100 words. We set the hyperparameter controlling mixed membership equal to α = 0.01, whereas we sampled the banks of Bernoulli parameters corresponding to latent patterns from a symmetric Dirichlet with hyperparameter 0.01. We then treated K* as unknown and approached model estimation in two ways: (i) by estimating the hyperparameter α as in our main analysis and (ii) by setting the hyperparameter α = 50/K, for a given K, according to the ad hoc strategy implemented by GS.

For model selection purposes, we considered a grid for K as follows: increments of 4 for 10 ≤ K ≤ 50, increments of 10 for 60 ≤ K ≤ 100, and increments of 50 for 150 ≤ K ≤ 500. Thus, we fit the model 25 times for each of 24 values of K.

(iii). Shared Memberships.

Assume that a document is associated with latent category k∈{1,…,K} if and only if its membership score for this category is greater than sd + 1/K, where sd is the posterior standard deviation of the membership scores. For each value of K in our grid, we computed the number of documents associated with exactly k∈{1,…,K} latent categories. We then examined these distributions of the shared membership for a range of models with up to K = 300.

(iv). Macrostructure.

Attempting to interpret the latent categories manually for all values of K in our grid is unreasonable. Hence, we analyzed computationally whether increases in model dimension K destroy the macrostructure and reorganize the latent categories by comparing multinomial probabilities for the latent patterns from the model with a smaller dimension K* with the closest-matching ones of the model with a larger dimension.

We provide details for sensitivity analyses (iii) and (iv) in SI Text.

Main Results

Dimensionality.

Our primarily goal is to assess qualitatively and quantitatively a reasonable range for the number of latent categories underlying the PNAS database. Our analysis offers some insights into the impact on the results from differences in the models and the inference strategies. The simulation study also investigates the impact of such differences on model fit and dimension selection in a controlled setting.

Dimensionality for Mixed-Membership Models.

To provide a quantitative assessment of model fit in terms of the number of latent categories K, we relied on their predictive performance with out-of-sample experiments, as described above. Recall that for mixed-membership analysis with models 1–4, we assume that K is an unknown constant. We split the articles into five batches to be used for all values of K. We considered values of K on a grid, spanning a range between 2 and 1,000. To summarize goodness of fit of the model in a predictive sense, we examine the held-out probability, that is, the probability computed on the held-out batch of articles.§

For each value of K on the grid, we computed the average held-out log-probability value over the five model fits. Fig. 1 summarizes predictive performance of the mixed-membership models 1–4, for values of K = 2,…,100 (the average log-probability values continued to decline gradually for K greater than 100). The goodness of fit improves when we estimate the hyperparameter α (solid lines); however, all plots suggest an optimal choice of K falls in the range of 20–40, independent of the estimation strategy for α and of references inclusion. Values of K that maximize the held-out log probability are somewhat greater when the database includes references. We obtained similar dimensionality results using the Bayesian information criterion (11).

Fig. 1.

Fig. 1.

Average held-out log probability corresponding to four mixed-membership models we fit (Table 3) to the PNAS Biological Sciences articles, 1997–2001, using words from article abstracts (Left) and words and references (Right). Solid lines correspond to models fitted by estimating the hyperparameter α; dashed lines correspond to models fitted by setting the hyperparameter equal to α = 50/K.

Dimensionality for Full-Membership Models.

Although we base the choice of K for mixed-membership models 1–4 on their predictive performance, semiparametric full-membership models 5 and 6 allow us to examine posterior distribution of K.

Fig. 2 shows the posterior distributions on K—density on the Y axis versus values of K on the X axis—obtained by fitting data to semiparametric models with words only (model 5, solid line) and words and references (model 6, dashed line). The maximum a posteriori estimate of K is smaller for the model including references compared to the model with words only. Further, the posterior range of K is smaller for the model including references. Thus adding references to the models reduces the posterior uncertainty about K.

Fig. 2.

Fig. 2.

Posterior distribution of the number of mixture components K for full-membership models for the PNAS Biological Sciences articles, 1997–2001, using words from article abstracts (solid line) and words and references (dashed line).

Dimensionality: Overall.

Our simulation showed that setting the hyperparameter α as a function of K in the same way as GS did had the greatest impact on estimates of the document-specific mixed-membership vectors, leading to a modest upward bias in the choice of an optimal K, but did not result in an order of magnitudes difference. We provide more detailed results on our simulation study in SI Text.

Overall, for all six models, values of K in the range of 20–40 are plausible choices for the number of latent categories in PNAS Biological Sciences research reports, 1997–2001.

Qualitative and Quantitative Analysis of Inferred Categories.

For illustrative purposes, we consider K* = 20 for the mixed-membership model with words and references. We obtain qualitative descriptions of the latent categories using two approaches: via examining high probability words and references in each category and via comparing the model-based inferred article categories with the original PNAS classifications.

Studying the lists of words and references that are most likely to occur according to the distribution of each latent category, we see some interesting patterns that are distinct from current PNAS classifications. For example, category 5 focuses on the process of apoptosis and genetic nuclear activity in general. Category 12 concerns peptides. Several categories relate to protein studies including pattern 8 that deals with protein structure and binding. We offer an interpretation of all the topics in SI Text in an effort to demonstrate what a reasonable model fit should look like.

To examine the relationship between the 20 inferred categories and the 19 original PNAS categories in the Biological Sciences, we plot in Fig. 3 the average membership of the set of documents in the ith PNAS class (row) in the kth latent category (column). We threshold the average membership scores so that small values (less than 10%) would not distract from the visual pattern.

Fig. 3.

Fig. 3.

Estimated average membership of articles in the 20 latent categories by PNAS classifications for mixed-membership models 1–4 in Table 3, Left to Right.

Fig. 3 (Left to Right) details results for models 1 and 2 (words only) and models 3 and 4 (words and references). The results reveal the impact of expanding the database to include references and of setting the hyperparameter at α = 50/K (models 2 and 4). When we include the references, the relationship of estimated latent categories with designated PNAS classifications becomes more composite for each estimation method. When we estimate the hyperparameter α, we observe a better agreement between estimated latent categories and the original PNAS classifications. A greater number of darker color blocks point to more articles with estimated substantial membership in just a few latent categories for the α-estimated models. Lighter blocks for the constrained-α models may be due to more spread-out membership (due to small membership values of all articles) or to an apparent disagreement of estimated membership vectors among articles from original PNAS classifications. Either explanation leads us to conclude that estimating hyperparameters gives us a model that has a better connection to the original PNAS classification.

From an inspection of the estimated categories, we see that small subclassifications such as Anthropology do not result in separate categories and broad ones such as Microbiology and Pharmacology have distinct subpatterns within them. Nearly all of the PNAS classifications are represented by several word-and-reference co-occurrence patterns, consistently across models.

Fig. 4 shows the distributions of shared memberships for varying values of K based on model 3. Overall, no matter what the dimensionality of the model, most articles tend to be associated with about five or fewer latent categories. For K* = 20, 37% of articles are associated with two latent categories and 2% with three categories, the theoretical upper bound on the number of associations in this case. SI Text provides further details.

Fig. 4.

Fig. 4.

Distribution of shared membership across the latent categories for different values of K using model with words and references. Black solid line indicates the upper bound on the number of associations for each K.

When we investigated the impact of increases in dimensionality K on interpretation, we found substantial reorganization among distributions of words and references in the latent categories. We compared estimated multinomial distributions for words and references between categories from pairs of models of dimensions K1 and K2, where K1 < K2, by computing correlations between all pairs of vectors. We found that correlations between the K1 vectors in the smaller model and K1 best-matching vectors from the larger model tend to diminish as K2 increases, indicating that the macrostructure is not preserved. As expected, we also found that correlations between the K1 vectors in the smaller model and additional vectors from the larger model were small.

Predictions.

Recall that our database includes 11,988 articles, classified by the authors into 19 subcategories of the Biological Sciences section. Of these, 181 were identified by their authors as having dual classifications. Here, we identify publications that have similar membership vectors to dual-classified articles, i.e., single-classified articles that may have been also cross-submitted. Table 3 summarizes these results. The parametric models 1–4 predict that respectively 554, 114, 1,008, and 538 additional articles were similar to the author identified dual-classified articles. By similar, we mean that their mixed-membership vectors in the 20 latent semantic patterns match a membership vector of a dual-classified article to the first significant digit. Of particular interest is the large proportion of Biochemistry, Neurobiology, Biophysics, and Evolution articles that our analyses suggest as potential dual-classified articles.

Discussion

We have focused on alternative specifications for mixed-membership models to explore ways to classify papers published in PNAS that capture, in a more salient fashion, the interdisciplinary nature of modern science. Through the data analysis of 5 y of PNAS Biological Science articles, we have demonstrated that a small number of classification topics do an adequate job of capturing the semantic structure of the published articles. They also provide us with a reasonable correspondence to the current PNAS classification structure.

The machine learning literature contains many variants of mixed-membership models for classification and clustering problems. For example, Blei and Lafferty (12) describe a dynamic topic model and apply it to data from 125 y of Science. A different approach to references might exploit the network structure of authors with the mixed-membership stochastic block model of ref. 13 or the author–topic model of ref. 14; see also a review of such models in the psychological literature (15). The selection of appropriate dimension for number of latent categories, K, is often hidden behind the scene in applications, with some exceptions such as those involving a probability distribution over the number of dimensions such as models with the Dirichlet process (16) and its many variants (1720).

Here we provide an extended analysis of dimensionality in a database of PNAS publications, contrasting our findings with earlier published ones (2, 3). The consistency of our results across multiple variants of mixed-membership models indicates that this type of statistical analysis, when done carefully, could support a substantive reconceptualization of the classification scheme used by PNAS. A more in-depth study of semantic patterns, inferred from actual data extracted from papers published in PNAS using tools such as those described in this paper, would also assist in the review process and the indexing of published papers, to reflect modern, overlapping, and interdisciplinary scientific publications. Finally, instead of relying solely on citations, researchers could be suggested related work via articles with the “most similar” semantic patterns, in an automated manner.

Supplementary Material

Supporting Information

Footnotes

The authors declare no conflict of interest.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1013452107/-/DCSupplemental.

*Of 13,008 research articles published during this five-year period, 12,036 or 92.53% were in the Biological Sciences.

A multinomial distribution quantifies the intuition that words (or references) occur at each position in an abstract (or a bibliography) with different probabilities. The data suggest which words and references are most popular in articles that express each latent category.

A symmetric Dirichlet distribution quantifies the intuition that an article tends to belong to a few latent categories, when α < 1. As α > 1, an article belongs to more and more latent categories. The data suggest that Inline graphic for articles in the biological sciences, implying that each research article covers only a few scientific areas.

§Technically, the held-out probability is a variational lower bound on the likelihood of the held-out documents, as we detail in SI Text.

We show only the 13 most frequently used disciplinary categories here, but we provide the complete figure in SI Text.

References

  • 1.Shiffrin RM, Börner K. Mapping knowledge domains. Proc Natl Acad Sci USA. 2004;101:5183–5185. doi: 10.1073/pnas.0307852100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Erosheva EA, Fienberg SE, Lafferty J. Mixed-membership models of scientific publications. Proc Natl Acad Sci USA. 2004;101:5220–5227. doi: 10.1073/pnas.0307760101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Griffiths TL, Steyvers M. Finding scientific topics. Proc Natl Acad Sci USA. 2004;101:5228–5235. doi: 10.1073/pnas.0307752101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Erosheva EA, Fienberg SE. In: Classification—The Ubiquitous Challenge. Weihs C, Gaul W, editors. Berlin: Springer; 2005. pp. 11–26. [Google Scholar]
  • 5.Jordan MI, Ghahramani Z, Jaakkola T, Saul L. Introduction to variational methods for graphical models. Mach Learn. 1999;37:183–233. [Google Scholar]
  • 6.Airoldi EM. Getting started in probabilistic graphical models. PLoS Comput Biol. 2007;3(12):e252. doi: 10.1371/journal.pcbi.0030252. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Erosheva EA, Fienberg SE, Joutard C. Describing disability through individual-level mixture models for multivariate binary data. Ann Appl Stat. 2007;1:502–537. doi: 10.1214/07-aoas126. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Hastie T, Tibshirani R, Friedman JH. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer; 2001. [Google Scholar]
  • 9.Ferguson TS. A Bayesian analysis of some nonparametric problems. Ann Stat. 1973;1:209–230. [Google Scholar]
  • 10.Neal R. Markov chain sampling methods for Dirichlet process mixture models. J Comput Graph Stat. 2000;9:249–265. [Google Scholar]
  • 11.Schwarz GE. Estimating the dimension of a model. Ann Stat. 1978;6:461–464. [Google Scholar]
  • 12.Blei DM, Lafferty JD. In: Cohen WW, Moore A, editors. Proceedings of the Twenty-Third International Conference; Pittsburgh: ACM; 2006. pp. 113–120. [Google Scholar]
  • 13.Airoldi EM, Blei DM, Fienberg SE, Xing EP. Mixed membership stochastic blockmodels. J Mach Learn Res. 2008;9:1981–2014. [PMC free article] [PubMed] [Google Scholar]
  • 14.Rosen-Zvi M, Chemudugunta C, Griffiths TL, Smyth P, Steyvers M. Learning author-topic models from text corpora. ACM T Inform Syst. 2010;28:1–38. [Google Scholar]
  • 15.Griffiths TL, Steyvers M, Tenenbaum J. Topics in semantic representation. Psychol Rev. 2007;114:211–244. doi: 10.1037/0033-295X.114.2.211. [DOI] [PubMed] [Google Scholar]
  • 16.Escobar M, West M. Bayesian density estimation and inference using mixtures. J Am Stat Assoc. 1995;90:577–588. [Google Scholar]
  • 17.Griffiths TL, Ghahramani Z. London: University College; 2005. Infinite latent feature models and the Indian buffet process. Technical Report GCNU-TR 2005–001. [Google Scholar]
  • 18.Griffin J, Steele M. Order-based dependent Dirichlet processes. J Am Stat Assoc. 2006;101:179–194. [Google Scholar]
  • 19.Teh YW, Jordan MI, Beal MJ, Blei DM. Hierarchical Dirichlet processes. J Am Stat Assoc. 2006;101:1566–1581. [Google Scholar]
  • 20.Duan JA, Guindani M, Gelfand AE. Generalized spatial Dirichlet process models. Biometrika. 2007;94:809–825. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Information

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES