Abstract
Annual meeting abstracts published by scientific societies often contain rich arrays of information that can be computationally mined and distilled to elucidate the state and dynamics of the subject field. We extracted and processed abstract data from the Society for Neuroscience (SFN) annual meeting abstracts during the period 2001–2006 in order to gain an objective view of contemporary neuroscience. An important first step in the process was the application of data cleaning and disambiguation methods to construct a unified database, since the data were too noisy to be of full utility in the raw form initially available. Using natural language processing, text mining, and other data analysis techniques, we then examined the demographics and structure of the scientific collaboration network, the dynamics of the field over time, major research trends, and the structure of the sources of research funding. Some interesting findings include a high geographical concentration of neuroscience research in the north eastern United States, a surprisingly large transient population (66% of the authors appear in only one out of the six studied years), the central role played by the study of neurodegenerative disorders in the neuroscience community, and an apparent growth of behavioral/systems neuroscience with a corresponding shrinkage of cellular/molecular neuroscience over the six year period. The results from this work will prove useful for scientists, policy makers, and funding agencies seeking to gain a complete and unbiased picture of the community structure and body of knowledge encapsulated by a specific scientific domain.
Introduction
Continuing exponential growth in the volume of science as measured by number of scientists or by publications has made it virtually impossible for individual researchers to keep track of the totality of knowledge and major progress areas in a research field using the traditional modes of scholarly research. This is individually frustrating for researchers not satisfied with exploring increasingly hyper-specialized niches, and also has negative implications for broader questions relating to the efficiency of the research enterprise and for science policy. Automated or semi-automated methods using natural language processing, applied to the scientific literature, provide a potential avenue to address this problem. Indeed, such bibliometric analysis forms the groundwork for search engines such as Google. However, most of the scientific literature exists behind a series of online firewalls which prevent efficient utilization of automated tools by the average researcher. Meeting abstracts published by scientific societies are often available freely in electronic form on the web or in the form of media distributed at annual meetings, and these form an attractive starting point for the construction and mining of knowledge bases about specific scientific domains. In particular, the annual meeting of the Society for Neuroscience (SFN) is a large-scale, international event that is arguably the most influential single meeting within the subject. The abstracts of presentation at this meeting are not peer-reviewed publications, but nonetheless, due to their volume and diversity, provide a unique global survey of the state of the subject of neuroscience each year.
Much recent work has used textual data in the form of abstracts or full-text publications in order to draw inferences about the structure of a research domain. Many such efforts have been focused on citation analysis [e.g. 1], including reputational indices such as highly cited papers [2]–[4] or scientist and journal impact factors [5]–[7]. An additional area of large concentration has been in the extraction of graphs or networks representing scientific collaborations or co-citations, and analysis of the overall statistical properties of these networks [8]–[12]. To a lesser extent researchers have worked on the problem of visualizing complex knowledge spaces through the creation of visual maps [13]–[17] and literature navigation tools [18], [19]. In contrast, there has been relatively less work on community demographics, the dynamics of fields over time to examine major research trends, or the structure of the sources of research funding. In this paper we examined each of these areas in order to obtain a broad and objective overview of contemporary neuroscience research. We believe that the volume and breadth of topics covered in SFN abstracts is likely to give a more complete and objective view of the field than would abstracts or articles from a single journal. Furthermore, no such survey of neuroscience is currently available from previous efforts, which have focused either on scientific research more generally or on other non-neuroscientific domains. Although our analyses were applied to the neuroscience community, the methodologies presented here are well suited for constructing knowledge bases and mining information about any scientific communities or social network.
We extracted and processed data from the annual SFN meeting planners to build databases of SFN abstracts and their authors. Maintaining an accurate count of the total number of authors was a challenging task complicated by two types of ambiguities: (1) different authors may share the same name and initials, and (2) the same author may use a different number of initials in different abstracts. In this study, we used a combination of string matching, entity matching, and co-authorship patterns to disambiguate unique authors (see Materials and Methods for details of these processes). We created one database for each year between 2001 and 2006, as well as a consolidated database encompassing authors and abstracts from all 6 years. The information contained in these databases allowed us to perform a variety of analyses to elucidate the structure and evolution of the neuroscience landscape.
The remainder of this paper is organized as follows. First, we present the geographical distribution of the SFN authors, followed by basic statistics and demographics of the SFN annual meetings. We constructed a graph of co-authors on abstracts and applied graph theoretic algorithms to investigate patterns of connection and communication between neuroscientists. Next, we used computational techniques in natural language processing to cluster the abstracts into neuroscience topics and studied their dynamics and concordance of these discovered topic clusters with the thematic organization provided by the SFN. Finally, we studied the distribution of funding inferred from the abstract database across these topics.
Results and Discussion
1. Geographical Distribution of SFN Abstract Authors
To explore the geographical distribution and dynamics of neuroscience research, the city, state (for US and Canada), and country of each author's home institution were extracted. The number of authors associated with each unique location was then tabulated for each year between 2001 and 2006. Table 1 shows the top 10 cities with the highest SFN representation during this time frame. Based on these data, the global “hubs” for neuroscience research seem to be concentrated in the northeast region of the United States (Boston, New York, Philadelphia, Baltimore/DC vicinity), Southern California, Tokyo, Montreal, and London. These representations remained fairly static over the years, indicating the stable presence of prominent and well-funded neuroscience research centers in these regions. From Figure 1, it is evident that New York City consistently ranks as the top producer of neuroscience research as measured by the number of authors who submitted abstracts. This finding signifies the number and caliber of academic institutions, research centers, and hospitals in the New York metropolitan area but is not surprising given the city's population.
Table 1. Top 10 cities for SFN representation in terms of raw number of abstract authors between 2001 and 2006.
2001 (San Diego) | 2002 (Orlando) | 2003 (New Orleans) | 2004 (San Diego) | 2005 (Wash. DC) | 2006 (Atlanta) |
New York (0.014) | New York (0.013) | New York (0.016) | New York (0.016) | New York (0.014) | New York (0.014) |
Boston (0.150) | Baltimore (0.137) | Bethesda (-) | Los Angeles (0.024) | Baltimore (0.139) | Baltimore (0.133) |
Baltimore (0.131) | Bethesda (-) | Baltimore (0.132) | Boston (0.164) | Bethesda (-) | Boston (0.134) |
Los Angeles (0.021) | Boston (0.129) | Boston (0.148) | Bethesda (-) | Boston (0.146) | Bethesda (-) |
Bethesda (-) | Los Angeles (0.017) | Los Angeles (0.020) | La Jolla (-) | Los Angeles (0.016) | Los Angeles (0.017) |
La Jolla (-) | Tokyo (0.006) | La Jolla (-) | Baltimore (0.124) | Philadelphia (0.039) | Chicago (0.023) |
Tokyo (0.007) | Chicago (0.018) | Chicago (0.021) | Philadelphia (0.042) | Chicago (0.019) | Atlanta (0.138) |
London (0.007) | Philadelphia (0.034) | Philadelphia (0.040) | Chicago (0.020) | La Jolla (-) | Philadelphia (0.039) |
Montreal (0.015) | La Jolla (-) | London (0.006) | Tokyo (0.007) | Atlanta (0.106) | Tokyo (0.006) |
Chicago (0.019) | Montreal (0.012) | Tokyo (0.006) | Pittsburgh (0.159) | Tokyo (0.006) | La Jolla (-) |
The meeting location for each year is highlighted in parenthesis in the first row. For each city in the table, the corresponding per capita participation (in percentage) is also included in parenthesis, if available.
To examine geographical areas that were disproportionately represented at SFN meetings, we computed per capita participation for large cities using population census data from the United Nations Statistics Division (see Materials and Methods). The per capita attendance for the largest contributing cities is shown in parentheses in Table 1. Furthermore, Table 2 shows the 10 cities with the highest per capita participation between 2001 and 2006. Again, the representations remained fairly static over the years we analyzed. It should be noted here that, because population data were not available on a yearly basis, we have assumed that each city's population was constant over the six-year period and equal to the latest figure available. We believe that this assumption is acceptable for the short time period analyzed here. Not surprisingly, many of the cities with high per capita participation are relatively small and home to prominent academic institutions. In addition, Boston and Baltimore appear to be particularly invested in neuroscience research as they rank high in both the raw number of authors and per capita participation from the SFN meetings.
Table 2. Top 10 cities for per-capita SFN representation between 2001 and 2006.
2001 (San Diego) | 2002 (Orlando) | 2003 (New Orleans) | 2004 (San Diego) | 2005 (Wash. DC) | 2006 (Atlanta) |
Ann Arbor, MI, USA | Cambridge, MA, USA | Cambridge, MA, USA | Cambridge, MA, USA | Ann Arbor, MI, USA | Cambridge, MA, USA |
New Haven, CT, USA | Ann Arbor, MI, USA | New Haven, CT, USA | Ann Arbor, MI, USA | Cambridge, MA, USA | Ann Arbor, MI, USA |
Cambridge, MA, USA | New Haven, CT, USA | Ann Arbor, MI, USA | New Haven, CT, USA | New Haven, CT, USA | New Haven, CT, USA |
Gainesville, FL, USA | Gainesville, FL, USA | Gainesville, FL, USA | Gainesville, FL, USA | Gainesville, FL, USA | Gainesville, FL, USA |
Cambridge, UK | Irvine, CA, USA | Irvine, CA, USA | Irvine, CA, USA | Boston, MA, USA | Irvine, CA, USA |
Irvine, CA, USA | Baltimore, MD, USA | Pittsburgh, PA, USA | Boston, MA, USA | Irvine, CA, USA | Pittsburgh, PA, USA |
Boston, MA, USA | Pittsburgh, PA, USA | Boston, MA, USA | Berkeley, CA, USA | Berkeley, CA, USA | Atlanta, GA, USA |
Pittsburgh, PA, USA | Durham, NC, USA | Durham, NC, USA | Pittsburgh, PA, USA | Baltimore, MD, USA | Durham, NC, USA |
Baltimore, MD, USA | Boston, MA, USA | Berkeley, CA, USA | Cambridge, UK | Durham, NC, USA | Boston, MA, USA |
Oxford, UK | Berkeley, CA, USA | Baltimore, MD, USA | Charleston, SC, USA | Pittsburgh, PA, USA | Baltimore, MD, USA |
The meeting location for each year is highlighted in parenthesis in the first row.
To determine whether the location in which the SFN annual meeting took place had any significant effect on the number of participating authors from the nearby region, we calculated the change in the fraction of all authors attending the meeting who were from within 100, 300, and 500 mile radii of the event location in the year of each meeting relative to the fraction of all authors who came from the same areas in years in which the meeting was held elsewhere (see Materials and Methods for detailed methodology). Figure 2 shows that the meeting location had a moderate effect on the fraction of participating authors from the surrounding area, but this effect varied from year to year. The increase in nearby contributors was minimal (less than 20%) for the 2001, 2002, and 2005 meetings, which were held in San Diego, Orlando, and Washington, D.C., respectively. This is in contrast to a somewhat larger change for the 2004 meeting in San Diego, and for the 2003 meeting in New Orleans and 2006 meeting in Atlanta, which resulted in a considerable surge of local scientists who submitted their abstracts.
The top five countries represented in the SFN annual meetings between 2001 and 2006 were: USA (56.6%), Japan (7.3%), Canada (5.2%), Germany (5.05%), and the United Kingdom (4.5%). It is interesting to compare the cities and countries with large neuroscience communities with historical and modern statistics about the geographical distribution of scientific research in general. For example, the top ten cities in terms of scientific publications in 1967 [Table 7.2 in 20] were, in descending order, Moscow, London, New York, Paris, Tokyo, Washington, Boston, Philadelphia, Chicago, and St Petersburg (Leningrad). At the country level, the leading producers of worldwide science and engineering articles in 2003 were EU-15 (31.5%), USA (30.3%) and Japan (8.6%) [21]. For comparison, the EU-15 nations together contributed 20.2% of SFN abstracts from 2001–2006. The United States, thus, appears to play an exaggerated role in neuroscience compared to all of science and engineering, at least as measured by representation at the SFN meetings.
The advent of web mapping technologies such as Google Maps (http://maps.google.com) and Yahoo Maps (http://maps.yahoo.com) provides capabilities to generate, visualize, and navigate high quality geographical maps on the World Wide Web. In order to visualize the geographical distribution of the home institutions of abstract authors on a map, the latitude and longitude of each author's location as extracted from the abstracts were fetched using Yahoo's GeoCode Web Service (http://developer.yahoo.com/maps/rest/V1/geocode.html). The quantitative distribution of these geographical data can then be plotted on different map templates using the application programming interface (API) provided by the mapping engine. For example, Figure 3 shows the geographical distribution of the raw number of SFN abstract authors for 2006 on a Google Map (generated by www.gpsvisualizer.com).
2. Basic Statistics and Demographics
A number of simple but informative measures that describe authorship and meeting attendance patterns could be easily calculated from the abstracts database. The accuracy of such measures depends on the accuracy of the database itself, the construction of which had to meet challenges such as the disambiguation of individual authors. The upper bound for the total number of authors in the six-year database is 197429; this number was obtained by parsing the data from SFN abstracts without applying any of the disambiguation or entity matching schemes described in the Materials and Methods section. The number of unique author names in the database was 99410, which represents the lower bound for total author count. After applying the disambiguation strategies to our database, the total number of authors was reduced by approximately 35% to 128553. This final tally falls between the upper and lower bounds and gives a reasonable estimate of the true number of unique authors in the database. The problem of author disambiguation could be avoided if, for example, each author was assigned a unique identifier at the time of submission of his or her first abstract; then future submissions could be associated with this identifier, removing ambiguity (see e.g. the “WikiAuthors” project at http://meta.wikimedia.org/wiki/WikiAuthors).
Between 2001 and 2006, the average number of abstracts per author was 2.93, and the average number of authors per abstract was 4.31. Looking at the statistics on a year by year basis (Table 3), it is apparent that the number of abstracts per author, number of authors per abstract, and average number of collaborators in any given year remained roughly constant during the six year span. This suggests that the neuroscience community produces research results at a relatively constant rate and that most research projects in the field are conducted by a small to moderately sized team of scientists. The average number of authors on Science and Engineering articles worldwide in 2003 was reported to be 4.22 and the corresponding number for the United States was 4.42 [21], suggesting that the team sizes represented in SFN abstracts are consistent with other areas of science.
Table 3. Basic statistics of SFN data for the 6-year period between 2001 and 2006.
Year | Number of Authors | Number of Abstracts | Avg. Abstracts Per Author | Avg. Authors Per Abstract | Avg. Num. of Collaborators |
2001 | 42318 | 15340 | 1.55 | 4.28 | 5.82 |
2002 | 37129 | 13307 | 1.53 | 4.21 | 5.51 |
2003 | 41349 | 15261 | 1.58 | 4.29 | 5.90 |
2004 | 43853 | 15987 | 1.59 | 4.37 | 6.09 |
2005 | 39622 | 13669 | 1.50 | 4.35 | 5.88 |
2006 | 39645 | 13979 | 1.54 | 4.33 | 5.96 |
2001–2006 | 128553 | 87543 | 2.93 | 4.31 | 8.62 |
To further elucidate the collaboration patterns of neuroscientists, we plotted the histograms of the number of co-authors for abstracts and the number of abstracts submitted by authors. As highlighted in Figure 4A, most SFN meeting abstracts contain two to five authors. Very few abstracts are associated with only one author or more than 10 authors. This may again imply that most research projects in neuroscience are carried out by a few scientists instead of large teams of people.
Figure 4B shows the histograms of the number of abstracts associated with each author. The majority of the authors (∼60%) had only one abstract over the span of six years. This number may reflect a large group of “transients” comprising mostly undergraduates, graduate students, and perhaps post-docs who entered and exited the neuroscience field in a short period of time. The transient population was generally “mixed” with a more permanent population because individual abstracts often contained authors from both sub-populations. Given the increasingly blurred boundaries between different disciplines of biomedical sciences, it is possible that many of these scientists simply shifted their focus to a different aspect of biomedical research, i.e. from cellular neuroscience to genomics, or from cognitive neuroscience to psychology. The histograms also highlight a few individuals who are associated with very large numbers of abstracts (some have over 100).
In Figure 5A, we plotted the histograms of the number of years in which authors were represented between 2001 and 2006. As the figure shows, approximately 60% of the authors made presence in only one SFN meeting within the six-year period. Again, we speculated that this high turnover rate is the direct manifestation of many transients who entered and exited the field in a relatively short time frame. The phenomena of a high transient rate, reflecting a sort of “infant mortality rate” for first time authors was first analyzed by Price [20], who estimated a 22% transient rate for paper authorship from a database consisting of a statistical sample of papers published between 1964 and 1970. Although we do not pursue it in detail, it should be straightforward to extend or implement Price's model of transients and continuants to the SFN abstracts database, particularly if data from a longer period of time becomes available.
To correlate these data with the demographics of actual SFN meeting attendance, we downloaded the annual meeting attendance statistics from 1971 to 2006 from the SFN website (http://www.sfn.org). These data are plotted in Figure 5B using a base 2 logarithm. The SFN meeting attendance has shown an overall slowing growth rate in the past 3 decades. As evident from the graph, the first doubling took approximately 5 years. The next two doublings occurred at a steady exponential rate between 1975 and 1995, with a doubling period of about 8 years. The growth slowed after 1995 and the current doubling rate is projected to be about 15 years.
What are the causes of the exponential growth, and what is causing the rates to slow down? To put the numbers in perspective, the number of life scientists employed in the Science and Technology workforce in the US for the years 1970, 1980, 1990 and 2000 were 55, 102, 139 and 226 (in thousands). These numbers also show an initial doubling period of 10 years and a subsequent slowing. The exponential increase in the number of scientists and scientific publication over the last three centuries has been studied systematically [20]. Interestingly, Price's estimate of the doubling times of 10–15 years is consistent with the estimated growth of SFN meeting attendees over the last three decades. However, this growth has also slowed down and may continue to fall further. It is interesting to speculate about what is slowing the growth in meeting attendance. A number of limitations come to mind: a reflection of an overall slowdown in growth of the science and technology workforce in general or of biomedical scientists in particular, perhaps due to saturating funding rates; maturation of the research field and a shift in scientific talent to other growth areas such as information technology, or perhaps non-scientific factors such as capacity of the convention center and the number of hotel rooms in the cities where the meetings are held.
To determine if the participation level of the annual SFN meetings might be linked to the amount of funding available to the scientists, we plotted in Figure 5C the budgets for National Institute of Health (NIH), one of the largest funding agencies for biomedical sciences, from 1976 to 2006 (Source: Historical Table 2: R&D by agency, AAAS website: http://www.aaas.org/spp/rd/guihist.htm). Although the NIH budget has grown steadily during this period, there does not seem to be a detailed correlation between NIH funding and SFN meeting attendance. In fact, the growth in meeting attendance slowed down precisely when the NIH budget was doubled from 1995 to 2005.
Exponential growths do not continue forever, and the increase in the number of SFN attendees is no exception. Price has pointed out that a doubling time of 10–15 years is much faster than the doubling time of the human population (which is currently around 50 years, and slowing), and has predicted a period of transition to a steady state where the number of scientists per capita reaches a stable value. In Price's estimate, we are either at the inflection point of the corresponding logistic curve, or have passed it already. It is to be noted that the percentage of the gross national product devoted to R&D in developed nations has remained steady between 2–3% since the 1970's [22], and other subject areas in science such as physics or electrical engineering also showed sharp growth followed by saturation within recent history. Unfortunately, despite such historical data and exhortations by Price and others about the necessity to manage the transition from rapid exponential growth to slower growth or a relatively steady state, there is little evidence for forward planning by the biomedical community in trying to manage the coming demographic transition by practicing stricter scientific “birth control” [23]. Absent such planning, the danger is that Malthusian factors will make the transition significantly more painful than necessary.
3. Analysis of Co-authorship Graphs
More detailed inferences about authorship patterns and the structure of the neuroscience community as a whole can be inferred from an analysis of a collaboration or co-authorship network [e.g. 10]. A co-author graph, G: = (V, E), was constructed from the preprocessed database by representing each author as a vertex on a graph, v∈V. Two authors were connected by an undirected edge, e∈E, if they have co-authored at least one abstract in the database. Matrix representations of the graph can then be used to analyze the structure of the underlying community. In addition, by integrating the data with a graph visualization package, such as Graphviz (www.graphviz.org) or JUNG (jung.sourceforge.net), one can visualize, explore, and navigate the network interactively.
A fundamental measure used in graph theory is the shortest path between a pair of connected vertices. In the context of the network under study, this measures the number of steps it takes to go from one author to another through intermediate collaborators. From the multi-year SFN database, the lengths of shortest paths between all pairs of authors for whom a connection exists were calculated exhaustively using a breadth-first search algorithm. These numbers were then averaged to yield the mean distance between authors in the entire network. Table 4 shows that the authors in the SFN community are separated from one another by an average distance of 6.09. A similar observation of “six degree of separation” has been reported previously for abstracts in the MEDLINE database [10], suggesting that neuroscience and the greater biomedical science community share similar connection patterns. The diameter of the graph, or the maximum distance between any two authors in the network for whom a connection exists, is 20, which also closely matches the result from Newman's MEDLINE analysis.
Table 4. Some graph analysis results for multi-year SFN data.
Average Distance | 6.09 |
Graph Diameter | 20 |
Mean Clustering Coefficient | 0.7724 |
Number of Connected Components | 2650 |
Size of Largest Connected Component | 116716 |
As a percentage | 90.79% |
Size of Second Largest Connected Component | 56 |
As a percentage | 0.0436% |
We also computed the clustering coefficient, which provides a measure of cliquishness [24]. Suppose that a vertex v in a graph has kv neighbors; then at most kv(kv-1)/2 edges can exist between them (this occurs when every neighbor of v is connected to every other neighbor of v). Let Cv denote the fraction of possible edges for the neighborhood around v that actually exist. The clustering coefficient of a graph is the average of Cv for all v. The mean clustering coefficient for the SFN network between 2001 and 2006 is 0.7724. In other words, two authors in the network have a 77.24% or greater probability of being collaborators if they have both collaborated with a third author.
A large sparse graph such as the one created from the SFN database may not be connected (i.e. there may not exist a path from each vertex to every other vertex in the graph). Finding the set of individual connected components in the graph may provide another insight into community structure. The SFN co-author graph for 2001–2006 was found to contain 2650 connected components (Table 4). Most authors belong to a single large connected component which comprises more than 90% of the entire network. The remaining connected components in the graph are significantly smaller, each accounting for less than 1% of the vertices of the entire graph. Some of these small connected components represent research groups from pharmaceutical companies or other commercial entities, while some others belong to laboratories from countries with a relatively low SFN presence.
Another interesting aspect of the graph is the relative importance of each vertex as measured by the betweenness centrality of the vertex [25], [26]. The betweenness centrality for a given vertex BC(v) is defined as:
where σst is the number of shortest paths between s∈V and t∈V, and σst(v) is the number of shortest paths between s and t that pass through v. In other words, betweenness centrality measures the frequency with which a vertex falls on one of the shortest paths between any other pair of vertices in the graph.
Vertices with large betweenness have more influence over the information flow in the graph and can thus be considered to represent authors playing central roles in the SFN co-author network. Analysis of the multi-year SFN data revealed that only a few individuals in the network have disproportionately large betweenness centrality measures (Figure 6A). In addition, Figure 6B shows that on average the distribution of the betweenness centrality of an author and his/her number of abstracts closely follow a power law. However, the authors possessing the largest betweenness centrality, and thus the most influence over the network, were not necessarily associated with the largest number of abstracts. To better elucidate the roles of these brokering members of the SFN network, the research profiles of these individuals were located from the World Wide Web and qualitatively assessed. Most of the authors with high betweenness centrality conduct research in the field of neurodegenerative diseases, such as Alzheimer's disease (AD) and Parkinson's disease (PD). Research related to AD, PD, and other neurodegenerative diseases is highly multidisciplinary in nature, and scientists engaging in this type of research will likely employ techniques and methodologies spanning multiple different sub-disciplines of neuroscience and other biomedical sciences, which might explain the high values of betweenness centrality. Another possible reason is the comparatively high funding rates for neurodegenerative disorders (discussed in Section 5).
4. Topic Modeling
The sheer number and diversity of the annual SFN meeting attendees indicate that the text corpora from the abstracts provide an illustrative view of the current state and dynamics of the neuroscience research landscape. One can perform a variety of text mining and natural language processing (NLP) techniques to exploit topic information from the syntaxes and semantics of the text corpora. The information gained from topic modeling can be used to classify abstracts into different categories, chart the rise and fall of research topics over time, measure the popularity of specific fields, and facilitate document retrieval.
We explored the utility of Latent Semantic Analysis (LSA) [27], [28] to describe the topic space spanned by the SFN abstract set. Briefly, LSA is a dimensionality reduction technique that projects terms and documents (abstracts) into a lower dimensional space. The reduced dimensionality vector space captures most of the important underlying structure in the association of terms and documents, while at the same time removing the noise or variability in word usage [29]. In the reduced vector space, terms that occur in similar documents are located near one another even if they never co-occur in the same document, and topically related documents are grouped near one another based on their semantic relatedness.
Figure 7 shows the projections of the terms used in SFN abstracts in a reduced two-dimensional vector space. The terms with the highest frequencies of occurrence are labeled. It can be seen from the figure that some terms representing similar concepts are located near one another in this reduced vector space. For example, many terms on the left side of the figure are related to sensory and motor systems (“task”, “stimulus”, “movement”, “visual”), terms at the bottom of the figure are related to cellular neuroscience (“potential”, “current”, “axon”, “channel”, “synaptic”, “neuron”), and many terms on the right side of the figure are related to molecular biology (“protein”, “gene”, “regulatory”, “bind”, “express”, “pathway”). This representation provides a map of the topic space in neuroscience, but does not reveal a tremendous amount of apparent structure. To try to better understand the structure of the topic space, we thus employed an additional strategy to uncover topic clusters.
4.1 Topic Clusters
After LSA was performed using 100 dimensions, we constructed a new sparse graph defined across abstracts, where the edges between abstracts were weighted by abstracts' cosine similarity (see Materials and Methods). We then applied the Normalized Cuts (NCuts) algorithm [30] to automatically partition this graph, and thus cluster the abstracts into different topic groups. The number of topic clusters was chosen by evaluating the concordance between the topic classification found using NCuts to the eight SFN theme labels (Table 5) which were available in the database for a subset of abstracts. Concordance was measured using the Adjusted Rand Index [31], which quantifies the agreement between two data partitions. The number of topic clusters that maximized concordance was found to be 10 (Refer to the Materials and Methods section for detailed descriptions of these algorithms).
Table 5. Themes used by SFN to categorize abstracts submitted for the 2006 meeting.
Theme A | Development |
Theme B | Neural Excitability, Synapses, and Glia: Cellular Mechanisms |
Theme C | Sensory and Motor Systems |
Theme D | Homeostatic and Neuroendocrine System |
Theme E | Cognition and Behavior |
Theme F | Disorders of the Nervous System |
Theme G | Techniques in Neuroscience |
Theme H | History and Teaching of Neuroscience |
To understand the content of the resulting topic clusters, we found the 20 most frequent words used in each cluster. The lists of frequent words, along with the complete collections of the abstracts, were also distributed to laboratory members working in neuroscience for subjective labeling. Among the 10 topic clusters, half of them were readily identified for their distinct and coherent themes. For example, all abstracts in Cluster 3 deal with research in songbirds. Abstracts in Cluster 6 frequently contain such words as “amyloid beta”, “abeta”, “tau protein”, and other terms relevant to Alzheimer's disease. Cluster 7 is distinct from all other clusters in that it contains mostly education and informatics related work. Cluster 8 groups together abstracts related to biological rhythms, which is evident from the abundance of the following words: “circadian”, “melatonin”, “clock”, “phase”, and “suprachiasmatic nucleus” or “SCN”. Finally, Cluster 10 contains mostly abstracts dealing with the structures and mechanisms of sleep. The remaining 5 clusters, which tend to be larger in size, were not as readily identifiable and required more thorough investigation of the abstracts themselves. Table 6 shows cluster sizes, lists of frequent words, and the labels qualitatively assigned to each cluster. For illustrative purpose, only the 7 most distinguishing words taken from each cluster's list of 20 most frequent words are shown. Complete lists of the 20 most frequent words for each cluster are available as supplementary materials in Table S1.
Table 6. The 10 clusters produced by the NCuts algorithm performed on the nearest-neighbor graph from year 2001–2006 (see Materials and Methods).
Cluster ID | Size of Cluster | Topic Label | Most Frequently Used Words |
1 | 16729 | Substance Abuse & Addiction | BEHAVIOR, LEVEL, COCAINE, DOSE, DRUG, INJECT, TREATMENT |
2 | 22647 | Cellular Neuroscience | SYNAPTIC, PROTEIN, CURRENT, CHANNEL, POTENTIAL, DENDRITIC, SUBUNIT |
3 | 492 | Behavior of Song Birds | SONG, HVC, BIRD, VOCAL, AUDITORY, FINCH, SING |
4 | 7210 | Pain & Trauma | SPINAL, PAIN, RECEPTOR, MUSCLE, INJURIES, DORSAL, MORPHINE |
5 | 19988 | Proteins, Gene Expression & Molecular Biology | CELL, NEURON, EXPRESS, ACTIVE, BRAIN, GENE, RECEPTOR |
6 | 3609 | Alzheimer's Disease | AD, AMYLOID, TAU, ALZHEIMER, PEPTIDE, PLAQUE, ABETA |
7 | 736 | Education & Informatics | STUDENT, DATA, LEARN, PROGRAM, MODEL, SCHOOL, INFORMATICS |
8 | 794 | Biological Rhythms | CIRCADIAN, SCN, LIGHT, RHYTHM, PHASE, CLOCK, CYCLE |
9 | 14192 | Visual & Motor Systems | RESPONSE, TASK, VISUAL, SUBJECT, CORTEX, MOVEMENT, STIMULUS |
10 | 1146 | Sleep | SLEEP, WAKE, REM, EEG, DEPRIVATION, PERIOD, WAVE |
Abbreviations: HVC = “High Vocal Center”; AD = “Alzheimer's Disease”; REM = “Rapid Eye Movement”; SCN = Suprachiasmatic Nucleus”.
The third column of the table shows the subjective topic label assigned by domain experts to each cluster. The last column shows the 7 most distinguishing words found in the 20 most frequently used words in each cluster.
To visualize the 10 topic clusters on a high level “conceptual map”, the abstracts from all six years were plotted as points in a 2D space formed by the 2nd and 3rd smallest eigenvectors of the graph Laplacian defined on the abstract similarity graph. The projection of each abstract as a point in this space was color coded based on the topic cluster to which it belongs. The resulting topic map is presented in Figure 8. In this representation we are able to see a large degree of separation of the topic clusters, while also revealing the analog within-category variation in abstract similarity. That is, abstracts that appear as points in close proximity to one another are likely to be more similar than those that are more distant.
4.2 Concordance with SFN Themes
While the Adjusted Rand Index provides a global measure of similarity between partitions, the individual abstract clusters derived from NCuts partitioning can also be compared pairwise with the SFN theme clusters. A concordance matrix, C, between the two classification systems was constructed in which each element Cij indicates the number of abstracts from 2006 that belonged to cluster i and theme j (i = [1..10], j = [A..H]).
Figure 9A shows the relative distribution of abstracts in each cluster across the SFN themes, after dividing each matrix element Cij by the total number of abstracts in cluster i. Thus these matrix entries represent the proportion of abstracts from cluster i that are classified as theme j. Some observations of good concordance can be made:
Most of the abstracts from Cluster 7 (“Education and Informatics”) are labeled as Theme G (Techniques in Neuroscience) or Theme H (History and Teaching of Neuroscience), with a higher percentage in the latter.
Cluster 6, which represents Alzheimer's disease, is almost wholly contained in Theme F (Disorders of the Nervous System).
Cluster 3, which corresponds to behavior of song birds, is mostly captured by Theme E (Cognition and Behavior).
There is fairly good concordance between Cluster 4, which represents topics related to pain and trauma, and Theme C (Sensory and Motor Systems).
Good concordance is also observed between Cluster 8 (“Biological Rhythms”) and Theme D (Homeostatic and Neuroendocrine Systems).
Similarly, by dividing each matrix element Cij by the total number of abstracts in theme j, the resulting matrix (Figure 9B) gives the proportion of abstracts from theme j that are classified as cluster i. There are some interesting observations as well:
Theme H (History and Teaching of Neuroscience) is almost entirely contained in Cluster 7.
Theme G (Techniques in Neuroscience) is spread between Cluster 2 (“Cellular Neuroscience”) and Cluster 9 (“Visual and Motor Systems”). This illustrates that while SFN groups together techniques used in kinematics, imaging, and cellular neuroscience, unsupervised clustering classified these abstracts according to their target applications.
There is very good concordance between Theme B (Neural Excitability, Synapses, and Glia: Cellular Mechanisms) and Cluster 2.
Many abstracts from Theme D (Homeostatic and Neuroendocrine Systems) belong to Cluster 1 (“Substance Abuse and Addiction”), suggesting that mechanisms of addiction to various psychoactive substances (i.e. alcohol, tobacco, drugs) are important elements of homeostatic and neuroendocrine research.
4.3 Dynamics of Topics
Analyzing the dynamics of scientific topics provides interesting insights into the rise and fall of different research subjects and methodologies. The amount of scientific interest generated by different topics has both sociological and economical implications, and tracking their changes can potentially prove useful for policy making, research planning, and funding allocation. Since the topic clustering performed in the previous section was applied to a corpus of abstracts spanning 6 years, it is straightforward to study short-term trends in neuroscience research by examining how the distribution of abstracts across the topic clusters changes from year to year. Detailed descriptions of our methodology are outlined in Materials and Methods. There have been several previous efforts to measure topic dynamics using text corpora [32]–[36]. These have often been limited to comparing the frequency of occurrence of particular keywords in one time period to another (although see [32], [34]). We examined the time series of abstracts contributed by topic and of word-frequency over the full six year period, giving a more detailed view of temporal dynamics than is offered by a simple comparison of pairs of temporal windows.
Among the 10 topic clusters, Cluster 9, which corresponds to visual and motor systems, is shown to have consistently increased in representation over the six year span (Figure 10A). On the other hand, Cluster 2, which corresponds to cellular neuroscience, exhibits the most significant decrease in representation over the same period (Figure 10B). These results suggest that there is a shift in general scientific interest from cellular-level work such as ion channel, synapse, and membrane physiology, towards more system level research incorporating such topics as vision, kinematics, motor processing, and imaging. We speculated this trend is reflective of the heavy reliance of neuroscience research on animal models and invasive techniques. The use of animal model systems continues to be the most prevalent way of studying the pathophysiologic mechanisms of neurodegenerative diseases, which is an area that is both well funded and well represented in the SFN abstracts database. This may explain the rise of macro-level study in favor of cell-based and molecular techniques. In addition, neuroimaging technologies have in recent years become indispensable tools in various aspects of neuroscience research. It is therefore not surprising to observe a surge of activities related to this subject matter.
In addition to charting the temporal changes in the distribution of abstracts across topic clusters, we also performed analysis of word frequency dynamics using principal components analysis (see Materials and Methods). The results indicated that a large fraction of the changes could be accounted for by a nearly linear component in time, which intuitively corresponds simply to some words becoming more frequent and some becoming less frequent. The corresponding word-space vector was examined to see which words contributed to the increase and which to the decrease. Figure 11 (bottom) shows the 25 terms with the largest positive and negative projections on this component. These terms seem to roughly correspond to the domains of cellular neuroscience (decreasing) and systems neuroscience (increasing). This finding is consistent with the analysis of topic clustering dynamics (above), and appears to indicate a significant shift in the topics being addressed at the Society for Neuroscience conference between the years 2001 and 2006.
5. NIH Funding Analysis
The National Institutes of Health (NIH) is the largest funding agency for biomedical research in the world, currently investing over $28 billion each year for conducting and supporting medical research in the United States and around the world (from NIH website: http://www.nih.gov/about/budget.htm). The NIH is made up of 27 different institutes and centers, each of which manages research activities related to specific topics (see http://grants.nih.gov/grants/glossary.htm). Much of the research showcased in SFN meetings is supported completely or partially by the NIH institutes. The correspondence between research dollars allocated from individual NIH institute and topic clusters provides another interesting perspective of the current neuroscience landscape. As a caveat to this section, it should be noted that the derivation of the funding information from the abstracts is inferential, since no dollar figures are provided in the abstracts, and we did not make any attempt to fine tune our analysis to individual funding mechanisms but counted each listed grant equally. Nevertheless, no comparably comprehensive database of neuroscience funding is publicly available, and we considered it valuable to perform such inferential analysis.
We anticipated a correspondence between certain topic clusters and specific NIH institutes. For example, Figure 12A shows the NIH funding breakdown among the 8 themes created by SFN for the 2006 meeting abstracts. The majority of the work categorized as Theme A (“Disorders of the Nervous System) was supported by the National Institute of Neurological Disorders and Stroke (NINDS) and National Institute of Drug Abuse (NIDA). If we further explore the funding distributions among the subthemes of Theme F (Figure 12B), it is clear that neurodegenerative disorders and addiction and drugs of abuse indeed represent the majority of the work classified as Theme F. Applying the same analysis to the NCuts-derived topic clusters, one might expect to find many abstracts from Cluster 1 (subjectively labeled “Substance abuse and addiction”) to be supported by the National Institute of Drug Abuse (NIDA), and most of the work supported by the National Eye Institute (NEI) to be captured by Cluster 9 (“Visual and Motor Systems”).
The funding information associated with each abstract between 2001 and 2006 was parsed from the original XML data file. If the NIH was designated as one of the funding sources, the specific institute was determined from the two-letter organization code preceding the grant number. For abstracts supported by more than one grant, an appropriate fraction was assigned to each institute by dividing the number of grants from each institute by the total number of grants listed. It should be pointed out that not all abstracts provided support information, and not all of those that did provided a grant number. However, considering the size of the database, the result is likely to be representative of the overall funding breakdown among the institutes. The breakdown of funding across the topics derived from NCuts and the NIH institutes is illustrated in Figure 12C.
As an example of an inference that may be drawn from these visualizations, note that a large fraction of neuroscience research, both at the cellular and system level, is supported by NINDS. This observation is consistent with the expectation that, regardless of techniques or methodologies, one of the ultimate goals of many neuroscience investigations is to further the understanding of the causes, prevention, diagnostics, and treatment of various disorders of the nervous systems. If more detailed information can be extracted from the specific grants referenced, one might further break down NINDS funding among different types of neurological disorders. These types of information can be useful for research planning and analysis of the societal costs of neurological diseases.
There is good concordance between the funding distributions of several NIH institutes and our topic clusters. For example, most abstracts from Cluster 6, which corresponds to Alzheimer's disease (AD), are supported by the National Institute on Aging (NIA). Similarly, a significant portion of the abstracts funded by the NIA are from Cluster 6, suggesting that AD is probably the top neurological health priority for the aging population. As anticipated, another example of good concordance is the fact that most of the work supported by NEI is associated with Cluster 9, which encompasses visual and motor systems. Finally, it makes intuitive sense that NIDA and National Institute on Alcohol Abuse and Alcoholism (NIAAA) would apportion most resources to support works related to substance abuse and addiction, which is captured by Cluster 1.
6. Related Work – Computational Linguistics in Research
The application of several methods from the field of computational linguistics (CL) to the body of neuroscience abstracts described here has revealed a number of interesting perspectives on contemporary neuroscience. Currently few efforts have been undertaken to leverage such techniques to help neuroscientists in their scholarly research [for examples of such work, see 37,38], although the array of available methods continues to grow. Several fields within computational linguistics use topic modeling, clustering and large-scale visualization efforts to analyze text corpora of varying degrees of size. Typically these text collections are non-scientific (either using sources such as Wikipedia with over 2 million pages, large-scale crawls of the world-wide-web or newstext). The National Library of Medicine's MEDLINE corpus is the standard data of choice for biomedical text mining [39]. MEDLINE contains roughly 16 million documents and requires large-scale supercomputing methods to analyze using these methods [40, Personal Communication].
A number of techniques provide an alternative methodology to LSA for the analysis of topics and topic signatures (the associations between words within clusters) within text, these include the log-likelihood ratio [41], a variety of clustering methods [See 42 for one example], and Latent Dirichlet Allocation (LDA) [43]. One refinement of LDA uses Gibbs sampling as an efficient methodology to discover topics [40], [44]. The complexity of the data may be explored with advanced graph visualization techniques to assist the analysis [45]. Recent studies include analyses of the 20 years of abstracts from the Proceedings of the National Academy of Sciences (PNSA) [15], and from publications concerned with Melanoma research [46].
Unlike massive resources such as MEDLINE, the SFN annual meeting abstract data provides an ideal ‘laboratory’ for the use of these techniques on a small, focused document set in the service of a relatively small specific community. As a well-established method to investigate topics for our specific domain, we focused on the use of LSA to provide a clear high-level overview of the whole subject and to investigate detailed trends and issues concerning policy and the informational needs of neuroscientists. We envisage that the SFN abstracts can provide a valuable resource and application domain for the CL community since neuroscientists need efficient computational tools to assist them in their scholarly work.
Materials and Methods
Sources
The annual Society for Neuroscience (SFN) meeting abstracts from the years 2001 through 2006 were available as XML files on CD-ROMs during the annual meetings of the society. These XML files were parsed to extract tagged attributes associated with each abstract. Each of these attributes was further processed in order to extract specific types of data. For example, the XML files provide attributes corresponding to authors' full names; these attributes were tokenized in order to separate last name from first and middle initials. Similar processing was applied to institution affiliations in which department name, institution name, city, state (for US and Canada), and country are identified. Furthermore, each author was linked to her respective institution based on annotated superscript numbers supplied during abstract submission. The postprocessed data were added into persistent storage in a MySQL database. The database contains three entity tables: author, institution, and paper. Since each author can be affiliated with multiple institutions and can produce one or more papers, these entities are mapped using many-to-many relationships in the database. For this study, we created one database for each year between 2001 and 2006, as well as a consolidated database encompassing data from all 6 years.
Author Disambiguation
As is the case in many bibliographical resources, each author in an SFN abstract is identified by last name followed by one or more initials. Such an identification system is inherently ambiguous and can impact the quality zof the database as more abstracts are pooled from multiple years. Two types of name ambiguities were observed during the parsing process. The first type results from the same author using a different number of initials in different abstracts. For example, Partha Mitra from Cold Spring Harbor Laboratory has been identified as “Mitra, P.” and “Mitra, P. P.” in different abstracts. Because such inconsistencies could lead to falsely identifying the same author as two unique individuals, only the last name and first initial were compared by default. Middle initials were used if and only if the two author names being compared both contained a middle initial. The second type of ambiguity arises when different authors actually share the same name and initials (e.g. “Brown, S.” from the University of Tennessee in Memphis and “Brown, S.” from Columbia University). To resolve this scenario, authors were identified as different individuals if their affiliations were different, regardless of name identities. This heuristic, of course, assumes that no two authors sharing the same name work in the same department of an institution, which is reasonable given the nature and size of the SFN data.
The method employed to distinguish authors by straightforward comparison of institution strings inevitably results in a large number of duplicates. This is because institution entities usually have many name variants. Syntactic differences (“Memorial Sloan Kettering Cancer Center” and “Sloan Kettering Institute for Cancer Research”), the use of abbreviations or acronyms (“New York State Psychiatric Institute” and “NYS Psychiatric Institute”), and even misspellings (“University of Pittsburg” instead of “University of Pittsburgh”, and “Wilfred Laurier University” in Ontario, Canada instead of “Wilfrid Laurier University”) were present due to the lack of a controlled vocabulary in abstract submission. Given this situation, a strategy that relies on exact string matching might suffer from low recall [47]. This problem of determining whether different names refer to the same entity, or entity matching, has been addressed extensively in the field of information integration, and numerous solutions have been developed [48]. Here, the following procedure was used to resolve semantic ambiguities for institution entities:
Break all institution affiliations, which consist of department name, institution name, city, state (for US and Canada), and country, into “bags of words” (or tokens). Convert all words to upper case.
Remove “stop words” from the token sets. Stop words are words that do not carry any weight in distinguishing different named entities. The initial stop list was downloaded from the Cornell SMART project (ftp://ftp.cs.cornell.edu/pub/smart/english.stop), and was supplemented by institution specific stop words such as “college”, “clinic”, “center”, “laboratory”, “program”, “campus”, etc.
- Perform token based name matching using Jaccard similarity, which is defined as:
where S and T are token sets of two arbitrary strings s and t, respectively. Two institutions are considered identical if their Jaccard similarity is 1. This step resolves institution names with different word orders such as “Weill Medical College of Cornell University” and “Cornell University Weill Medical College”. Edit distance is used as a metric to resolve syntactic variations in institution names (e.g. “UC Berkeley” versus “University California Berkeley”, or “Mount Sinai” versus “Mt. Sinai”). The edit distance between strings s and t is the cost of the best sequence of edit operations that convert s to t [49]. If the distance between two names is less than a certain threshold, the two are considered aliases of the same entity and are thus merged into one representation.
In addition to institution entities, co-authorship patterns were also used to detect authors who moved between affiliations, further reducing duplicate author instances. For simplicity, authors who share the same name and have at least one common co-author were considered to be the same individual. The workflow for disambiguating and matching author entities is summarized in Figure 13.
Geographical Distribution of SFN Abstract Authors
For each annual meeting between 2001 and 2006, the city, state (for US and Canada), and country of each author's first institution were extracted, and the total number of authors from each city in each year was calculated. The longitude and latitude coordinates of each of these locations were then obtained from the Yahoo GeoCode Web Service (http://developer.yahoo.com/maps/rest/V1/geocode.html). Per capita participation for each major city was computed using population data from the United Nations (UN) Statistics Division (http://unstats.un.org/unsd/demographic/sconcerns/densurb/urban.aspx). The raw number of authors associated with each major city was normalized by its total population to obtain the per capita rate. The UN data include the latest available population census for capital cities and cities of 100,000 or more inhabitants, and thus the reference years vary for different countries. These populations were assumed to be approximately correct for each city over the entire six-year period.
To estimate the effect of the location of the annual meeting (available on http://www.sfn.org) on the number of contributing authors from nearby regions, we tabulated the number of participating authors whose address was within 100, 300, and 500 mile radii of the meeting location for each year between 2001 and 2006. The distance, d, in miles between two locations was calculated using the Great Circle Distance Formula:
where φ1, μ1 and φ2, μ2 are the latitude and longitude pairs (in radians) of the two geographical locations, and r≈3963 is the equatorial radius of the earth in miles. Let equal to the fraction of all contributing authors for year i who came from within an n mile radius of meeting location m. Then, the effect of meeting location m (in year i) on contributions by nearby authors, controlling for overall meeting attendance, was calculated as:
where indicates the average fraction of authors from the same n-mile radius around location m in all years k in which the meeting was not held at location m. Thus, this quantity gives the percent change in relative attendance from the area surrounding the meeting compared to the same area's relative attendance when the meeting was held elsewhere. Multiple n values are used to examine how the effect of proximity falls off as a function of distance from the meeting.
Graph Analysis
The breadth-first search algorithm used to calculate lengths of shortest paths between all pairs of authors for whom a connection exists was implemented in Perl. Refer to Introduction to Algorithms [50] for detailed descriptions of the algorithm.
All other graph analyses (connected component, betweenness centrality, and clustering coefficient) were performed in MATLAB 7.3.0 (R2006b) using the MATLAB Boost Graph Library (MatlabBGL) written by David Gleich (http://www.stanford.edu/dgleich/programs/matlab_bgl).
Topic Modeling
Latent Semantic Analysis
The first step of latent semantic analysis (LSA) was to construct a term-by-document matrix, A, in which each row corresponds to a unique term and each column to a unique document (abstract). Entry Aij contains the number of times term i appeared in abstract j. The full text from the 87543 SFN meeting abstracts from years 2001 to 2006 were first parsed into tokens. All punctuations, numbers, and other special characters were discarded. In addition, common English words that do not carry semantic value were eliminated based on a “stop word” list from the Cornell SMART project (ftp://ftp.cs.cornell.edu/pub/smart/english.stop). To further reduce the size of the resulting “bag of words”, all terms that appeared in only one abstract were eliminated. Word stemming algorithms from Snowball (http://snowball.tartarus.org) were also applied to all tokens so that morphologically similar words sharing the same root (e.g. “neuron”, “neurons”, “neuronal”) were collapsed into one (“neuron”). Previous studies have indicated that the use of stemming can result in some improvement of the precision and recall of information retrieval [51].
The preprocessing steps resulted in 87543 documents and 35943 terms. The term-by-document matrix A was constructed by counting the number of occurrences of each term in each document. In LSA, it is customary to transform this frequency matrix by some weight function to give better interrelations between term and document. In this work, the matrix A was weighted using the log entropy function [52]. The log entropy weight of each term i is the product of its local weight lij and global weight gi computed as
where Aij is the frequency of the ith term in the jth document, pij isthe probability of the ith term occurring in the jth document, and n is the total number of documents in the corpus. The weighted frequency of each element from A is then calculated by multiplying its local component by its global component. In other words, the weighted m×n term-by-document matrix, F, is defined as
The goal of using a weighting scheme is to assign less weight to terms that appear in many documents while awarding more weight to less frequent terms because the latter presumably have more differentiating power.
The weighted m×n term-by-document matrix, F, was factored into the product of 3 matrices using the singular value decomposition (SVD):
where U is the m×r orthogonal matrix containing the left (term) singular vectors, WT is the r×n orthogonal matrix containing the right (document) singular vectors, and Σ is the r×r diagonal marix of singular values of A [53]. The number of singular values computed for the matrix F, denoted by r, was set to 100 in this work.
In the reduced dimensionality vector space created by truncating the SVD, terms that occur in similar documents are located near one another even if they never co-occur in the same document. Topically related documents are also grouped near one another in the reduced vector spaces. The similarity between any pair of documents x and y can be measured by their cosine similarity, which is computed as:
where x and y are the r-dimensional projections of the two documents in the reduced space.
Topic Clustering
After LSA was completed, topic clustering of the documents proceeded as follows. First, cosine similarities were computed exhaustively for all pairs of documents. For each document, a sorted list of nearest neighbors was identified as those having the highest cosine similarity scores. To reduce computational complexity, we identified only the top 100 nearest neighbors. Next, these data were represented as an undirected, weighted graph G = (V, E) where each vertex, v∈V, denotes a document and each edge, e(i, j)∈E, connects a document i with one of its nearest neighbors j, i≠j. The weight associated with each edge e(i, j) was simply set to cos(i, j). Given the resulting sparse, connected graph, clustering could be performed using graph partitioning algorithms that segment the vertices of a graph into n disjoint sets, V1, V2,…,Vn, such that document similarity is high within a set Vi and lower across different sets Vi and Vj.
In this study, we applied the Normalized Cuts (NCuts) algorithm originally proposed by Shi and Malik (2000) to partition the full nearest neighbors graph. Unlike many other graph partitioning methods, the NCuts algorithm avoids the bias of separating out small sets of isolated points by considering the global properties of the graph instead of focusing on local features [30]. The algorithm attempts to partition G into n set of disjoint clusters by minimizing the normalized cut cost between any two partitions Vi, Vj, Vi∪Vj = V, :
where is the sum of the weights of the edges that are removed between Vi and Vj, is the sum of the weights of edges connecting vertices in Vi to all vertices in the graph, and assoc(Vj,V) is similarly defined. Therefore, the NCuts algorithm not only evaluates the total edge weight connecting two partitions, but also computes the cut cost as a fraction of the total edge connections to all vertices in the graph [30] in order to produce globally optimal partitions. NCuts was applied to cut the full graph into n connected components; the number of components or “clusters” is a parameter that required specification by some objective means.
Estimate Number of Clusters
Since the SFN theme labels and assignments were produced by scientists with domain expertise, we used this categorization as an evaluation benchmark to estimate the optimal number of clusters, n. The goal was to find the clustering of abstracts based on the NCuts algorithm that best matched globally the clustering based on SFN theme labels for the year 2006; this value n could then be assumed to be an appropriate number of clusters across the full 6-year data set. By varying the number of clusters, n, different degrees of cluster agreement were obtained. We used the Adjusted Rand Index to quantify the agreement between NCuts clustering and the SFN theme labels. The Adjusted Rand Index is defined as follows [54]: Given two partitions X and Y of a common set of data points, the quantities a, b, c, and d are computed for all possible pairs of data points i and j, and their respective cluster assignments, CX(i), CX(j), CY(i), CY(j), where
In the present context, X represents SFN theme labels and Y represents the NCuts cluster assignment. The quantity a is the number of document pairs from the same SFN theme that are assigned to the same cluster in Y, d is the number of document pairs from different themes that are assigned to different clusters, b is the number of document pairs from the same theme that are assigned to different clusters, and c is the number of document pairs from different themes that are assigned to the same cluster.
The Rand Index [55] is then the fraction of all document pairs for which the clusterings agree:
The Rand Index lies between 0 and 1. When the partitions X and Y agree perfectly, the Rand Index is 1. The Adjusted Rand Index was devised by Hubert and Arabie [31] to correct for the fact that the expected value of R for random partitions is not constant. The Adjusted Rand Index linearly transforms the Rand Index such that its expected value is 0, and maximum value is 1. The Adjusted Rand Index comparing NCuts clusters with SFN themes was calculated for n = [5..20], a range intentionally chosen to be similar to the number of distinct SFN themes.
The Adjusted Rand Index versus the numbers of NCuts clusters is shown in Figure 14. The plot suggests that NCuts produces the clustering that is most similar to the SFN theme categorization whenthe number of clusters is 10, which was used throughout this work.
Dynamics of Topics
A 10×6 matrix, D, was constructed, where each element Dij denotes the number of abstracts from cluster i and year j. The matrix columns were normalized by the total number of abstracts in each year. To find topic clusters that demonstrate consistent and noteworthy rise or decline in popularity, we applied linear regression fit to the normalized frequency of each cluster by year.
An additional analysis of dynamics was performed using a term-frequency by year matrix, H. Entries of H count the occurrences of each term in abstracts, normalized by the total number of words in all abstracts for each year. Only those terms that appeared in more than one abstract were included in H. The row-wise mean, which indicates the average frequency of a given term across years, was removed. The singular value decomposition of this matrix was performed to reveal the principal temporal components and associated term-space components of change in the six year data set.
Supporting Information
Footnotes
Competing Interests: The authors have declared that no competing interests exist.
Funding: This work was supported in part by a grant (#061984) from the W. M. Keck Foundation and the Crick-Clay Professorship from Cold Spring Harbor Laboratory. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
References
- 1.Garfield E. Citation Indexes for Science. Science. 1955;122:108–111. doi: 10.1126/science.122.3159.108. [DOI] [PubMed] [Google Scholar]
- 2.Ioannidis JPA. Concentration of the Most-Cited Papers in the Scientific Literature: Analysis of Journal Ecosystems. PLoS ONE. 2006;1:e5. doi: 10.1371/journal.pone.0000005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Redner S. How popular is your paper? An empirical study of the citation distribution. European Physical Journal B. 1998;4:131–134. [Google Scholar]
- 4.Redner S. Citation statistics from 110 years of Physical Review. Physics Today. 2005;58:49–54. [Google Scholar]
- 5.Hirsch JE. An index to quantify an individual's scientific research output. Proc Natl Acad Sci U S A. 2005;102:16569–16572. doi: 10.1073/pnas.0507655102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Bollen J, Rodriguez MA, Van De Sompel H. Journal status. Scientometrics. 2006;69:669–687. [Google Scholar]
- 7.Egghe L. Theory and practise of the g-index. Scientometrics. 2006;69:131–152. [Google Scholar]
- 8.Barabasi AL, Jeong H, Neda Z, Ravasz E, Schubert A, et al. Evolution of the social network of scientific collaborations. Physica A. 2002;311:590–614. [Google Scholar]
- 9.Bilke S, Peterson C. Topological properties of citation and metabolic networks. Physical Review E. 2001;6403 doi: 10.1103/PhysRevE.64.036106. [DOI] [PubMed] [Google Scholar]
- 10.Newman M. The structure of scientific collaboration networks. Proc Natl Acad Sci U S A. 2001;98 doi: 10.1073/pnas.021544898. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Newman MEJ. Scientific collaboration networks. I. Network construction and fundamental results. Physical Review E. 2001;6401 doi: 10.1103/PhysRevE.64.016131. [DOI] [PubMed] [Google Scholar]
- 12.Newman MEJ. Scientific collaboration networks. II. Shortest paths, weighted networks, and centrality. Physical Review E. 2001;6401 doi: 10.1103/PhysRevE.64.016132. [DOI] [PubMed] [Google Scholar]
- 13.Börner K, Chen C, Boyack KW. Visualizing knowledge domains. Annual Review of Information Science and Technology. 2003;37:179–255. [Google Scholar]
- 14.Kaski S, Honkela T, Lagus K, Kohonen T. WEBSOM - Self-organizing maps of document collections. Neurocomputing. 1998;21:101–117. [Google Scholar]
- 15.Boyack KW. Mapping knowledge domains: characterizing PNAS. Proc Natl Acad Sci USA. 2004;101:5192–5199. doi: 10.1073/pnas.0307509100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Boyack KW, Klavans R, Borner K. Mapping the backbone of science. Scientometrics. 2005;64:351–374. [Google Scholar]
- 17.Chen CM, Paul RJ. Visualizing a knowledge domain's intellectual structure. IEEE Computer. 2001;34:65–71. [Google Scholar]
- 18.Douglas SM, Montelione GT, Gerstein M. PubNet: a flexible system for visualizing literature derived networks. Genome Biology. 2005;6 doi: 10.1186/gb-2005-6-9-r80. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Plake C, Schiemann T, Pankalla M, Hakenberg J, Leser U. ALIBABA: PubMed as a graph. Bioinformatics. 2006;22:2444–2445. doi: 10.1093/bioinformatics/btl408. [DOI] [PubMed] [Google Scholar]
- 20.Price DDS. Big Science, Little Science and Beyond. New York: Columbia University Press; 1986. [Google Scholar]
- 21.Science and Engineering Indicators 2006, National Science Board. Arlington, VA: National Science Foundation; [Google Scholar]
- 22.Ziman J. What is happening to science? 1990; Dodrecht, The Netherlands. Kluwer Academic Publishers.
- 23.Martinson B. Universities and the money fix. Nature. 2007;449:141–142. doi: 10.1038/449141a. [DOI] [PubMed] [Google Scholar]
- 24.Watts DJ, Strogatz SH. Collective dynamics of ‘small-world’ networks. Nature. 1998;393:440–442. doi: 10.1038/30918. [DOI] [PubMed] [Google Scholar]
- 25.Anthonisse JM. The rush in a directed graph; 1971; Stichting Mahtematisch Centrum, Amsterdam.
- 26.Freeman LC. A set of measures of centrality based on betweenness. Sociometry. 1977;40:35–41. [Google Scholar]
- 27.Deerwester S, Dumais ST, Landauer TK, Furnas GW, Harshman RA. Indexing by latent semantic analysis. Journal of the Society for Information Science. 1990;41:391–407. [Google Scholar]
- 28.Landauer TK, Laham D, Derr M. From paragraph to graph: Latent semantic analysis for information visualization. Proc Natl Acad Sci U S A. 2004;101:5214–5219. doi: 10.1073/pnas.0400341101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Berry MW, Dumais ST, O'Brien GW. Using linear algebra for intelligent information retrieval. SIAM Review. 1995;37:573–595. [Google Scholar]
- 30.Shi J, Malik J. Normalized Cuts and Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2000;22:888–905. [Google Scholar]
- 31.Hubert L, Arabie P. Comparing Partitions. Journal of the Classification. 1985;2:193–218. [Google Scholar]
- 32.Hoffmann R, Valencia A. Life cycles of successful genes. Trends Genet. 2003;19:79–81. doi: 10.1016/s0168-9525(02)00014-8. [DOI] [PubMed] [Google Scholar]
- 33.Perez-Iratxeta C, Andrade-Navarro MA, Wren JD. Evolving research trends in bioinformatics. Briefings in Bioinformatics. 2007;8:88–95. doi: 10.1093/bib/bbl035. [DOI] [PubMed] [Google Scholar]
- 34.Pfeiffer T, Hoffmann R. Temporal patterns of genes in scientific publications. Proc Natl Acad Sci U S A. 2007;104:12052–12056. doi: 10.1073/pnas.0701315104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Rebholz-Schuhman D, Cameron G, Clark D, van Mulligen E, Coatrieux JL, et al. SYMBIOmatics: synergies in Medical Informatics and Bioinformatics–exploring current scientific literature for emerging topics. BMC Bioinformatics. 2007;8(Suppl 1):S18. doi: 10.1186/1471-2105-8-S1-S18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Viator JA, Pestorius FM. Investigating trends in acoustics research from 1970–1999. Journal of the Acoustical Society of America. 2001;109:1779–1783. doi: 10.1121/1.1366711. [DOI] [PubMed] [Google Scholar]
- 37.Crasto C, Marenco L, Migliore M, Mao B, Nadkami P, et al. Text mining neuroscience journal articles to populate neuroscience databases. Neuroinformatics. 2003;1:215–237. doi: 10.1385/NI:1:3:215. [DOI] [PubMed] [Google Scholar]
- 38.Nielsen F, Hansen L, Balsley D. Mining for associations between text and brain activation in a functional neuroimaging database. Neuroinformatics. 2004;2:369–380. doi: 10.1385/NI:2:4:369. [DOI] [PubMed] [Google Scholar]
- 39.Kim JD, Tsujii J. Corpora and their annotation. In: Ananiadou S, McNaught J, editors. Text Mining for Biology and Biomedicine. Boston: Artech House; 2006. pp. 179–212. [Google Scholar]
- 40.Newman D, Chemudugunta C, Smyth P, Steyvers M. Statistical entity-topic models 2006.
- 41.Lin CY, Hovy E. The Automated Acquisition of Topic Signatures for Text Summarization.; 2000; COLING Conference, Strasbourg, France
- 42.Pantel P, Lin D. Document clustering with committees; 2002; Tampere, Finland.
- 43.Blei DM, Ng AY, Jordan MI. Latent Dirichlet Allocation. Journal of Machine Learning Research. 2003;3:993–1022. [Google Scholar]
- 44.Griffiths TL, Steyvers M. Finding scientific topics. Proc Natl Acad Sci USA. 2004;101:5228–5235. doi: 10.1073/pnas.0307752101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Shiffrin RM, Borner K. Mapping knowledge domains. Proc Natl Acad Sci USA. 2004;101:5183–5185. doi: 10.1073/pnas.0307852100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Boyack KW, Mane K, Borner K. Mapping Medline Papers, Genes and Proteins related to Melanoma Research; 2004; London, UK.
- 47.Cohen W, Sarawagi S. Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods; 2004. pp. 89–98.
- 48.Shen W, DeRose P, Vu L, Doan AH, Ramakrishnan R. Source-aware Entity Matching: A Compositional Approach; 2007. pp. 196–205.
- 49.Bilenko M, Mooney R, Cohen W, Ravikumar P, Fienberg S. Adaptive name matching in information integration. IEEE Intelligent Systems. 2003;18:16–23. [Google Scholar]
- 50.Cormen TH, Leiserson CE, Rivest RL, Stein C . Introduction to Algorithms: MIT Press and McGraw-Hill; 2001. [Google Scholar]
- 51.Hull D. Stemming Algorithms: A case study for detailed evauation. Journal of the American Society for Information Science. 1996;47:70–84. [Google Scholar]
- 52.Berry MW, Browne M. Understanding Search Engines: Mathematical Modeling and Text Retrieval; 1999; Philadelphia, PA.
- 53.Golub G, Van Loan C. Matrix Computations. Baltimore, MD: The Johns Hopkins University Press; 1996. [Google Scholar]
- 54.Handl J, Knowles J, Kell DB. Computational cluster validation in post-genomic data analysis. Bioinformatics. 2005;21:3201–3212. doi: 10.1093/bioinformatics/bti517. [DOI] [PubMed] [Google Scholar]
- 55.Rand WM. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association. 1971;66:846–850. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.