Abstract
Objective
Translational science aims at “translating” basic scientific discoveries into clinical applications. The identification of translational science has practicality such as evaluating the effectiveness of investments made into large programs like the Clinical and Translational Science Awards. Despite several proposed methods that group publications—the primary unit of research output—into some categories, we still lack a quantitative way to place articles onto the full, continuous spectrum from basic research to clinical medicine.
Materials and Methods
I learn vector representations of controlled vocabularies assigned to Medline articles to obtain a translational axis that points from basic science to clinical medicine. The projected position of a term on the translational axis, expressed by a continuous quantity, indicates the term’s “appliedness.” The position of an article, determined by the average location over its terms, quantifies the degree of its appliedness, which I term the level score.
Results
I validate the present method by comparing with previous techniques, showing excellent agreement yet uncovering significant variations of scores of articles in previously defined categories. The measure allows us to characterize the standing of journals, disciplines, and the entire biomedical literature along the basic-applied spectrum. Analysis on large-scale citation network reveals 2 main findings. First, direct citations mainly occurred between articles with similar scores. Second, shortest paths are more likely ended up with an article closer to the basic end of the spectrum, regardless of where the starting article is on the spectrum.
Conclusions
The proposed method provides a quantitative way to identify translational science.
Keywords: translational science, Medical Subject Heading, citation analysis, science of science
INTRODUCTION
Translational science, research that translates basic scientific discoveries (“bench”) into clinical applications (“bedside”), has received much emphasis recently, as exemplified by the Clinical and Translational Science Awards program launched by the National Institutes of Health. A consensual definition and consequent identification of translational science are helpful in evaluating the effectiveness of these programs and understanding translational pathways of drug development.1–6
However, such a consensus has yet to be reached and existing proposals have limitations in several ways. For example, the 4-phase (T1-T4) continuum of translational research in genomics,7 which moves from the T1 phase that seeks a health application from a genome-based discovery to T4 that evaluates health impact of the application in practice, is constrained within translational science and fails to provide ways to operationalize the 4 phases to research output. In a seminal study back in 1976, Narin et al8 presented 4 levels of research activities, namely basic research, clinical investigation, clinical mix, and clinical observation, and assigned each biomedical journal to 1 of the 4 levels. The method, though later used for research evaluation9,10 and mapping exercises,11 did not operate at the article level—my main focus here—and was largely based on the authors’ judgment, lacking proper justification. Later methods classified articles into Narin et al’s 4 categories based on title words.12,13 Recently, Weber14 introduced a novel cell-animal-human triangle and placed articles onto 1 of the 7 positions (3 corners, 3 midpoints, and the center) in the triangle. The placement is based on expert-assigned keywords called Medical Subject Headings (MeSH),15 the controlled vocabulary thesaurus maintained by the U.S. National Library of Medicine (NLM). He then considered articles with only cell and nonhuman animal (hereafter animal) related MeSH terms to be the most basic, articles with only human related terms the most applied, and articles with both types of terms in between. A similar idea was applied to National Institutes of Health grants in a more recent work by Li et al,16 who made the basic-applied dichotomy based on several dimensions such as model organisms studied and whether a grant is disease-oriented, as determined by MeSH terms associated to its abstract. Supervised learning-based approaches have also been employed to classify articles, based on conventional bag-of-words representation17 or modern dense vector representation18 of title and abstract. The utilization of these methods requires labor-intensive manual labeling of articles. All these methods suffer from a major limitation: they treated all articles (or grants) in the same defined category to have the same extent of “basicness,” but there is a large variation of such extent, as I shall show and elaborate subsequently.
By contrast, here I present a quantitative method to measure the degree of basicness of an article. The method results in a continuous measure, which I call the level score (LS), ranging from –1 to 1, with the value closer to –1 meaning that the article is more orientated toward basic research and 1 more applied. The present proposal is inspired from 2 lines of literature. The first is Weber14 and Li et al’s16 studies that leverage the fact that certain MeSH keywords indicate whether the research is done at the cell, animal, or human level as a hint to determine whether it is basic. The mere appearance of these terms, however, is not able to fully distinguish how basic of the research. For example, articles with human related terms can be about nursing research on patient care, or clinical studies about the test of the safety and effectiveness of a new drug product on human. The former case is often considered as less basic than the latter one,2,4,7 which fails to be captured by previous methods. This shortcoming is remedied by the second line of literature on learning vector representations of entities (eg, words), which allows one to explore relationships between them via simple arithmetic vector operations.19,20 I apply these learning methods on MeSH terms to obtain, using the example mentioned previously, similarities between human related terms and the rest, which in total determines the basicness of articles.
Extensive validations show that my results are consistent with previous methods. I apply the measure to more than 15 million articles published between 1980 and 2013 and find that the LS is able to characterize to what extent articles associated with a journal, field, or the entire biomedical literature are oriented toward basic research (or clinical medicine).
MATERIALS AND METHODS
Data
I use a snapshot of Medline, a large-scale bibliographic dataset for the biomedical research literature. Maintained by NLM, it is the primary database behind the widely used PubMed search engine. The data is publicly available21 and contains over 25 million articles.
One prominent feature of Medline is that each and every article indexed there is associated with expert-assigned keywords (ie, MeSH) (Figure 1A), controlled medical vocabularies organized into a hierarchical tree with 16 branches (A–N, V, and Z). As Medline also indexes nonbiomedical articles, I only consider the following 8 branches: A, B, C, D, E, G, M, and N, which are the most relevant to biomedical research, the most used, and contain the majority of MeSH terms (Supplementary Table S1).
Figure 1.
Schematic illustration of the calculation of the level score (LS) of Medical Subject Headings (MeSH) terms and articles. (A) Articles indexed in Medline are associated with MeSH terms. The table lists selected articles indicated by their PubMed ID (PMID), publication year, and MeSH terms separated by semicolon. (B) For each year , I compute the co-occurrence matrix between MeSH terms based on articles published in the 5-year time window , where the entry is the number of articles whose MeSH terms contain both term and . The figure illustrates selected entries in as a network, where I set the edge width as . (C) I embed each into vector space, using an embedding method called LINE,20 which assigns each term a vector . For illustration, I show their positions at 1980 in 2- space using t-SNE,22 a dimension reduction technique. In the vector space, I get the centroid of basic (red dots) and applied terms (blue dots), marked as the large red and blue cross, respectively. Basic terms are those related to cell, molecular, and animal, whereas applied terms related to human14 (Supplementary Table S2). I then get the translational axis (TA) vector that points from the centroid of basic terms to that of applied terms. The LS of a MeSH term is the cosine similarity between its vector and the TA vector, with larger value indicating the term is more applied. (D) The LS of selected terms in 1980. (E) The LS of an article is the average of LS of its MeSH terms. The table reports the score of selected articles. Supplementary Figures S11–S18 test the robustness of the present results by using another embedding method called GloVe,19 which is widely used for embedding co-occurrence statistics.
I assign journals in Medline to different fields based on the NLM Catalog data,23 which contains 120 Broad Subject Terms such as biochemistry, general surgery, nursing, etc.
Finally, as citation data in Medline are available only for PubMed Central articles, I turn to the Web of Science (WoS) database for such data. To match a Medline article in WoS, I used a lookup table that maps PubMed ID to Accession Number, the primary identifier used in WoS.
Placing articles onto the basic-applied spectrum
The present method relies on MeSH terms associated to articles and is of bottom-up: to quantify the basicness of an article, I first place MeSH terms onto the basic-applied spectrum. Note that they are the most atomic piece of information available, and there are no lower-level “sub-MeSH” terms that allow us to accomplish the quantification task. I therefore use a priori coding, in a minimal way, in that the coding only applies to a subset of all terms. Leveraging the coding schema introduced by Weber,14 I consider basic terms as those related to the topics of (1) cell and molecular and (2) animal, treating them as equally basic; and applied terms those related to the topic of human. Operationally, cell and molecular terms are located in the subtrees rooted at the nodes cells, archaea, bacteria, viruses, molecular structure, and chemical processes; animal terms in the subtree rooted at the node eukaryota (except humans); human terms in the subtrees rooted at the nodes humans and persons (Supplementary Table S2). The rational of the assignment is the commonly adopted definition of clinical research—research that involve human subjects.24
I then assign each term a score, also named as LS, based on how similar it is to the basic and applied terms. This is achieved by 2 steps. First, I employ representation learning methods to obtain vector representations (embeddings) of terms, so that their pairwise similarities can be easily calculated using the cosine similarity—a commonly used measure for such task. In doing so, I compute time-evolving co-occurrence matrices between MeSH terms (Figure 1B), as biomedical knowledge may have been evolving over time. Specifically, for each year , the entry represents the number of articles that were published in the 5-year time window from to whose associated MeSH terms contain both term and . Co-occurrences capture how terms are used together among different types of articles. For example, freeze fracturing and cytochalasin B are frequently used in articles about cell-level experiments, patient participation and professional-patient relations co-occur in articles about patient care, but cytochalasin B and patient participation have not been co-used (Figure 1B).
Next, I embed each matrix into -dimensional vector space using LINE,20 a popular network embedding technique that seeks to minimize the Kullback-Leibler divergence of the connection probability in the vector space given the empirical one. Here I set the embedding dimension to 10. After embedding, each term at time is associated with a real-valued vector (Figure 1C). Note that I tested the robustness of the results by using another embedding method called GloVe,19 which has been widely used in natural language processing to learn vector representations of words based on their co-occurrences and also used to study language bias.25 All the results hold when using GloVe (Supplementary Figures S11–S18).
In the second step, I find in the vector space an imaginary vector called translational axis (TA) that points from basic terms to applied terms (Figure 1C). TA is represented by the vector , where and are the centroid of applied and basic terms, respectively, obtained by averaging their constituent vectors. I then project each term onto TA, and its LS is the cosine similarity between its vector and the TA vector (Figure 1D). The score ranges from –1 to 1, and by construction, terms with larger scores are more oriented toward the applied end of the basic-applied spectrum. Supplementary Tables S3 and S4 provide the top 10 most basic and applied terms in 1980.
Once obtaining the LS of terms, the score of an article published at year is then the average of the scores of its MeSH terms at (Figure 1E). Out of all the 17 362 010 articles published between 1980 and 2013, I only consider the 15 693 562 (90.4%) articles whose majority of MeSH terms are included in the analysis.
RESULTS
Validation
As simple validations of the embedding and the LS of MeSH terms, I find that for the 2 (cell/animal and human) categories of terms, within-category pairs of terms have significantly higher cosine similarity than between-category ones (Supplementary Figure S1) and that the distributions of LS of terms in the 2 categories are well-separated (Supplementary Figure S2), which is consistent across years (Supplementary Figure S3).
Before proceeding to validating LS of articles, I stress that all previous methods classified articles into some predefined categories, whereas ours assigns continuous value to articles. This difference leads to the validation exercise to be comparisons of distributions of LS of articles in those categories. First, I test the effectiveness of the present method on a particular type of articles—articles flagged as clinical trials. These articles reported results from studies that evaluate the effectiveness of interventions on humans, and therefore are expected to be from the applied side. Supplementary Figure S4A confirms this, showing that the vast majority of clinical trial publications have an LS >0, with a median of 0.42 (cf Figure 2A). Note that the information of whether articles are about clinical trials was not used during the quantification process. Supplementary Figure S4B–E further display respectively the distributions of LS of phase I, II, III, and IV clinical trial studies, indicating that the 4 phases of studies are progressively more oriented toward applied research, with their medians getting significantly larger (permutation test, ) (Supplementary Table S5).
Figure 2.
Histogram of level score of articles. (A) All Medline articles included in the present analysis. The red dashed line marks the median (0.27) and the blue dotted line indicates the score (0.16) corresponding to the local minimum of density, empirically estimated as the midpoint of the bin that achieves such minimum. The right figure is a duplication of the left one, for ease of comparison. (B) Articles published in different journals. JBC: Journal of Biological Chemistry; JCI: Journal of Clinical Investigation; Nat. Rev. Drug Discov.: Nature Reviews Drug Discovery; NEJM: New England Journal of Medicine. (C) Articles in different disciplines, categorized based on journals where they were published and journal-discipline designations. Each tick on the y-axis represents 1 unit, with tick labels omitted for clarity.
Second, Weber14 classified articles into 7 categories based on whether their MeSH terms contain cell-, animal-, and human-related ones, and considered articles with only cell and animal terms to be most basic, articles with only human terms most applied, and articles with both types of terms in between. Our results agree with Weber’s analysis: I arrange these categories of articles based on their LS in the same order as Weber did (Table 1 and Supplementary Figure S5).
Table 1.
Comparison between Weber’s14 and the present results
| Weber’s Results |
Present Results |
Articles |
|||
|---|---|---|---|---|---|
| Category | Research level | Category | Median level score | n | % |
| C | 3.78 | CA | –0.19 | 729 412 | 4.65 |
| CA | 3.68 | C | –0.15 | 1 914 634 | 12.20 |
| CAH | 3.40 | CAH | –0.10 | 826 219 | 5.26 |
| A | 3.15 | A | –0.06 | 1 495 234 | 9.53 |
| CH | 2.85 | CH | 0.10 | 1 674 517 | 10.67 |
| AH | 2.10 | AH | 0.14 | 594 467 | 3.79 |
| H | 1.59 | H | 0.48 | 7 881 325 | 50.22 |
The first column lists the 7 classes of articles, categorized based on whether their MeSH terms contain the ones related to cell and molecular (C), animal (A), and human (H) (Supplementary Table S2). I arrange the categories in the order from most basic to most applied, according to the values of research level showed in the second column that are taken from Weber’s analysis (Table 1a in Weber).14 The research level of a category was defined as the weighted average of research level of the 4 prototype journals selected by Narin et al, and the weight was the inverse of the number of articles in each journal.14 In the third column, I show the present results of the same 7 categories, based on the median level score. Supplementary Figure S5 shows the distributions of level score of articles in each category. The last 2 columns report the number and percentage of articles in each category. The remaining 577 754 (3.68%) articles belong to none of the 7 categories.
Research level of journals and biomedical fields
I examine LS of articles in different journals and fields to characterize their standings along the basic-applied spectrum. To explore this, Figure 2B shows distributions of LS of articles in selected journals. Our results are in full agreement with the 4-level (L4-L1) categorization proposed by Narin et al8: the 4 selected prototype journals—Journal of Biological Chemistry (JBC) (L4), Journal of Clinical Investigation (L3), New England Journal of Medicine (L2), and Journal of the American Medical Association (L1) —are placed onto the basic-applied spectrum in the order from L4 to L1 by the present method (Figure 2B), and the median LS of articles published in these journals are –0.22, –0.11, 0.41, and 0.52, respectively. Figure 2B also suggests that there are many “additional levels” among the 4 journals. Articles published in Cell have a similar distribution of LS with that of JBC. Multidisciplinary journals such as Nature and Science, though publish articles from the applied side, cover more articles from the basic side, making the mode score comparable to that of JBC and Cell. Between Journal of Clinical Investigation and New England Journal of Medicine are Nature Medicine, a preclinical medicine journal, Neuropsychopharmacology, a journal covering topics on both basic and clinical research about the brain and behavior, and Nature Reviews Drug Discovery, a journal on drug discovery and development. These results are consistent with a previous study that conceptually placed journals along the pipeline of translational medicine.26
Figure 2C focuses on the basicness of biomedical research fields, as defined by NLM as Broad Subject Terms, based on the designation of journals to fields. Cell biology, biochemistry, and molecular biology are close to the basic end of the basic-applied spectrum, whereas nursing and health services research the applied end. Between the 2 ends are, in the order from basic to applied, Allergy and immunology, physiology, psychopharmacology, neoplasms, and general surgery, among others. What is also noticeable from Figure 2C is bimodal distributions of LS of articles from disciplines like brain, endocrinology, psychopharmacology, neurology, and neoplasms. This meets with the intuition that research in these fields may have 2 goals at the same time—scientific discoveries and therapeutics. Neoplasms, for example, can be about both fundamental research on the identification of tumor suppressor genes and clinical research on the development of cancer drugs.
Finally, the entire biomedical research literature, as recorded in the Medline dataset, exhibits a bimodal distribution of the score (Figure 2A). This is interesting from the following aspects. First, it suggests that there exists a robust threshold () to separate the 2 modes, therefore providing a natural classification between basic and applied articles. This may be appealing in many lines of enquiry, such as examining types of research conducted,27 where the basic-applied dichotomy is often used. Second, the fact that the median (0.27) is greater than the threshold indicates that biomedical research in its entirety is more toward the applied side, with the score greater than for 57.3% of articles. Third, the bimodality indicates that there may lack amounts of published translational science—research that is essential for translating basic scientific discoveries into clinical medicine.
In summary, while the present results are consistent with previous qualitative studies, they uncover significant variations of the score of articles in previously defined categories, therefore highlighting the necessity of quantifying research level at the individual article level. The proposed measure also allows us to characterize standing of journals, disciplines, and the entire biomedical literature along the basic-applied spectrum.
Changes over time
In the previous section, I have pooled all articles together to examine basicness of different categories. As biomedical knowledge evolves, research level of MeSH terms may change accordingly. Our sliding window approach naturally captures this, thus allowing us to detect temporal changes of research level. Figure 3 illustrates some examples. I observe a steady decrease of LS for the term cloning, organism (Figure 3A), indicating that the research has been moving in the direction of basic science. This may be due to the trend that experiments about organism cloning were more on human originally and then more on animal. LS of articles with this descriptor, correspondingly, has been decreasing (Figure 3B). On the other hand, research about hepatitis A vaccines has been evolving toward human research. The term adipose tissue, brown and its associated articles have been moving to the human direction since the 1990s, when it was found in adult humans.14
Figure 3.
Evolution of research level of MeSH terms and articles. (A) Level score (LS) of different terms. The black line is the average LS of all terms. (B) Mean LS of articles containing the terms. The shaded region covers 1 SD.
Citation linkage between basic and applied research
Next, I study how articles with different LS are connected in the citation network. Are basic articles more likely to be cited by basic or applied ones? Where do citations originated from applied articles go? In this section, I only focus on the 10 118 672 Medline articles included in the present analysis that are also present in WoS, and there are 200 359 263 citation pairs between those articles. Figure 4A plots, in the form of a heatmap, the LS of citing article and that of cited article for all 200 million citation pairs, where the color encodes the number of pairs. I observe 2 regions of very high density, corresponding to the case where citing and cited articles have similar scores. This means that basic research is more likely to cite other basic research, while applied cite other applied—a homophilous pairing of citing and cited articles with respect to their research levels. This observation cannot be explained by the bimodal distribution of LS of articles (Supplementary Section S1; Supplementary Figure S7A).
Figure 4.
Homophilous pairing of citing and cited articles with respect to level score. (A) Heatmap of level scores of citing and cited articles, where color encodes the number of pairs. (B) For each article, I compute the average difference, denoted as , between its level score and the scores of its references. The figure shows how is distributed for articles with different scores, where color encodes the number of articles.
I further quantify the pattern of homophilous pairing at the article level. I calculate, for each article, the mean difference between its LS and the scores of its references, capturing the average direction from which it absorbs previous knowledge. Figure 4B shows that the mean difference is around 0 for most articles regardless of their LS, which cannot be explained by the bimodality of LS (Supplementary Figure S7B). Another noticeable observation from Figure 4B is an asymmetric dispersion of around 0 for articles at both ends of spectrum but symmetric for articles in the middle. This indicates that there are basic articles that cite much more applied ones, applied articles that cite much more basic ones, and articles in between that cite articles from both directions.
Our results so far suggest that direct citations rarely occurred between basic and applied research. This raises the questions of whether they operate in separated spheres and how basic knowledge can subsequently be used in applied research. To answer them, I go beyond direct citations and characterize citation connectivity between articles that are steps away. To illustrate the present method, let us for now focus on a single article published in year . I calculate the distances from to all other reachable articles, , in the entire citation network. The potentially reachable articles are those published until , denoted as , as an article can only cite previously published ones. For visualization purpose, I discretize LS and let () be the binned LS value of article () and is the binning function. I define the reachability of node to articles with value as the fraction of target articles that can be reached from :
This measure may provide an answer to a question related to the life cycle of translational science: how many basic scientific discoveries have resulted in marketed drugs?2 Our answer is not an exact one, which would require case-by-case studies, but it provides an upper bound reflected in the citation network. I also introduce 2 further measures: , the average distance to reachable articles:
where is the distance from to ; and , the average publication year difference between and :
is motivated by previous studies that have examined the number of years taken for drug development.2,4
As I am interested in how the 3 measures vary across source articles with different LS, I randomly selected 101 184 (1%) articles, denoted as , and repeated the above calculations for each , as doing so for millions of articles is computationally burdensome. Aggregating all source nodes, I obtain 3 matrices, , , and , respectively, representing the average reachability, distance, and publication year difference between nodes with different LS, defined as:
Figure 5A, which visualizes the matrix , demonstrates the connectivity advantage of basic research. Target articles that located at the basic end of the basic-applied spectrum have a larger portion of paths reaching them regardless of research level of source article. An applied article can reach to, on average, 40% of basic articles published before and 20% of applied ones.
Figure 5.
Characterization of the citation network with respective to level score. (A) The fraction of source-target pairs of articles where target article can be reached from the source one. (B) The average path length. (C) The average year differences.
Figure 5B shows the matrix . On average, the shortest paths among all sampled pairs of articles happen between those both located at the basic end, with 7.5 steps away from each other. Reaching to articles at the basic end from the applied end requires 10 steps. Figure 5C plots the matrix . The publication year difference in an applied-to-basic pair of articles is about 17 years, resonating previous results.4
DISCUSSION
The main purpose of this work was to propose a method to place publications onto the translational spectrum, by leveraging recent advances in representation learning. One advantage of the present method over previous ones is that rather than grouping articles into different categories, it assigns continuous scores to articles, therefore allowing me to capture the varying basicness of articles in the same group. The introduced measure well quantifies research from basic to clinical to health practice. It may also be useful for policymakers to measuring the returns of science investments.28–31
Throughout the work, I have adopted a working definition that cell- and animal-related MeSH terms are basic and human related applied. However, not all biomedical research involving humans is applied—neuroscience research that advances the understanding of the nervous system using human subjects has been considered as basic.32 Yet, the limitation operates at the term level rather than at the article level, and the present bottom-up approach, in that the position of an article on the basic-applied spectrum is based on the positions of its MeSH terms may have partially accounted for the limitation. This is based on the consideration that the types of research for articles with the term humans can be diverse, from basic neuroscience research to clinical trial to health practice, and the basicness of these articles should not be determined solely by the single term. To support this, Supplementary Figure S6 shows articles containing both humans and magnetic resonance imaging terms, though still located at the applied side, are more basic than health practice articles.
A related note is about the definitions of basic, translational, and applied research. On one hand, not only the definition of translational science has been evolving,33 but also consensus definitions of basic-applied research have not been reached. Basic research has been defined as, among others, fundamental, curiosity-driven research that leads to scientific discoveries,34 or use-inspired research (eg, disease oriented).32,35 On the other, the computation identification of articles from these categories requires some operational definitions, as many previous studies did by looking at journals, title words, or MeSH terms. Here I followed this line and focused on MeSH terms. As such, the proposed LS is simply an attempt to quantify 1 of the many possible dimensions of translational science. The use of the measure therefore should always be cautious, as any other indicators. When possible, it should be used together with other ways such as a careful reading by domain experts.32 Future work could also combine LS with other measures, such as whether the research is disease oriented, and develop multidimensional analysis of translational science.
The key idea of the present method is the construction of an imaginary TA that points from basic terms to applied terms. With such a TA, there may be other ways to get the LS of an article. In Supplementary Section S2, I provide an alternative one. Specifically, I first get the TA vector as described previously and obtain the vector of an article by averaging the vectors of its MeSH terms. The LS of an article is then defined as the cosine similarity between the TA vector and the article vector. Supplementary Figures S8–S10 demonstrate that the results remain essentially the same using this formulation.
I relied on MeSH terms assigned to articles. Terms are being updated yearly in various ways, such as additions and deletions. This, however, may not affect the present results, as I used a sliding window approach, which examined the usage of terms in articles one period at a time and over different periods.
Limitations remain. First, I examined citation linkages between articles of different LS, looking at the number and length of paths connecting basic and applied articles as well as year difference between them. Although a similar approach has been used to study the development of drugs,6 citations only represent codified knowledge flows and a connected path from basic research to clinical medicine does not mean the occurrence of translation. However, I believe the measures I introduced provide a lower bound. Second, when performing citation analysis, the reliance on WoS for citation data indicates that I missed a significant portion (35.5%) of articles that are indexed in Medline but not presented in WoS. Those articles were excluded, as I have no data about their references. This may to some extent bias the present results, but WoS is the most comprehensive data I can access.
Despite these limitations, the availability of a quantity of the degree to which an article is a basic one paves the way to a number of systematic investigations, opening possibilities for future work. One can, for example, study the association between the LS and the number of citations, understand how funding is allocated along the basic-applied spectrum,36,37 characterize types of research conducted27 and output of researchers, funding agencies, and research institutes, and examine how articles are cited outside the scientific domain, such as patents, drug products, and clinical guidelines, given the availability of such linkage information. These studies would further advance the present understanding of the biomedicine enterprise.
CONCLUSIONS
I proposed a method to place publications onto the translational spectrum, by learning embeddings of controlled vocabularies. The introduced measure is consistent with previous qualitative results and allows us to characterize basicness of journals, fields, and the entire biomedical literature. I found that articles with similar research level tend to cite each other directly, yet articles located at the basic end of the spectrum have the advantage of being more likely to be reached regardless of research level of source article.
AUTHOR CONTRIBUTORS
QK designed research, performed research, analyzed data, and wrote the article.
Supplementary Material
ACKNOWLEDGMENTS
I thank Pik-Mai Hui for initial collaboration; Cassidy R. Sugimoto for helpful comments to the manuscript; Steven N. Goodman and Griffin M. Weber for discussions; Filippo Radicchi for providing excellent computing resources; Rodrigo Costas for sharing matching data between Web of Science and Medline; and the authors of the LINE and GloVe algorithms for open sourcing their code. This work uses the Web of Science data, provided by the Indiana University Network Science Institute. This work is partially supported by National Science Foundation of China (NSFC No. 71874077).
REFERENCE
- 1. Zerhouni E. The NIH roadmap. Science 2003; 302 (5642): 63–72. [DOI] [PubMed] [Google Scholar]
- 2. Contopoulos-Ioannidis DG, Alexiou GA, Gouvias TC et al. Life cycle of translational research for medical interventions. Science 2008; 321 (5894): 1298–9. [DOI] [PubMed] [Google Scholar]
- 3. Khoury MJ, Gwinn M, Ioannidis JPA. The emergence of translational epidemiology: from scientific discovery to population health impact. Am J Epidemiol 2010; 172 (5): 517–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Morris ZS, Wooding S, Grant J. The answer is 17 years, what is the question: understanding time lags in translational research. J R Soc Med 2011; 104 (12): 510–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Collins FS. Reengineering translational science: the time is right. Sci Transl Med 2011; 3: 90cm17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Williams RS, Lotia S, Holloway AK et al. From scientific discovery to cures: bright stars within a galaxy. Cell 2015; 163 (1): 21–3. [DOI] [PubMed] [Google Scholar]
- 7. Khoury MJ, Gwinn M, Yoon PW et al. The continuum of translation research in genomic medicine: how can we accelerate the appropriate integration of human genome discoveries into health care and disease prevention? Genet Med 2007; 9 (10): 665–74. [DOI] [PubMed] [Google Scholar]
- 8. Narin F, Pinski G, Gee HH. Structure of the biomedical literature. J Am Soc Inf Sci 1976; 27 (1): 25–45. [Google Scholar]
- 9. Lewison G, Dawson G. The effect of funding on the outputs of biomedical research. Scientometrics 1998; 41 (1–2): 17–27. [Google Scholar]
- 10. Lewison G, Devey ME. Bibliometric methods for the evaluation of arthritis research. Rheumatology (Oxford) 1999; 38 (1): 13–20. [DOI] [PubMed] [Google Scholar]
- 11. Cambrosio A, Keating P, Mercier S et al. Mapping the emergence and development of translational cancer research. Eur J Cancer 2006; 42 (18): 3140–8. [DOI] [PubMed] [Google Scholar]
- 12. Lewison G, Paraje G. The classification of biomedical journals by research level. Scientometrics 2004; 60 (2): 145–57. [Google Scholar]
- 13. Boyack KW, Patek M, Ungar LH et al. Classification of individual articles from all of science by research level. J Informetrics 2014; 8 (1): 1–12. [Google Scholar]
- 14. Weber GM. Identifying translational science within the triangle of biomedicine. J Transl Med 2013; 11: 126. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.US National Library of Medicine. Medical subject headings; 2018. https://www.nlm.nih.gov/mesh/ Accessed September 2, 2018.
- 16. Li D, Azoulay P, Sampat BN. The applied value of public investments in biomedical research. Science 2017; 356 (6333): 78–81. [DOI] [PubMed] [Google Scholar]
- 17. Surkis A, Hogle JA, DiazGranados D et al. Classifying publications from the clinical and translational science award program along the translational research spectrum: a machine learning approach. J Transl Med 2016; 14: 235. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Major V, Surkis A, Aphinyanaphongs Y. Utility of general and specific word embeddings for classifying translational stages of research. arXiv preprint 2017; arXiv1705.06262. Accessed February 16, 2018. [PMC free article] [PubMed] [Google Scholar]
- 19. Pennington J, Socher R, Manning CD. GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing; Stroundsburg, PA: Association for Computational Linguistics; 2014: 1532–43. [Google Scholar]
- 20. Tang J, Qu M, Wang M et al. LINE: large-scale information network embedding. In: Proceedings of the 24th International Conference on World Wide Web; New York City, New York: ACM: 2015: 1067–77. [Google Scholar]
- 21.U.S. National Library of Medicine. Download Medline/PubMed Data; 2018. https://www.nlm.nih.gov/databases/download/pubmed_medline.html Accessed July 28, 2018.
- 22. van der Maaten L, Hinton G. Visualizing data using t-SNE. J Mach Learn Res 2008; 9: 2579–605. [Google Scholar]
- 23.U.S. National Library of Medicine. NLM catalog; 2018. https://www.ncbi.nlm.nih.gov/nlmcatalog Accessed July 28, 2018.
- 24.National Institutes of Health. Glossary & acronym list; 2013. https://grants.nih.gov/grants/glossary.htm#ClinicalResearch Accessed September 28, 2018.
- 25. Caliskan A, Bryson JJ, Narayanan A. Semantics derived automatically from language corpora contain human-like biases. Science 2017; 356 (6334): 183–6. [DOI] [PubMed] [Google Scholar]
- 26. Pasterkamp G, Hoefer I, Prakken B. Lost in the citation valley. Nat Biotechnol 2016; 34 (10): 1016–8. [DOI] [PubMed] [Google Scholar]
- 27. Zinner DE, Campbell EG. Life-science research within US academic medical centers. JAMA 2009; 302 (9): 969–76. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Hamilton Moses I, Martin JB. Biomedical research and health advances. N Engl J Med 2011; 364 (6): 567–71. [DOI] [PubMed] [Google Scholar]
- 29. Lane J, Bertuzzi S. Measuring the results of science investments. Science 2011; 331 (6018): 678–80. [DOI] [PubMed] [Google Scholar]
- 30. Toole AA. The impact of public basic research on industrial innovation: evidence from the pharmaceutical industry. Res Policy 2012; 41 (1): 1–12. [Google Scholar]
- 31. Press WH. What’s so special about science (and how much should we spend on it?). Science 2013; 342 (6160): 817–22. [DOI] [PubMed] [Google Scholar]
- 32. Landis S. Back to basics: a call for fundamental neuroscience research; 2014. https://blog.ninds.nih.gov/2014/03/27/back-to-basics/ Accessed September 10, 2018.
- 33. Fort DG, Herr TM, Shaw PL et al. Mapping the evolving definitions of translational research. J Clin Transl Sci 2017; 1 (1): 60–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.National Institutes of General Medical Sciences. Curiosity creates cures: the value and impact of basic research; 2018. https://www.nigms.nih.gov/Education/Pages/factsheet_CuriosityCreatesCures.aspx Accessed September 29, 2018.
- 35. Stokes DE. Pasteur’s Quadrant: Basic Science and Technological Innovation. Washington, DC: Brookings Institution Press; 1997. [Google Scholar]
- 36. Moses H, Dorsey ER, Matheson DHM et al. Financial anatomy of biomedical research. JAMA 2005; 294 (11): 1333–42. [DOI] [PubMed] [Google Scholar]
- 37. Levitt M, Levitt JM. Future of fundamental discovery in US biomedical research. Proc Natl Acad Sci USA 2017; 114 (25): 6498–503. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.





