Visualizing the knowledge structure and evolution of bioinformatics

Jiaqi Wang; Zeyu Li; Jiawan Zhang

doi:10.1186/s12859-022-04948-9

. 2022 Sep 30;23(Suppl 8):404. doi: 10.1186/s12859-022-04948-9

Visualizing the knowledge structure and evolution of bioinformatics

Jiaqi Wang ^1,^✉, Zeyu Li ^1,^✉, Jiawan Zhang ^1,^✉

PMCID: PMC9523889 PMID: 36180852

Abstract

Background

Bioinformatics has gained much attention as a fast growing interdisciplinary field. Several attempts have been conducted to explore the field of bioinformatics by bibliometric analysis, however, such works did not elucidate the role of visualization in analysis, nor focus on the relationship between sub-topics of bioinformatics.

Results

First, the hotspot of bioinformatics has moderately shifted from traditional molecular biology to omics research, and the computational method has also shifted from mathematical model to data mining and machine learning. Second, DNA-related topics are bridge topics in bioinformatics research. These topics gradually connect various sub-topics that are relatively independent at first. Third, only a small part of topics we have obtained involves a number of computational methods, and the other topics focus more on biological aspects. Fourth, the proportion of computing-related topics hit a trough in the 1980s. During this period, the use of traditional calculation methods such as mathematical model declined in a large proportion while the new calculation methods such as machine learning have not been applied in a large scale. This proportion began to increase gradually after the 1990s. Fifth, although the proportion of computing-related topics is only slightly higher than the original, the connection between other topics and computing-related topics has become closer, which means the support of computational methods is becoming increasingly important for the research of bioinformatics.

Conclusions

The results of our analysis imply that research on bioinformatics is becoming more diversified and the ranking of computational methods in bioinformatics research is also gradually improving.

Keywords: Bioinformatics, Visualization, Knowledge structure, Text-mining

Background

Bioinformatics originated as a cross-disciplinary field because of the increasing need for computational solutions to research problems in biomedicine [1]. Since the field developed in leaps and bounds, researchers have shown an increasing interest in summing up the development and evolution of the entire discipline. However, the relationship and evolution of bioinformatics’s subtopics have unique characteristics due to its interdisciplinary nature. Most previous works failed to further explore this relationship and did not realize the great potential of visualization in exploring and displaying the evolution of this relationship.

Traditional bibliometric analysis

Bibliometrics is the application of mathematics and statistical methods to evaluate the literature in different disciplines. Patra et al. [2] analyzed the growth of the scientific literature in bioinformatics. They applied Bradford’s law which estimates the exponentially diminishing returns of searching for references in science journals and Lotka’s law which is used to describe the frequency of publication by authors in any given field [3] to identify core journals and analyze author’s productivity pattern. W Glänzel et al. [4] proposed a novel subject-delineation strategy to retrieve of the core literature in bioinformatics. They then analyzed the core literature with bibliometric analysis tools such as co-author citation analysis, national publication activity, citation impact etc. Song et al. [5] conducted a bibliometric analysis of bioinformatics by extracting citation data from PubMed Central full-text. They focused on evaluating the productivity and influence of bioinformatics. Four measures were used to identify productivity: most productive authors, most productive countries, most productive organizations, and most popular subject terms. Research impact was analyzed based on the measures of most cited papers, most cited authors, emerging stars, and leading organizations.

Text mining applied to bioinformatics bibliometrics

The development of text-mining techniques provides a new perspective for bibliometric analysis. Topic model is the most prevalent approach among techniques applied for bibliometric analysis. Latent Dirichlet Allocation(LDA) is one of the most popular models applied for bibliometric analysis. Song et al. [6] attempted to detect the knowledge structure of bioinformatics by applying LDA to a large set of bioinformatics full-text articles for topic model generation. Author-Conference-Topic (ACT) [7] model (an extension of LDA that can incorporate the paper, author, and conference into the topic distribution simultaneously.) was adopted by Heo et al. [8] to study the field of bioinformatics from the perspective of key phrases, authors, and journals. Heo analyzed the ACT Model results in each period to explore the development trend of bioinformatics

Traditional bibliometric analysis lacks thematic descriptions of bioinformatics subtopics, and the above-mentioned works filled that gap. However, LDA is a typical bag-of-words model that has two major weaknesses: it loses the ordering of the words and ignore their semantics. We choose another technical scheme: paragraph embedding [9], dimension reduction and clustering. Paragraph Embedding is an unsupervised framework that learns continuous distributed vector representations for pieces of texts which will take the ordering of words in to account. Its construction gives this algorithm the potential to overcome the weaknesses of bag-of-words models and allows us to capture more accurate features of documents to benefit document clustering. Thus, we can better map knowledge structure of bioinformatics.

Visualization applied to bioinformatics bibliometrics

Scholarly data visualization enables scientists to have a better way to represent the structure of data sets and reveal hidden patterns in the data [10]. However, most previous studies did not realize the benefit of using visualization for exploring knowledge structure of bioinformatics. Visual mapping intuitively shows the overall knowledge structure, research framework, and development trends of a discipline that is very helpful for researchers to rapidly comprehend the overall research status and hotspots [11]. Although basic charts are widely used to display changes in the number of literatures [12] or the number of citation count [6], those charts lack of description of the overall knowledge structure and interaction to explore more information. Therefore, the advantages of visualization are not fully utilized. Another visualization tool used in bibliometrics analysis is network, which is commonly used for co-analysis, such as co-author analysis [13], co-citation analysis [6] and term co-occurrence analysis [14]. This type of network focuses on only one specific relationship, such as co-author focus on relationships between scholars. These diverse network structures are difficult to combine for comparative analysis. Furthermore, when dealing with big data set, the structure of graph may look like a hairball which is incomprehensible for analysts.

In this study, we fully utilized the ability of text mining technology to extract abstract paradigms from massive data and the ability of visualization to display complex information. First, we revealed the distribution of topics in two-dimensional document space by drawing scientific maps of bioinformatics, and then checked the evolution of topic relationships with time filters. Finally, to further understand the interdisciplinary nature of bioinformatics, we combined themeriver with words co-occurrence network to elaborate the evolution of computing-related topics.

Compared with previous works, present study has the following contributions: (1) A science map of bioinformatics was drawn to depict the intersection and evolution of sub-topics of bioinformatics. (2) The interdisciplinary nature of bioinformatics was explored emphatically analyzing the evolution of computing-related topics. (3) The validity and necessity of visualization in analysis were pointed out and proved.

Results

In this section, we will introduce the final visualization results. The goals of our visualization are twofold: (1) intuitively show the change in the popularity of different topics over time; and (2) clearly demonstrate the relationship between topics and how this relationship changes over time

Theme river

Goal 1 can be achieved by applying themeriver [15], where the color band represents clustering, and the width of the color band represents the number of articles in the clustering.

As shown in Fig. 1, our theme river has two forms of expression. (a) The middle ordinate represents the actual number of documents, and the development of the origin of the overall discipline can be clearly seen. However, the number of documents is too far away from the origin stage because the development of the discipline has shown a violent growth trend in middle and late stages(From 72 papers in 1960 to more than 16,000 papers in 2010). Under such a quantitative gap, the category information from 1960 to 1990 is compressed by the scale, so that its internal category composition cannot be clearly seen in Fig. 1a. Therefore, we designed the second form of expression. The ordinate no longer represents the true value but the proportion of sub-topics. The value range is from 0 to 1. From the Figure, we can clearly see that the hot topics also change over time. Moreover, some topics are slowly falling out of sight over time.

Science map

Science map originated from a traditional scientific notation called “knowledge tree” [16]. According to this tree structure, knowledge is divided into branches, which are then merged into main disciplines, and further divided into molecular disciplines and different specialties. However, with the continuous increase in the speed of knowledge dissemination, the exchanges between disciplines have become more intense, and more interdisciplinary disciplines have emerged. The tree structure has been difficult to meet this demand for discipline description. We need to use a more vivid and intuitive way to show the structure of bioinformatics and the impact of exchanges between different disciplines on bioinformatics’ structure.

To show the relationship between bioinformatics papers in a two-dimensional space, we need to perform dimensionality reduction again on the document vectors obtained in the previous step. This time we still chose UMAP to reduce the vectors’ dimension to 2, which will enable us to project all vectors into two-dimensional space. We choose UMAP for two reasons: first, according to Espadoto et al. [17],UMAP performs well; second, it can support supervised/semi-supervised dimensionality reduction. To ensure that the data points that belong to two-dimensional space continue to maintain the structural information in the high-dimensional space, we first filtered the soft clustering results and selected those data labels whose category probability is greater than 0.9. We then used UMAP with those labels for semi-supervised dimensionality reduction. Finally, we projected the obtained vector into a two-dimensional space and presented it as a scatter plot shown in Fig. 2

Each point represents a paper, and color codes the cluster it belongs to. However, when there is a larger number of points scatter plots may cause a misleading that the greater scope covered by an area, the more is the quantity of its papers. To eliminate this misleading and make the overall structure more visible, we implemented a contour line, where color depth is proportional to the density of papers, and line spacing is inversely proportional to the density gradient. The final science map of bioinformatics is shown in Fig. 3. We also supported interactive operation to help in-depth analysis, The high-frequency MESH phrases of the papers in the region of interest can be obtained by selecting the region.

Fig. 3 — Knowledge structure of bioinformatics. Color opacity is proportional to the density of papers and the darker the color, the higher the density. Line spacing is inversely proportional to the density gradient. Descriptions for each cluster are summarized by TF-IDF

Discussion

In this section, we analyzed the results from three different perspectives: knowledge structure of bioinformatics, evolution of knowledge structure, and evolution of computing-related topics.

Knowledge structure of bioinformatics

As shown in Fig. 1b, the popularity of topic 1(mathematical models, theoretical biology) and topic 2(DNA replication, transcription and expression), which are the initial main topics, have gradually decreased since 1980; meanwhile, topic 4(molecular biology) and topic 5(Molecular dynamics) remain relatively stable with limited change. Topic 6(genomics and proteomics) and Topic 8(data-related) have developed rapidly after 2000 and became the main topics of bioinformatics. Topic 3(system biology) and Topic 9(proteomics) were developed around 2000; although they did not receive too much attention, they have maintained a small but stable growth since 2000. The mode of Topic 7(HGP: The Human Genome Project) is somewhat special. This topic has been growing steadily since it appeared in 1980, reaching a peak in about 1995, but began to decrease greatly after 2000. This phenomenon can be well explained by combining the start and end time of the human genome project (1990-2003).

Although themeriver can describe the changes in research hotspots, it cannot sketch the relationship between topics and thus cannot accomplish our goal 2. Thus, we need a science to map the whole knowledge structure of bioinformatics. By analyzing Fig. 3, we obtain the following conclusions about the relationship between bioinformatics’ topics. (1) Topics 2, 4 and 6 are bridges connecting other topics. And they all deal with DNA indicating that DNA research is the ’skeleton’ of bioinformatics. (2) Among the topics distributed around the structure diagram, Topic 5 is related to Topic 2 and 4, while Topic 7 is mainly related to Topic 2 and 6. Topic 1 is relatively independent from other topics. (3) The three topics 3, 8 and 9 related to computing methods are all located at the top of the structural diagram. Topic 8 is at the center, which means that data are the core of computing-related topics of bioinformatics. (4) The closest non-computational topic to Topic 8 is Topic 6; hence, topic discussed in Topic 6 may use more methods of data science than other topics.

Evolution of knowledge structure

With the continuous development of bioinformatics, the relationship between topics is also changing. From 1960 to 2019, the change in the number of papers on different topics is shown in Fig. 1a. We selected several time periods when the number of papers varied greatly.

At the beginning of the development of bioinformatics, the relationship between the main topics 1, 2, 4 and 5 was relatively weak. However, from 1980 to 2000, the 2,4,5 topics began to merge to a certain extent. We selected the fusion part interactively and examined at the high-frequency MESH phrases discussed in the merged papers. The top six phrases are: escherichia coli, amino acids, binding site, plasma membrane, binding sites, and signal transduction, which are all related to biological macromolecules (Fig. 4).

Fig. 4 — Evolution of bioinformatics from 1960 to 2000

From 2000 to 2005, the number of substructures of bioinformatics increased, but the upper and lower sides are relatively independent. Bridge topic 6(genomics and proteomics) has not become the main hot topic at that time, and has not produced more contact with Topic 3 and 8, but produced more fusion with topic 2, indicating that omics research began with DNA. The period from 2005 to 2010 is a period of vigorous development of bioinformatics. Topic 6, as a bridge topic and a hot topic, began to connect various parts (Fig. 5).

Fig. 5 — Evolution of bioinformatics from 2000 to 2010

After 2010, the hotspot began to move upward and the distance between Topic 6 and Topics 8,9 was shortened in this period. This finding indicates that omics research increasingly relied more on the support of computational methods (Fig. 6).

Fig. 6 — Evolution of bioinformatics from 2010 to 2019

Evolution of computing-related topics

Understanding the evolution of topics related to computation is crucial to understand the whole bioinformatics structure due to the special interdisciplinary nature of bioinformatics. We tried to answer two questions through visual analysis: (1) What is the proportion of computing-related topics in bioinformatics? Is this ratio stable? (2) What changes have taken place in the biological topics and computational methods involved in the evolution of these topics?

Goal 1 can be achieved by simply applying themeriver. We will mainly introduce the solution to Goal 2.

As shown in Fig. 7, themeriver shows the change of the proportion of four computing-related topics and the word co-occurrence network elaborates the change of topics in each period. By default, the word co-occurrence network of all topics is displayed. Click the category label to hide other categories and explore the selected topic in depth. MESH phrases with high frequency will be displayed by default. Moving the mouse over the circle will display its corresponding MESH phrase. The terms related to computation are represented by stroked circles and marked with orange.

The proportion of papers that related to computation evidently changes. The proportion was close to 0.3 in the early stage of the development of bioinformatics, began to decline in the early 1980s, and only accounted for about 0.1 of the total papers in the early 1990s. The proportion then began to rise again in the middle and late 1990s, surpassing the initial proportion and occupying nearly half of the bioinformatics topics after 2005. During such changes, an obvious pattern is observed: although the proportion of computing-related topics is only a little higher than the original, the internal structure of those topics has undergone earth-shaking changes.

Through further analysis of the co-occurrence word network, we conclude the following. (1) Prior to 1980, Topic 1 occupied an absolute dominant position. The computational methods involved were mainly mathematical models, and computer simulation appeared after 1970. The main biological topics discussed during this period include: active transport, cell, etc. (2) The period of 1990-2000 was a turning point for major changes in the internal structure of computing-related topics. The overall proportion began to rise, but the popularity of Topic 1 remained the same; and Topic 8 gradually replaced Topic 1 to occupy the dominant position. No new computational method was found in Topic 8, but the discussion on sequence increased significantly. (3)The period of 2000 - 2010 is an era of data and genes. The popularity of topics 3 and 8 continues to rise, but the content of the discussion has undergone great changes. The computational topics are all centered on data, data analysis, and the popularity of gene expression is the highest in biological topics while discussions on sequences began to decrease. Topic 9 has also seen a significant increase, and the computational methods involved are mainly related to data; however, biological topics focus on magnetic resonance and images. (4) In 2010 - 2019, the overall structure has not changed, and the proportion of each topic has remained relatively stable, however, the contents they discussed are different from those before. During this period, more computing methods are involved, such as machine learning, data mining, text mining, cloud computing and so on. Thus the interaction between computing methods and bioinformatics becomes increasingly closer and closer.

Conclusion

As an interdisciplinary and fast-growing field of science, bibliometric analysis of bioinformatics has attracted the attention of many researchers. In this study, we collected 330192 bioinformatics papers and applied Doc2vec combined with clustering and dimension reduction technology to detect the knowledge structure of bioinformatics. And then we focus on the role of visual analysis in exploring this structure. Unlike previous works, we focus more on substructures’ relationship. The evolution process of computing-related topics was emphatically analyzed which is vital for understanding the interdisciplinary nature of bioinformatics. The results of our analysis imply that research on bioinformatics is becoming more diversified; the ranking of computational methods in bioinformatics research is also gradually improving. In the future, we plan to enrich and complete the knowledge structure diagram of bioinformatics by applying visualization to explore other aspects of bibliometric analysis, such as author analysis, organization analysis, citation analysis, etc.

Methods

In this section, we will introduce the overall procedure of the proposed approach for visualizing the knowledge structure of bioinformatics. Figure 8 shows the pipeline of our methods.

Fig. 8 — Pipeline of the proposed methods. a Shows the process of data collection and topic extraction. b Is our final visualization results. Science map shows the knowledge structure of bioinformatics and it can be filtered by time to show evolution.Themeriver shows the change in the proportion of papers and word cooccurrence network illustrates the change of topics in each period

Data collection and preprocessing

Data were obtained from Microsoft Academic Graph (MAG) [18]. MAG is a heterogeneous graph comprising more than 120 million publication entities and related authors, institutions, venues and fields of study. Papers in MAG that belongs to the union of two venue sets were analyzed: (1) 47 bioinformatics journals offered in Song [6], and 33 conferences used in Song [19]; (2) venues whose title contains ‘bioinformatics’ or ‘computational biology’ in MAG (“Appendix 1 and 2”). Corpus generation is mainly based on bioinformatics related venues. Papers of these venues have been strictly screened by review experts, thus they can better represent the knowledge structure of bioinformatics. This is also the reason why venue-filtered method is adopted in many previous works [5, 6, 13, 19].

A total of 330,192 papers were analyzed in the following discussion. We then extracted the title and abstract of these papers from the MAG to form a corpus. Detecting phrases in the corpus is crucial because the semantics of these phrases change once they are split into words. So after removing the stop words, we used phrases in Medical Subject Headings(MESH) to annotate phrases in our corpus. These phrases used to analyze bioinformatics literatures in many previous studies [2, 4, 6] to detect phrases in the corpus.

Topic modeling

Paragraph embedding

Paragraph embedding is an unsupervised framework that learns continuous distributed vector representations for documents. Semantic similarity between documents can be obtained through calculating cosine similarity between vectors. Paragraph embedding is implemented in the gensim package (called doc2vec in gensim), in which the representation vector of each word in the corpus can be obtained at the same time as the document vector by setting parameter ‘dm’ to 1, which is used to calculate the semantic similarity between words. The selection of parameters is based on the best practice given by Lau [20]: window size is equals to 5 and epoch is equals to 600;

Clustering

To have a preliminary understanding of the knowledge structure of bioinformatics, we analyzed MESH term vectors first which is also better for subsequent document classification and interpretation. We used HDBSCAN [21], an outstanding soft clustering algorithm, to determine the clusters for MESH vectors. According to McInnes, HDBSCAN has the following characteristics compared to those parameter-sensitive algorithms: Performs DBSCAN over varying epsilon values and integrates the result to find a clustering that gives the best stability over epsilon. In practice this means that HDBSCAN returns a good clustering straight away with little or no parameter tuning. Phrases tend to be more specific than single word, so we chose to check on the top 10 most representative phrases in each category. Among nine clusters, only one cluster is related to computation, and the other clusters are all terms related to biology. Biology terms are mainly related to genes, proteins and sequences.

Clustering the document vectors can help us to obtain the topic distribution of bioinformatics and further understand the knowledge structure. Compared with LDA, our method can flexibly select the clustering method and the number of categories according to the clustering results after obtaining the vectors. After several rounds of attempts, we chose Gaussian Mixture Model (GMM). GMM is similar to K-Means, but it can be used as soft clustering, giving the probability of data points being assigned to a category.

To avoid high computational cost and remove noises of data, we used PCA to reduce the dimension of document vector from initial 300 to 50. To preserve data’s nonlinearity, we used the nonlinear dimensionality reduction algorithm UMAP [22] to further reduce the dimension of the document vector to 10. The parameters we used are as follows: n-neighbors is equal to 100 and dimension is equal to 10. These 10-dimensional vectors were used as the input of GMM. Basing on the clustering experience of MESH phrase with HDBSCAN, we selected the number of categories(the number of topics) to be 9. To explain the semantic meaning of each cluster, for each cluster, we used tf-idf algorithm to sort the words in the cluster, and selected the top 10 words as the descriptors of the cluster. Specifically, the tf-idf algorithm is as follows:

\begin{matrix} t f_{i} d f_{i, j} = \frac{n_{i, j}}{Σ_{k} n_{k, j}} \times l o g \frac{| D |}{| j : t_{i} \in d_{j} |} \end{matrix}

where $n_{i} j$ is the count of word $t_{i}$ in cluster $d_{j}$ , $Σ_{k} n_{k} j$ is the total count of all words in cluster $d_{j}$ , |D| is the number of clusters and $| j : t_{i} \in d_{j} |$ is the number of clusters containing $t_{i}$ .

Results are shown in Table 1. ‘compute’ means the words belongs to the category related to computation in the word vector classification, and ‘biology’ is the word in the eight other categories.

Table 1.

Top 10 tf-idf words in each cluster

Topic1

Topic2

Topic3

Compute:

mathematical- model

Biology:

Population- dynamics; gene expression; cell cycle; natural selection; population size; immune system; basic reproduction- number;escherichia coli systems biology ndtabular

Compute:

Biology:

Escherichia coli; gene expression; binding site; e coli; dna replication; binding sites; amino acids; rna polymerase; cell cycle; dna damage

Compute:

Cloud computing; data analysis; data mining; data sets; machine learning; semantic web

biology:

Computational- biology;systems biology; gene expression; molecular biology

Topic4

Topic5

Topic6

Compute:

Biology:

Escherichia coli; e coli; amino acids; gene expression; saccharomyces cerevisiae; mass spectrometry; molecular weight; risk factorspublic health; heavy metals

Compute:

Biology:

Active site; escherichia coli; binding site; amino acids; circular dichroism; hydrogen bonds; hydrogen bond; cytochrome c; e coli; protein folding

Compute:

Biology:

Mass spectrometry; gene expression; breast cancer; cell lines; dna methylation; transcription- factors; liquid- chromatography; amino acids; oxidative stress; protein spots

Topic7

Topic8

Topic9

Compute:

Biology:

Amino acids; gene expression; x chromosome; skeletal muscle; reading frame; human genome; alternative splicing; breast cancer; linkage- disequilibrium; amino acid- sequence

compute:

Data sets; machine learning; data set; data analysis

biology:

Gene expression; protein structure; supplementary- information; sequence alignment; protein sequences; dna sequences

Compute:

Deep learning; wavelet transform; machine learning; data sets

Biology:

Magnetic resonance- imaging; magnetic resonance white matter; image- reconstruction visual cortex; breast cancer

Open in a new tab

Co-occurrence word network

To further investigate bioinformatics topics related to computational methods, we combine topic clustering and co-occurrence words analysis to construct a co-occurrence word network. The calculation method is as follows:

The data set of each topic was divided into six time slices to show the changes in each topic over time. From 1960 to 2019, each time slice represents 10 years. We calculated the tf-idf value of MESH phrases in different time slices in each topic, and then divided top 10 phrases into computational phrases and biology phrases according to MESH classification results. A co-occurrence word network between computational phrases and biology phrases is constructed to illustrate this time slice.

Acknowledgements

Not applicable.

About this supplement

This article has been published as part of BMC Bioinformatics Volume 23 Supplement 8, 2022: Selected articles from the 16th International Symposium on Bioinformatics Research and Applications (ISBRA-20): bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-23-supplement-8.

Appendix 1: List of bioinformatics journals

BMC Bioinformatics
BMC Genomics
PLoS Biology
Genome Biology
PLoS Genetics
PLoS Computational Biology
BMC Research Notes
Bioinformatics
Molecular Systems Biology
BMC Systems Biology
Comparative and Functional Genomics
Bioinformation
Human Molecular Genetics
The Embo Journal
Cancer Informatics
Genome Medicine
Evolutionary Bioinformatics
Biochemistry
Algorithms for Molecular Biology
Eurasip Journal on Bioinformatics and Systems Biology
Journal of Molecular Biology
Molecular & Cellular Proteomics
Mammalian Genome
Source Code for Biology and Medicine
Biodata Mining
Journal of Computational Neuroscience
Journal of Proteome Research
Journal of Biomedical Semantics
Journal of Molecular Modeling
Bulletin of Mathematical Biology
Pharmacogenetics and Genomics
Statistical Methods in Medical Research
Neuroinformatics
Genomics
Protein Science
Physiological Genomics
Trends in Genetics
Journal of Proteomics
Proteomics
Trends in Biochemical Sciences
Journal of Biotechnology
trends in Biotechnology
Journal of theoretical Biology
Journal of Integrative Bioinformatics
Computational Biology and Chemistry
International Journal of Data Mining and Bioinformatics
MOJ Proteomics & Bioinformatics
Journal of Proteomics & Bioinformatics
Network Modeling Analysis in Health Informatics and Bioinformatics
The Open Bioinformatics Journal
Advances in Bioinformatics
Dictionary of Bioinformatics and Computational Biology
IEEE ACM Transactions on Computational Biology and Bioinformatics
Current Bioinformatics
Bioinformatics and Biology Insights
biometrics and Bioinformatics
Journal of Clinical Bioinformatics
International Journal of Bioinformatics Research and Applications
International Journal of Computational Biology and Drug Design
International Journal of Bioscience Biochemistry and Bioinformatics
Applied Bioinformatics
International Journal of knowledge Discovery in Bioinformatics
Mathematical Biology and Bioinformatics
Journal of Bioinformatics and Computational Biology
trends in Bioinformatics
Journal of Computational Biology
briefings in Bioinformatics
biotechnologia Journal of Biotechnology Computational Biology and Bionanotechnology
IPSJ Transactions on Bioinformatics
China Journal of Bioinformatics
Genomics Proteomics & Bioinformatics

Appendix 2: List of bioinformatics conferences

APBC
BIBE
BIBM
B-interface
Biodevices
Bioinformatics
Biosig
Biosignals
Biostec
BMEI
BSBT
CIBCB
CMSB
CSB
DNA computing
ECCB
EuroGP
EvoBio
FOGA
GCB
GECCO
ISBI
ISBRA
ISMB
PRIB
PSB
RECOMB
WABI
WBIR
Brazilian symposium on bioinformatics
Data mining in bioinformatics
Computational intelligence methods for bioinformatics and biostatistics
Data and text mining in bioinformatics
International conference computational systems-biology and bioinformatics
Bioinformatics research and development
International conference on bioinformatics and biomedical engineering
International joint conferences on bioinformatics, systems biology and intelligent computing
International workshop on practical applications of computational biology and bioinformatics
International symposium health informatics and bioinformatics
Biocomputation, bioinformatics, and biomedical technologies
International conference bioscience, biochemistry and bioinformatics

Author contributions

JW: Visualization, data analysis, Writing—Original Draft; ZL: Conceptualization, Methodology; JZ: Supervision, Writing—Review Editing; All authors have read and approved the final manuscript.

Funding

No funding.

Availability of data and materials

Data were obtained from Microsoft Academic Graph (MAG) : https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/

Declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Jiaqi Wang, Email: jqw.skye@gmail.com.

Zeyu Li, Email: lzytianda@tju.edu.cn.

Jiawan Zhang, Email: jwzhang@tju.edu.cn.

References

1.Roos DS. Bioinformatics-trying to swim in a sea of data. Science. 2001;291(5507):1260–1261. doi: 10.1126/science.291.5507.1260. [DOI] [PubMed] [Google Scholar]
2.Patra SK, Mishra S. Bibliometric study of bioinformatics literature. Scientometrics. 2006;67(3):477–489. doi: 10.1556/Scient.67.2006.3.9. [DOI] [Google Scholar]
3.Chen Y-S, Leimkuhler FF. A relationship between Lotka’s law, Bradford’s law, and Zipf’s law. J Am Soc Inf Sci. 1986;37(5):307–314. doi: 10.1002/(SICI)1097-4571(198609)37:5<307::AID-ASI5>3.0.CO;2-8. [DOI] [Google Scholar]
4.Glänzel W, Janssens F, Thijs B. A comparative analysis of publication activity and citation impact based on the core literature in bioinformatics. Scientometrics. 2009;79(1):109–129. doi: 10.1007/s11192-009-0407-1. [DOI] [Google Scholar]
5.Song M, Kim S, Zhang G, Ding Y, Chambers T. Productivity and influence in bioinformatics: a bibliometric analysis using pubmed central. J Am Soc Inf Sci. 2014;65(2):352–371. [Google Scholar]
6.Song M, Kim SY. Detecting the knowledge structure of bioinformatics by mining full-text collections. Scientometrics. 2013;96(1):183–201. doi: 10.1007/s11192-012-0900-9. [DOI] [Google Scholar]
7.Tang J, Zhang J, Yao L, Li J, Zhang L, Su Z. Arnetminer: extraction and mining of academic social networks. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, 2008;990–998.
8.Heo GE, Kang KY, Song M, Lee JH. Analyzing the field of bioinformatics with the multi-faceted topic modeling technique. BMC Bioinform 2017;18(Suppl 7). [DOI] [PMC free article] [PubMed]
9.Le Q, Mikolov T. Distributed representations of sentences and documents. In: International conference on machine learning, 2014;1188–1196.
10.Liu J, Tang T, Wang W, Xu B, Kong X, Xia F. A survey of scholarly data visualization. IEEE Access. 2018;6:19205–19221. doi: 10.1109/ACCESS.2018.2815030. [DOI] [Google Scholar]
11.Gu D, Li J, Li X, Liang C. Visualizing the knowledge structure and evolution of big data research in healthcare informatics. Int J Med Inform. 2017;98:22–32. doi: 10.1016/j.ijmedinf.2016.11.006. [DOI] [PubMed] [Google Scholar]
12.Wu H, Wang M, Feng J, Pei Y. Research topic evolution in“bioinformatics”. In: 2010 4th international conference on bioinformatics and biomedical engineering, 2010;1–4.
13.Song M, Yang CC, Tang X. Detecting evolution of bioinformatics with a content and co-authorship analysis. Springerplus. 2013;2(1):186. doi: 10.1186/2193-1801-2-186. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Liao H, Tang M, Luo L, Li C, Chiclana F, Zeng XJ. A bibliometric analysis and visualization of medical big data research. Sustainability (Switzerland) 2018;10(1):1–18. [Google Scholar]
15.Havre S, Hetzler B, Nowell L. Themeriver: visualizing theme changes over time. In: IEEE symposium on information visualization 2000. INFOVIS 2000. Proceedings, 2000;115–123.
16.Rafols I, Porter AL, Leydesdorff L. Science overlay maps: a new tool for research policy and library management. J Am Soc Inform Sci Technol. 2010;61(9):1871–1887. doi: 10.1002/asi.21368. [DOI] [Google Scholar]
17.Espadoto M, Martins RM, Kerren A, Hirata NS, Telea AC. Towards a quantitative survey of dimension reduction techniques. IEEE Trans Visual Comput Graph 2019. [DOI] [PubMed]
18.Wang K, Shen Z, Huang C, Wu C-H, Dong Y, Kanakia A. Microsoft academic graph: when experts are not enough. Quant Sci Stud. 2020;1(1):396–413. doi: 10.1162/qss_a_00021. [DOI] [Google Scholar]
19.Song M, Heo GE, Kim SY. Analyzing topic evolution in bioinformatics: investigation of dynamics of the field with conference data in DBLP. Scientometrics. 2014;101(1):397–428. doi: 10.1007/s11192-014-1246-2. [DOI] [Google Scholar]
20.Lau JH, Baldwin T. An empirical evaluation of doc2vec with practical insights into document embedding generation 2016. arXiv preprint arXiv:1607.05368
21.McInnes L, Healy J, Astels S. hdbscan: Hierarchical density based clustering. J Open Source Softw. 2017;2(11):205. doi: 10.21105/joss.00205. [DOI] [Google Scholar]
22.McInnes L, Healy J, Melville J. Umap: Uniform manifold approximation and projection for dimension reduction 2018. arXiv preprint arXiv:1802.03426

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Data were obtained from Microsoft Academic Graph (MAG) : https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/

[CR1] 1.Roos DS. Bioinformatics-trying to swim in a sea of data. Science. 2001;291(5507):1260–1261. doi: 10.1126/science.291.5507.1260. [DOI] [PubMed] [Google Scholar]

[CR2] 2.Patra SK, Mishra S. Bibliometric study of bioinformatics literature. Scientometrics. 2006;67(3):477–489. doi: 10.1556/Scient.67.2006.3.9. [DOI] [Google Scholar]

[CR3] 3.Chen Y-S, Leimkuhler FF. A relationship between Lotka’s law, Bradford’s law, and Zipf’s law. J Am Soc Inf Sci. 1986;37(5):307–314. doi: 10.1002/(SICI)1097-4571(198609)37:5<307::AID-ASI5>3.0.CO;2-8. [DOI] [Google Scholar]

[CR4] 4.Glänzel W, Janssens F, Thijs B. A comparative analysis of publication activity and citation impact based on the core literature in bioinformatics. Scientometrics. 2009;79(1):109–129. doi: 10.1007/s11192-009-0407-1. [DOI] [Google Scholar]

[CR5] 5.Song M, Kim S, Zhang G, Ding Y, Chambers T. Productivity and influence in bioinformatics: a bibliometric analysis using pubmed central. J Am Soc Inf Sci. 2014;65(2):352–371. [Google Scholar]

[CR6] 6.Song M, Kim SY. Detecting the knowledge structure of bioinformatics by mining full-text collections. Scientometrics. 2013;96(1):183–201. doi: 10.1007/s11192-012-0900-9. [DOI] [Google Scholar]

[CR7] 7.Tang J, Zhang J, Yao L, Li J, Zhang L, Su Z. Arnetminer: extraction and mining of academic social networks. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, 2008;990–998.

[CR8] 8.Heo GE, Kang KY, Song M, Lee JH. Analyzing the field of bioinformatics with the multi-faceted topic modeling technique. BMC Bioinform 2017;18(Suppl 7). [DOI] [PMC free article] [PubMed]

[CR9] 9.Le Q, Mikolov T. Distributed representations of sentences and documents. In: International conference on machine learning, 2014;1188–1196.

[CR10] 10.Liu J, Tang T, Wang W, Xu B, Kong X, Xia F. A survey of scholarly data visualization. IEEE Access. 2018;6:19205–19221. doi: 10.1109/ACCESS.2018.2815030. [DOI] [Google Scholar]

[CR11] 11.Gu D, Li J, Li X, Liang C. Visualizing the knowledge structure and evolution of big data research in healthcare informatics. Int J Med Inform. 2017;98:22–32. doi: 10.1016/j.ijmedinf.2016.11.006. [DOI] [PubMed] [Google Scholar]

[CR12] 12.Wu H, Wang M, Feng J, Pei Y. Research topic evolution in“bioinformatics”. In: 2010 4th international conference on bioinformatics and biomedical engineering, 2010;1–4.

[CR13] 13.Song M, Yang CC, Tang X. Detecting evolution of bioinformatics with a content and co-authorship analysis. Springerplus. 2013;2(1):186. doi: 10.1186/2193-1801-2-186. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR14] 14.Liao H, Tang M, Luo L, Li C, Chiclana F, Zeng XJ. A bibliometric analysis and visualization of medical big data research. Sustainability (Switzerland) 2018;10(1):1–18. [Google Scholar]

[CR15] 15.Havre S, Hetzler B, Nowell L. Themeriver: visualizing theme changes over time. In: IEEE symposium on information visualization 2000. INFOVIS 2000. Proceedings, 2000;115–123.

[CR16] 16.Rafols I, Porter AL, Leydesdorff L. Science overlay maps: a new tool for research policy and library management. J Am Soc Inform Sci Technol. 2010;61(9):1871–1887. doi: 10.1002/asi.21368. [DOI] [Google Scholar]

[CR17] 17.Espadoto M, Martins RM, Kerren A, Hirata NS, Telea AC. Towards a quantitative survey of dimension reduction techniques. IEEE Trans Visual Comput Graph 2019. [DOI] [PubMed]

[CR18] 18.Wang K, Shen Z, Huang C, Wu C-H, Dong Y, Kanakia A. Microsoft academic graph: when experts are not enough. Quant Sci Stud. 2020;1(1):396–413. doi: 10.1162/qss_a_00021. [DOI] [Google Scholar]

[CR19] 19.Song M, Heo GE, Kim SY. Analyzing topic evolution in bioinformatics: investigation of dynamics of the field with conference data in DBLP. Scientometrics. 2014;101(1):397–428. doi: 10.1007/s11192-014-1246-2. [DOI] [Google Scholar]

[CR20] 20.Lau JH, Baldwin T. An empirical evaluation of doc2vec with practical insights into document embedding generation 2016. arXiv preprint arXiv:1607.05368

[CR21] 21.McInnes L, Healy J, Astels S. hdbscan: Hierarchical density based clustering. J Open Source Softw. 2017;2(11):205. doi: 10.21105/joss.00205. [DOI] [Google Scholar]

[CR22] 22.McInnes L, Healy J, Melville J. Umap: Uniform manifold approximation and projection for dimension reduction 2018. arXiv preprint arXiv:1802.03426

PERMALINK

Visualizing the knowledge structure and evolution of bioinformatics

Jiaqi Wang

Zeyu Li

Jiawan Zhang

Conference

Abstract

Background

Results

Conclusions

Background

Traditional bibliometric analysis

Text mining applied to bioinformatics bibliometrics

Visualization applied to bioinformatics bibliometrics

Results

Theme river

Fig. 1.

Science map

Fig. 2.

Fig. 3.

Discussion

Knowledge structure of bioinformatics

Evolution of knowledge structure

Fig. 4.

Fig. 5.

Fig. 6.

Evolution of computing-related topics

Fig. 7.

Conclusion

Methods

Fig. 8.

Data collection and preprocessing

Topic modeling

Paragraph embedding

Clustering

Table 1.

Co-occurrence word network

Acknowledgements

About this supplement

Appendix 1: List of bioinformatics journals

Appendix 2: List of bioinformatics conferences

Author contributions

Funding

Availability of data and materials

Declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Footnotes

Contributor Information

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases