Humans and machines in biomedical knowledge curation: hypertrophic cardiomyopathy molecular mechanisms’ representation

Mila Glavaški; Lazar Velicki

doi:10.1186/s13040-021-00279-2

. 2021 Oct 2;14:45. doi: 10.1186/s13040-021-00279-2

Humans and machines in biomedical knowledge curation: hypertrophic cardiomyopathy molecular mechanisms’ representation

Mila Glavaški ^1,^✉, Lazar Velicki ^1,²

PMCID: PMC8487578 PMID: 34600580

Abstract

Background

Biomedical knowledge is dispersed in scientific literature and is growing constantly. Curation is the extraction of knowledge from unstructured data into a computable form and could be done manually or automatically. Hypertrophic cardiomyopathy (HCM) is the most common inherited cardiac disease, with genotype–phenotype associations still incompletely understood. We compared human- and machine-curated HCM molecular mechanisms’ models and examined the performance of different machine approaches for that task.

Results

We created six models representing HCM molecular mechanisms using different approaches and made them publicly available, analyzed them as networks, and tried to explain the models’ differences by the analysis of factors that affect the quality of machine-curated models (query constraints and reading systems’ performance). A result of this work is also the Interactive HCM map, the only publicly available knowledge resource dedicated to HCM. Sizes and topological parameters of the networks differed notably, and a low consensus was found in terms of centrality measures between networks. Consensus about the most important nodes was achieved only with respect to one element (calcium). Models with a reduced level of noise were generated and cooperatively working elements were detected. REACH and TRIPS reading systems showed much higher accuracy than Sparser, but at the cost of extraction performance. TRIPS proved to be the best single reading system for text segments about HCM, in terms of the compromise between accuracy and extraction performance.

Conclusions

Different approaches in curation can produce models of the same disease with diverse characteristics, and they give rise to utterly different conclusions in subsequent analysis. The final purpose of the model should direct the choice of curation techniques. Manual curation represents the gold standard for information extraction in biomedical research and is most suitable when only high-quality elements for models are required. Automated curation provides more substance, but high level of noise is expected. Different curation strategies can reduce the level of human input needed. Biomedical knowledge would benefit overwhelmingly, especially as to its rapid growth, if computers were to be able to assist in analysis on a larger scale.

Supplementary Information

The online version contains supplementary material available at 10.1186/s13040-021-00279-2.

Keywords: Data mining, Curation, Automated curation, Hypertrophic cardiomyopathy, Signaling pathways, Knowledge graphs, Disease maps

Background

Biomedical knowledge is dispersed across scientific papers and databases and is growing constantly. Biomedical literature can be seen as a large, unstructured data repository [1]. PubMed is a biomedical literature database and supports the search and retrieval of the literature [2]. Filters are used to narrow the search by different criteria (publication date, species, etc.). Each publication in the database has a unique PubMed Identifier (PMID). Medical Subject Headings (MeSH) is a vocabulary thesaurus used for indexing articles for PubMed [3]. Combinations of these and other approaches (e.g., using keywords and key phrases) can be used to constrain database queries. There are also other biomedical databases such as Pathway Commons [4], DrugBank [5], ChEMBL [6], CTDbase [7], miRTarBase [8], and many more.

Curation is the extraction of knowledge from unstructured data into a structured, computable form [9]. Molecular mechanisms can be extracted from biomedical knowledge resources by manual or automated curation [10, 11]. Manual curation consists of the synthesis and integration of information from the literature, large-scale projects, and databases [9] and represents the gold standard for information extraction in biomedical research [12]. The extracted information about molecular mechanisms can be subsequently visually represented using visual pathway editors such as CellDesigner [10]. One example of an automated approach is the “Integrated Network and Dynamical Reasoning Assembler” (INDRA), which extracts molecular mechanisms from text and biomedical databases and assembles them into executable models [13]. It contains a number of clients for accessing and using resources from biomedical databases (e.g., Pathway Commons database) and literature clients for retrieving the literature. For the extraction of molecular mechanisms from text, INDRA uses reading systems such as REACH [14], TRIPS [15], Sparser [16], ISI [17], RLIMPS-P [18], Eidos [19], etc. They extract INDRA statements, intermediate knowledge representations of extracted molecular mechanisms [13]. INDRA statements are then assembled into models [13]. The INDRA Database is built with INDRA, combining content from numerous readers and databases [20].

When the information is combined, its value increases [9]. Disease maps are comprehensive, knowledge-based representations of disease mechanisms [21]. Biomedical knowledge in the form of graphs facilitates the study of complex processes, both as visual and thereby more intuitive representations, as well as a standardized data structure that is human- and computer-readable [22].

Hypertrophic cardiomyopathy (HCM) is the most common genetic cardiac disease [23–25], with a prevalence of 1 in 500 people worldwide [23, 26–29]. It is characterized by marked variability in expression, ranging from asymptomatic to sudden cardiac death or heart failure [30]. In addition to the direct effects of underlying mutations, gene expression is altered by micro and small noncoding RNAs, and secondary molecular changes occur in many signaling pathways [31]. Many studies have been conducted to decipher the molecular mechanisms underlying HCM; however, genotype–phenotype associations remain incompletely understood [32].

Models made exclusively by manual curation or by automated curation have never been compared. Automated biomedical knowledge curation policies that produce disease models of higher quality are still not known.

Our aims were to compare human- and machine-curated HCM models, as well as to examine the performance of different machine approaches for the same task.

Results

Constructed models

We created six models representing HCM molecular mechanisms using different approaches and made them publicly available (Table 1). The Manual HCM model was constructed by a human, based on an extensive literature search in PubMed, using CellDesigner. The Tabular manual HCM model was created by manual transcription of species and reactions from the original Manual HCM model CellDesigner XML file to nodes and interactions of a network table in XLSX format. The INDRA-assembled PubMed HCM model was assembled automatically, using INDRA’s PubMed literature client. The INDRA-assembled PubMed+PathwayCommons HCM model was assembled automatically, using INDRA’s PubMed literature client and Pathway Commons database via INDRA’s BioPAX API. The Truncated INDRA DB model was created using INDRA Database. Only statements that were completely correctly extracted from the text were incorporated into the Truncated INDRA DB model. After applying the criteria for correctness, 9.27% of statements remained for inclusion in the Truncated INDRA DB HCM model. The INDRA DB model was created using the INDRA Database. All statements returned by the query were incorporated into the INDRA DB model.

Table 1.

Constructed models

Model	Number of elements	Number of interactions	Number of compartments	Available at
Manual HCM model	440^a	509^a	0^a	https://bit.ly/3s47FyA
Tabular manual HCM model	175	278	0	https://bit.ly/3saXwR2
INDRA-assembled PubMed HCM model	435	451	0	https://bit.ly/3blm2rB
INDRA-assembled PubMed+PathwayCommons HCM model	1883	3642	0	https://bit.ly/2OLxJQM
Truncated INDRA DB HCM model	77	59	0	https://bit.ly/2ZKypbD
INDRA DB HCM model	546	638	0	https://bit.ly/3upHsga

Open in a new tab

^aAs estimated by Cytoscape. The original Manual HCM model consisted of 207 elements, 233 reactions, and 11 compartments

The number of elements and interactions in models differ markedly, regardless of whether they represent the same disease (HCM). Models created by automated curation contain no compartments (Table 1).

Network analysis of the generated models

Topological analysis

Topological parameters for the networks (Table 2) and network diameter per element (Table 3) were computed.

Table 2.

Topological parameters for HCM models obtained with Network Analyzer

	Manual HCM model	Tabular manual HCM model	INDRA-assembled PubMed HCM model	INDRA-assembled PubMed+PathwayCommons HCM model	Truncated INDRA DB HCM model	INDRA DB HCM model
Average number of neighbors	2.309^a	2.789	1.917	3.582	1.455	2.059
Network diameter	1^a	12	6	8	3	9
Network radius	1^a	1	1	1	1	1
Characteristic path length	1.000^a	4.334	2.541	2.395	1.299	3.900
Clustering coefficient	0.000^a	0.054	0.007	0.006	0.014	0.028
Network density	0.003^a	0.009	0.002	0.001	0.010	0.002
Connected components	26^a	11	58	51	23	101
Multi-edge node pairs	1^a	24	21	213	3	48
Number of self-loops	0^a	4	7	14	0	6

Open in a new tab

^a Due to the CellDesigner XML file incompatibility, we suggest that some or all topological measures for the Manual HCM model are calculated falsely by Cytoscape

Table 3.

Network diameter per element

Manual HCM model

Tabular manual HCM model

INDRA-assembled PubMed HCM model

INDRA-assembled PubMed+PathwayCommons HCM model

Truncated INDRA DB HCM model

INDRA DB HCM model

Network diameter/number of elements

0.0023^a

0.0048

0.0686

0.0138

0.0042

0.0390

0.0165

Open in a new tab

^aNumber of elements estimated using Cytoscape

Nodes’ centrality scores

The intersections of sets containing the top 10% elements by centrality measures for each network showed low consensus in terms of centrality measures between networks (Fig. 1). The elements ranked in the top 10% by different centrality measures for each network were visualized (Table 4). Network centrality scores could not be determined for the CellDesigner XML file.

Fig. 1 — Intersections of sets containing top 10% elements ranked by centrality measures for each network. Top 10% elements were determined for each network by: a-betweenness, b-bottleneck, c-closeness, d-clustering coefficient, e-degree, f-DMNC, g-eccentricity, h-EPC, i-MCC, j-MNC, k-radiality, l-stress

Table 4.

Elements ranked as top 10% by centrality measures for each network

Model	Link to folder with top 10% elements for each of centrality measures for the model
Tabular manual HCM model	https://bit.ly/3s7PQyO
INDRA-assembled PubMed HCM model	https://bit.ly/3k6Dmon
INDRA-assembled PubMed+PathwayCommons HCM model	https://bit.ly/3s9Wc0x
Truncated INDRA DB HCM model	https://bit.ly/3s6uqSL
INDRA DB model	https://bit.ly/37Kqlfc

Open in a new tab

The most important nodes

Consensus about the most important nodes was achieved only with respect to one element (calcium), while consensus for other most and least important nodes was lacking (Fig. 2).

Fig. 2 — The most important elements of networks (left) and the least important elements of networks (right)

Each network was represented as a packed concentric ring sorted by k-shell and gradient of nodes’ color applied based on k-shell (Fig. 3, Additional file 1). Rank and k-shell for each node of each network were calculated (Additional file 2). Cytoscape Wk-decomposition [33] could not be performed on the CellDesigner XML file.

Reliability of interactions

A different level of reliability threshold was estimated and applied for each model and, as a result, models with reduced levels of noise were generated (Table 5).

Table 5.

Estimated best reliability threshold for each network and models with reduced level of noise

Model	Estimated best reliability threshold	Models with reduced level of noise
Manual HCM model	–	https://bit.ly/3qDFZ3g
Tabular manual HCM model	0.15	https://bit.ly/3qBzv59
INDRA-assembled PubMed HCM model	0.15	https://bit.ly/3bBKFkf
INDRA-assembled PubMed+PathwayCommons HCM model	0.60	https://bit.ly/3s6ALO3
Truncated INDRA DB HCM model	0.02	https://bit.ly/3k9iH2T
INDRA DB model	0.50	https://bit.ly/3pFqo1Y

Open in a new tab

Cooperatively working elements

The number of detected cooperatively working elements (functional modules) was vastly different for networks (Table 6). Models made by machines without later human intervention contained ambiguous and exogenous elements in the detected functional modules (Table 6, Additional file 3). We have proposed likely implications for the detected functional modules in HCM (Additional file 3). The Manual HCM model could not be analyzed using NCMine app [34].

Table 6.

Functional modules

Model	Criterion for near-clique mining	Number of functional modules detected	Functional modules with ambiguous elements (%)	Functional modules with exogenous elements (%)
Tabular manual HCM model	Page Rank	17	0.00	0.00
Tabular manual HCM model	Node Degree	18	0.00	0.00
INDRA-assembled PubMed HCM model	Page Rank	6	50.00	16.67
INDRA-assembled PubMed HCM model	Node Degree	5	60.00	20.00
INDRA-assembled PubMed+PathwayCommons HCM model	Page Rank	61	4.92	77.05
INDRA-assembled PubMed+PathwayCommons HCM model	Node Degree	60	5.00	80.00
Truncated INDRA DB HCM model	Page Rank	2	0.00	0.00
Truncated INDRA DB HCM model	Node Degree	2	0.00	0.00
INDRA DB HCM model	Page Rank	27	22.22	18.52
INDRA DB HCM model	Node Degree	33	21.21	15.15

Open in a new tab

Factors that affect the quality of machine-curated models

Query constraints in machine-curated models

Query based on keywords is considerably more potent than query by MeSH (Table 7).

Table 7.

Number of results as a consequence of different query constraints

Query		Filter	Search details	Number of results
MeSH	Cardiomyopathy, Hypertrophic, Familial	10 years	MeSH Term	265
MeSH	Cardiomyopathy, Hypertrophic, Familial	10 years	MeSH Major Topic	232
keywords	familial hypertrophic cardiomyopathy	10 years	–	562
keywords	“familial hypertrophic cardiomyopathy”	10 years	Exact match	336
keywords	hypertrophic cardiomyopathy	10 years	–	7952
keywords	“hypertrophic cardiomyopathy”	10 years	Exact match	7390

Open in a new tab

The average year of publication for papers found by INDRA Database [20] query by the MeSH, used for the INDRA DB HCM model, was x̅=2010.27, with 43.75% of the papers describing research conducted on human material, 15.97% on human and other species material, and the rest being animal studies.

Reading systems’ performance

The most dominant reading system for the extraction of statements for the INDRA DB HCM model was Sparser, followed by RLIMS-P, REACH, and TRIPS/DRUM (Fig. 4). Reading systems’ extraction performance differed markedly for different reaction types (Table 8). Most extractions per statement were found for different versions of phosphorylation and translocation (Fig. 5).

Fig. 4 — Reading systems’ contribution to extraction of statements for INDRA DB HCM model

Table 8.

Percent of reading systems’ extractions by different reaction types in INDRA DB HCM model

Reaction types	ISI/AMR (%)	RLIMS-P (%)	Eidos (%)	TRIPS/DRUM (%)	Sparser (%)	REACH (%)
Activation, 2 elements	0.01	0.00	0.20	0.05	22.57	77.16
Activation, when binding	0.00	0.00	0.00	0.00	0.00	100.00
Activation, when carrying	0.00	0.00	100.00	0.00	0.00	0.00
Activation, when occurring	0.00	0.00	100.00	0.00	0.00	0.00
Autophosphorylation	0.00	0.00	0.00	1.57	98.43	0.00
Binding inhibits	0.00	0.00	0.00	0.00	0.00	100.00
Binding, 2 elements	0.04	0.00	0.00	0.03	58.36	41.57
Binding, more than 2 elements	0.00	0.00	0.00	0.00	99.07	0.93
Complex	0.00	0.00	100.00	0.00	0.00	0.00
Decreasing the amount, 2 elements	0.00	0.00	0.00	0.00	0.00	100.00
Dephosphorylation, 2 elements	0.00	0.00	0.00	0.00	0.00	100.00
Dephosphorylation, 2 elements, precise	0.00	0.00	0.00	0.00	0.00	100.00
Increasing the amount, 2 elements	0.00	0.00	0.00	0.00	0.00	100.00
Inhibition observed in	0.00	0.00	100.00	0.00	0.00	0.00
Inhibition, 2 elements	0.01	0.00	0.53	0.17	3.27	96.02
Inhibition, when binding	0.00	0.00	0.00	0.00	0.00	100.00
Object dephosphorylated	0.00	0.00	0.00	0.00	99.95	0.05
Object phosphorylated	0.00	18.80	0.00	0.71	80.25	0.24
Object phosphorylated, precise	0.00	9.15	0.00	0.00	90.79	0.06
Object produced	0.00	0.00	0.00	60.00	0.00	40.00
Phosphorylation increases amount	0.00	0.00	0.00	0.00	0.00	100.00
Phosphorylation, 2 elements	0.05	5.08	0.00	0.23	43.26	51.38
Phosphorylation, 2 elements, precise	0.00	1.51	0.00	0.00	43.22	55.28
Subject leads to dephosphorylation of object	0.00	0.00	0.00	0.00	0.00	100.00
Subject leads to phosphorylation of object	0.00	0.00	0.00	0.00	0.00	100.00
Translocation, destination precise	0.00	0.00	0.00	0.08	82.95	16.97
Translocation, starting point precise	0.00	0.00	0.00	0.00	0.00	100.00
Ubiquitination, 2 elements	0.00	0.00	0.00	0.00	0.00	100.00

Open in a new tab

Fig. 5 — Number of extractions per statement for 28 reaction types in INDRA DB HCM model

For all reading systems, the most common issue was that statements extracted had two or more critical issues (a combination of wrong elements, misleading element label, wrong interaction, or wrong direction of the interaction) in the same statement, followed by wrong element and wrong direction of interaction in case of Sparser and TRIPS reading systems (Fig. 6).

Fig. 6 — Specific issues found in the statements extracted by reading systems. Count of correct statements is shown as a reference point. The “not correct” issue was assigned in cases where two or more critical issues were found. Wrong element, misleading element label, wrong interaction, wrong direction of the interaction were designated as critical issues

REACH and TRIPS showed much higher accuracy than Sparser (Table 9) but at the cost of extraction performance (Fig. 4, Table 9). The TRIPS reading system proved to be the best single reading system for text segments about HCM when considering a compromise between accuracy and extraction performance (Fig. 4, Table 9).

Table 9.

Accuracy of Sparser, REACH, and TRIPS reading systems

	Sparser	REACH	TRIPS
Tolerably accurate^a (%)	41.02	83.59	84.38
Not tolerably accurate, not inaccurate (%)	12.89	8.01	6.64
Inaccurate^b (%)	46.09	8.40	8.98
No extraction (%)	–	68.16	38.48

Open in a new tab

Accuracy has been determined for all text segments for which Sparser, as the most dominant reading system, extracted a statement. ^a Tolerably accurate: correct statement or no extraction; ^b Inaccurate: contains critical issue(s)

For the INDRA DB model, 44.19% of the statements extracted by the Eidos reading system (the result of 20.65% of total extractions by Eidos) were meaningless and inapplicable (Additional file 4). Those were complex statements by structure and brought puzzling noise to the model. For the statements representing simple interactions (consisting of one subject, one object, and interaction between them), Eidos extracted the possible and applicable statements.

Interactive HCM map

The Interactive HCM map is available at https://silicofcm.eu/interactive-map/. It is hosted on the MINERVA (Molecular Interaction NEtwoRks VisuAlization) platform [35–37] which interfaces with DrugBank [5], ChEMBL [6], CTDbase [7], and miRTarBase [8]. The majority of the proteins that have a 3D structure already resolved and available in the Protein Data Bank can be directly visualized and explored using MolArt [38], a built-in MINERVA platform visualization tool.

Plugins enable additional onsite analysis. In maps with defined pathway areas, the Gene set enrichment analysis (GSEA) plugin [37] retrieves active data overlays and performs enrichment analysis, highlighting pathways significantly enriched for data overlays. These data can be user-provided. Adverse drug reactions plugin [37] links an external data file to the corresponding map elements. Targets of drugs with identified adverse reactions are shown in the map and can be filtered. The Disease-variant associations plugin [37] indicates genes with variants associated with a given disease [37]. Map exploration plugin [37] enables focused molecular interaction network exploration (e.g., of the neighborhood of a molecule appearing multiple times in a network) [37]. Centrality plugin [39] calculates network topology values. Overlays plugin [39] automatically creates, displays, or removes multiple overlays from uploaded data files [39].