Skip to main content
AMIA Annual Symposium Proceedings logoLink to AMIA Annual Symposium Proceedings
. 2025 May 22;2024:1206–1214.

A Comprehensive System for Searching and Evaluating Genomic Variant Evidence Using AI and Knowledge Bases to Support Personalized Medicine

Jinlian Wang 1, Hui Li 1, Hongfang Liu 1
PMCID: PMC12099401  PMID: 40417484

Abstract

We introduce an innovative automated system for the search and assessment of genetic variant evidence, meticulously aligned with ACMG guidelines. Leveraging the synergistic power of artificial intelligence (AI), elastic search, and an extensive knowledge base, our system advances the efficiency and accuracy of genetic variant interpretation. Distinct from existing methodologies, it features a pioneering literature filtering mechanism that automates the identification and relevance ranking of scientific articles, significantly reducing the time spending on literature evidence search and optimizing the evidence assessment process. Implemented and rigorously tested by a commercial company hereditary cancer variant curation team, the system demonstrated its effectiveness and scalability by processing over 3 million PMIDs and 1.8 million full-text articles. Throughout the period of active utilization, significant insights were gleaned into the real-world impact and user experience of the system, conclusively affirming its robustness. Our comparative analysis with Mastermind 2.0 highlights the system’s enhanced performance in minimizing false positives for various mutation types. The core AI model exhibits exceptional precision, recall, and F1 scores above 0.8, signifying its adeptness in selecting pertinent literature for variant classification. The experience and knowledge acquired from deploying the system in a commercial setting provide a distinctive outlook on its practicality and prospects for future development. The novel integration of AI with traditional genetic variant curation processes heralds a new era in the field, promising significant advancements and broader application prospects.

Introduction

The emergence of high-throughput sequencing technologies has been a watershed moment for genetics, offering the swift detection of genomic variants linked to an array of diseases and traits. This influx of genomic information underscores the urgent need for sophisticated methods to interpret these variants for personalized medicine applications1, 2. Variant interpretation, a nuanced task, necessitates the amalgamation of information from genetic databases, scientific literature, and clinical reports to elucidate each variant’s relevance to specific health outcomes or treatment efficacies 3.

A single patient’s exome might reveal 200-500 rare variants after analysis, underscoring the laborious nature of pinpointing causative variants amid the vast expanse of available data4. This task is compounded by the manual labor involved in sifting through, evaluating, and synthesizing evidence from myriad sources—a process that can extend over several hours or days for just one variant5. The discovery of previously documented disease-causing variants within a patient’s genome can significantly expedite this process, underscoring the value of ongoing research efforts that contribute to the burgeoning number of articles in genetic variant databases like the National Library of Medicine/MEDLINE/PubMed. Each article’s insights can pivotally shift a variant’s classification from uncertain to likely pathogenic, highlighting the critical nature of comprehensive literature review in clinical variant interpretation.

The variant curation process from literature encompasses two primary activities: the retrieval of variant evidence and the assessment of this evidence’s validity and strength, culminating in diagnostic assertions at both the variant and gene levels. Despite the growing body of literature on human genetic diseases and the arduous nature of manual variant curation, the advent of enhanced computational access to primary literature underscores the potential benefits of automating parts of the variant curation pipeline6. Our focus is on automating the retrieval of variant evidence from primary literature concerning Mendelian diseases, without delving into the automation of variant evaluation. This involves identifying papers that mention specific variants, linking these mentions to the correct gene transcript, and translating them into genomic coordinates for downstream use.

With PubMed hosting over 33 million biomedical abstracts and PubMed Central’s Open Access Subset full-text articles, the extraction of genomic variant evidence from this vast corpus represents a significant bottleneck in designing diagnostic sequencing panels and interpreting variants for genetic disease diagnoses 7. The search process is further complicated by the diversity and complexity of genetic variant nomenclatures and the challenge of accurately identifying diseases, genes, and drug compounds within these texts. Current methodologies, such as manual searches through PubMed or Google Scholar, often return many false positives due to their inability to reconcile different biological entity nomenclatures, thereby reducing specificity. Addressing the inconsistencies in genomic variant assessment across laboratories remains a challenge, even with the American College of Medical Genetics (ACMG) 8 set of criteria introduced to harmonize pathogenicity determinations. Studies reveal varying degrees of concordance in variant interpretations among clinical laboratories, pointing to the need for a more systematic evidence provision to ensure consistent interpretations.

In response to these challenges, we introduce an innovative variant evidence search and assessment system designed to automate the evidence collection process and enhance the precision of variant interpretations within the realm of precision medicine. The system empowers users to navigate a vast dataset of medically annotated literature, supported by a user-friendly interface and a powerful evaluation tool based on ACMG guidelines. This system not only automates evidence gathering but also prioritizes literature based on relevance, significantly lightening the curator’s load and fostering more accurate and consistent variant classifications. Our paper delineates the system’s design principles, architecture, and functionalities, showcasing its transformative potential to revolutionize variant curation and personalized medicine.

Methods

Our approach leverages automation to streamline the gathering and analysis of evidence from diverse sources, facilitating the classification of genetic variants in line with ACMG guidelines. This integration of text mining, Elastic Search, artificial intelligence, and comprehensive knowledge bases aims to pinpoint evidence of pathogenicity in germline variants. The implementation of this system greatly diminishes the need for time-consuming manual curation. Through Elastic Search, we ensure meticulous handling of supplementary files, guaranteeing comprehensive evidence coverage. Additionally, employing APIs for Google and Google Scholar allows our search to encompass the latest and most pertinent literature. By consolidating data from various sources and refining the process of evidence evaluation, our system enhances the accuracy and efficiency of variant interpretation. The workflow, illustrated in Figure 1, unfolds in three primary phases: (1) Evidence Collection, utilizing text mining, APIs, and machine learning to identify relevant mentions within full-text articles, supplementary materials, and structured knowledge bases; (2) Evidence Integration, where all gathered data and literature are compiled into a centralized database or S3 bucket, with each piece of evidence labeled by its source and tallied by variant occurrence; and (3) Evidence Sorting and Ranking, where evidence is organized based on its relevance to the specific gene and variant under investigation.

Figure 1.

Figure 1.

Workflow diagram of the system.

To enhance the precision of our search queries, we refine the Binary BioBert model 9, which is tailored for clinical relevance, to sift through papers based on their pertinence to genetic variant research. This includes filtering by journal names, keywords, and other related fields to incrementally improve the precision of automated classification and the ranking of evidence. The process begins when users input one or more genetic variants of interest, specifying details such as the gene name, nucleotide change, or the effect on the protein. The system then automatically searches across multiple knowledge bases, including our internal database and Elastic Search, as well as conducting real-time searches on Google Scholar and Google for any evidence linked to the specified variants.

We apply preliminary filtering criteria, which we have developed, that consider publication date, relevance, and source credibility to refine the search results to the most applicable evidence. All gathered evidence is consolidated into a comprehensive table, highlighting key details from the collated data such as the article title, publication year, and authors. An algorithm is then employed to automatically rank this compiled evidence, prioritizing it based on its relevance, the credibility of the source, and the strength of its association with the variant in question. Consistent with ACMG/AMP guidelines, our system emphasizes the importance of incorporating functional studies from the literature into the clinical interpretation of sequenced variants.

The system integrates ranked publications with a sophisticated machine learning algorithm to estimate their relevance to ACMG classification criteria, facilitating an initial categorization of literature evidence into categories such as PS3, PS4, PM3, and PM4. For instance, PS3 encompasses research articles that provide functional studies demonstrating a variant’s negative impact on gene functionality or protein product. This evidence could emerge from biochemical assays, animal models, or cell-based assays clarifying the variant’s pathogenic mechanisms. PS4 focuses on evidence showing a variant’s significant prevalence in affected individuals compared to controls. PM3 emphasizes genetic evidence, supported by literature documenting instances where a variant was seen in trans with another pathogenic variant, indicative of a recessive disorder. PM4 includes literature detailing the functional or clinical outcomes of in-frame deletions/insertions or stop-loss variants within the same gene, aiding in understanding the pathogenicity mechanisms of these variants. To facilitate curator interaction, the system presents a user-friendly interface showcasing the sorted evidence, preliminary classifications, and tools for manual evaluation and modifications. Curators are empowered to refine the automated classifications and evidence rankings based on their expertise, culminating in the final variant classification in accordance with ACMG standards. The literature and database entries are regularly updated, incorporating new research findings and user feedback to continuously enhance the accuracy of evidence ranking and variant classification processes.

MEDLINE XML files were retrieved and stored in an S3 bucket, and we processed these files along with the baseline data into a database table for baseline records. PDFs and supplementary materials (such as .doc, .xls, and .ppt files) were acquired from PubMed Central (PMC). A specialized pipeline was developed to manage and update all obtained files on the AWS Cloud ECS platform. Our analysis encompasses approximately 33 million PubMed abstracts, around 1.8 million full-text articles from PMC, and an additional 60,000 articles manually sourced outside of PMC. These documents underwent scrutiny through our proprietary text mining pipeline to identify mentions of genes and variants.To extract disease and mutation details from the collected publications, we utilized resources such as dbSNP for mutation data, along with phenotype keywords, a mutation list, and a Conditional Random Field (CRF) 10 model. Our disease terminology is informed by databases such as UMLS (Unified Medical Language System), HPO (Human Phenotype Ontology), and the London Dysmorphology Database (LDDB)11. The database incorporates mentions of genes and variants, including the specific sentences and locations within the texts where these mentions occur, complemented by an evaluative score for each mention. Our variant detection employs 1007 regular expressions designed to recognize various mutation notations, including but not limited to formats like “c.1054C>G,” “g.58354491G>C,” “g.58865857G>C,” “n.1438G>C,” “Arg648Stop,” and “Ser49Cys.”

In our Elastic Search setup, we index every article file, including the main text and supplementary materials, stored within S3 buckets. For each article, identified by its PMID, we gather meta-information such as the title, publication year, journal, authors, article type, keywords, and abstract. This data is retrieved from the corresponding XML file in the baseline database table. If a PMID is not found in this table, we resort to utilizing the PubMed API for information retrieval.To index the files associated with each PMID within the S3 bucket, we locate the primary text document. This is identified in two ways: if a .nxml file is present, any other file sharing the same name but with a .pdf extension is designated as the main text. Should no such file exist, the file named “<PMID>.pdf” assumes this role. At this phase, we focus on indexing specific file types: .doc/.docx, .pdf, .rtf, .html/.htm, .pptx, .xls/.xlsx, and .csv/.tsv. The subsequent step involves converting these files to text format using various methods:.doc/.docx, .rtf, .pdf, and .html/.htm files through textract; .xls files using pandas.read_csv; and .xlsx files with openpyxl’s load_workbook. When dealing with structured files (like .csv, .tsv, .xls, and .xlsx), we index each line or row individually, especially for Excel files, to mitigate the risk of false-positive results. This approach ensures that genes and variants are accurately indexed, avoiding confusion when they appear in different rows or lines. These documents are then prepared and sent for indexing in batches of 5,000.

To determine the relevance of each article, we employ a weighted voting strategy that assigns scores to various criteria (this weighted voted can be customized based on different content), ensuring that each paper’s relevance to the search query is accurately assessed:

  1. Match in the title: Score of 2.0

  2. Match with our Internal knowledgebase: Score of 1.5

  3. Publication in the most recent year: Score of 1.5

  4. Match in the abstract: Score of 1.0

  5. Match with Human Gene Mutation Database : Score of 1.0

  6. Match with ClinVar database: Score of 1.0

  7. Match with LitVar database: Score of 1.0

  8. Match found by our internal text mining tool: Score of 1.0

  9. Match found through Google Scholar searches: Score of 0.5

  10. Match found through Google searches: Score of 0.5

The user interface prioritizes and lists publications based on the above scoring system to ensure that users can readily identify the most significant papers. It labels each paper with scores indicating its predicted relevance to PS3, PS4, PM3, and PM4 categories as per ACMG guidelines, employing AI to make these distinctions based on patient/functional/segregation data. The system offers functionalities like ranking, filtering, and sorting for efficient literature access. It features a journal ranking diagram and an author network to display the influence and connections of the research, alongside a variant map viewer for an overview of pathogenic variants’ distribution within genes. A standout feature is the ability to annotate PDF texts directly from the interface, enabling users to extract, comment, and highlight essential information.

Our search strategy incorporates both stringent and broad matching techniques to balance between specificity and sensitivity. A strict approach, recommending exact matches and phrase quoting, is preferred for identifying novel or rare variants, ensuring high specificity. Conversely, a broad approach employs boolean logic to amalgamate various gene and mutation identifiers, including legacy names, for a wider search scope, more suited for common variant identification. This strategy includes forming complex search strings for different search engines, accounting for their unique Boolean operators. An illustrative table outlines the construction of these search queries, demonstrating our nuanced approach to literature search.

Our system was fully implemented and utilized on a daily basis within the hereditary cancer testing project, a firm specializing in genomic research, until its stop running due to marketing in 2022. This period of active usage provided invaluable insights into the system’s operational efficacy, user interface design, and overall impact on streamlining the variant evidence curation process. Despite the company’s financial outcome, the deployment of our system in a real-world environment significantly contributed to our understanding of its practical applications and areas for further enhancement.

Results

Our analysis identified a total of 262,933 unique PMIDs and 1,374,202 distinct variants, culminating in 2,866,397 unique PMID-variant pairings. From the ClinVar 12 database (data as of October 2021), we extracted 59,181 unique PMIDs and 691,709 variants, resulting in 1,050,373 PMID-variant pairs. Additionally, we processed the human subset of DisGeNET 13 v.7.0, which comprises 30,170 entries, isolating associated publications to achieve a non-redundant collection of 881,185 articles. By June 2022, we had indexed a total of 4,158,287 PMIDs covering main texts and supplementary materials.

We developed and launched the entire system on the AWS cloud platform. It was utilized by a company hereditary cancer curation team for clinical diagnosis reporting, in conjunction with the variant interpretation tool(VarInt)—a separate system for clinical variant management and reporting we developed. Our system effortlessly merges with VarInt, automatically populating evidence into ACMG categories for variant classification. Figure 2 showcases the system’s main features and user interface.

Figure 2.1.

Figure 2.1

The master table compiles evidence from Google, Scholar, Medline, ClinVar, Internal Database, and LitVar, organizing the data according to assigned weights

Figure 2.2.

Figure 2.2.

Variant browser is a variant exploration tool: Displays pathogenicity information from ClinVar, internal knowledge base, and the genomic position of variants, enabling curators to understand the context of the variants of interest.

Figure 2.3.

Figure 2.3

ACMG Guidelines-Driven Variant Auto-Classification Interface: A backend pipeline annotates and classifies variants according to collected evidence from all accessible resources. Users can customize the color scheme, display options, and individual forms

Figure 2.4.

Figure 2.4.

Co-Author Network: Selecting an author’s name highlights their co-authors across both the network diagram on the left and the demographic graph.

Figure 2.5.

Figure 2.5.

PDF Annotator: Genes and variants are automatically highlighted, allowing users to extract evidence, highlight further, or add comments. All modifications can be saved or exported to VarInt

Figure 2.6.

Figure 2.6.

Journal Impact Ordering and Source Filtering: Users can refine the master table by selecting or deselecting sources, which reorganizes the table based on journal impact and chosen resources.

To evaluate the efficacy of our method, we conducted a comparison with Mastermind 2.0. This involved searching for missense, nonsense, and frameshift variants and already manually curated by experts by use of the precisely defined search terms. We selected 568 PMIDs which are selected by experts and the most important papers expected to be returned by both Mastermind and our system. Mastermind returns 349 papers, 92 were overlapped with 295 PMIDs, the 94 were not found by Mastermind, 163 are out of 295 PMIDs, precision of Mastermind 92/(92 + 163) = 36%, Recall: 92/(92 + 94) = 49% . Our system retrieved 734 PMIDs, 95 overlapped with 295 PMIDS, 91 not found. 401 are out of 295 PMIDs, precision of our system 95/(95 + 401) = 19.2% and Recall: 95/(95 + 91) = 51% . Our primary focus was on minimizing false positives, which require significant curator effort to filter out irrelevant papers. The comparison revealed that our approach yielded fewer false positives for nonsense and frameshift mutations compared to Mastermind. However, for missense mutations, we observed a higher rate of false positives, attributed to our comprehensive search of full texts and supplementary files, including Excel and Word documents, where the variant was mentioned. The results of this false positive comparison are detailed in Table 2.

Table 2.

False Positive(FP) comparison between Mastermind and our system with 3 types of variant.

Variant Type Mastermind FP Our system FP
Missense 30 1100
Nonsense 70 20
Frameshift 144 123

Regarding the AI model employed to identify related papers, we randomly selected 5,000 papers for performance testing. The results, illustrated in Figure 3.1, demonstrate that our model achieves precision, F1 score, and recall rates all above 0.8. Figure 3.2 presents the confusion matrix for the fine-tuned model, providing further insights into its performance accuracy.

Figure 3.1.

Figure 3.1.

AI model performance assessment

Figure 3.2.

Figure 3.2.

Confusion matrix of AI model

Discussion and Conclusions

Drawing on the methodologies and findings detailed in our study, we conclude that our integrated system, combining text mining, Elastic Search, artificial intelligence, and comprehensive knowledge bases, offers a significant advancement in the automated classification of genetic variants. Our approach not only streamlines the collection and assessment of evidence from multiple sources in accordance with ACMG guidelines but also markedly reduces the time and effort traditionally required for manual curation.

The system’s capability to efficiently manage and index vast amounts of literature, evidenced by our processing of over 3 million PMIDs, including main texts and supplementary files, underscores its potential to enhance genetic variant research. By employing a sophisticated scoring system for relevance, our method effectively minimizes false positives, particularly in the classification of nonsense and frameshift mutations. This precision, however, is balanced against the challenge of a higher rate of false positives in missense mutations due to the inclusion of comprehensive search parameters. The comparative analysis with Mastermind 2.0 further illustrates our system’s efficacy, showcasing fewer false positives in key mutation categories. Additionally, the AI model employed demonstrates strong performance metrics, with precision, F1 scores, and recall all above 0.8, reinforcing the model’s reliability in identifying relevant literature.

Our findings suggest that the integration of AI and advanced text mining techniques can substantially improve the accuracy and efficiency of genetic variant interpretation. This progression not only supports researchers and clinicians in making more informed decisions but also paves the way for future developments in personalized medicine. As the system continues to evolve, regular updates and refinements based on new literature, database entries, and user feedback will further enhance its utility and precision.

In conclusion, our study contributes to the growing body of knowledge on genetic variant classification and underscores the transformative potential of leveraging technology to refine and expedite this process. The system’s high degree of accuracy, combined with its capacity to reduce manual workload, positions it as a valuable tool in the field of genomics and personalized medicine.

Figures & Tables

Table 1.

Search string examples with strict and loos strategy.

Variant Type Strict Loose
Missense p.V600E (BRAF) (BRAF OR v-raf) AND (c.1799T>A OR p.V600E OR “valine to glutamate”)
Frameshift c.35delG (GJB2) AND (GJB2 OR Connexin 26) AND (c.35delG OR “35delG”)
Inframe-del p.K650_T653del (FGFR3) (FGFR3 OR Fibroblast growth factor receptor 3) AND (c.1949_1958del OR p.K650_T653del OR “lysine 650 to threonine 653 deletion”)
Nonsense (Stop) p.R213* (TP53) (TP53 OR p53) AND (c.637C>T OR p.R213* OR “arginine to stop”)

References

  • 1.Petralia F, Ma W, Yaron TM, Caruso FP, et al. Clinical Proteomic Tumor Analysis Consortium. Pan-cancer proteogenomics characterization of tumor immunity. Cell. 2024 Feb 29;187(5):1255–1277.e27. doi: 10.1016/j.cell.2024.01.027. doi: 10.1016/j.cell.2024.01.027. Epub 2024 Feb 14. PMID: 38359819. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Taylor JC, Martin HC, Lise S, et al. Factors influencing success of clinical genome sequencing across a broad spectrum of disorders. Nat Genet. 2015;47(7):717–726. doi: 10.1038/ng.3304. doi: 10.1038/ng.3304. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Lek M, Karczewski KJ, Minikel EV, et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature. 2016;536(7616):285–291. doi: 10.1038/nature19057. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Dewey FE, Grove ME, Pan C, et al. Clinical interpretation and implications of whole-genome sequencing. JAMA. 2014;311(10):1035. doi: 10.1001/jama.2014.1717. doi: 10.1001/jama.2014.1717. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Smedley D, Jacobsen JOB, Jäger M, et al. Next-generation diagnostics and disease-gene discovery with the Exomiser. Nat Protoc. 2015;10(12):2004–2015. doi: 10.1038/nprot.2015.124. doi: 10.1038/nprot.2015.124. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Jagadeesh KA, Birgmeier J, Guturu H, et al. Phrank measures phenotype sets similarity to greatly improve Mendelian diagnostic disease prioritization. Genet Med Off J Am Coll Med Genet. July 2018. [DOI] [PubMed]
  • 7.Deisseroth CA, Birgmeier J, Bodle EE, et al. ClinPhen extracts and prioritizes patient phenotypes directly from medical records to expedite genetic disease diagnosis. Genet Med. December 2018. p. 1. [DOI] [PMC free article] [PubMed]
  • 8.Richards S, Aziz N, Bale S, et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet Med Off J Am Coll Med Genet. 2015;17(5):405–424. doi: 10.1038/gim.2015.30. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Lee J., Yoon W., Kim S., Kim D., Kim S., So C.H., et al. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36:1234–1240. doi: 10.1093/bioinformatics/btz682. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Zhang N., Cao X., Lin R., Wang B., Shi H., Zhou H., et al. Research on the normalization of traditional Chinese medicine symptom terms in epilepsy, World Sci. Technol.-Modernization Tradit. Chin. Med. 2020;22 [Google Scholar]
  • 11.Guest SS, Evans CD, Winter RM. The Online London Dysmorphology Database. Genet Med. 1999 Jul-Aug;1(5):207–12. doi: 10.1097/00125817-199907000-00007. [DOI] [PubMed] [Google Scholar]
  • 12. https://www.ncbi.nlm.nih.gov/clinvar/
  • 13. https://www.disgenet.org/home/

Articles from AMIA Annual Symposium Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES