Skip to main content
[Preprint]. 2025 Apr 17:2024.12.31.24319792. Originally published 2024 Dec 31. [Version 2] doi: 10.1101/2024.12.31.24319792

Figure 1.

Figure 1.

(a) An overview of text processing and record sampling used ahead of fine-tuning BERT models with ClinVar submission text summaries. (b) An example submission summary (SCV002749858): In this submission, the lab describes this variant (gray highlighting) and also classifies the variant as pathogenic (pink highlighting). We trained a sentence classifier to identify and filter these description and conclusion sentences so that only sentences containing evidence (blue highlighting) are used in model training. Additionally, we should the text filtering statistics for the SentenceClassifier on the ClinVar dataset. (c) Sentence type proportion distribution for three submission classification labels (B/LB, VUS, and P/LP) in the training data. Text summaries from the B/LB and P/LP classes have much larger fractions of evidence-labeled sentences, in contrast with VUS-labeled samples, which have a much larger share of description-labeled sentences. (d) Sentence type proportion distribution by ClinVar submission creation year, with pre-2019 years grouped together and individual years shown from 2019 through 2024.

Data source ClinVar data obtained from NCBI 1 (accessed in June 2023). Data and text corpus are processed using the pipeline discussed in the Methods Section.