Implementing the VMC Specification to Reduce Ambiguity in Genomic Variant Representation

Michael Watkins; Shawn Rynearson; Alex Henrie; Karen Eilbeck

. 2020 Mar 4;2019:1226–1235.

Implementing the VMC Specification to Reduce Ambiguity in Genomic Variant Representation

Michael Watkins ¹, Shawn Rynearson ¹, Alex Henrie ¹, Karen Eilbeck ¹

PMCID: PMC7153148 PMID: 32308920

Abstract

Current methods used for representing biological sequence variants allow flexibility, which has created redundancy within variant archives and discordance among variant representation tools. While research methodologies have been able to adapt to this ambiguity, strict clinical standards make it difficult to use this data in what would otherwise be useful clinical interventions. We implemented a specification developed by the GA4GH Variant Modeling Collaboration (VMC), which details a new approach to unambiguous representation of variants at the allelic level, as a haplotype, or as a genotype. Our implementation, called the VMC Test Suite (http://vcfclin.org), offers web tools to generate and insert VMC identifiers into a VCF file and to generate a VMC bundle JSON representation of a VCF file or HGVS expression. A command line tool with similar functionality is also introduced. These tools facilitate use of this standard—an important step toward reliable querying of variants and their associated annotations.

Introduction

As we near twenty years since the completion of the human genome and ten years since the breaking of the sequencing cost barrier¹, the amount of genomic variant data which has been generated and made available for research and clinical use is staggering. These variants are represented using nomenclatures and file types which are designed with enough extensibility to account for the growing complexity of variant data.

The most common variant file type is the Variant Call Format (VCF), which stores all variants from a single reference sequence in a tab-delimited format based on genomic coordinates. The most common variant representation nomenclature is called HGVS after the Human Genome Variation Society that develops it. This nomenclature specifies how to represent individual instances of genomic variation in a way that is human-readable.

Although typical analyses using variant data utilize both HGVS and VCF, there are a few fundamental differences between the two methodologies. For example, each methodology uses different variant normalization methods. Normalization refers to the position of a variant in the context of a nucleotide repeat—they can be left or right shifted. HGVS requires variants to be right-justified while VCF requires left-justification. This is further complicated as both HGVS and VCF continue to evolve to accommodate the growing complexity of variants^2,3. Versioning differences between tools implementing HGVS or VCF cause ambiguity in how these variants are represented, leading to complicated querying and annotation in later use^4,5.

Research institutions have been able to adapt their individual methods in response to this ambiguity⁶. However, this has not been the case for clinical institutions that rely on strict informatics standards^7,11. Reproducible variant research and full clinical utilization of variant data all require a fundamental change in variant representation methods to address ambiguity. Individuals and institutions must be able to exchange variants and communicate about them with a surety that they are referring to the same genomic change on the same genomic reference sequence.

Ambiguity in Variant Representation

Ambiguous variant representation, caused by inconsistency in syntax structure, results in two or more representations of a single variant. Even the change of one letter in the syntax used to represent a variant can cause inconsistency and result in ambiguous representations. This causes downstream complications in associating annotations with that variant⁵. Here we outline some causes of ambiguity in two popular variant representation methods.

VCF

The VCF standard has de facto become the principal variant file format in clinical practice, surpassing others such as genomeVCF and the genome variation format (GVF)¹². This is more likely due to its long history and vast set of supporting software tools than to its ability to avoid ambiguity. In fact, the VCF specification is extremely flexible with many optional parameters (local phasing information, reference calls, no-calls, quality, likelihood, etc.) and many recommended, but not required, parameters (accession version, VCF version and HGNC gene identification)^3,13.While this flexibility allows VCF to be used for a wide variety of applications, it creates two problems when translating to clinical space. The first is that clinical decision support applications require reliable data fields in the source data which they compute over and the variable nature of most of the fields in a typical VCF file leaves it unsuitable¹². The second is ambiguity in the representation of the individual variants themselves. A simple example is acknowledged in the latest VCF specification³ and is shown in Figure 1.

Figure 1. — Different representations of a two base genomic deletion.

Though difficult to detect at first, this entry shows an allelic variant which could occur at two different locations on the reference sequence. While this example is a simple two base pair difference in location, it would result in redundant entries to a variant archive and would complicate future extraction and annotation of the variant.

There are many more instances of ambiguity in the representation of variants which are possible because of the flexibility of the VCF standard. Most arise unexpectedly and, as mentioned previously, are accounted for in the research space by adjusting methods and processes in a way which is not feasible in clinical application.

HGVS

The HGVS nomenclature has been designed to describe the wide array of sequence variation including not only sequence changes, but biological mechanism, predicted events, and complex states¹⁴. There is an important distinction to be made between variant representation and variant annotation. A variant representation captures information related to the position of the variant with regards to a particular reference, where as an annotation refers to additional information such as interpretations made about the variant. Annotating a particular variant depends very much on the unambiguous representation of that variant. One study demonstrates the difficulty for several institutions to fully implement the HGVS nomenclature correctly in representing variants¹⁵, with many other studies showing downstream complications in annotating variants represented by ambiguous HGVS representations in variant archives^16,22.

A study by Yen et al. (2017) identified ambiguous HGVS representations both generated by certain tools and found in different variant archives. The study compared them to the preferred HGVS representation which was compiled using the most recent HGVS specification version. These examples are included in Figure 2 and show the ambiguity and discord which results from tools and archives not being able to keep up with a constantly evolving nomenclature. Interestingly, the study also identified ambiguous representations which each comply with the most recent HGVS specification version. These are shown in the last two rows of Figure 2 and demonstrate that ambiguity is an issue in the nomenclature itself and not just a result of differing versions being used by tools and archives. This issue must be addressed before reliable and standardized computational approaches to using variant data can be implemented.

Figure 2. — Examples of multiplicity of variant representation within HGVS nomenclature.

General sources of ambiguity

While fundamental aspects of VCF and HGVS make them particularly prone to ambiguous variant representation, there are additional more generalized sources of ambiguity that vary between tools and archives. These include whether coordinates are 0-based (as they are in the UCSC genome browser²³) or 1-based (as they are in the Ensembl genome browsers²⁴), which reference sequence is used, and whether or not the coordinates used are inclusive or interbase.

Clinical effects of ambiguous variant representation

Two independent studies have produced technical desideratas for the integration of genomic data into a clinical setting8,¹⁰. These describe a future of clinical systems which leverage information from many different genomic and non-genomic data sources in order to increase the ability of clinicians to use genomic data to make relevant treatment changes or recommendations. However, the studies conclude that this ideal is only possible through certain technical changes in how genomic data is represented and stored. For example, the actual variant data should be separated from clinical observations and should support lossless data compression⁸. Variants should also be able to be classified into groups of common clinical impact while still supporting the ability to reference at the individual variant level when necessary¹⁰. One desiderata also calls for a common knowledge base which is deployed at and developed by multiple independent organizations. This would allow affordable access to comprehensive genomic data¹⁰. However, such a federated approach would likely have a high tolerance for redundant submissions and ambiguous entries unless a more reliable representation standard is adopted.

Although the opportunity for clinical genomic innovations and tools to provide recommendations and decision support to clinicians is constantly growing9, ^25,27, their implementation is stymied by the issue of ambiguity in variant representation28,²⁹. Mismatched variant annotations and subsequent false treatment changes could be disastrous¹⁶. These issues have begun to be addressed by a GA4GH task force, which assigned a sub-group, the VMC, to develop a variant model and a defined specification to facilitate variant representation and reliable exchange.

Variant Modelling Collaboration

The purpose of the Variant Modelling Collaboration (VMC) is to address the issue of unreliable genomic data exchange by developing a fundamentally different approach to representation. The workgroup developed new data models to represent not only allelic variations, but variation at the genotype and haplotype level. These models are straightforward and well-adapted to future modifications¹⁴. The data models themselves are computationally digested to generate unique identifiers to serve as a machine-readable representation of the variant. Thus, the VMC data model unambiguously names variation, with respect to reference sequence, and is well-equipped to foster reliable exchange between institutions or queries from external variant knowledge bases.

The VMC specification breaks variant data into fundamental components. These components are represented in specific data models and then run through the VMC digest algorithm (a combination of hash and encoding algorithms) to create unique identifiers which represent those specific components¹⁴. The following is an example, shown visually in Figure 3, of creating a VMC identifier for an allelic variant:

Figure 3. — Identifier generation for data types needed for variant representation.

Take the raw sequence data for the reference sequence and run it through the digest algorithm. This returns a VMC sequence identifier.
Take the start and end coordinates for the allelic variant and use them, according to the VMC interval model, to create a VMC interval object.
Run both the VMC sequence identifier and VMC interval object through the digest as a VMC location object, according to the model, to generate a VMC location identifier.
Use the VMC location identifier and the alternate base(s) found in the previously specified interval (referred to as state in the VMC specification) to create a VMC allele object according to the model. Then digest that object to generate the ultimate goal of a unique VMC allele identifier.

The current VMC specification also has data models to represent and digest identifiers for haplotypes and genotypes. A VMC haplotype is used to designate multiple VMC alleles as being in phase or “cis”. A VMC genotype is used to designate a group of VMC haplotypes. Both models are included in Figure 3.

In this way, fundamental components of variant data can be abstracted, compartmentalized, and used to generate unique representations. The algorithm used to generate these identifiers was designed to drop the probability of collision in a corpus of 10³⁰ objects to less than a 1 in 10²⁷ chance¹⁴. This straightforward approach of a strictly standardized VMC data model digested into a unique identifier removes variant representation ambiguity and fosters reliable variant matching.

Implementation

We have developed a VMC Test Suite as an implementation of the schemas and digest algorithm explained in this first version of the specification. The Test Suite has four tools which have been made available as publicly accessible web tools hosted at http://vcfclin.org. The code can be found at https://github.com/eilbecklab/VMC-Software-Suite.

The first tool, shown in Figure 4, accepts a user-uploaded VCF file. It then goes through each variant of the file and pulls location intervals and alternate bases in order to generate a VMC allele identifier for each entry. It stores that identifier, along with the VMC sequence identifier and VMC location identifier used in the process, in the info field of that variant entry. It also adds requisite header lines which explain the additions. The modified file can then be downloaded back by the user.
The second tool also takes a user-uploaded VCF file as input. However, rather than give back a modified VCF file for download with VMC identifiers added in, the user is given a VMC bundle object, represented in the JSON format, which contains each of the VMC objects and identifiers generated from the file. The VMC bundle is envisaged as a mechanism to integrate complex variant data such as haplotypes into a workflow.
The third tool implements existing code found in the VMC GitHub repository¹⁴ which converts an HGVS string into a VMC bundle. This VMC bundle is a JSON object which holds each of the VMC identifiers generated for the variant encoded by the HGVS string.
A command line tool has also been developed with very similar functionality to each of the three tools. It is a lightweight implementation of the core data models and digest algorithm included in the VMC specification. It is publicly available through a GitHub repository³⁰.

Figure 4. — One of the VMC Test Suite tools which adds VMC identifiers to a VCF file.

Evaluation

In order to confirm correct implementation of the VMC data models, digest algorithm, and output format, three validation tests were performed on the VMC Test Suite. These tests focused on the three main processes which enable the functionality of the tools in the suite.

The first test was to validate our process of generating sequence identifiers. SeqRepo is a large collection of biological sequences which is made available by Biocommons³¹. The VMC developers were granted access to SeqRepo and generated a sequence identifier for each entry. We used SeqRepo as the gold standard for this test.
The second test was to validate our process of generating of allele identifiers. Since allele identifiers require a location identifier (as shown in Figure 3), this would also test our process of generating location identifiers. The VMC developers created Python code which generates VMC identifiers for HGVS expressions. We used outputs of that code as the gold standard for this test.
The third test was to validate our implementation of the VMC bundle JSON schema. This schema is available on the VMC GitHub repository and was used as the gold standard for this test.

Test 1: Sequence identifiers

The tools in the suite each draw from a custom database of pre-digested sequence identifiers which grows as new variants are encountered. This cuts down on future processing time. The suite must use information from the VCF entry to download the appropriate reference FASTA file for that entry from NCBI. This FASTA file is then run through the VMC digest algorithm in order to generate the sequence identifier. Since a difference of even one base in the FASTA file used would result in non-identical sequence identifiers, our process of locating and digesting the appropriate FASTA file had to be validated. To do this, we simply compared the sequence identifiers in our database to those found in a gold standard database. Both associated an accession number with the sequence identifier, making the look-up process simple.

We found that, of the 1720 accession numbers which were shared between the databases, each one had identical sequence identifiers. There were 204,819 accession numbers in our database which weren’t found in SeqRepo but since the process used to generate their corresponding sequence identifiers was the same as the 1720 matches, these can be assumed to be accurate sequence identifiers as well.

Test 2: Allele identifiers

As explained previously, VMC allele identifiers are each generated by digesting a data model which includes a VMC sequence identifier, a VMC location identifier, and an allelic change. After getting the right sequence identifier for a given VCF entry, the suite extracts all the information needed to fill the location and allele data models. It digests the location data model and uses it to complete the allele data model. This model is then digested into an allele identifier.

To test these processes, we needed a VCF file which had HGVS expressions for each entry. This would provide an HGVS expression to use as input for the HGVS conversion code written by the VMC developers. Allele identifiers generated with this code would serve as a gold standard. We would then be able to compare that identifier to the one generated by the suite (generated from the other fields of that VCF entry). Identical allele identifiers would show that the suite is generating allele identifiers (and, by extension, location identifiers) properly.

ClinVar provides a weekly release VCF file of their variants³² which have HGVS expressions included in the INFO field for each entry. An example entry from the file, with portions of the INFO field omitted to highlight the included HGVS expression and rsID, is shown in Figure 5. The included rsID was also important because it allowed the suite to select the appropriate sequence identifier.

Figure 5. — Entry from the ClinVar weekly release VCF file.

Of the 393403 variants in the file, the suite generated allele identifiers for 381257 of them (96.91%) which were identical to the gold standard allele identifiers. Of the non-identical allele identifiers, 11450 of them (2.91% of total) came from deletions, 682 of them (0.17% of total) came from insertions, and 14 of them (0.0036% of total) came from indels. On closer inspection, these non-identical allele identifiers represent an assortment of edge cases where the coordinates of the VCF entry doesn’t match the location in the corresponding HGVS expression.

The identical allele identifiers include a high percentage of all forms of variation accepted by the gold standard HGVS conversion code (substitutions, insertions, deletions, and indels). We believe this test validates our approach for generating correct VMC identifiers and points to rare situations where VCF and HGVS are not aligned.

Test 3: VMC bundle JSON

To ascertain the JSON representation of the results of the tools was true to the VMC bundle JSON schema included in the VMC specification, we used the jsonschema Python tool³³ to check each field. This tool confirmed that the output of our second VCF tool (creates a VMC bundle in JSON from a VCF file) matches the schema.

Discussion

The VMC specification provides an important building block in the effort to overcome ambiguity in variant representation and resulting discordance in variant archives. It also provides a backbone for naming complex variants such as haplotypes, a feature necessary for communication of pharmacogenomic and immunologic variants. It takes a fundamentally different approach to the process of representing variants and provides unique identifiers to enable reliable querying and annotation of those variants. However, the next important problem which must be considered is that of equivalence.

General sources of ambiguity such as different reference sequences, different coordinate systems, alternate transcripts, etc., do not go away with the VMC data models. However, VMC does provide more stringent representation standards and as a result, the concept of equivalence definitions is now a possibility. Four specific types of equivalence are discussed here with the acknowledgement that additional considerations will continue to become more apparent as the VMC standard is adopted.

Normalization

Different normalization strategies and tools can result in slightly different location intervals. These differing location intervals could lead to different VMC identifiers being generated which should be considered equivalent. Anticipating this, the VMC specification requires that the VT normalization algorithm³⁴ be used to normalize variants prior to their inclusion in the data model and digestion into a VMC identifier. By requiring all variants to be normalized using the same algorithm, it will be possible to computationally determine normalization equivalence and account for it in future implementations of the VMC specification.

Projection

A variant may be represented on a genomic sequence, on a transcript, or on a protein sequence. With the volume of genomic data now available it is likely that any particular variant could be represented in all three ways. Location coordinates and reference sequences will be different for each representation but, in the end, they are referring to the same variant and should be considered equivalent. This would allow the combination and utility of all three levels of annotation associated with those sequence types.

Alternate transcripts

A common source of ambiguity is different variant locations arising from alternate transcripts. Biologically, the different splice sites of a particular genomic region can result in a variant being found on two non-identical transcripts. While the transcripts themselves are non-identical, the variant is functionally the same. The different transcripts lead to different variant locations which ultimately would result in different VMC identifiers. However, these different identifiers should be considered equivalent because they refer to a functionally-equivalent variant. Establishing this equivalency would require computation over gene annotations and genomic-level interpretations. While the computation would be complex, VMC makes this equivalence possible to determine in future implementations.

Lift-over

The idea of lift-over is purely systematic and not biological. It refers to the ability to take a variant backward and forward between different genome builds and versions while maintaining equivalence. Each genome build and version will generate a different VMC identifiers which are functionally identical and should be considered equivalent. NCBI currently hosts a lift-over tool called Remap which projects annotation data between different coordinate systems³⁵. Providing this functionality for variant representations on a larger scale will be a possibility as VMC implementation continues to grow and many more VMC identifiers are generated and associated.

Conclusion

We have presented an implementation of the GA4GH VMC specification, with the hope that these tools will facilitate the use of this variant representation standard. Using VMC identifiers to represent variant data will reduce redundancy within variant archives. The computationally stringent implementation requirements will also significantly reduce discord among tools used to generate VMC identifiers. This reduction in ambiguity in variant representation will allow more reliable and precise data queries, resulting in more reproducible research methods and many useful clinical applications which were not possible before.

Acknowledgments

GA4GH is increasingly involved in standards development for biological data. The VMC group provides an open environment for discussion and development of specifications and tools for variant representation. We are thankful for the productive weekly discussion and innovation. We want to highlight Reece Hart for initiating this group and for his leadership. This work was supported by the National Institute of Health: R01HG008628 to KE, and NLM T15- LM007124 training predoctoral slot to MW.

Figures & Table

Figure 6. — Example depictions of the four types of equivalence discussed here.

References

1.Bonetta L. Whole-Genome Sequencing Breaks the Cost Barrier. Cell. 2010 Jun;141(6):917–919. doi: 10.1016/j.cell.2010.05.034. [DOI] [PubMed] [Google Scholar]
2.den Dunnen JT, Dalgleish R, Maglott DR, Hart RK, Greenblatt MS, McGowan-Jordan J, et al. HGVS Recommendations for the Description of Sequence Variants: 2016 Update. Human Mutation. 2016 Jun;37(6):564–569. doi: 10.1002/humu.22981. [DOI] [PubMed] [Google Scholar]
3.The Variant Call Format (VCF) Version 4.2 Specification. 2018 [Google Scholar]
4.McCarthy DJ, Humburg P, Kanapin A, Rivas MA, Gaulton K, Cazier J, et al. Choice of transcripts and software has a large effect on variant annotation. Genome medicine. 2014;6(3):26. doi: 10.1186/gm543. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Yen JL, Garcia S, Montana A, Harris J, Chervitz S, Morra M, et al. A variant by any name: quantifying annotation discordance across tools and clinical databases. Genome Medicine. 2017 Dec;9(1):7. doi: 10.1186/s13073-016-0396-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Dalgleish R. LSDBs and How They Have Evolved. Human Mutation. 2016 Jun;37(6):532–539. doi: 10.1002/humu.22979. [DOI] [PubMed] [Google Scholar]
7.Vijay P, McIntyre ABR, Mason CE, Greenfield JP, Li S. Clinical Genomics: Challenges and Opportunities. Critical Reviews in Eukaryotic Gene Expression. 2016;26(2):97–113. doi: 10.1615/CritRevEukaryotGeneExpr.2016015724. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Masys DR, Jarvik GP, Abernethy NF, Anderson NR, Papanicolaou GJ, Paltoo DN, et al. Technical desiderata for the integration of genomic data into Electronic Health Records. Journal of Biomedical Informatics. 2012 Jun;45(3):419–422. doi: 10.1016/j.jbi.2011.12.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Ullman-Cullere MH, Mathew JP. Emerging landscape of genomics in the electronic health record for personalized medicine. Human Mutation. 2011 May;32(5):512–516. doi: 10.1002/humu.21456. [DOI] [PubMed] [Google Scholar]
10.Welch BM, Eilbeck K, Fiol GD, Meyer LJ, Kawamoto K. Technical desiderata for the integration of genomic data with clinical decision support. Journal of Biomedical Informatics. 2014 Oct;51:3–7. doi: 10.1016/j.jbi.2014.05.014. [DOI] [PubMed] [Google Scholar]
11.Li MM, Datto M, Duncavage EJ, Kulkarni S, Lindeman NI, Roy S, et al. Standards and Guidelines for the Interpretation and Reporting of Sequence Variants in Cancer: A Joint Consensus Recommendation of the Association for Molecular Pathology, American Society of Clinical Oncology, and College of American Pathologists. The Journal of molecular diagnostics : JMD. 2017;19(1):4–23. doi: 10.1016/j.jmoldx.2016.10.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Lubin IM, Aziz N, Babb LJ, Ballinger D, Bisht H, Church DM, et al. Principles and Recommendations for Standardizing the Use of the Next-Generation Sequencing Variant File in Clinical Settings. The Journal of Molecular Diagnostics. 2017;19(3):417–426. doi: 10.1016/j.jmoldx.2016.12.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, et al. The variant call format and VCFtools. Bioinformatics (Oxford, England) 2011 Aug;27(15):2156–8. doi: 10.1093/bioinformatics/btr330. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.VMC Github Repository. Available at: https://github.com/ga4gh/vmc. [Google Scholar]
15.Tack V, Deans ZC, Wolstenholme N, Patton S, Dequeker EMC. What’s in a Name? A Coordinated Approach toward the Correct Use of a Uniform Nomenclature to Improve Patient Reports and Databases. Human Mutation. 2016 Jun;37(6):570–575. doi: 10.1002/humu.22975. [DOI] [PubMed] [Google Scholar]
16.Vail PJ, Morris B, van Kan A, Burdett BC, Moyes K, Theisen A, et al. Comparison of locus-specific databases for BRCA1 and BRCA2 variants reveals disparity in variant classification within and among databases. Journal of Community Genetics. 2015 Oct;6(4):351–359. doi: 10.1007/s12687-015-0220-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Pepin MG, Murray ML, Bailey S, Leistritz-Kessler D, Schwarze U, Byers PH. The challenge of comprehensive and consistent sequence variant interpretation between clinical laboratories. Genetics in Medicine. 2016 Jan;18(1):20–24. doi: 10.1038/gim.2015.31. [DOI] [PubMed] [Google Scholar]
18.Balmaa J, Digiovanni L, Gaddam P, Walsh MF, Joseph V, Stadler ZK, et al. Conflicting Interpretation of Genetic Variants and Cancer Risk by Commercial Laboratories as Assessed by the Prospective Registry of Multiplex Testing. Journal of Clinical Oncology. 2016 Dec;34(34):4071–4078. doi: 10.1200/JCO.2016.68.4316. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Gradishar W, Johnson K, Brown K, Mundt E, Manley S. Clinical Variant Classification: A Comparison of Public Databases and a Commercial Testing Laboratory. The Oncologist. 2017 Jul;22(7):797–803. doi: 10.1634/theoncologist.2016-0431. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Mitropoulou C, Webb AJ, Mitropoulos K, Brookes AJ, Patrinos GP. Locus-specific database domain and data content analysis: evolution and content maturation toward clinical usea. Human Mutation. 2010 Sep;31(10):1109–1116. doi: 10.1002/humu.21332. [DOI] [PubMed] [Google Scholar]
21.Knoppers BM. Framework for responsible sharing of genomic and health-related data. The HUGO journal. 2014 Dec;8(1):3. doi: 10.1186/s11568-014-0003-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Global Alliance for Genomics and Health. A federated ecosystem for sharing genomic, clinical data. Science. 2016 Jun;352(6291):1278–1280. doi: 10.1126/science.aaf6162. [DOI] [PubMed] [Google Scholar]
23.UCSC Genome Browser Home. Available at: https://genome.ucsc.edu/ [Google Scholar]
24.Ensembl genome browser 95. Available at: https://uswest.ensembl.org/index.html. [Google Scholar]
25.Overby CL, Kohane I, Kannry JL, Williams MS, Starren J, Bottinger E, et al. Opportunities for genomic clinical decision support interventions. Genetics in medicine : official journal of the American College of Medical Genetics. 2013 Oct;15(10):817–23. doi: 10.1038/gim.2013.128. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Cutting E, Banchero M, Beitelshees AL, Cimino JJ, Fiol GD, Gurses AP, et al. User-centered design of multigene sequencing panel reports for clinicians. Journal of Biomedical Informatics. 2016 Oct;63:1–10. doi: 10.1016/j.jbi.2016.07.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Welch BM, Kawamoto K. The need for clinical decision support integrated with the electronic health record for the clinical application of whole genome sequencing information. Journal of personalized medicine. 2013 Dec;3(4):306–25. doi: 10.3390/jpm3040306. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Krier JB, Kalia SS, Green RC. Genomic sequencing in clinical practice: applications, challenges, and opportunities. Dialogues in clinical neuroscience. 2016;18(3):299–312. doi: 10.31887/DCNS.2016.18.3/jkrier. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Shirts BH, Salama JS, Aronson SJ, Chung WK, Gray SW, Hindorff LA, et al. CSER and eMERGE: current and potential state of the display of genetic information in the electronic health record. Journal of the American Medical Informatics Association. 2015 Jul;:ocv065. doi: 10.1093/jamia/ocv065. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.VMCCL GitHub Repository. Available at: https://github.com/srynobio/vmccl. [Google Scholar]
31.SeqRepo GitHub Repository. Available at: https://github.com/biocommons/biocommons.seqrepo. [Google Scholar]
32. ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf GRCh37/weekly/ [Google Scholar]
33.Jsonschema GitHub Repository. Available at: https://github.com/Julian/jsonschema. [Google Scholar]
34.Tan A, Abecasis GR, Kang HM. Unified representation of genetic variants. Bioinformatics. 2015 Jul;31(13):2202–2204. doi: 10.1093/bioinformatics/btv112. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.What is NCBI Remap? Available at: https://www.ncbi.nlm.nih.gov/genome/tools/remap/docs/whatis. [Google Scholar]

[r1-3203466] 1.Bonetta L. Whole-Genome Sequencing Breaks the Cost Barrier. Cell. 2010 Jun;141(6):917–919. doi: 10.1016/j.cell.2010.05.034. [DOI] [PubMed] [Google Scholar]

[r2-3203466] 2.den Dunnen JT, Dalgleish R, Maglott DR, Hart RK, Greenblatt MS, McGowan-Jordan J, et al. HGVS Recommendations for the Description of Sequence Variants: 2016 Update. Human Mutation. 2016 Jun;37(6):564–569. doi: 10.1002/humu.22981. [DOI] [PubMed] [Google Scholar]

[r3-3203466] 3.The Variant Call Format (VCF) Version 4.2 Specification. 2018 [Google Scholar]

[r4-3203466] 4.McCarthy DJ, Humburg P, Kanapin A, Rivas MA, Gaulton K, Cazier J, et al. Choice of transcripts and software has a large effect on variant annotation. Genome medicine. 2014;6(3):26. doi: 10.1186/gm543. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r5-3203466] 5.Yen JL, Garcia S, Montana A, Harris J, Chervitz S, Morra M, et al. A variant by any name: quantifying annotation discordance across tools and clinical databases. Genome Medicine. 2017 Dec;9(1):7. doi: 10.1186/s13073-016-0396-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r6-3203466] 6.Dalgleish R. LSDBs and How They Have Evolved. Human Mutation. 2016 Jun;37(6):532–539. doi: 10.1002/humu.22979. [DOI] [PubMed] [Google Scholar]

[r7-3203466] 7.Vijay P, McIntyre ABR, Mason CE, Greenfield JP, Li S. Clinical Genomics: Challenges and Opportunities. Critical Reviews in Eukaryotic Gene Expression. 2016;26(2):97–113. doi: 10.1615/CritRevEukaryotGeneExpr.2016015724. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r8-3203466] 8.Masys DR, Jarvik GP, Abernethy NF, Anderson NR, Papanicolaou GJ, Paltoo DN, et al. Technical desiderata for the integration of genomic data into Electronic Health Records. Journal of Biomedical Informatics. 2012 Jun;45(3):419–422. doi: 10.1016/j.jbi.2011.12.005. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r9-3203466] 9.Ullman-Cullere MH, Mathew JP. Emerging landscape of genomics in the electronic health record for personalized medicine. Human Mutation. 2011 May;32(5):512–516. doi: 10.1002/humu.21456. [DOI] [PubMed] [Google Scholar]

[r10-3203466] 10.Welch BM, Eilbeck K, Fiol GD, Meyer LJ, Kawamoto K. Technical desiderata for the integration of genomic data with clinical decision support. Journal of Biomedical Informatics. 2014 Oct;51:3–7. doi: 10.1016/j.jbi.2014.05.014. [DOI] [PubMed] [Google Scholar]

[r11-3203466] 11.Li MM, Datto M, Duncavage EJ, Kulkarni S, Lindeman NI, Roy S, et al. Standards and Guidelines for the Interpretation and Reporting of Sequence Variants in Cancer: A Joint Consensus Recommendation of the Association for Molecular Pathology, American Society of Clinical Oncology, and College of American Pathologists. The Journal of molecular diagnostics : JMD. 2017;19(1):4–23. doi: 10.1016/j.jmoldx.2016.10.002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r12-3203466] 12.Lubin IM, Aziz N, Babb LJ, Ballinger D, Bisht H, Church DM, et al. Principles and Recommendations for Standardizing the Use of the Next-Generation Sequencing Variant File in Clinical Settings. The Journal of Molecular Diagnostics. 2017;19(3):417–426. doi: 10.1016/j.jmoldx.2016.12.001. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r13-3203466] 13.Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, et al. The variant call format and VCFtools. Bioinformatics (Oxford, England) 2011 Aug;27(15):2156–8. doi: 10.1093/bioinformatics/btr330. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r14-3203466] 14.VMC Github Repository. Available at: https://github.com/ga4gh/vmc. [Google Scholar]

[r15-3203466] 15.Tack V, Deans ZC, Wolstenholme N, Patton S, Dequeker EMC. What’s in a Name? A Coordinated Approach toward the Correct Use of a Uniform Nomenclature to Improve Patient Reports and Databases. Human Mutation. 2016 Jun;37(6):570–575. doi: 10.1002/humu.22975. [DOI] [PubMed] [Google Scholar]

[r16-3203466] 16.Vail PJ, Morris B, van Kan A, Burdett BC, Moyes K, Theisen A, et al. Comparison of locus-specific databases for BRCA1 and BRCA2 variants reveals disparity in variant classification within and among databases. Journal of Community Genetics. 2015 Oct;6(4):351–359. doi: 10.1007/s12687-015-0220-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r17-3203466] 17.Pepin MG, Murray ML, Bailey S, Leistritz-Kessler D, Schwarze U, Byers PH. The challenge of comprehensive and consistent sequence variant interpretation between clinical laboratories. Genetics in Medicine. 2016 Jan;18(1):20–24. doi: 10.1038/gim.2015.31. [DOI] [PubMed] [Google Scholar]

[r18-3203466] 18.Balmaa J, Digiovanni L, Gaddam P, Walsh MF, Joseph V, Stadler ZK, et al. Conflicting Interpretation of Genetic Variants and Cancer Risk by Commercial Laboratories as Assessed by the Prospective Registry of Multiplex Testing. Journal of Clinical Oncology. 2016 Dec;34(34):4071–4078. doi: 10.1200/JCO.2016.68.4316. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r19-3203466] 19.Gradishar W, Johnson K, Brown K, Mundt E, Manley S. Clinical Variant Classification: A Comparison of Public Databases and a Commercial Testing Laboratory. The Oncologist. 2017 Jul;22(7):797–803. doi: 10.1634/theoncologist.2016-0431. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r20-3203466] 20.Mitropoulou C, Webb AJ, Mitropoulos K, Brookes AJ, Patrinos GP. Locus-specific database domain and data content analysis: evolution and content maturation toward clinical usea. Human Mutation. 2010 Sep;31(10):1109–1116. doi: 10.1002/humu.21332. [DOI] [PubMed] [Google Scholar]

[r21-3203466] 21.Knoppers BM. Framework for responsible sharing of genomic and health-related data. The HUGO journal. 2014 Dec;8(1):3. doi: 10.1186/s11568-014-0003-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r22-3203466] 22.Global Alliance for Genomics and Health. A federated ecosystem for sharing genomic, clinical data. Science. 2016 Jun;352(6291):1278–1280. doi: 10.1126/science.aaf6162. [DOI] [PubMed] [Google Scholar]

[r23-3203466] 23.UCSC Genome Browser Home. Available at: https://genome.ucsc.edu/ [Google Scholar]

[r24-3203466] 24.Ensembl genome browser 95. Available at: https://uswest.ensembl.org/index.html. [Google Scholar]

[r25-3203466] 25.Overby CL, Kohane I, Kannry JL, Williams MS, Starren J, Bottinger E, et al. Opportunities for genomic clinical decision support interventions. Genetics in medicine : official journal of the American College of Medical Genetics. 2013 Oct;15(10):817–23. doi: 10.1038/gim.2013.128. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r26-3203466] 26.Cutting E, Banchero M, Beitelshees AL, Cimino JJ, Fiol GD, Gurses AP, et al. User-centered design of multigene sequencing panel reports for clinicians. Journal of Biomedical Informatics. 2016 Oct;63:1–10. doi: 10.1016/j.jbi.2016.07.014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r27-3203466] 27.Welch BM, Kawamoto K. The need for clinical decision support integrated with the electronic health record for the clinical application of whole genome sequencing information. Journal of personalized medicine. 2013 Dec;3(4):306–25. doi: 10.3390/jpm3040306. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r28-3203466] 28.Krier JB, Kalia SS, Green RC. Genomic sequencing in clinical practice: applications, challenges, and opportunities. Dialogues in clinical neuroscience. 2016;18(3):299–312. doi: 10.31887/DCNS.2016.18.3/jkrier. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r29-3203466] 29.Shirts BH, Salama JS, Aronson SJ, Chung WK, Gray SW, Hindorff LA, et al. CSER and eMERGE: current and potential state of the display of genetic information in the electronic health record. Journal of the American Medical Informatics Association. 2015 Jul;:ocv065. doi: 10.1093/jamia/ocv065. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r30-3203466] 30.VMCCL GitHub Repository. Available at: https://github.com/srynobio/vmccl. [Google Scholar]

[r31-3203466] 31.SeqRepo GitHub Repository. Available at: https://github.com/biocommons/biocommons.seqrepo. [Google Scholar]

[r32-3203466] 32. ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf GRCh37/weekly/ [Google Scholar]

[r33-3203466] 33.Jsonschema GitHub Repository. Available at: https://github.com/Julian/jsonschema. [Google Scholar]

[r34-3203466] 34.Tan A, Abecasis GR, Kang HM. Unified representation of genetic variants. Bioinformatics. 2015 Jul;31(13):2202–2204. doi: 10.1093/bioinformatics/btv112. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r35-3203466] 35.What is NCBI Remap? Available at: https://www.ncbi.nlm.nih.gov/genome/tools/remap/docs/whatis. [Google Scholar]

PERMALINK

Implementing the VMC Specification to Reduce Ambiguity in Genomic Variant Representation

Michael Watkins

Shawn Rynearson

Alex Henrie

Karen Eilbeck

Abstract

Introduction

Ambiguity in Variant Representation

VCF

Figure 1.