Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2019 Nov 18;36(6):1902–1907. doi: 10.1093/bioinformatics/btz856

SPDI: data model for variants and applications at NCBI

J Bradley Holmes 1,, Eric Moyer 1, Lon Phan 1, Donna Maglott 1, Brandi Kattman 1
Editor: Jonathan Wren
PMCID: PMC7523648  PMID: 31738401

Abstract

Motivation

Normalizing sequence variants on a reference, projecting them across congruent sequences and aggregating their diverse representations are critical to the elucidation of the genetic basis of disease and biological function. Inconsistent representation of variants among variant callers, local databases and tools result in discrepancies that complicate analysis. NCBI’s genetic variation resources, dbSNP and ClinVar, require a robust, scalable set of principles to manage asserted sequence variants.

Results

The SPDI data model defines variants as a sequence of four attributes: sequence, position, deletion and insertion, and can be applied to nucleotide and protein variants. NCBI web services convert representations among HGVS, VCF and SPDI and provide two functions to aggregate variants. One, based on the NCBI Variant Overprecision Correction Algorithm, returns a unique, normalized representation termed the ‘Contextual Allele’. The SPDI data model, with its four operations, defines exactly the reference subsequence affected by the variant, even in repeat regions, such as homopolymer and other sequence repeats. The second function projects variants across congruent sequences and depends on an alignment dataset of non-assembly NCBI RefSeq sequences (prefixed NM, NR and NG), as well as inter- and intra-assembly-associated genomic sequences (NCs, NTs and NWs), supporting robust projection of variants across congruent sequences and assembly versions. The variant is projected to all congruent Contextual Alleles. One of these Contextual Alleles, typically the allele based on the latest assembly version, represents the entire set, is designated the unique ‘Canonical Allele’ and is used directly to aggregate variants across congruent sequences.

Availability and implementation

The SPDI services are available for open access at: https://api.ncbi.nlm.nih.gov/variation/v0.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Critical to the advancement of genetic research and precision medicine is an improved knowledge of the genetic contribution to diseases, cancers and drug responses (Carter and He, 2016). In the next decade, the sequencing of millions of genomes will produce billions of called variants requiring analysis. A crucial step in analyzing the raw data is aggregating disparate variant representations into standardized reports in large public databases, such as dbSNP (Sherry et al., 2001) and ClinVar (Landrum et al., 2018). The aggregation step can be challenging because any variant can be described using different formats [variant call format (VCF), HGVS, RefSNP and SequenceLocation in ClinVar’s submission XML), different reference sequence types (mRNA, protein or genomic) and different assembly versions (hg18/NCBI36, hg19/GRCh37 or hg38/GRCh38).

Previously, NCBI’s dbSNP, a database of small genetic variations, addressed this problem by aggregating variant data from disparate submitters by projecting (or remapping) the variant to the genome using defined sequences flanking the variant site. All novel and existing variant flanking sequences were mapped to the latest genomic assembly by BLAST (Altschul et al., 1990) and variants of the same type [single nucleotide variation (SNV), deletion, etc.] and genomic location were aggregated and assigned a unique dbSNP Reference SNP cluster ID (RSID) to create a non-redundant set of reference variants (RefSNP). However, the algorithm employed by dbSNP was overly precise for variants in repetitive regions where reported variant position can be ambiguous. RefSNP clustering artifacts could result from errors in submitted flanking sequence, ambiguous or imperfect sequence alignments in repetitive genomic regions, and overprecision of reported variants in those same regions (Assmus et al., 2013; Li et al., 2014).

In the 20 years since dbSNP began, several formats for communicating variants have come into prominence, including HGVS nomenclature (den Dunnen et al., 2016), VCF (Danecek et al., 2011) and more recently the GA4GH Variant Representation (https://vr-spec.readthedocs.io/). Several strategies have been employed to overcome the imperfect aggregation (Deans et al., 2016; Yen et al., 2017) as well as to support the latest representations of variant data, and are discussed in detail elsewhere (Yen et al., 2017) and summarized in Kanterakis et al. (2018). More recent examples of normalization and aggregation tools include the ClinGen Allele Registry (Pawliczek et al., 2018), VariantValidator (Freeman et al., 2018), the HGVS Python Package (Wang et al., 2018), Ensembl’s Variant Recorder (McLaren et al., 2016), TransVar (Zhou et al., 2015), normalize-vt (Tan et al., 2015) and Mutalyzer (den Dunnen, 2016) and are compared in detail in the Supplementary Material.

Here, we document the four simplifying principles adopted by NCBI’s dbSNP and ClinVar resources to support variant aggregation and searching. First, sequence changes are modeled as a collection of alleles and for ClinVar, observations are associated with alleles. Second, all dbSNP and most of ClinVar alleles have known endpoints, and therefore can be represented by the SPDI data model. Third, application of a modified algorithm used previously (Assmus et al., 2013; Tan et al., 2015), and in limited coordination with the ClinGen Allele Registry (Pawliczek et al., 2018) produces a non-parsimonious, unambiguous representation on the submitted sequence, an identifier named the Contextual Allele. Fourth and finally, Contextual Alleles may be aggregated into a Canonical Allele, through a defined set of pairwise sequence alignments. Associated services that support these operations are available via a web API (https://api.ncbi.nlm.nih.gov/variation/v0) to aid users in variant analysis and interpretation. These four simplifying principles now underpin the dbSNP and ClinVar dataflows to aggregate and annotate submissions, and have resulted in the removal of millions of merged, redundant variants.

2 Materials and methods

2.1 Variant model

For nucleotides, SPDI represents all variants as a sequence of four operations (Supplementary Table S1): start at the boundary before the first nucleotide in the sequence S, advance P nucleotides, delete D nucleotides, then Insert the nucleotides in the string. This model has four parameters:

  1. (S)equence: (a string): reference sequence identifier as accession and version (any identifier space supported by the implementer is allowed),

  2. (P)osition: (a non-negative integer) the number of nucleotides to advance on the reference sequence from the boundary before the first nucleotide, which can be thought of as a 0-based interbase coordinate for the variant start,

  3. (D)eletion: (a non-negative integer) deletion length, i.e. the reference allele length

  4. (I)nsertion: (a string) the inserted variant sequence.

Notably, the position for coding transcripts is like ‘r.’ HGVS notation (from the beginning of the sequence), not ‘c.’ (from the start codon). SPDI also supports an alternate representation in which the Deletion field is a string containing the literal sequence to delete. This is redundant information but, for efficiency, it facilitates the transformation of Contextual Alleles to VCF and HGVS without having to refer to the reference sequences. It also enables some error detection for strandedness and off-by-one errors.

2.2 Two functions to support aggregation: contextual and canonical

NCBI web services based on the SPDI data model supply two key functions to aggregate alleles: the Contextual Allele and the Canonical Allele.

The Contextual Allele is the representation of the variant on a defined reference sequence after correcting for overprecision in repeat regions. Overprecision occurs when gaps in the alignment between observed sequence reads and the reference produce multiple alignment choices but only one is used to calculate the reported variant. It is not possible to report which nucleotide was deleted or inserted in repeat regions. Multiple distinct answers resolve to the same observation. Therefore, it is overly precise to ascribe the variation to one particular change from a group of equal choices. (As a corollary, this manuscript rejects parsimony as a virtue for shiftable insertions and deletions. Shiftable variants written in parsimonious form are, by definition, overly precise).

The reference sequence (S) can be of any type: transcript, protein or genomic. Because it depends only upon the variant and the reference sequence, the Contextual Allele is stable over time. However, it is unique only within a sequence. For more on the algorithm that identifies the Contextual Allele, see Supplementary Figure S1.

The Canonical Allele extends identification across related, or congruent sequences, taking into account sequence changes (see Section 2.5). For the purposes of producing a reference catalog, all Contextual Alleles that are placed together in a canonical set are considered the same allele because they result in the same local sequence in a congruent region by alignment. That is, the Canonical Allele represents a set of congruent Contextual Alleles. One contextual representation is chosen as a Canonical Allele Representative and we use its Contextual SPDI as the identifier for the Canonical Allele. The chosen Canonical Allele is based on the latest genomic assembly or version of an mRNA if the genomic position in unknown. Because there is a one-to-one correspondence between the Canonical Allele Representative and the Canonical Allele, we frequently use only the term Canonical Allele. The Canonical Allele depends strongly on accurate determination of which regions of which sequences are congruent. Because the sequence alignment can change over time with an improved algorithm and alignment tools, we encapsulate congruent interpretations in a versioned alignment dataset (ADS). Though obvious congruences are stable, difficult alignments such as those in low-complexity or paralogous regions may be improved over time. Thus, some Canonical Allele Representatives may differ as new ADS versions are released. The Canonical Allele Representatives provides (such as for dbSNP) (i) a single exemplar from a set of identical variants to compare against a novel variant and (ii) a non-redundant dataset for variant annotation on sequence viewers and for data exchange.

2.3 Projection

Projection (or remapping or lifting over) is a process for calculating congruent coordinates across sequences using alignments, whether it is between different versions of the same sequence accession, or across sequence types (genomic versus mRNA). The NCBI Variation Remapping Service (https://www.ncbi.nlm.nih.gov/variation/services/remapping/) was originally based on the existing assembly-assembly NCBI Genome Remapping Service (https://www.ncbi.nlm.nih.gov/genome/tools/remap) but now includes additional paired alignment sets to provide support for robust remapping and annotation of variants on different sequences. The ADS includes multiple types of pairwise alignments generated by various NCBI processes including NCBI Splign (Kapustin et al., 2008) for cDNA-to-Genomic, NG aligner and NCBI’s assembly-assembly aligner. See Supplementary Material for more information about the types of alignments and the algorithm used to remap variants.

2.4 Alignment datasets encode sequence relationships

Calculating the Canonical Allele from Contextual Alleles based on different reference sequences requires a set of invertible relationships between sequences in the ADS. The current ADS covers most scenarios for reporting the billions of variants collected by NCBI dbSNP and ClinVar from thousands of different submission sources. However, additional alignment pairs can be created for new reference sequences and added to ADS should the need arise.

The ADS currently consists of over 2.5 million aligned sequence intervals for over 350 000 distinct input sequences. It is stored as an 187 MB-compressed, annotation-rich, binary seq-align objects or a 77 MB-compressed, stripped-down text file of pair-wise regions. It is regularly updated with new sequences as a result of improved alignment heuristics and software updates. Since the alignments in the ADS encode the sequence relationships, it follows that the Canonical Alleles that depend on those alignments are also regularly recomputed. Therefore, any allele set or aggregation system, such as dbSNP or ClinVar must likewise support a regular mechanism to recompute sets and allow for their merging and splitting.

Consider the pairwise alignments used for remapping an SNV [dbSNP: rs756655831 (www.ncbi.nlm.nih.gov/snp/rs756655831)] (Fig. 1). The ADS heuristic assumes that alignment is transitive between (i) the sequence and its different versions and (ii) between the sequences, NG and NM, that are part of the annotation set for a particular assembly version. Thus, the number of sequence pairs to project between sequences is minimized. Alignment connects the positions on the older sequence models and assembled chromosome sequences through other, more recent, sequences (e.g. NM_003193.4).

Fig. 1.

Fig. 1.

For rs756655831, a representation of the alignments between various sequences, and the resulting SPDIs. Notably, this RefSNP maps to two RefSeqGenes (TBCE NG_009230.1 and B3GALNT2, NG_033219.2). Each has its own set of transcripts, of which all current ones align to the GRCh38 chromosomal sequence NC_000001.11. All previous versions are aligned only to the current version of the sequence. NW_014040927.1 is a novel patch to chromosome 1 and aligns to only one RefSeqGene, but all transcripts. The location with the red outline is the canonical representation of this set of variants, which allows submissions on any of the sequences to be grouped together

2.5 Alignment projection special cases

Two special cases must be considered carefully when projecting through alignments: alignment orientation changes and gaps in the alignment. The forward orientation of mRNA is determined by the direction of transcription (5′–3′) and the associated genes and their transcript products may be reverse to the chromosome. When mapping variants across such alignments, it is important to reverse-complement the inserted sequence, as the SPDI model uses only the positive strand. In addition, extra care must be taken when mapping an insertion variant with zero-length deletion sequence, because alignments use base coordinates, not SPDI’s interbase coordinates. It is usually convenient to convert to an insert-before semantic (which does not alter the numbering system) when computing the mapping. However, because directionality changes when strand orientation changes, insert-before becomes insert-after. In order to return to the insert-before semantic, the position must be increased by one (Fig. 2A).

Fig. 2.

Fig. 2.

Examples of (A) reverse orientation and (B) indels in alignments special cases. Coordinates are 0-based interbase coordinates for two fictitious, congruent sequences. (A) For reverse orientation, boundary 85 of the chromosome corresponds to boundary 28 of the gene. But, nucleotide 85 is aligned to nucleotide 27. This must be incremented by one in order to adjust for the change in orientation. (B) In some cases, indels exist in one sequence, but not another. Remapping just the interval may not return any result, as it remaps into a gap. In this example, NC_1.1 and NC_1.2 refer to sequential versions of the same chromosome sequence, not two different chromosomes. See Supplementary Figure S2 for additional remapping examples

In the second case, a disagreement between sequence models may represent an insertion and deletion (indel) variant that is absent in one sequence model and present in another (Fig. 2B). Describing the variant represented by these two sequence models results in the variant being described as the reference on one sequence model and as an insertion, deletion or indel when ambiguities exist, on the other sequence. In general, remapping to other sequences can completely change the type of the variant, between any of the six variant types represented by dbSNP.

2.6 NCBI API variation services

We implemented public-access API services for the solution presented in this document (http://api.ncbi.nlm.nih.gov/variation/v0). These services will continue to be improved, e.g. by returning canonical SPDI, congruent contextual SPDI and HGVS based on a single request as HGVS.

3 Results

SPDI was developed to meet the NCBI’s need to represent variants consistently and accurately across resources including dbSNP and ClinVar and for broader use by the community. dbSNP and ClinVar have tested the SPDI data model and associated algorithms thoroughly and incorporated them into their respective workflows. In so doing, we have improved identification of observations submitted with different representations as the same allele and provided feedback to submitters.

3.1 ClinVar

A major function of ClinVar is the aggregation of data from diverse submitters so that each submitter, and the community at large, can determine whether there is a consensus in understanding the clinical significance of an allele (Landrum et al., 2018). That function requires robust normalization and aggregation of submissions that may define an allele by non-standard HGVS, by standard HGVS (e.g. right-justified, with duplication having precedence over insert), by VCF or by representation on current or previous versions of any of several reference sequences. From its inception, ClinVar converted all submissions to HGVS and normalized by determining the corresponding HGVS expression on the reference assembly, currently GRCh38. However, the HGVS conversion did not first standardize the HGVS, so that if one submitter reported an allele as an insert left-justified and another reported as duplication and another reported as an insert right-justified, aggregation of the three submissions required manual intervention. As SPDI was adopted into ClinVar’s data flow, these cases could be identified easily by establishing the representation of the Canonical Allele. Retrospective correction of alleles by application of the NCBI Variant Overprecision Correction Algorithm (VOCA) has resulted in 1400 alleles identified as duplicates and merged (Table 1). The majority (about 1200) are assessed as pathogenic, such as the well-studied 5/7/9 T alleles (Kiesewetter et al., 1993) in intron 9 of CFTR that may affect pathogenicity. ClinVar received diverse submitted variant descriptions (Table 2) all of which resolve to the same variant, NM_000492.3: c.1210-12T[9] [or NM_000492.3(CFTR): c.1210-7_1210-6dup, both of which are equally supported by the HGVS specification]. (https://www.ncbi.nlm.nih.gov/clinvar/variation/161188/).

Table 1.

Summary of ClinVar allele merges after the adoption of the VOCA

Variant type Number of merged ClinVar allele
Total 1400
Deletion 761 (54.4%)
Insertion/duplication 604 (43.1%)
Indel 35 (2.5%)

Table 2.

ClinVar submissions for the same allele

Description Submitted variant
Submissions
  • NM_000492.3: c.1210-6_1210-5insTT

  • NM_000492.3: c.1210-7_1210-6dupTT

  • NC_000007.13: g.117188689_ 117188690insTT

  • NC_000007.14: g.117548635_ 117548636insTT

Normalized, canonical SPDI NC_00007.14: 117548628: TTTTTTT: TTTTTTTTT
Corresponding HGVS
  • NC_000007.14: g.117548629T[9]

  • NM_000492.3: c.1210-12T[9]

3.2 dbSNP

Over the past few years, dbSNP has undergone a significant redesign of its data aggregation and product generation process. For over 15 years, an RDBMS-based solution had served dbSNP well. But with the desire to apply increasingly complex algorithms to identify identical variants and manage over 2 billion human variation submissions (ss IDs), a new system was designed and implemented. The new aggregation and reporting pipeline is based on a MapReduce framework built on the foundation of the SPDI data model and VOCA. A set of submitted variants (ss) are aggregated together to form a reference SNP (RefSNP) cluster if they meet two conditions when remapped to a common genomic sequence: on the same deletion interval and same type. dbSNP recognizes six types: identity (observed variant matches the reference sequence), SNV, multiple nucleotide variant, deletion, insertion and small indel.

All variants in repeat regions are now modeled as indels and corrected using VOCA, with the span of the deletion representing the maximum level of precision allowed. This has the particular effect of grouping alleles of varying deletion size in repeat regions into a single RefSNP cluster using VOCA results. When applied to dbSNP, 6 423 296 deletions and 2 341 877 insertion variants merged with already existing records (Table 3). All told, 8 945 252 variants merged into 4 493 144 extant RefSNPs, nearly all receiving just one RefSNP.

Table 3.

Summary of dbSNP allele merges after the adoption of the VOCA

Number of merged RefSNPs Percent change (%)
All types 8 945 252 −1.35
Deletion 6 423 296 (71.8%) −36.80
Insertion/duplication 2 341 877 (26.2%) −42.24
Indel 72 587 (0.75%) −0.18

Percent change reflects decrease of RefSNPs of that type in human dbSNP Build 152 compared to the previous one.

In the most extreme case, one extant RefSNP received 42 merged RefSNPs, rs55883101 (Fig. 3). In this example, a few of the merged RefSNPs were the result of collapsing identical alleles (see alleles marked as * and † in Fig. 3A). For most of the alleles, it is a variable number of deletions of A were collapsed into a single RefSNP, that now has 43 unique alleles. Notably, rs869211356 did not join this cluster, as its deletion allele begins with a ‘T’, firmly anchoring the subsequence that is removed. This deletion is precise and uncorrected by the VOCA. The same is true for each SNV in the region (red boxes).

Fig. 3.

Fig. 3.

(A) Example set of 42 RefSNPs from the previous dbSNP Build that (B) are now merged into one (rs55883101) in the new distributed dbSNP build that uses SPDI and VOCA with reduction in the number of RefSNPs annotated on the sequence. The SNP track coloring scheme has been updated between the two tracks, but in both cases, SNVs are red. In (A), deletions are blue with a downward triangle, in (B), purple, with a downward triangle. * and † marked variants in (A) are two pair of identical alleles

4 Discussion

In their archival function, dbSNP and Clinvar databases accept and store submissions that were ascertained by different projects using different sequencing technologies and variant calling pipelines. Hence the submitted variants can be asserted on different assembly and sequence versions, including older sequence versions from the past 20 years, and represented in different formats including the common HGVS and VCF formats. In addition, there are often redundant variants submitted across submitters and projects that need to be aggregated. The databases process the heterogeneous submitted variants and provide non-redundant variant annotation and reporting on the latest sequence version (RefSNP and RCV/VCV). This is critical to provide the most accurate view of the variant in sequence context and for efficient data exchange, integration and reporting. The processing requires projecting all submitted variants, whether defined on genomic or spliced sequences, to a common sequence space in order to generate aggregated clusters. This can be computationally challenging, which was particularly true for dbSNP that was using a pipeline based on SQL databases and BLAST technologies that were not efficient or scalable for processing billions of records. Therefore, we developed the SPDI data model and VOCA to provide a robust and consistent representation of sequence variants. The model supports pipelines for accurate and consistent processing and annotating variants at NCBI, using modern computing framework (Hadoop) and process control (Airflow) that scale and integrate with other NCBI resources. The processes based on SPDI include:

  • Validate submitted variant allele and position and convert from HGVS and asserted location (VCF) format to standard Contextual Alleles.

  • Map disparate submissions to NCBI standard top-level sequence on latest genomic assembly version using ADS and alleles corrected using VOCA (Canonical Allele).

  • Retrieve all congruent Contextual Alleles on mRNA, protein and genomic sequences.

  • Convert variant representations to standard VCF and other formats for export and data exchange.

We also made these functions open access as API calls for users to analyze and aggregate their variants across congruent sequences that are consistent with NCBI.

4.1 SPDI compared to HGVS

SPDI does not require that the submitter specify a class of variant in its notation; for SPDI, all variants are defined by simple operations of deletion and insertion, focusing on the variant sequence itself, without annotating a mechanism or effect. HGVS notation, however, supports multiple ways of describing the same simple allele, thus a decision must be made as to how a variant is named, according to a priority order (den Dunnen et al., 2016). These priorities are not always followed, so NCBI often receives submissions for inversions as deletion/insertions or duplications as insertions. Note this list of priorities does not include repeat regions, so there is even more variability within the community for describing alleles such as the polymorphism in intron 9 of the CFTR gene. The standard is to report as a repeat (http://varnomen.hgvs.org/recommendations/DNA/variant/repeated/), yet most submissions are received with HGVS for an insertion or a duplication (Table 2). In addition, HGVS requires variants to be right-shifted, but again, submissions are received in a variety of shifted states. With SPDI notation there is no interpretation of the type of allele and it is algorithmically unambiguous to generate a standard representation without maintaining complex parsing logic. SPDI currently supports a subset of variants that can be represented by HGVS nomenclature (see SPDI limitations below). NCBI API Variation Services is compared with other public tools (see Supplementary Table S2).

4.2 Design trade-offs of the SPDI data model and associated NCBI variation services

While SPDI provides a powerful representation of precise sequence changes relative to a reference, we purposefully limited the scope to trade-off computational complexity and cost against additional features.

  • SPDI does not support reporting a position offset from the reference such as used in HGVS expressions c.88 + 2T>G or c.89-1G>T. Instead SPDI enforces the use of a reference sequence that includes the variant in order to reduce computation complexity and ambiguity by sequence features context, such as CDS annotations, that can change.

  • Variants without precise breakpoints (such as large structural variants detected by paired-end mapping) cannot be specified in the model.

  • NCBI variation services partially support variants with a protein reference. While the services do compute a Contextual Allele, they are not remapped as we have no support for codon degeneracy (https://api.ncbi.nlm.nih.gov/variation/v0/hgvs/NP_001161461.1: p.Gly113_Arg119del/contextuals).

  • NCBI variation services support variants reported only on NCBI reference sequence [RefSeq (O’Leary et al., 2016)] which may not represent all common and alternate haplotypes.

4.3 Using the Contextual and Canonical Allele

The Canonical Allele representations are essential for determining if two variants represented on different reference sequences refer to the same genomic change—yet this determination is an interpretation based on the preferred reference sequence and available alignments, and subject to change and refinement over time. As we mentioned in the method section, the VOCA-corrected Contextual Allele on a particular sequence accession and version is stable over time in dbSNP and ClinVar. It serves as the archival summary evidence for the variant observed. In contrast, the Canonical Allele can change over time because it is a function of the supplied Contextual Allele and ADS that changes with new sequence versions or alignment algorithm improvements. Therefore, we propose that whenever the Canonical Allele is used, the Contextual Allele should also be made available to allow users to see supporting observed variant as well as the derived interpretation. These rules would be important for medical records or other specialized variant resources, such as disease or locus-specific variant databases, to apply so that their data can be easily accessed, interpreted and shared.

5 Conclusion

The SPDI data model, VOCA and associated variation services were developed to address the challenges of processing, annotating and exchanging the growing volume of variation data in dbSNP and ClinVar databases. This work has resulted in improving the identification and validation of submissions and standardizing their representation. The APIs are provided as a public service that could benefit the community by providing a standard variant representation based on the SPDI model, normalization by VOCA and computation of an aggregable Canonical Allele that is consistent with the usage by dbSNP and ClinVar.

Supplementary Material

btz856_Supplementary_Data

Acknowledgements

We would like to thank the GA4GH GKS-VR members, Sarah Hunt, Raymond Dalgleish and Peter Causey-Freeman for their helpful discussions and feedback.

Funding

This work was supported by the Intramural Research Program of the National Library of Medicine, National Institutes of Health.

Conflict of Interest: none declared.

References

  1. Altschul S.F. et al. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410. [DOI] [PubMed] [Google Scholar]
  2. Assmus J. et al. (2013) Equivalent indels–ambiguous functional classes and redundancy in databases. PLoS One, 8, e62803. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Carter T.C., He M.M. (2016) Challenges of identifying clinically actionable genetic variants for precision medicine. J. Healthc. Eng., 2016, 1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Danecek P. et al. (2011) The variant call format and VCFtools. Bioinformatics, 27, 2156–2158. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Deans Z.C. et al. (2016) HGVS nomenclature in practice: an example from the United Kingdom National External Quality Assessment Scheme. Hum. Mutat., 37, 576–578. [DOI] [PubMed] [Google Scholar]
  6. den Dunnen J.T. et al. (2016) HGVS recommendations for the description of sequence variants: 2016 update. Hum. Mutat., 37, 564–569. [DOI] [PubMed] [Google Scholar]
  7. den Dunnen J.T. (2016) Sequence variant descriptions: HGVS nomenclature and mutalyzer. Curr. Protoc. Hum. Genet., 90, 7.13.1–7.13.19. [DOI] [PubMed] [Google Scholar]
  8. Freeman P.J. et al. (2018) VariantValidator: accurate validation, mapping, and formatting of sequence variation descriptions. Hum. Mutat., 39, 61–68. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Kanterakis A. et al. (2018) A review of tools to automatically infer chromosomal positions from dbSNP and HGVS genetic variants In: Lambert, C.G. et al. (eds) Human Genome Informatics. Elsevier, Amsterdam, pp. 133–156. [Google Scholar]
  10. Kapustin Y. et al. (2008) Splign: algorithms for computing spliced alignments with identification of paralogs. Biol. Direct, 3, 20. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Kiesewetter S. et al. (1993) A mutation in CFTR produces different phenotypes depending on chromosomal background. Nat. Genet., 5, 274–278. [DOI] [PubMed] [Google Scholar]
  12. Landrum M.J. et al. (2018) ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res., 46, D1062–D1067. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Li Z. et al. (2014) Vindel: a simple pipeline for checking indel redundancy. BMC Bioinformatics, 15, 359. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. McLaren W. et al. (2016) The Ensembl variant effect predictor. Genome Biol., 17, 122. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. O’Leary N.A. et al. (2016) Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res., 44, D733–D745. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Pawliczek P. et al. (2018) ClinGen Allele Registry links information about genetic variants. Hum. Mutat., 39, 1690–1701. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Sherry S. et al. (2001) dbSNP: The NCBI Database of Genetic Variation. Nucleic Acids Res., 29, 308–311. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Tan A. et al. (2015) Unified representation of genetic variants. Bioinformatics, 31, 2202–2204. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Wang M. et al. (2018) hgvs: a Python package for manipulating sequence variants using HGVS nomenclature: 2018 update. Hum. Mutat., 39, 1803–1813. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Yen J.L. et al. (2017) A variant by any name: quantifying annotation discordance across tools and clinical databases. Genome Med., 9, 7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Zhou W. et al. (2015) TransVar: a multilevel variant annotator for precision genomics. Nat. Methods, 12, 1002–1003. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btz856_Supplementary_Data

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES