Abstract
Locus Reference Genomic (LRG; http://www.lrg-sequence.org/) records contain internationally recognized stable reference sequences designed specifically for reporting clinically relevant sequence variants. Each LRG is contained within a single file consisting of a stable ‘fixed’ section and a regularly updated ‘updatable’ section. The fixed section contains stable genomic DNA sequence for a genomic region, essential transcripts and proteins for variant reporting and an exon numbering system. The updatable section contains mapping information, annotation of all transcripts and overlapping genes in the region and legacy exon and amino acid numbering systems. LRGs provide a stable framework that is vital for reporting variants, according to Human Genome Variation Society (HGVS) conventions, in genomic DNA, transcript or protein coordinates. To enable translation of information between LRG and genomic coordinates, LRGs include mapping to the human genome assembly. LRGs are compiled and maintained by the National Center for Biotechnology Information (NCBI) and European Bioinformatics Institute (EBI). LRG reference sequences are selected in collaboration with the diagnostic and research communities, locus-specific database curators and mutation consortia. Currently >700 LRGs have been created, of which >400 are publicly available. The aim is to create an LRG for every locus with clinical implications.
INTRODUCTION
Accurate and unambiguous annotation of disease-causing variants is essential. Central to this is the reference DNA sequence with respect to which a variant is reported. Previously, the lack of accepted and stable reference sequences for variant reporting resulted in the use of different sequences and inconsistency of variant reporting over time. The Locus Reference Genomic (LRG) project addresses these problems (1). Specifically designed for the reporting of diagnostically relevant variants, an LRG provides a stable reference sequence record for a particular genomic locus: genomic DNA, transcript and protein sequences are all included in one record, thereby providing a concise ‘one-stop’ record for variant reporting in all coordinates. The sequences defined by an LRG accession will never change. Mapping of the LRG to current and previous genome assemblies is included in the LRG record, thus overcoming difficulties for the user associated with updates to the genome assembly.
The LRG project is a joint National Center for Biotechnology Information (NCBI)/European Bioinformatics Institute (EBI) project, and responsibility for the creation, publication and maintenance of each LRG is shared (Figure 1). This enables integration of data from both NCBI and EBI within the LRG record. All LRG records are available from the LRG website (http://www.lrg-sequence.org/) and are viewable from multiple sites at the NCBI (http://www.ncbi.nlm.nih.gov/refseq/rsg/lrg/) and Ensembl (http://www.ensembl.org/) genome browsers (2).
Adoption of LRGs as the universal reference sequences for variant reporting will facilitate data exchange, ease of comparability and eliminate errors previously caused by imprecise use of reference sequences. This will enable increased submission of disease-causing variants to public databases, improving the accuracy of information available for use in clinical diagnosis, treatment decisions and research.
CREATION OF LRGS
Community collaboration
Each LRG is created in collaboration with the community, defined as the research and diagnostic laboratories, locus specific database (LSDB) curators and mutation consortia with expertise in that locus. Working with the community ensures that each LRG record is ideally suited to the demands of variant reporting at a specific locus.
LRG records are created in response to requests from members of the community. Guidelines for making requests are available at http://www.lrg-sequence.org/lrg-request. Once a request is received, a pending LRG is created using the reference sequences specified by the original requester (Figure 1). Each pending LRG record is then reviewed by members of the community for that specific locus. Requesters are asked to provide information on any other members of the community who should be contacted, while additional collaborators are identified from GeneTests (http://genetests.org/), NIH’s Genetic Testing Registry (GTR) (http://www.ncbi.nlm.nih.gov/gtr) and the Leiden University Medical Center LSDB list (http://www.lovd.nl/LSDBs), where available. Additional information on the identification and selection of collaborators is provided on the LRG website (http://www.lrg-sequence.org/lrg-collaborators). Collaborators are asked to review the pending LRG record, provide information on any alternative reference sequences they currently use and are advised on how to switch to reporting variants using LRGs. During the creation process, additional transcripts can be added to the LRG record at the recommendation of a member of the community. Any collaborators with significant involvement in the selection of reference sequences for inclusion in the LRG are asked if they would like to be added as an additional requester. Ultimately, LRGs are created to meet the needs of the community, with final advice and authority resting with them. LRGs can be requested by e-mailing request@lrg-sequence.org.
LRG fixed section
LRGs are divided into two main sections: ‘fixed’ and ‘updatable’ (Figure 2). The fixed section contains the core information for the LRG (Figure 2), including requester information, the LRG reference sequences, a unique identifier in the format ‘LRG_[number]’, the HUGO Gene Nomenclature Committee (HGNC) (3) gene identifier and LRG-specific exon numbering.
The LRG reference sequences are composed of a genomic sequence, transcripts required for variant reporting and any corresponding proteins. Sequences submitted for inclusion in the LRG record may be based on RefSeqGene or RefSeq records, or alternative nucleotide sequences in FASTA, GenBank or European Nucleotide Archive format. Most LRG genomic sequences extend 5 kb upstream of the first exon and 2 kb downstream of the last exon, or to the extent necessary to cover all relevant components (i.e. promoters or other regulatory elements). The genomic, transcript and protein sequences of the LRG exactly match those of the corresponding versions of RefSeqGene and RefSeq (4) records (excluding the poly-A tail, which is removed from all LRG transcripts). If RefSeqGene or RefSeq (4) records do not exist to match the sequences requested, the RefSeq curators create such records. Therefore, variant coordinates can easily be translated between these and LRGs. Currently, all LRGs created have been for genic loci; however, they can be created for any locus with clinical implications.
Transcripts for inclusion in the fixed section are initially suggested by the original requester or collaborators. The LRG curators then check NCBI and Ensembl for the most up-to-date biological information data for this locus. For example, they carry out alignments to ensure that all coding sequence with evidence of expression is represented in the transcripts proposed. If there are transcripts with additional sequence and evidence of expression, the curators discuss with the requesters whether these should also be included. As the project aims to reduce ambiguity, only transcripts that are well characterized and are deemed by the community to be essential for variant reporting are included. In practice this means that the majority of LRGs only have one transcript. Limiting the transcripts to those deemed essential by community experts provides guidance to other users with regard to the transcripts that should be used for variant reporting. In rare cases, the collaborators have requested to use an idealized transcript, for example, containing all exons of the gene, as the reporting standard, even though the existence of such a transcript is not supported by biological evidence. Examples of such cases are described in the Supplementary Data. In such cases, the proposed sequences are also reviewed by RefSeq curators and a corresponding RefSeq transcript created.
Each LRG has a fixed LRG-specific exon numbering system based on the transcript(s) included in the fixed section. Each distinct exon is numbered consecutively 5′–3′, and then the numbering is applied to individual transcripts (Figure 3). Any transcripts added after the LRG is made public will be assigned exon numbers in collaboration with the community, but will not change the existing LRG-specific exon numbering (Figure 3).
LRG updatable section
The updatable section contains the mapping of the LRG to the most recent genome build and the most advanced and up-to-date biological knowledge for the LRG region from Ensembl and NCBI. To ensure that each LRG record is kept current, the curators update all existing LRGs twice a year and when a new genome build is released.
The LRG is aligned to the reference genome to determine its mapping coordinates. Each area of contiguous alignment is described separately, along with any differences between the LRG and the reference.
Biological knowledge from NCBI and Ensembl for the LRG region is included in separate sections. Both detail the official names and synonyms of genes, transcripts and proteins contained within the LRG region. Mapping of these sequences in LRG coordinates and any mapping discrepancies are included. Furthermore, there are links from the sequences to the NCBI or Ensembl browsers where additional information, such as expression patterns or biological evidence for each sequence, can be found. The transcript(s) included in the fixed section are also listed in the updatable section, but marked with a note to distinguish them from the other transcripts included in the record.
The final section is for additional data on the LRG locus, namely, a link to the LSDB list (http://www.lovd.nl/LSDBs) for the locus hosted at Leiden University Medical Center. Alternate or legacy numbering systems (exon and amino acid), determined prior to the existence of the LRG and widely used by the community, may also be included in this part of the updatable section to enable easy comparison between different numbering systems.
NCBI BIOLOGICAL KNOWLEDGE
Annotation provided by NCBI in the updatable section includes RefSeq-related data (accessions, stable sequence identifiers [GIs]), other database identifiers [Consensus CDS identifiers (CCDS) for coding regions (5,6), MIM numbers and GeneIDs] and alternate names and symbols. Any genes partially or completely overlapping the LRG are annotated.
ENSEMBL BIOLOGICAL KNOWLEDGE
The Ensembl data include protein-coding genes from the Ensembl transcript annotation process (7). All transcripts are based on mRNA and proteins in public scientific databases, and also include the CCDS transcripts. The Ensembl gene set also includes automatically-annotated pseudogenes and non-coding RNAs.
LRG release
Once the review process is complete, a pending LRG is made public and requesters are notified. From this point, sequences included in the fixed section will not change. Additional transcripts can be added to the fixed section of a public LRG in the future should this be necessary for reporting variants at the locus. If this should happen, the original information in this section, e.g. existing exon numbering, will remain unchanged. LRGs are established to provide a consistent reporting framework. Because of this policy, should an error in an LRG sequence be identified after release, the sequence in the LRG will not be changed. Instead, a second LRG for the locus will be created if the community so requests, with an independent accession number, and the updatable layer of the problematic LRG will be annotated to inform users of the discrepancy.
Reporting variants on LRGs
All sequences in the LRG fixed section can be used for stable reporting of variants using their stable identifiers. In accordance with HGVS conventions, variant reporting using LRGs as a reference standard is possible in genomic DNA (e.g. LRG_1:g.8463G>C), mRNA (e.g. LRG_1t1:c.572G>C), non-coding RNA (e.g. LRG_163t1:n.5C>T) or protein (e.g. LRG_1p1:p.Gly191Ala) coordinates. Should the need arise in the future to create multiple LRGs for the same locus, these will have different accession numbers, rather than versions of the same accession number. Therefore, this will eliminate the ambiguity caused by versioning in variant reporting. LRG sequences have been endorsed by HGVS (http://www.hgvs.org/mutnomen/) and in the European Molecular Genetics Quality Network best practice guidelines (8). LRGs are also recommended as the reference sequence of choice for LSDBs (9,10). Conversion of existing variant data into LRG coordinates is facilitated using the NCBI Genome Remapping Service (http://www.ncbi.nlm.nih.gov/genome/tools/remap/, Clinical Remap tab). In short, locations or HGVS expressions submitted on a selected genomic assembly or RefSeqGene will be converted to locations on an LRG if one is publicly available for that region.
Submission of variants
Variants can be submitted to dbSNP (http://www.ncbi.nlm.nih.gov/SNP/) and ClinVar (http://www.ncbi.nlm.nih.gov/clinvar/) via the LRG submission process. This process was ratified at the 2010 Human Variome Project Meeting (11) and is described on the LRG website. A submission template was designed to capture the essential elements to describe genotypic variation and its impact on phenotype and disease. Submissions are loaded to ClinVar. If the variant is not yet represented in dbSNP or dbVar/DGVa (12), then ClinVar reports the variant to the appropriate database at NCBI for assignment of identifiers in those databases. These data are then reported to the submitter and are made publicly available.
Availability of data
LRG website
Each LRG exists as a single file in extensible markup language (XML) format (1). All LRG records (public and pending) are viewable and downloadable on the LRG website (http://www.lrg-sequence.org/), along with all LRG reference sequences in FASTA format and schema documentation (ftp://ftp.ebi.ac.uk/pub/databases/lrgex/docs/LRG_XML_schema_documentation_1_8.pdf). A search box on the LRG home page allows searching by LRG identifier, HGNC symbol, NCBI and Ensembl accession numbers, gene synonym or LRG status. A unique web page for each LRG renders the content of the corresponding XML file into a user-friendly display.
The LRG view on the website includes additional features not available in XML format. For public LRGs, links to Ensembl’s summary LRG page (http://www.ensembl.org/Homo_sapiens/LRG/Summary?lrg=LRG_5) and LRG variation table (http://www.ensembl.org/Homo_sapiens/LRG/Variation_LRG/Table?lrg=LRG_5) are included in the Ensembl annotation section.
LRG web services
The LRG web services allow programmatic access to the data for one or more LRGs (see full details on the LRG website: http://www.lrg-sequence.org/web-service). These are based on the web service part of the EB-eye (13) search engine (http://www.ebi.ac.uk/Tools/webservices/services/eb-eye) and use the XML-RPC (http://xmlrpc.scripting.com/default.html) protocol. It is possible, for example, to retrieve the LRG identifiers for a list of HGNC gene symbols, or the genomic sequence for a given LRG. An authentication key is required to access the web service, which can be freely requested from help@lrg-sequence.org.
LRG support
LRGs can be viewed in the Ensembl and NCBI genome browsers, usually within 2 months of the LRG being made public (Supplementary Table S1). LRGs are also supported by external software: the Variant Effect Predictor (14) for variation analysis, e.g. from exome sequencing; Mutalyzer (15) for checking sequence variant nomenclature; Alamut (Interactive Biosoftware: http://www.interactive-biosoftware.com/) for variation interpretation; Variobox (16) for annotation, analysis and comparison of human genes; and the the Leiden Open Variation Database (17) DNA variation database system.
Case studies
Detailed manual curation and extensive collaboration with experts enable the creation of custom-made records that are optimal for reporting variants. Currently >700 records have been created, including some with unique features and for genes with known complexity. Several case studies are described in the Supplementary Data to illustrate how the LRG project has addressed these challenging regions.
Further information
Communication is an essential part of the LRG project. To facilitate this the LRG website includes a ‘News’ page (http://www.lrg-sequence.org/news) for important developments or schema changes. News items can be received via e-mail by subscribing to the mailing list (e-mail contact@lrg-sequence.org) or via the RSS feed, which publicises changes to the status of LRGs. The LRG website also contains background information on the project and the complete LRG specification, along with instructions on how to request an LRG and submit variants.
Future developments
The long-term goal of the LRG project is to create an LRG record for every clinically relevant locus. Creation of LRGs is being prioritized according to demand, with LRGs directly requested by the diagnostic community taking precedence. Ongoing collaborations are underway to create LRGs for the genes involved in inherited bleeding and platelet disorders and for the genes included in the United Kingdom National External Quality Assessment Service scheme. LRGs are also being created for genes organized by type of disease as included in commercial diagnostic testing panels, e.g. those available from Cegat (http://www.cegat.de/), GeneDx (http://www.genedx.com/) and the Illumina TruSIGHT panels.
Whenever a new genome assembly is available, and the NCBI and Ensembl databases have been updated, the mapping information and annotation of all LRGs will be updated to the new assembly. Mapping of the LRG genomic sequence to both the current and penultimate assembly will be included. HGVS-compliant variant descriptions based on the LRG will be added to ClinVar to support searching and reporting.
In the future, access to LRG data will also be available through a REST API. Development of this will enable retrieval of LRG data through specific URLs without the need for an authentication key.
SUPPLEMENTARY DATA
Supplementary Data are available at NAR online, including [18–25].
FUNDING
The Wellcome Trust [WT095908]; British Heart Foundation [SP/10/10/28431]; European Molecular Biology Laboratory. European Community’s Seventh Framework Programme [FP7/2007-2013] under grant agreement number 200754–the GEN2PHEN project. Work at NCBI is supported by the National Institutes of Health Intramural Research Program and the National Library of Medicine. Funding for open access charge: The Wellcome Trust.
Conflict of interest statement. None declared.
ACKNOWLEDGEMENT
The authors would like to thank the requesters and collaborators of all LRGs. In particular, James Ware and Roddy Walsh for genes involved in inherited cardiac conditions; Nazneen Rahman and Shazia Mahamdallie for cancer predisposition genes; Jean-Christophe Bourdon, Thierry Soussi and Magali Olivier for TP53; and Johan den Dunnen for NEB. They also thank members of the GEN2PHEN Consortium and Ewan Birney for providing valuable inputs throughout the LRG project.
REFERENCES
- 1.Dalgleish R, Flicek P, Cunningham F, Astashyn A, Tully RE, Proctor G, Chen Y, McLaren WM, Larsson P, Vaughan BW, et al. Locus Reference Genomic sequences: an improved basis for describing human DNA variants. Genome Med. 2010;2:24. doi: 10.1186/gm145. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Flicek P, Ahmed I, Amode MR, Barrell D, Beal K, Brent S, Carvalho-Silva D, Clapham P, Coates G, Fairley S, et al. Ensembl 2013. Nucleic Acids Res. 2013;41:D48–D55. doi: 10.1093/nar/gks1236. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Gray KA, Daugherty LC, Gordon SM, Seal RL, Wright MW, Bruford EA. Genenames.org: the HGNC resources in 2013. Nucleic Acids Res. 2013;41:D545–D552. doi: 10.1093/nar/gks1066. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Pruitt KD, Tatusova T, Brown GR, Maglott DR. NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy. Nucleic Acids Res. 2012;40:D130–D135. doi: 10.1093/nar/gkr1079. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Harte RA, Farrell CM, Loveland JE, Suner MM, Wilming L, Aken B, Barrell D, Frankish A, Wallin C, Searle S, et al. Tracking and coordinating an international curation effort for the CCDS Project. Database(Oxford) 2012 doi: 10.1093/database/bas008. bas008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Pruitt KD, Harrow J, Harte RA, Wallin C, Diekhans M, Maglott DR, Searle S, Farrell CM, Loveland JE, Ruef BJ, et al. The consensus coding sequence (CCDS) project: identifying a common protein-coding gene set for the human and mouse genomes. Genome Res. 2009;19:1316–1323. doi: 10.1101/gr.080531.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, Aken BL, Barrell D, Zadissa A, Searle S, et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 2012;22:1760–1774. doi: 10.1101/gr.135350.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Van Dijk FS, Byers PH, Dalgleish R, Malfait F, Maugeri A, Rohrbach M, Symoens S, Sistermans EA, Pals G. EMQN best practice guidelines for the laboratory diagnosis of osteogenesis imperfecta. Eur. J. Hum. Genet. 2012;20:11–19. doi: 10.1038/ejhg.2011.141. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Celli J, Dalgleish R, Vihinen M, Taschner PEM, den Dunnen JT. Curating gene variant databases (LSDBs): toward a universal standard. Hum. Mutat. 2012;33:291–297. doi: 10.1002/humu.21626. [DOI] [PubMed] [Google Scholar]
- 10.Vihinen M, den Dunnen JT, Dalgleish R, Cotton RG. Guidelines for establishing locus specific databases. Hum. Mutat. 2012;33:298–305. doi: 10.1002/humu.21646. [DOI] [PubMed] [Google Scholar]
- 11.Kohonen-Corish MR, Al-Aama JY, Auerbach AD, Axton M, Barash CI, Bernstein I, Béroud C, Burn J, Cunningham F, Cutting GR, et al. How to catch all those mutations–the report of the third Human Variome Project Meeting, UNESCO Paris, May 2010. Hum. Mutat. 2010;31:1374–1381. doi: 10.1002/humu.21379. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Lappalainen I, Lopez J, Skipper L, Hefferon T, Spalding JD, Garner J, Chen C, Maguire M, Corbett M, Zhou G, et al. DbVar and DGVa: public archives for genomic structural variation. Nucleic Acids Res. 2013;41:D936–D941. doi: 10.1093/nar/gks1213. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Valentin F, Squizzato S, Goujon M, McWilliam H, Paern J, Lopez R. Fast and efficient searching of biological data resources–using EB-eye. Brief. Bioinform. 2010;11:375–384. doi: 10.1093/bib/bbp065. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.McLaren W, Pritchard B, Rios D, Chen Y, Flicek P, Cunningham F. Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor. Bioinformatics. 2010;26:2069–2070. doi: 10.1093/bioinformatics/btq330. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Wildeman M, van Ophuizen E, den Dunnen JT, Taschner PEM. Improving sequence variant descriptions in mutation databases and literature using the Mutalyzer sequence variation nomenclature checker. Hum. Mutat. 2008;29:6–13. doi: 10.1002/humu.20654. [DOI] [PubMed] [Google Scholar]
- 16.Gaspar P, Lopes P, Oliveira J, Santos R, Dalgleish R, Oliveira JL. Variobox: automatic detection and annotation of human gene variants. Hum. Mutat. 2013 doi: 10.1002/humu.22474. (doi: 10.1002/humu.22474; http://onlinelibrary.wiley.com/doi/10.1002/humu.22474/abstract) [DOI] [PubMed] [Google Scholar]
- 17.Fokkema IFAC, Taschner PEM, Schaafsma GCP, Celli J, Laros JFJ, den Dunnen JT. LOVD v.2.0: the next generation in gene variant databases. Hum. Mutat. 2011;32:557–563. doi: 10.1002/humu.21438. [DOI] [PubMed] [Google Scholar]
- 18.Herman DS, Lam L, Taylor MR, Wang L, Teekakirikul P, Christodoulou D, Conner L, DePalma SR, McDonough B, Sparks E, et al. Truncations of titin causing dilated cardiomyopathy. N. Engl. J. Med. 2012;366:619–628. doi: 10.1056/NEJMoa1110186. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Hackman P, Vihola A, Haravuori H, Marchand S, Sarparanta J, De Seze J, Labeit S, Witt C, Peltonen L, Richard I, et al. Tibial muscular dystrophy is a titinopathy caused by mutations in TTN, the gene encoding the giant skeletal-muscle protein titin. Am. J. Hum. Genet. 2002;71:492–500. doi: 10.1086/342380. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Daba A, Koromilas AE, Pantopoulos K. Alternative ferritin mRNA translation via internal initiation. RNA. 2012;18:547–556. doi: 10.1261/rna.029322.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Grover R, Candeias MM, Fåhraeus R, Das S. p53 and little brother p53/47: linking IRES activities with protein functions. Oncogene. 2009;28:2766–2772. doi: 10.1038/onc.2009.138. [DOI] [PubMed] [Google Scholar]
- 22.Ossipow V, Descombes P, Schibler U. CCAAT/enhancer-binding protein mRNA is translated into multiple proteins with different transcription activation potentials. Proc. Natl Acad. Sci. USA. 1993;90:8219–8223. doi: 10.1073/pnas.90.17.8219. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Ware JS, Walsh R, Cunningham F, Birney E, Cook SA. Paralogous annotation of disease-causing variants in long QT syndrome genes. Hum. Mutat. 2012;33:1188–1191. doi: 10.1002/humu.22114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Grist SA, Dubowsky A, Suthers G. Evaluating DNA sequence variants of unknown biological significance. Clinical Bioinformatics (Methods in Molecular Medicine) 2008;141:199–217. doi: 10.1007/978-1-60327-148-6_11. [DOI] [PubMed] [Google Scholar]
- 25.Flicek P, Amode MR, Barrell D, Beal K, Brent S, Chen Y, Clapham P, Coates G, Fairley S, Fitzgerald S, et al. Ensembl 2011. Nucleic Acids Res. 2011;39:D800–D806. doi: 10.1093/nar/gkq1064. [DOI] [PMC free article] [PubMed] [Google Scholar]