Version Changes
Revised. Amendments from Version 1
This revision addresses comments raised by reviewers, with the most significant changes being 1) addition of a Limitations section, 2) comparison to the limma packages alias2Symbol and alias2SymbolTable functions, and 3) improvement of the readability of the figure.
Abstract
Gene symbols are recognizable identifiers for gene names but are unstable and error-prone due to aliasing, manual entry, and unintentional conversion by spreadsheets to date format. Official gene symbol resources such as HUGO Gene Nomenclature Committee (HGNC) for human genes and the Mouse Genome Informatics project (MGI) for mouse genes provide authoritative sources of valid, aliased, and outdated symbols, but lack a programmatic interface and correction of symbols converted by spreadsheets. We present HGNChelper, an R package that identifies known aliases and outdated gene symbols based on the HGNC human and MGI mouse gene symbol databases, in addition to common mislabeling introduced by spreadsheets, and provides corrections where possible. HGNChelper identified invalid gene symbols in the most recent Molecular Signatures Database (MSigDB 7.0) and in platform annotation files of the Gene Expression Omnibus, with prevalence ranging from ~3% in recent platforms to 30-40% in the earliest platforms from 2002-03. HGNChelper is installable from CRAN.
Keywords: gene symbols, molecular biology, HGNC, MGI
Introduction
Gene symbols are widely used in biomedical research because they provide descriptive and memorable nomenclature for communication. However, gene symbols are constantly updated through the discoveries and re-identification of genes, resulting in new names or aliases. For example, GCN5L2 ( General Control of amino acid synthesis protein 5-Like 2) is a gene symbol that was later discovered to function as a histone acetyltransferase and therefore renamed as KAT2A ( K(lysine) Acetyl Transferase 2A) 1 . In addition to the rapid and constant updates on valid gene symbols, commonly used spreadsheet software, such as Microsoft Excel, modify some gene symbols, converting them into dates or floating-points numbers 2, 3 . For example, ‘ DEC1’, a symbol for ‘ Deletion in Esophageal Cancer 1’ gene, can be exported in date format, ‘1-DEC’. There have been attempts to rectify gene symbol issues, but they have largely been limited to Excel-modified gene symbols. Also the suggested solutions often reference static files with the corrections curated at the time of publication 3 or comprise scripts for detecting the existence of Excel-modified gene symbols without correction 2 . In recognition of the importance of the spreadsheet modification issues, HGNC offers its own symbol correction tool, the Multi-symbol checker, and also recently announced that all symbols that auto-convert to dates in Excel have been changed 4 . However, much literature and public data still contains outdated and incorrect gene symbols, motivating a convenient method of systematic detection and correction. To systematically identify historical aliases, correct for capitalization differences, and simultaneously correct spreadsheet-modified gene symbols, we built the HGNChelper R package. HGNChelper maps different aliases and spreadsheet-modified gene symbols to approved gene symbols maintained by The HUGO Gene Nomenclature Committee (HGNC) database 5 . HGNChelper also supports mouse gene symbol correction based on the Mouse Genome Informatics (MGI) database 6 .
Methods
Implementation
Source data. Human gene symbols are accessed from HGNC Database ftp site ( ftp://ftp.ebi.ac.uk/pub/databases/genenames/new/tsv/hgnc_complete_set.txt) 7 and mouse gene symbols are acquired from MGI Database ( http://www.informatics.jax.org/downloads/reports/MGI_EntrezGene.rpt) 6 . These URLs, and their access and processing, are handled by HGNChelper so the user does not interact directly with them.
Algorithm. Human gene symbol correction is processed in three steps. First, capitalization is fixed: all letters are converted to upper-case, except the open reading frame (orf) nomenclature, which is written in lower-case. Second, dates or floating-point numbers generated via Excel-modification are corrected using a custom index generated by importing all human gene symbols into Excel, exporting them in all available date formats, and collecting any gene symbols that are different from the originals. In the last and most commonly applied step, aliases are updated to approved gene symbols in the HGNC database. Mouse gene symbol correction follows the same three steps as in human gene symbol correction, except the capitalization step since mouse gene symbols begin with an uppercase character, followed by all lowercase.
User interface. The user interface of HGNChelper does not include any local input or output files; instead it uses R data structures as function arguments and output. Base R data export functions such as write.table can be used to write results to file in whichever format required. The input arguments to the main function, checkGeneSymbols, are:
-
1.
x: A character vector of gene symbols to check for modified or outdated values
-
2.
chromosome: An optional integer vector the same length as x, providing chromosome numbers for each gene
-
3.
unmapped.as.na: A logical value, if TRUE (default), unmapped symbols will appear as NA in the Suggested.Symbol output column. If FALSE, the original unmapped symbol will be kept.
-
4.
map: An optional user-updated or non-standard gene map. The default maps can be updated by running the interactive example provided in the help page to checkGeneSymbols.
-
5.
species: A required character vector of length 1, either "human" (default) or "mouse".
checkGeneSymbols returns an R data.frame with one row per input gene and three columns:
-
1.
The first column of the data frame shows the input gene symbols.
-
2.
The second column indicates whether the input symbols are valid.
-
3.
The third column provides a corrected gene symbol where possible.
A message is printed indicating when the package’s built-in map was last updated. Because the gene symbol databases are updated as frequently as every day, we provide the getCurrentHumanMap and getCurrentMouseMap functions for updating the reference map without requiring an HGNChelper software update. These functions fetch the most up-to-date version of the map from HGNC and MGI, respectively, and users can provide the output of these functions through the map argument of checkGeneSymbols function. However, fetching a new map requires internet access and takes longer than using the package’s built-in index.
Operation
HGNChelper is an R package installable from CRAN on Linux, Windows, and OSX. It requires a base installation of R (> 3.5.0) and no other dependencies, and has minimal hardware requirements that should be met by any computer capable of installing the R dependency.
Results
To evaluate the performance of HGNChelper, we quantified the extent of invalid gene symbols present in platform annotation files in the Gene Expression Omnibus (GEO) database from 2002 to 2020. We downloaded 20,716 GEO platform annotation (GPL) files using GEOquery::getGEO 8 , of which 2,044 platforms were suspected to contain gene symbol information based on matching to valid symbols. There is a clear trend of increasing proportion of invalid gene symbols with age of platform submission ( Figure 1), ranging from an average of ~3% for recent platforms and increasing with age to ~20% in 2010 and 30–40% in the earliest platforms from 2002–03. The overall proportion of valid gene symbols was 79%, increasing to 92% after HGNChelper correction. We also checked the validity of gene symbols in the Molecular Signatures Database (MSigDB 7.0) 9 . Out of 38,040 gene symbols used in MSigDB version 7.0, 850 were invalid, and this number reduces to 453 after HGNChelper correction, of which the majority were lncRNA and a few withdrawn symbols.
The limma 10 Bioconductor package provides related functionality; however, limma::alias2Symbol and limma::alias2SymbolTable are intended only to translate known gene aliases, whereas HGNChelper is intended for heterogeneous input that may include aliases, valid symbols, Excel-modified symbols, incorrect capitalization, and unmappable symbols, and to provide a map between input and output. limma::alias2SymbolTable maintains the length of the output vector as same as the input, but if there are multiple aliases, it displays only the one with the lowest Entrez ID number, whereas HGNChelper returns a delimited vector of all aliases.
Discussion
Gene symbols are error-prone and unstable, but remain in common use for their memorability and interpretability. Our analysis of public databases containing gene symbols emphasizes the need for gene symbol correction particularly when using symbols from older datasets and reported results. Such correction should be routinely done when gene symbols are part of high-throughput analysis, such as re-analysis of targeted gene panels for precision medicine, which tend to be annotated with gene symbols (e.g. 11), in Gene Set Enrichment Analysis using the gene symbol versions of popular databases such as MSigDB 9 or GeneSigDB 12 , or when performing systematic review or meta-analysis of published multi-gene signatures (e.g. 13). HGNChelper implements a programmatic and straightforward approach to the routine identification and correction of invalid gene symbols.
Limitations
We reduced the fraction of invalid gene symbols in GPL files using HGNChelper ( Figure 1), but there are still 8% remaining, invalid gene symbols. We further investigated the cases where HGNChelper failed to fix and identified the following situations:
-
1.
Long non-coding RNAs (e.g. “ lnc-ARMCX4-1”, “lnc-SOX11-1”)
-
2.
Withdrawn symbol (e.g. “ OCLM”)
-
3.
Uncharacterized gene (e.g. “ LOC644669”) : Symbols beginning with LOC. When a published symbol is not available, and orthologs have not yet been determined, this may be represented as ‘LOC’ + the GeneID.
-
4.
Non-human gene symbol
-
5.
Missing data
-
6.
Commercial product name (e.g. Probe ID)
Another limitation with HGNChelper is that it cannot always provide the correct answer for which gene a symbol refers to. For example, FHL1 is both an approved symbol and an alias of CFH, so unless the chromosome of CFH is specified, FHL1 will be just returned as a valid symbol. Thus, we recommend users to provide as much information as possible and still be cautious in interpretation of its output.
Software availability
Package available from CRAN: https://cran.r-project.org/package=HGNChelper
Source code available from: https://github.com/waldronlab/HGNChelper/
Archived source code as at time of publication: https://doi.org/10.5281/zenodo.4309985 13
License: GPL (≥ 2.0)
Acknowledgements
An earlier version of this article can be found on bioRxiv (doi: https://doi.org/10.1101/2020.09.16.300632)
This work was supported by National Cancer Institute (NCI) grant U24-CA180996 to L.W.
Funding Statement
This work was supported by the National Cancer Institute (NCI) grant U24-CA180996 to LW.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
[version 2; peer review: 3 approved]
References
- 1. Poux AN, Cebrat M, Kim CM, et al. : Structure of the GCN5 histone acetyltransferase bound to a bisubstrate inhibitor. Proc Natl Acad Sci U S A. 2002;99(22):14065–70–70. 10.1073/pnas.222373899 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Zeeberg BR, Riss J, Kane DW, et al. : Mistaken identifiers: gene name errors can be introduced inadvertently when using Excel in bioinformatics. BMC Bioinformatics. 2004;5:80. 10.1186/1471-2105-5-80 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Ziemann M, Eren Y, El-Osta A: Gene name errors are widespread in the scientific literature. Genome Biol. 2016;17(1):177. 10.1186/s13059-016-1044-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Bruford EA, Braschi B, Denny P, et al. : Guidelines for human gene nomenclature. Nat Genet. 2020;52(8):754–758. 10.1038/s41588-020-0669-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Yates B, Braschi B, Gray KA, et al. : Genenames.org: the HGNC and VGNC resources in 2017. Nucleic Acids Res. 2017;45(D1):D619–D625. 10.1093/nar/gkw1033 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Bult CJ, Blake JA, Smith CL, et al. : Mouse Genome Database (MGD) 2019. Nucleic Acids Res. 2019;47(D1):D801–D806. 10.1093/nar/gky1056 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Home | HUGO Gene Nomenclature Committee.[cited 2 May 2020]. Reference Source [Google Scholar]
- 8. Davis S, Meltzer PS: GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor. Bioinformatics. 2007;23(14):1846–1847. 10.1093/bioinformatics/btm254 [DOI] [PubMed] [Google Scholar]
- 9. Liberzon A, Subramanian A, Pinchback R, et al. : Molecular signatures database (MSigDB) 3.0. Bioinformatics. 2011;27(12):1739–1740. 10.1093/bioinformatics/btr260 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Ritchie ME, Phipson B, Wu D: limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2020;43(7):e47. 10.1093/nar/gkv007 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. McCabe MJ, Gauthier MEA, Chan CL, et al. : Development and validation of a targeted gene sequencing panel for application to disparate cancers. Sci Rep. 2019;9(1):17052. 10.1038/s41598-019-52000-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Culhane AC, Schwarzl T, Sultana R, et al. : GeneSigDB--a curated database of gene expression signatures. Nucleic Acids Res. 2010;38(Database issue):D716–25. 10.1093/nar/gkp1015 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Waldron L, Haibe-Kains B, Culhane AC, et al. : Comparative meta-analysis of prognostic gene signatures for late-stage ovarian cancer. J Natl Cancer Inst. 2014;106(5): dju049. 10.1093/jnci/dju049 [DOI] [PMC free article] [PubMed] [Google Scholar]