Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2022 Nov 1;51(D1):D994–D1002. doi: 10.1093/nar/gkac970

HGD: an integrated homologous gene database across multiple species

Guangya Duan 1,2,3, Gangao Wu 3,4,3, Xiaoning Chen 5,6, Dongmei Tian 7,8, Zhaohua Li 9,10, Yanling Sun 11,12, Zhenglin Du 13,14, Lili Hao 15,16, Shuhui Song 17,18,19, Yuan Gao 20,21,22, Jingfa Xiao 23,24,25, Zhang Zhang 26,27,28, Yiming Bao 29,30,31, Bixia Tang 32,33,34,, Wenming Zhao 35,36,37,
PMCID: PMC9825607  PMID: 36318261

Abstract

Homology is fundamental to infer genes’ evolutionary processes and relationships with shared ancestry. Existing homolog gene resources vary in terms of inferring methods, homologous relationship and identifiers, posing inevitable difficulties for choosing and mapping homology results from one to another. Here, we present HGD (Homologous Gene Database, https://ngdc.cncb.ac.cn/hgd), a comprehensive homologs resource integrating multi-species, multi-resources and multi-omics, as a complement to existing resources providing public and one-stop data service. Currently, HGD houses a total of 112 383 644 homologous pairs for 37 species, including 19 animals, 16 plants and 2 microorganisms. Meanwhile, HGD integrates various annotations from public resources, including 16 909 homologs with traits, 276 670 homologs with variants, 398 573 homologs with expression and 536 852 homologs with gene ontology (GO) annotations. HGD provides a wide range of omics gene function annotations to help users gain a deeper understanding of gene function.

INTRODUCTION

Homology is defined as a common genealogical relationship between and within organisms, which is fundamental to decipher evolutionary processes and infer genes’ potential functions (1–4). Genes with shared ancestry are generally referred to as homologs, which can be further divided into two types, namely, orthologs due to speciation and paralogs due to duplication (5–8). With the advent of comparative genomic research, the identification and inference of homolog relationships relying on sequences have been developed for years, resulting in a large number of practical approaches and databases (9–11), which facilitate the function studies in genomics and systems biology.

In general, the methods of predicting homolog genes mostly based on protein sequence that can be grouped into three categories. The first is graph-based methods that infer gene relationships by calculating pairwise sequences similarity (12), and typically representative databases include COG (13), InParanoid (14), eggNOG (15), Hieranoid (16), OMA (17) and HomoloGene (18). The second is phylogenetic tree-based methods that rely on the reconciliation of gene trees and species trees (19–21), which are adopted by the databases of Panther (22), TreeFam (23) and Ensembl Compara (24). For these two categories, they use a variety of criteria to organize the homologs. For instance, OMA defined grouped homologs, whereas Ensembl Compara used pairwise homologs. The third category is integrated homology prediction methods, which combine multiple algorithms of homology inference, such as DIOPT (25) and Alliance (26). In addition to the methods based on protein sequence prediction, synteny is another method to identify homologs using genomic context (11,27,28), for example, ATGC (29).

The databases mentioned above all provide ongoing maintenance and public data access services except ATGC, but there are still a few issues requiring further improvements. One issue is there is no uniform standard for homology identifier and cross-reference identifier across different databases, resulting in insufficient correlation of identical homologous genes retrieved from different resources. Some databases, for example OMA, used their own defined protein identifier format as homology identifier or used a single cross-reference identifier like Uniprot ID/Ensembl ID/NCBI ID, while others used a combination of Ensembl ID, NCBI gene/protein ID or Uniprot ID as homology identifiers. The divergences in homology identifiers inconvenience the mapping of homologs, which brings barriers to easy access to the most accurate homologs of interest (30). Despite the efforts of the Quest for Orthologs consortium to promote the development of a unified standard for homologs, it remains a challenging task due to the rapid pace of both new genome releases and algorithm updates (30,31). Another issue is that the functional annotations of most homology databases mainly focused on the GO, pathway and/or protein domain, which are not conductive enough to the comprehensive research of homologs based on next-generation sequencing technologies such as evolutionary conservation of gene expression (32). For instance, despite some resources support multi-omics functional annotations like variation and expression, such as Ensembl Compara, they are available in browsing the variation, expression or phenotype information for individual genes but inaccessible to the comparison of homologs across species in a single panel. Additionally, comprehensive homology resources such as DIOPT and Alliance integrates gene symbols, gene identifiers and more functional information about homologs, but focus only on model organisms. Therefore, it is necessary to construct a comprehensive homology resource by integrating multiple homology inference results as well as multi-omics functional annotations and incorporating both model and non-model organisms for the worldwide research communities.

Here, we present the Homologous Gene Database (HGD, https://ngdc.cncb.ac.cn/hgd), a comprehensive homology resource that integrates public homology resources for multi species, incorporates multi-omics gene annotations including traits, variations, gene expression, and gene functional annotations, and provides free public data services for browsing, retrieval, comparison and downloading.

MATERIALS AND METHODS

Data source

The inferred homologs, IDs, and gene annotations were collected from worldwide resources. Specifically, HGD integrates predictions from five of the top-performing methods, using the most recent assessment from Quest for ortholog benchmarking (33), including eggNOG (version 5.0, http://eggnog5.embl.de/#/app/home), Panther (version 17.0, http://www.pantherdb.org), TreeFam (version 4.5.1, http://www.treefam.org), Hieranoid (version 2, https://hieranoid.sbc.su.se) and InParanoid (version 8, https://inparanoid.sbc.su.se/cgi-bin/index.cgi). Currently, the inclusion criterion is performing at a higher number of predicted homology relationships and a higher rate of positive predictive values. For ID mapping, a large batch of files with various versions were manually curated and downloaded from Uniprot (https://www.uniprot.org) (34), Ensembl (http://www.ensembl.org) (35) and NCBI (https://www.ncbi.nlm.nih.gov) (36). Gene functional annotations were collected from GWAS Atlas (https://ngdc.cncb.ac.cn/gwas) (37) for traits annotations, GVM (https://ngdc.cncb.ac.cn/gvm) (38) for variants annotations, GEN (https://ngdc.cncb.ac.cn/gen) (39) for expression annotations and Ensembl (http://www.ensembl.org) for GO (40) annotations.

Data processing

The entire data processing procedure includes homology data pre-process, ID mapping and homologous gene annotation. The collected original homology data was first filtered by 37 species, and then converted into homology pairs. After that, a batch mapping among Ensembl Protein ID, UniProt ID, and NCBI Protein ID was implemented. The basic principle of data integration is that, for each homologous pair compared with others, the conflicting data would be retained and the duplicated data would be merged into a single piece of data. The original homolog ID (e.g. Group ID/ Cluster ID/Tree ID) would be recorded along with the corresponding data sources the homolog comes from in case users need to trace the data. As a result, 112 383 644 non-redundant homologous pairs were obtained. Then, 1 138 192 unique homologous proteins were screened out and complemented with gene basic information including gene identifier, gene symbol, gene synonym, gene type, position, gene description and so on. Subsequently, extensive gene function annotations from GWAS Atlas, GVM, GEN and Ensembl GO were annotated into the above homologous gene list, resulting in homolog annotations for trait, variant, expression and GO respectively. During the data processing, NumPy library and Pandas library of Python with a multi-threaded parallel processing method were used to accelerate the processing of hundreds of millions of homology pairs. The whole process described above is shown in Figure 1.

Figure 1.

Figure 1.

Overview of data sources, data processing and database contents of Homologous Gene Database.

Database implementation

HGD was implemented using Spring Boot (https://spring.io/projects/spring-boot; a framework easy to create standalone java applications) as the back-end framework. All data was stored and managed using MySQL (https://dev.mysql.com; a free and popular relational database management system). To provide user-friendly and highly interactive web applications, web pages were constructed using Vue3 (https://v3.cn.vuejs.org/, an approachable, high-performance, and versatile framework for building web user interfaces). Front-end interfaces were built using Element UI (https://element.eleme.cn/; a Vue3 component library for designers and developers). Furthermore, data visualization was built by ECharts (https://www.echarts.com; a JavaScript plug-in for creating interactive charts), D3.js (https://d3js.org/; a JavaScript library for manipulating documents based on data) and DataTables (https://datatables.net; a plug-in for the jQuery JavaScript library to render HTML tables).

DATABASE CONTENTS AND USAGE

Homology collection

HGD features comprehensive collection of homology data from diverse resources and integration of multi-omics annotations for multiple species. In the current version, HGD houses 37 species (19 animals,16 plants and 2 microorganisms) with 112,383,644 homologous pairs. Especially, 10 of the 37 species are model organisms. Meanwhile, HGD integrates various annotations from public resources including 16 909 homologs with traits, 276 670 homologs with variants, 398 573 homologs with expression, and 536 852 homologs with GO. The data statistics information is summarized in Table 1.

Table 1.

Annotation statistics for homologs of each species

Organism Common name Taxon Id #Trait #Variant #Expression #GO
Triticum aestivum Bread wheat 4565 1102 22 364 25 561 16 602
Manihot esculenta Cassava 3983 - - - 22 121
Gossypium hirsutum Cotton 3635 - - 51 141 -
Cucumis sativus Cucumber 3659 - - - 15 636
Phoenix dactylifera Date palm 42 345 - - - -
Vitis vinifera Grape 29 760 - 94 - 94
Zea mays Maize 4577 7 575 - 15 819 26 537
Capsicum annuum Pepper 4072 - 7 951 - 20 356
Populus trichocarpa Poplor 3694 - 27 182 - 23 541
Solanum tuberosum Potato 4113 - - - 21 352
Brassica napus Rapeseed 3708 154 7276 8 466 33 517
Oryza sativa Rice 4530 6 083 24 796 24 849 20 000
Sorghum bicolor Sorghum 4558 1 130 26 940 13 776 21 932
Glycine max Soybean 3847 320 46 583 31 573 38 385
Arabidopsis thaliana Thale cress 3702 - - 24 160 22 026
Solanum lycopersicum Tomato 4081 - 3754 2140 3965
Felis catus Cat 9685 - 17 764 - 16 686
Bos taurus Cattle 9913 236 17 738 20 721 17 672
Gallus gallus Chicken 9031 214 13 327 15 497 13 034
Pan troglodytes Chimpanzee 9598 - - - 19 479
Canis familiaris Dog 9615 - - 1077 973
Drosophila melanogaster Fruit fly 7227 - - 12 116 10 628
Ailuropoda melanoleuca Giant panda 9646 - 12 580 - 13 252
Apis mellifera Honey bee 7460 - - - 211
Equus caballus Horse 9796 - 16 037 - 16 594
Homo sapiens Human 9606 - 18 969 19 923 19 208
Mus musculus Mouse 10 090 - - 21 709 14 495
Sus scrofa Pig 9823 95 13 315 19 493 11 352
Rattus norvegicus Rat 10 116 - - 19 633 17 727
Macaca mulatta Rhesus monkey 9544 - - 19 245 16 325
Caenorhabditis elegans Roundworm 6239 - - 13 926 10 595
Bombyx mori Silkworm 7091 - - - 7066
Xenopus tropicalis Tropical clawed frog 8364 - - 14 064 8427
Meleagris gallapavo Turkey 9103 - - - 10 207
Danio rerio Zebrafish 7955 - - 23 415 21 987
Saccharomyces cerevisiae Brewer's yeast 4932 - - 269 4295
Escherichia coli E. coli 562 - - - 575

Homologs with annotated traits

HGD integrates the trait annotations from GWAS Atlas to provide a more comprehensive understanding of gene function effects on traits (Figure 2A). According to the trait terms in GWAS Atlas, the trait annotations were organized by trait ontology initially obtained from Animal Trait Ontology for livestock (https://bioportal.bioontology.org/ontologies/ATOL) and Plant Trait Ontology (41). After mapping the trait terms to homologous genes, 15 trait ontology terms were filtered with 16 909 homologous genes for 9 species (3 animals and 6 plants). Users can select a trait term of interest to view different trait annotations for multi-species homologs represented by coloured icon. In particular, the green icon indicates the homologs play a same function role in determining given trait. Users can obtain further detailed information by clicking on the green icon, which shows a list of integrated homologs with the number of data sources for quantitative evaluation of the confidence of homologous genes and a list of genotype-phenotype containing detailed genotype information for further research on the gene function of homologs.

Figure 2.

Figure 2.

Screenshots of HGD. (A) Homology data with trait information and detailed information of homologous genes and GWAS Atlas traits. The blue icon represents homologs between two species, the orange icon represents homologs have ontology term annotations, and the green icon represents two homologs with the same ontology term annotation. (B) Homology data with variation information. (C) Homology data with expression information. (D) Homology data with GO information. (E) A heat map of homologous pairs for 37 species.

Homologs with associated variants

HGD integrates the variant annotations from GVM and provides the function for comparing various homologous gene variations (Figure 2B). According to the variant annotation results in GVM by Ensembl Variant Effect Predictor (42), the variant annotation data was organized by variant ontology initially selected from the sequence ontology database (http://www.sequenceontology.org) (43). Mapping the variant annotation data to homologous genes resulted in 29 variation terms with 276 670 homologs for 16 species (7 animals and 9 plants). Users can select the variant ontology term of interest to view the distinct variants annotation of multi-species homologs indicated by coloured icon. With the green icons clicked, it presents a homologs list of the clicked gene with detailed information along with a list of variation with detailed allele and positions about that gene, which supports further research on the impact of the variation for homologs.

Homologs with related expressions

HGD integrates the expression data from GEN to visualize expression profiles of homologs in multiple tissues across species (Figure 2C). According to the expression dataset classification in GEN, the expression data was organized by ontology term initially selected from Disease Ontology (DO, https://disease-ontology.org) (44), BRENDA Tissue Ontology (BTO, http://www.ontobee.org/ontology/bto) and the biological context defined by the GEN. Mapping the expression profiles to homologs resulted in 53 expression terms with 398,573 homologs for 22 species (12 animals, 9 plants and 1 microorganism). When selecting the expression term of interest, a list of the homologs will be displayed. Coloured icons represent the various expression situation of homologs across species. The green icon represents that the homologs may share the same expression pattern, and can be clicked on to further display the homologs list and the average transcripts per million (TPM) value of tissues shown as a boxplot.

Homologs with annotated GO terms

HGD integrates GO annotation data from Ensembl to provide a functional comparison of genes among homologs (Figure 2D). In GO module, it houses 60 GO terms containing 536,852 homologs of 35 species (19 animals,14 plants and 2 microorganism). Users can select the GO term of interest to view the corresponding homologs across multiple species. A list of integrated homologs and a GO list with detailed sub-GO term are available for further gene function research.

Homologous pairs between species

HGD provides a heat map of homologous pairs for all 37 species (Figure 2E). From the heat map, it can be observed that there is a large amount of homologous pairs among animals and plants. The two microorganisms have the least number of homologous pairs. Meanwhile, there are a certain number of homologous pairs between animals and plants, which may be valuable for researchers interested in studies across plants and animals. By clicking on the blocks of the heat map, users can directly access a detailed list of homologous genes between any two species.

Retrieval of homologous genes

HGD provides a basic search function and an advanced data filter function for users to retrieve homologous genes. Users can input various search keywords to search for homologs, including gene symbols, gene synonyms, UniProt ID, Ensembl protein ID, Ensembl gene ID, NCBI gene ID, species common name, protein biotype, gene description and protein name, the latter two with fuzzy matching support. After the search results are obtained, users can further filter using a variety of conditions such as trait ontology, variation ontology, expression term, GO term and species, which can be easily added or removed by a user-friendly web interface. Users can view the gene symbols, Ensembl protein ID and gene description of the resulting homologs. And a homologs list of other species is also available. Meanwhile, users can view the number of gene annotations including GO, expression, variation and trait, and can click for more detailed information. Users can download the results for further data analysis.

An example of using HGD

EP2 is reported to regulate panicle erectness, panicle length, and grain size in rice (45). After searching for EP2 in HGD via gene symbol, the result page shows that EP2 is from Oryza sativa, which has 45 homologous genes across species (Figure 3A). And SORBI_3002G374400 of sorghum being the homologous gene of EP2. Wang et al. reported that the EP2 ortholog is a candidate gene for the panicle compactness locus of sorghum and the function needs to be further examined (46). By clicking on EP2 (Oryza sativa), a new page will be open to show 6 sections with gene basic information, homologs, GO, trait, variants and expression. The basic gene information shows gene location, description and various cross-reference IDs of EP2 (Figure 3B). In the homologs section (Figure 3C), all the homology information of EP2, corresponding to 45 homologs, is displayed by default. Filtering the search box for species as sorghum will show that there are two homologous pairs in sorghum. One is Uniprot A0A1B6QFJ9 with family ID PTHR31008 in the homology inference source Panther and multiple IDs such as 3I77Q in the source eggNOG, and the other is C5XE12 with cluster ID 339 in the homology inference source InParanoid (Figure 3C). Each homology inference source comes with a web link, clicking on which will jump to the corresponding homology database. In the GO section, by comparing the homologs, a colored gene function profile normalized by the GO annotation number (Figure 3D) shows that EP2 homologs of sorghum have gene function in nucleotide binding, catalytic activity and oxidoreductase activity. In the variation section, by comparing the homologs, a colored variation profile normalized by the number of variants (Figure 3E) shows that EP2 has missense, splice region and synonymous variants. By clicking on the colored block, a table list is opened to show the detailed variation alleles, positions, molecule consequence, allele change and amino acid residues change, which is useful for further research. In the trait section, a colored trait profile normalized by the number of trait annotation (Figure 3F) shows that both EP2 and the homologs have plant morphology traits. By clicking on the colored block, a GWAS table list is opened to show that EP2 affects flag leaf lamina width (47), grain length and grain length-width ratio (48), whereas the EP2 homologs of sorghum may be associated with panicle morphology (46). In the expression section, by comparing the homologs, a colored expression profile normalized by the number of RNA-seq datasets (Figure 3G) shows EP2 expressed in a variety of biological contexts including temporal, spatial, phenotypic, genetic and environmental. By clicking on tissue term, the expression data shows that the EP2 gene is expressed in 31 high-quality RNA-seq datasets and has a high expression level in internode, panicle, embryo (49), shoot (50), coleoptile, root (51), seed, leaf (52,53) and floret, with an average TPM value above 100, which is consistent with the reported high expression of EP2 in internodes and panicles both temporally and spatially during the heading stage (45). Click on the bar-plot icon and a box graph will pop up to visualize the average TPM values of homologs in the current RNA-seq dataset, which can be used to compare different expression level of homologs in the same RNA-seq dataset (Figure 3H).

Figure 3.

Figure 3.

Screenshots of EP2 in HGD. (A) A search result for EP2 gene. (B) The gene basic information of EP2. (C) The homologous gene list of EP2 in rice. (D) The GO pattern of EP2 and the homologs, with a list of GO. (E) The variation pattern of EP2 and the homologs, with a list of variants. (F) The trait pattern of EP2 and the homologs, with a detailed list of GWAS. (G) The expression pattern of EP2 and the homologs, with a list of expression values for EP2 in shoot, internode and panicle tissues. (H) The bar graph indicating the expression levels of homologous genes in the same RNA-seq dataset.

DISCUSSION AND FUTURE PLANS

Homologs are genes with shared ancestry (5), which plays a crucial role for comparative, developmental, and molecular biology. Homolog database as the curated knowledgebase also plays an important role in genome-related research, and there are already lots of homologous gene databases released (13–18,22–26,29). Different from these databases (Table 2), HGD systematically integrate homologs from 5 public single homologs resources including eggNOG, Panther, TreeFam, Hieranoid and InParanoid, and with some specific features. HGD uses a homologous gene naming rule to display homologs primarily in gene symbols. By handling a wide range of ID mappings, HGD supports searching for homologs by various keywords, including gene symbol, gene synonym, protein name, Uniprot ID, Ensembl protein ID, Ensembl gene ID, NCBI gene ID and so on. Meanwhile, HGD collects multi-omics data including trait, expression and variation from 3 public resources including GWAS Atlas (37), GVM (38) and GEN (39) of NGDC (54) and provides a comparison function when browsing homologs of multiple species simultaneously, together with search functions by genes, species and ontology terms to facilitate convenient access to data of interest. In addition, HGD houses a number of species including animals, plants and microorganisms, which helps to extend the homologous genes research to non-model organisms. Since HGD has integrated homologous genes with multi-omics annotation data, users can explore the functional effects of homologous genes on traits from different species, compare the variety of homologous gene variations and demonstrate the difference in expression levels of homologs in multiple tissues across species. All these features set HGD apart from all existing homology resources, and make HGD, as a complement to existing resources, an indispensable and important homologous gene resource in the community.

Table 2.

The overall features of existed homology resources

Gene annotation types Homologs view mode
Database Homology inferring method Homology relationship #Organisms/ species Homologs’ ID type GO Pathway Protein/ domain ID Trait Variation Expression Single gene By sequence By comparing
COG Graph-based Pairwise, group 1309 NCBI Protein ID × × × × × ×
InParanoid Graph-based Group 273 Uniprot ID × × × × × ×
eggNOG Graph-based Group 5090 NCBI/Ensembl/ Uniprot Protein ID × × × ×
OMA Graph-based Pairwise, group 2326 Self-definded Protein ID, Uniprot ID/ Ensembl ID/ NCBI ID × × × × ×
HomoloGene Graph-based Group 21 Gene symbol/NCBI Gene ID × × ×
Panther Tree-based Group - Ensembl/NCBI Gene ID,Protein ID × × × × ×
Hieranoid Tree-based Group 66 Uniprot ID × × × × × × ×
TreeFam Tree-based Pairwise, group 109 Ensembl Gene ID/Gene Name/ Uniprot ID × × × × × ×
Ensembl Compara Tree-based Pairwise - Ensembl Gene ID × ×
DIOPT Integrated Pairwise 10a Ensembl/NCBI Gene ID × × × × × ×
Alliance Integrated Pairwise 7a Gene symbol × ×
a

Mainly for model organisms.

In the future, we plan to continuously update and integrate homologs from high-quality resources such as OMA (17) and OrthoDB (55) to enlarge the homology resource and curate reported or validated homologs from public papers to provide more high-confidence homologous relationships. Meanwhile, we will add more organisms, such as cultivars like Sweet potato, Rye and Green gram to fulfil various research requirements. In addition, we will develop online tools such as homology visualization and BLAST (56) to help users retrieve and browse homologs with annotated data in a more user-friendly manner.

DATA AVAILABILITY

HGD is available online for free at https://ngdc.cncb.ac.cn/hgd and does not require user registration.

ACKNOWLEDGEMENTS

We thank GWAS Atlas, GVM, GEN team in National Genomics Data Center (NGDC) for providing data retrieve interface, and thank the high-performance computing platform of NGDC for providing the powerful computational resources.

Contributor Information

Guangya Duan, National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; University of Chinese Academy of Sciences, Beijing 100049, China.

Gangao Wu, National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; University of Chinese Academy of Sciences, Beijing 100049, China.

Xiaoning Chen, National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; University of Chinese Academy of Sciences, Beijing 100049, China.

Dongmei Tian, National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China.

Zhaohua Li, National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; University of Chinese Academy of Sciences, Beijing 100049, China.

Yanling Sun, National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China.

Zhenglin Du, National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China.

Lili Hao, National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China.

Shuhui Song, National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; University of Chinese Academy of Sciences, Beijing 100049, China.

Yuan Gao, National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; University of Chinese Academy of Sciences, Beijing 100049, China.

Jingfa Xiao, National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; University of Chinese Academy of Sciences, Beijing 100049, China.

Zhang Zhang, National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; University of Chinese Academy of Sciences, Beijing 100049, China.

Yiming Bao, National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; University of Chinese Academy of Sciences, Beijing 100049, China.

Bixia Tang, National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; University of Chinese Academy of Sciences, Beijing 100049, China.

Wenming Zhao, National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation, Beijing 100101, China; University of Chinese Academy of Sciences, Beijing 100049, China.

FUNDING

Strategic Priority Research Program of the Chinese Academy of Sciences [XDB38050300]; National Key R&D Program of China [2018YFD1000505 to W.Z.]; Genomics Data Center Operation and Maintenance of Chinese Academy of Sciences [CAS-WX2022SDC-XK05]; National Natural Science Foundation of China [32100506, 32170678, 32100511]. Funding for open access charge: National Natural Science Foundation of China.

Conflict of interest statement. None declared.

REFERENCES

  • 1. Koonin E.V. Orthologs, paralogs, and evolutionary genomics. Annu. Rev. Genet. 2005; 39:309–338. [DOI] [PubMed] [Google Scholar]
  • 2. Descorps-Declere S., Lemoine F., Sculo Q., Lespinet O., Labedan B.. The multiple facets of homology and their use in comparative genomics to study the evolution of genes, genomes, and species. Biochimie. 2008; 90:595–608. [DOI] [PubMed] [Google Scholar]
  • 3. Brigandt I. Homology in comparative, molecular, and evolutionary developmental biology: the radiation of a concept. J. Exp. Zool. B Mol. Dev. Evol. 2003; 299:9–17. [DOI] [PubMed] [Google Scholar]
  • 4. Sommer R.J. Homology and the hierarchy of biological systems. Bioessays. 2008; 30:653–658. [DOI] [PubMed] [Google Scholar]
  • 5. Wu H., Mao F., Olman V., Xu Y.. Hierarchical classification of functionally equivalent genes in prokaryotes. Nucleic Acids Res. 2007; 35:2125–2140. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Fitch W.M. Distinguishing homologous from analogous proteins. Syst. Zool. 1970; 19:99–113. [PubMed] [Google Scholar]
  • 7. CHEN R., JEONG S.-S.. Functional prediction: identification of protein orthologs and paralogs. Protein Sci. 2000; 9:2344–2353. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Studer R.A., Robinson-Rechavi M.. How confident can we be that orthologs are similar, but paralogs differ?. Trends Genet. 2009; 25:210–216. [DOI] [PubMed] [Google Scholar]
  • 9. Gabaldon T., Koonin E.V.. Functional and evolutionary implications of gene orthology. Nat. Rev. Genet. 2013; 14:360–366. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Altenhoff A.M., Dessimoz C.. Phylogenetic and functional assessment of orthologs inference projects and methods. PLoS Comput. Biol. 2009; 5:e1000262. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Shao Y., Chen C., Shen H., He B.Z., Yu D., Jiang S., Zhao S., Gao Z., Zhu Z., Chen X.et al.. GenTree, an integrated resource for analyzing the evolution and function of primate-specific coding genes. Genome Res. 2019; 29:682–696. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Kuzniar A., van Ham R.C., Pongor S., Leunissen J.A.. The quest for orthologs: finding the corresponding gene across genomes. Trends Genet. 2008; 24:539–551. [DOI] [PubMed] [Google Scholar]
  • 13. Galperin M.Y., Wolf Y.I., Makarova K.S., Vera Alvarez R., Landsman D., Koonin E.V.. COG database update: focus on microbial diversity, model organisms, and widespread pathogens. Nucleic Acids Res. 2021; 49:D274–D281. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Sonnhammer E.L., Ostlund G.. InParanoid 8: orthology analysis between 273 proteomes, mostly eukaryotic. Nucleic Acids Res. 2015; 43:D234–D239. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Huerta-Cepas J., Szklarczyk D., Heller D., Hernandez-Plaza A., Forslund S.K., Cook H., Mende D.R., Letunic I., Rattei T., Jensen L.J.et al.. eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res. 2019; 47:D309–D314. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Kaduk M., Riegler C., Lemp O., Sonnhammer E.L.. HieranoiDB: a database of orthologs inferred by hieranoid. Nucleic Acids Res. 2017; 45:D687–D690. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Altenhoff A.M., Train C.M., Gilbert K.J., Mediratta I., Mendes de Farias T., Moi D., Nevers Y., Radoykova H.S., Rossier V., Warwick Vesztrocy A.et al.. OMA orthology in 2021: website overhaul, conserved isoforms, ancestral gene order and more. Nucleic Acids Res. 2021; 49:D373–D379. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Wheeler D.L., Barrett T., Benson D.A., Bryant S.H., Canese K., Chetvernin V., Church D.M., DiCuccio M., Edgar R., Federhen S.et al.. Database resources of the national center for biotechnology information. Nucleic Acids Res. 2006; 34:D173–D180. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Roth A.C., Gonnet G.H., Dessimoz C.. Algorithm of OMA for large-scale orthology inference. BMC Bioinf. 2008; 9:518. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Tekaia F., Yeramian E.. SuperPartitions: detection and classification of orthologs. Gene. 2012; 492:199–211. [DOI] [PubMed] [Google Scholar]
  • 21. Tulpan D., Leger S.. The plant orthology browser: an orthology and gene-order visualizer for plant comparative genomics. Plant Genome. 2017; 10: 10.3835/plantgenome2016.08.0078. [DOI] [PubMed] [Google Scholar]
  • 22. Mi H., Ebert D., Muruganujan A., Mills C., Albou L.P., Mushayamaha T., Thomas P.D.. PANTHER version 16: a revised family classification, tree-based classification tool, enhancer regions and extensive API. Nucleic Acids Res. 2021; 49:D394–D403. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Schreiber F., Patricio M., Muffato M., Pignatelli M., Bateman A.. TreeFam v9: a new website, more species and orthology-on-the-fly. Nucleic Acids Res. 2014; 42:D922–D925. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Vilella A.J., Severin J., Ureta-Vidal A., Heng L., Durbin R., Birney E.. EnsemblCompara genetrees: complete, duplication-aware phylogenetic trees in vertebrates. Genome Res. 2009; 19:327–335. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Hu Y., Flockhart I., Vinayagam A., Bergwitz C., Berger B., Perrimon N., Mohr S.E.. An integrative approach to ortholog prediction for disease-focused and other functional studies. BMC Bioinf. 2011; 12:357. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. The Alliance of Genome Resources Consortium Alliance of genome resources portal: unified model organism research platform. Nucleic Acids Res. 2020; 48:D650–D658. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Jun J., Mandoiu II, Nelson C.E. Identification of mammalian orthologs using local synteny. BMC Genomics. 2009; 10:630. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Kristensen D.M., Wolf Y.I., Mushegian A.R., Koonin E.V.. Computational methods for gene orthology inference. Brief Bioinform. 2011; 12:379–391. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Novichkov P.S., Ratnere I., Wolf Y.I., Koonin E.V., Dubchak I.. ATGC: a database of orthologous genes from closely related prokaryotic genomes and a research platform for microevolution of prokaryotes. Nucleic Acids Res. 2009; 37:D448–D454. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Gabaldón T., Dessimoz C., Huxley-Jones J., Vilella A.J., Sonnhammer E.L., Lewis S.. Joining forces in the quest for orthologs. Genome Biol. 2009; 10:403. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Linard B., Ebersberger I., McGlynn S.E., Glover N., Mochizuki T., Patricio M., Lecompte O., Nevers Y., Thomas P.D., Gabaldon T.et al.. Ten years of collaborative progress in the quest for orthologs. Mol. Biol. Evol. 2021; 38:3033–3045. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Liao B.Y., Zhang J.. Evolutionary conservation of expression profiles between human and mouse orthologous genes. Mol. Biol. Evol. 2006; 23:530–540. [DOI] [PubMed] [Google Scholar]
  • 33. Nevers Y., Jones T.E.M., Jyothi D., Yates B., Ferret M., Portell-Silva L., Codo L., Cosentino S., Marcet-Houben M., Vlasova A.et al.. The quest for orthologs orthology benchmark service in 2022. Nucleic Acids Res. 2022; 50:W623–W632. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. UniProt C. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res. 2021; 49:D480–D489. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Cunningham F., Allen J.E., Allen J., Alvarez-Jarreta J., Amode M.R., Armean I.M., Austine-Orimoloye O., Azov A.G., Barnes I., Bennett R.et al.. Ensembl 2022. Nucleic Acids Res. 2022; 50:D988–D995. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. O’Leary N.A., Wright M.W., Brister J.R., Ciufo S., Haddad D., McVeigh R., Rajput B., Robbertse B., Smith-White B., Ako-Adjei D.et al.. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016; 44:D733–D745. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Tian D., Wang P., Tang B., Teng X., Li C., Liu X., Zou D., Song S., Zhang Z.. GWAS atlas: a curated resource of genome-wide variant-trait associations in plants and animals. Nucleic Acids Res. 2020; 48:D927–D932. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Li C., Tian D., Tang B., Liu X., Teng X., Zhao W., Zhang Z., Song S.. Genome variation map: a worldwide collection of genome variations across multiple species. Nucleic Acids Res. 2021; 49:D1186–D1191. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Zhang Y., Zou D., Zhu T., Xu T., Chen M., Niu G., Zong W., Pan R., Jing W., Sang J.et al.. Gene expression nebulas (GEN): a comprehensive data portal integrating transcriptomic profiles across multiple species at both bulk and single-cell levels. Nucleic Acids Res. 2022; 50:D1016–D1024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Gene Ontology Consortium The gene ontology resource: enriching a GOld mine. Nucleic Acids Res. 2021; 49:D325–D334. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Cooper L., Meier A., Laporte M.A., Elser J.L., Mungall C., Sinn B.T., Cavaliere D., Carbon S., Dunn N.A., Smith B.et al.. The planteome database: an integrated resource for reference ontologies, plant genomics and phenomics. Nucleic Acids Res. 2018; 46:D1168–D1180. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. McLaren W., Gil L., Hunt S.E., Riat H.S., Ritchie G.R., Thormann A., Flicek P., Cunningham F.. The ensembl variant effect predictor. Genome Biol. 2016; 17:122. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Eilbeck K., Lewis S.E., Mungall C.J., Yandell M., Stein L., Durbin R., Ashburner M.. The Sequence Ontology: a tool for the unification of genome annotations. Genome. Biol. 2005; 6:R44. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Schriml L.M., Munro J.B., Schor M., Olley D., McCracken C., Felix V., Baron J.A., Jackson R., Bello S.M., Bearer C.et al.. The human disease ontology 2022 update. Nucleic Acids Res. 2022; 50:D1255–D1261. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Zhu K., Tang D., Yan C., Chi Z., Yu H., Chen J., Liang J., Gu M., Cheng Z.. Erect panicle2 encodes a novel protein that regulates panicle erectness in indica rice. Genetics. 2010; 184:343–350. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Wang L., Upadhyaya H.D., Zheng J., Liu Y., Singh S.K., Gowda C.L.L., Kumar R., Zhu Y., Wang Y.H., Li J.. Genome-Wide association mapping identifies novel panicle morphology loci and candidate genes in sorghum. Front. Plant Sci. 2021; 12:743838. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Lu Q., Zhang M., Niu X., Wang S., Xu Q., Feng Y., Wang C., Deng H., Yuan X., Yu H.et al.. Genetic variation and association mapping for 12 agronomic traits in indica rice. BMC Genomics. 2015; 16:1067. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Begum H., Spindel J.E., Lalusin A., Borromeo T., Gregorio G., Hernandez J., Virk P., Collard B., McCouch S.R.. Genome-wide association mapping for yield and other agronomic traits in an elite breeding population of tropical rice (Oryza sativa). PLoS One. 2015; 10:e0119873. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Narsai R., Secco D., Schultz M.D., Ecker J.R., Lister R., Whelan J.. Dynamic and rapid changes in the transcriptome and epigenome during germination and in developing rice (Oryza sativa) coleoptiles under anoxia and re-oxygenation. Plant J. 2017; 89:805–824. [DOI] [PubMed] [Google Scholar]
  • 50. Locke A.M., Barding G.A. Jr, Sathnur S., Larive C.K., Bailey-Serres J. Rice SUB1A constrains remodelling of the transcriptome and metabolome during submergence to facilitate post-submergence recovery. Plant Cell Environ. 2018; 41:721–736. [DOI] [PubMed] [Google Scholar]
  • 51. Yuan J., Li J., Yang Y., Tan C., Zhu Y., Hu L., Qi Y., Lu Z.J.. Stress-responsive regulation of long non-coding RNA polyadenylation in oryza sativa. Plant J. 2018; 93:814–827. [DOI] [PubMed] [Google Scholar]
  • 52. Wilkins K.E., Booher N.J., Wang L., Bogdanove A.J.. TAL effectors and activation of predicted host targets distinguish asian from african strains of the rice pathogen xanthomonas oryzae pv. oryzicola while strict conservation suggests universal importance of five TAL effectors. Front. Plant Sci. 2015; 6:536. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Dossa G.S., Quibod I., Atienza-Grande G., Oliva R., Maiss E., Vera Cruz C., Wydra K.. Rice pyramided line IRBB67 (Xa4/Xa7) homeostasis under combined stress of high temperature and bacterial blight. Sci. Rep. 2020; 10:683. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. CNCB-NGDC Members and Partners Database resources of the national genomics data center, china national center for bioinformation in 2022. Nucleic Acids Res. 2022; 50:D27–D38. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Zdobnov E.M., Kuznetsov D., Tegenfeldt F., Manni M., Berkeley M., Kriventseva E.V.. OrthoDB in 2020: evolutionary and functional annotations of orthologs. Nucleic Acids Res. 2021; 49:D389–D393. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56. Altschul S.F., Gish W., Miller W., Myers E.W., Lipman D.J.. Basic local alignment search tool. J. Mol. Biol. 1990; 215:403–410. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

HGD is available online for free at https://ngdc.cncb.ac.cn/hgd and does not require user registration.


Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES