Skip to main content
STAR Protocols logoLink to STAR Protocols
. 2023 Jan 11;4(1):102014. doi: 10.1016/j.xpro.2022.102014

HSDecipher: A pipeline for comparative genomic analysis of highly similar duplicate genes in eukaryotic genomes

Xi Zhang 1,2,5,, Yining Hu 3, Zhenyu Cheng 2,4, John M Archibald 1,2,6,∗∗
PMCID: PMC9852648  PMID: 36633953

Summary

Many tools have been developed to measure the degree of similarity between gene duplicates within and between species. Here, we present HSDecipher, a bioinformatics pipeline to assist users in the analysis and visualization of highly similar duplicate genes (HSDs). We describe the steps for analysis of HSDs statistics, expanding HSD gene sets, and visualizing the results of comparative genomic analyses. HSDecipher represents a useful tool for researchers exploring the evolution of duplicate genes in select eukaryotic species.

For complete details on the use and execution of this protocol, please refer to Zhang et al. (2021)1 and Zhang et al. (2022).2

Subject areas: Bioinformatics, Genetics, Genomics, Evolutionary biology

Graphical abstract

graphic file with name fx1.jpg

Highlights

  • The HSDecipher pipeline analyzes highly similar duplicate genes (HSDs) in eukaryotes

  • HSDecipher pipeline statistics can be acquired using custom scripts

  • A larger dataset of HSDs is acquired by using a series of combination thresholds

  • Groups of HSDs can be visualized and compared within or between species


Publisher’s note: Undertaking any experimental protocol requires adherence to local institutional guidelines for laboratory safety and ethics.


Many tools have been developed to measure the degree of similarity between gene duplicates within and between species. Here, we present HSDecipher, a bioinformatics pipeline to assist users in the analysis and visualization of highly similar duplicate genes (HSDs). We describe the steps for analysis of HSDs statistics, expanding HSD gene set, and visualizing the results of comparative genomic analyses. HSDecipher represents a useful tool for researchers exploring the evolution of duplicate genes in select eukaryotic species.

Before you begin

Gene duplication has long been recognized as an important process in molecular evolution. Due to interest in identifying the possible role of duplicate genes in organismal adaptation,3 bioinformatics tools have been developed to aid in their detection.4 However, distinguishing orthologs (i.e., genes that differ due to speciation) from recently-evolved paralogs (genes that arose by gene duplication) can still be difficult. Hundreds of highly similar duplicate genes (HSDs) were recently identified in the genome of an Antarctic green alga Chlamydomonas sp. UWO241 (renamed Chlamydomonas priscuii).5,6,7 The HSDs were found using HSDFinder1 and characterized alongside those in other eukaryotic genomes in HSDatabase.2 In a previously published protocol the use and application of HSDFinder was presented.8 Here we describe the step-by-step use of custom scripts in HSDecipher, which allows researchers to carry out a downstream analysis of HSDs in eukaryotic species of interest.

Requirements for setting up the pipeline

HSDecipher contains a set of custom Python scripts to visualize and interpret the data generated by HSDFinder. Sample output files can be found at the following link: https://github.com/zx0223winner/HSDecipher. Here, five Python scripts are presented and applied sequentially in a pipeline. To run locally, pre-installed Python (preferably Python 3) and Linux (e.g., Ubuntu 20.04 LTS) environments are required. The other necessary scripts and data can be accessed via the links in the key resources table.

Note: To allow the comparative analysis data to be visualized in a heatmap, the minimum specification is a computer with 2 cores, 4 GB of RAM and 128 GB storage.

Key resources table

RESOURCE SOURCE IDENTIFIER∗
Deposited data

Chlamydomonas reinhardtii GenBank: GCA_000002595.39 https://www.ncbi.nlm.nih.gov/genome/?term=txid3055[orgn]
Arabidopsis thaliana GenBank: GCA_000001735.2 10,11 https://www.ncbi.nlm.nih.gov/genome/?term=GCA_000001735.2
Homo sapiens GenBank: GCF_000001405.3912,13 https://www.ncbi.nlm.nih.gov/genome/?term=txid63221[Organism:noexp]

Software and algorithms

Python 3 The Python community SCR_008394; https://www.python.org/downloads/
pandas v1.2.2 Python Data Analysis Library https://pandas.pydata.org
HSD_statistics.py, HSD_categories.py, HSD_add_on.py, HSD_batch_run.py, and HSD_heatmap.py This study https://github.com/zx0223winner/HSDecipher

∗Note: Identifier is used from the RRID portal (https://scicrunch.org/resources).

Materials and equipment

The software implementation was written in Python 3 using the following custom scripts and platforms: HSD_statistics.py, HSD_categories.py, HSD_add_on.py, HSD_batch_run.py, and HSD_heatmap.py. For example, (1) HSD_statistics.py is a python script that calculates the statistics of HSDs using a variety of HSDFinder thresholds. The output file is written in a table with the following headers: File name, Candidate HSDs, Non-redundant gene copies, Gene copies, True HSDs, Space, Incomplete HSDs, Capturing value, and Performance score. ‘Capturing value’ and ‘Performance score’ are two parameters used to evaluate the HSD results.1 (2) HSD_categories.py counts the number of HSDs with two, three, and more than four categories, which is helpful when evaluating the distribution of duplicate groups within HSDs. (3) Since the similarity of duplicate genes within and among genomes can vary significantly, by using the scripts HSD_add_on.py and HSD_batch_run.py, users can add newly curated HSDs using a combination of thresholds to assemble a larger dataset of HSD candidates. (4) HSD_heatmap.py visualizes the collected HSDs in a heatmap and compares HSDs sharing the same predicted biochemical pathway function. The KEGG database has been used to provide KO accession numbers for each gene model identifier.14 In this step-by-step protocol, we use the results of an HSD analysis of the genomes of Chlamydomonas reinhardtii, Arabidopsis thaliana and Homo sapiens to illustrate how to perform downstream analysis with these custom Python scripts.

Step-by-step method details

Downstream analysis of HSD statistics

Inline graphicTiming: ∼2 min (Depending on file sizes and internet speed)

This step performs a preliminary evaluation of the HSD results obtained using the HSDFinder tool.8

Note: By using different thresholds for amino acid length and pairwise identities, users can filter the groups of HSDs from the all-against-all BLAST protein sequence similarity search (E-value cut-off ≤1e-10). We used a short form of the sequence similarity assessment metrics, such as 90%_10aa, which refers to amino acid pairwise identity ≥90%, and amino acid aligned length variance ≤10. When naming the files, users should adhere to this format (species_name.identity_length.txt; e.g., “Chlamydomonas_reinhardtii.90_10.txt”), thereby allowing recognition of the output by downstream scripts.

Note: In the HSDs folder, we have prepared an HSDFinder analysis for three model species: C. reinhardtii, A. thaliana, and H. sapiens. To save processing time, the comparatively small genome of C. reinhardtii is used as the study case.

# Clone the package and move to the HSDecipher/directory

>git clonehttps://github.com/zx0223winner/HSDecipher

# install python3 and relevant libraries

pip3 install python

# Chlamydomonas_reinhardtii.90_10.txt

XP_001689821.1 XP_001689821.1; XP_001690281.2 241; 241 Pfam PF00011; PF00011 Hsp20/alpha crystallin family; Hsp20/alpha crystallin family 2.0E-10; 2.8E-10 IPR002068; IPR002068 Alpha crystallin/Hsp20 domain; Alpha crystallin/Hsp20 domain

  • 2.

    Users can run the Python scripts HSD_statistics.py and HSD_categories.py, which can be found in the GitHub main directory.

# HSD_statistics.py

>python3 HSD_statistics.py <path to HSD species folder> <format of HSD file. e.g., 'txt' or 'tsv'> <output file name. e.g., species_stat.tsv>

# In our case of HSD data in C. reinhardtii genome

>Python3 HSD_statistics.py /HSDs_folder/Chlamydomonas_reinhardtii txt Chlamy_stat.tsv

# HSD_categories.py

>python3 HSD_categories.py <path to HSD species folder> <format of HSD file. e.g., 'txt' or 'tsv'> <output file name. e.g., species_groups.tsv>

# In our case of HSD data in C. reinhardtii genome

>Python3 HSD_categories.py /HSDs_folder/Chlamydomonas_reinhardtii txt Chlamy_groups.tsv

Inline graphicCRITICAL: In Table 1, ‘Candidate HSDs’ indicates the number of highly similar gene duplicate candidates; True HSDs are duplicate groups satisfying the respective thresholds and gene copies containing the same domain(s); Non-redundant gene copies are the number of unique gene copies in each group of HSDs; Gene copies are the total number of gene copies in each group of HSDs; The number of spaces indicates the number of gene copies encoding the putative function without any conserved domain(s) with hits to the Pfam database (e.g., hypothetical proteins); Capturing value indicates the levels of predicted HSDs; Performance score is a value that allows users to assess the performance of the HSD retreival process. Troubleshooting 1.

Inline graphicCRITICAL: In Table 2, ‘2-group HSDs’ refers to the number of HSD categories containing only two gene copies. Troubleshooting 2.

Table 1.

Example of HSDecipher statistics file based on the output file from HSDFinder

File_name Candidate_HSDs# Non-redundant gene copies# Gene copies# True HSDs# Space# Incomplete HSDs# Capturing value Performance score
Chlamydomonas_reinhardtii.50_10 577 1625 1662 508 301 69 88.04 11.2
Chlamydomonas_reinhardtii.50_100 1089 3746 4954 915 487 174 84.02 8.67
Chlamydomonas_reinhardtii.50_30 822 2518 2676 710 380 112 86.37 10.19
Chlamydomonas_reinhardtii.50_50 942 3037 3462 814 422 128 86.41 10.34
Chlamydomonas_reinhardtii.50_70 1029 3389 4168 873 452 156 84.84 9.24
Chlamydomonas_reinhardtii.60_10 475 1380 1388 423 267 52 89.05 11.91
Chlamydomonas_reinhardtii.60_100 864 2910 3109 753 437 111 87.15 10.54
Chlamydomonas_reinhardtii.60_30 649 2034 2036 575 339 74 88.6 11.8
Chlamydomonas_reinhardtii.60_50 741 2409 2431 661 374 80 89.2 12.69
Chlamydomonas_reinhardtii.60_70 809 2657 2791 711 402 98 87.89 11.29
Chlamydomonas_reinhardtii.70_10 405 1204 1210 365 242 40 90.12 12.88
Chlamydomonas_reinhardtii.70_100 694 2355 2566 623 394 71 89.77 12.82
Chlamydomonas_reinhardtii.70_30 538 1704 1704 486 307 52 90.33 13.53
Chlamydomonas_reinhardtii.70_50 599 1981 1999 549 333 50 91.65 15.98
Chlamydomonas_reinhardtii.70_70 649 2161 2181 590 361 59 90.91 14.63
Chlamydomonas_reinhardtii.80_10 335 1030 1082 307 213 28 91.64 14.79
Chlamydomonas_reinhardtii.80_100 533 1910 1922 483 329 50 90.62 13.47
Chlamydomonas_reinhardtii.80_30 444 1450 1456 402 270 42 90.54 13.4
Chlamydomonas_reinhardtii.80_50 474 1648 1750 435 287 39 91.77 15.55
Chlamydomonas_reinhardtii.80_70 499 1772 1826 457 306 42 91.58 15.12
Chlamydomonas_reinhardtii.90_10 293 858 867 275 196 18 93.86 19.58
Chlamydomonas_reinhardtii.90_100 409 1486 2103 364 269 45 89 10.96
Chlamydomonas_reinhardtii.90_30 347 1128 1519 316 229 31 91.07 13.56
Chlamydomonas_reinhardtii.90_50 372 1281 1774 337 240 35 90.59 13.03
Chlamydomonas_reinhardtii.90_70 391 1380 1968 352 252 39 90.03 12.28

Table 2.

Example of HSDecipher categories result file based on the output file from HSDFinder

File_name 2-group_HSDs# 3-group_HSDs# >=4-group_HSDs#
Chlamydomonas_reinhardtii.50_10 420 87 70
Chlamydomonas_reinhardtii.50_100 689 185 215
Chlamydomonas_reinhardtii.50_30 554 137 131
Chlamydomonas_reinhardtii.50_50 622 157 163
Chlamydomonas_reinhardtii.50_70 662 168 199
Chlamydomonas_reinhardtii.60_10 344 69 62
Chlamydomonas_reinhardtii.60_100 589 131 144
Chlamydomonas_reinhardtii.60_30 437 107 105
Chlamydomonas_reinhardtii.60_50 509 112 120
Chlamydomonas_reinhardtii.60_70 553 125 131
Chlamydomonas_reinhardtii.70_10 291 63 51
Chlamydomonas_reinhardtii.70_100 482 94 118
Chlamydomonas_reinhardtii.70_30 369 85 84
Chlamydomonas_reinhardtii.70_50 413 90 96
Chlamydomonas_reinhardtii.70_70 449 92 108
Chlamydomonas_reinhardtii.80_10 232 57 46
Chlamydomonas_reinhardtii.80_100 357 80 96
Chlamydomonas_reinhardtii.80_30 297 76 71
Chlamydomonas_reinhardtii.80_50 316 78 80
Chlamydomonas_reinhardtii.80_70 331 82 86
Chlamydomonas_reinhardtii.90_10 210 47 36
Chlamydomonas_reinhardtii.90_100 267 57 85
Chlamydomonas_reinhardtii.90_30 220 56 71
Chlamydomonas_reinhardtii.90_50 246 51 75
Chlamydomonas_reinhardtii.90_70 259 53 79

Using a series of combination thresholds to expand an HSD gene dataset

Inline graphicTiming: ∼2 min (Depending on file sizes, computing power, and internet speed) (for step 3)

Users will require the Python scripts HSD_add_on.py and HSD_batch_run.py to run the following analysis.

  • 3.

    HSD_add_on.py can add newly acquired HSD data to original HSD output, thereby enlarging the HSD candidate dataset.

# HSD_add_on.py

#HSD_add_on.py python3 HSD_add_on.py -i <inputfile> -a <adding_file> -o <output file>

# In our case of HSD data in C. reinhardtii genome

>Python3 HSD_add_on.py -i /HSDs_folder/Chlamydomonas_reinhardtii/ Chlamydomonas_reinhardtii.90_10.txt -a Chlamydomonas_reinhardtii.90_30.txt -o Chlamydomonas_reinhardtii.90_10_90_30.txt

Note: For example, HSDs identified at a threshold of 90%_30aa were added to those identified at a threshold of 90%_10aa (denoted as “90%_30aa+90%_10aa”).

Inline graphicCRITICAL: Any redundant candidate HSDs acquired at each combination threshold are removed if the more relaxed threshold (e.g., 90%_30aa) retrieves the identical genes from the stricter cut-off (e.g., 90%_10aa).

Troubleshooting 3.

# HSD_batch_run.py

>python3 batch_run.py -i <inputfolder>

# In our case of HSDs data

>Python3 HSD_categories.py /HSDs_folder/

Inline graphicCRITICAL:HSD_batch_run.py can execute a series of combination threshold analyses at once. Users should back up the original HSDs folder before running the HSD_batch_run.py script. To minimize redundancy and to acquire a larger dataset of HSD candidates, we processed each selected species with the following combination of thresholds:

Troubleshooting 4.

# Chlamydomonas_reinhardtii.90_10.txt

XP_001689450.1 XP_001689450.1; XP_001700901. 1 280; 276 Pfam PF01459; PF01459 Eukaryotic porin; Eukaryotic porin 1.3E-34; 3.5E-39 IPR027246; IPR027246 Eukaryotic porin/Tom40; Eukaryotic porin/Tom40

XP_001689455.1 XP_001689455.1; XP_001698498.1 194; 161 Pfam PF08534; PF08534 Redoxin; Redoxin 4.2E-35; 1.1E-36 IPR013740; IPR013740 Redoxin; Redoxin

Note: The resulting output file of HSDs based on a combination of thresholds will appear in HSDs_folder, e.g., “Chlamydomonas_reinhardtii.90_10.txt”, “Arabidopsis_thaliana.90_10.txt” and “Homo_sapiens.90_10.txt”.

Downstream comparative genomic analysis of HSDs in eukaryotic genomes

Inline graphicTiming: ∼4 min (Depending on the size of the data, computing power, and internet speed) (for step 4)

In this step, users can apply the HSD_heatmap.py script on the previous generated HSD results to perform a comparative analysis.

  • 4.

    Users can compare different thresholds of HSDs in one genome or HSDs retrieved from different genomes in a heatmap (Figures 1 and 2).

Note: the data can be derived from multiple genomes with a species or the genomes of different species. The generated tabular file (Table 3) collects gene duplicates predicted to be involved in the same biological process or biochemical pathway, which can be used for natural selection analysis.

# HSD_heatmap.py

# For intra-species

>python3 HSD_heatmap.py -f <HSD file folder> -k <KO file folder> -r <width of output heatmap, e.g., 30 pixels> -c <height of output heatmap, e.g., 20 pixels>

>python3 HSD_heatmap.py -f /HSDs_folder/Chlamydomonas_reinhardtii/ -k /ko/ -r 30 -c 20

Note: The generated examples can be found in the heatmap folder under the HSDecipher main directory, such as the high resolution heatmap file “Chlamydomonas_reinhardtii_output_heatmap.eps” and the tabular file “Chlamydomonas_reinhardtii_output_heatmap.tsv”.

# HSD_heatmap.py, for inter-species analysis

>python3 HSD_heatmap.py -f <HSD file folder> -k <KO file folder> -r <width of output heatmap, e.g., 30 pixels > -c <height of output heatmap, e.g., 20 pixels >

>python3 HSD_heatmap.py -f /HSDs_folder/ -k /ko/ -r 30 -c 20

Note: The inter-species analysis example can be found in the heatmap folder under the HSDecipher main directory with the name “test_output_heatmap.eps” and “test_output_heatmap.tsv”.

Inline graphicCRITICAL: It is important to name the KEGG pathway KO file and HSD result file correctly so that they can be recognized by the HSDecipher scripts. For example, the KO information file for each species should be formatted as follows: “species_name.ko.txt” (e.g., Chlamydomonas_reinhardtii.ko.txt); the HSDs results file should be named “species_name.thresholds_thresholds.txt” (e.g., Chlamydomonas_reinhardtii.90_10.txt).

Figure 1.

Figure 1

Heatmap illustrating results obtained using different thresholds for the detection of highly similar duplicates (HSDs) in the genome of C. reinhardtii using the HSDecipher pipeline

The matrix in the heatmap refers the number of HSDs retrieved using different thresholds (e.g., 90_10, which refers to amino acid pairwise identity ≥90%, and amino acid aligned length variance ≤10) classified based on their KEGG functional categories.

Figure 2.

Figure 2

Heatmap showing results of running the HSDecipher pipeline on the predicted proteomes of C. reinhardtii, A. thaliana and H. sapiens

The matrix in the heatmap refers the number of HSDs across three eukaryotic species classified by their KEGG functional categories.

Table 3.

Example of HSDecipher heatmap tabular file based on the output file from HSDFinder

ko_id category1 category2 Function Chlamydomonas_reinhardtii.90_10_hsds_id Chlamydomonas_reinhardtii.90_10_hsds_genes Chlamydomonas_reinhardtii.90_10_hsds_num Arabidopsis_thaliana.90_10_hsds_id Arabidopsis_thaliana.90_10_hsds_genes Arabidopsis_thaliana.90_10_hsds_num Homo_sapiens.90_10_hsds_id Homo_sapiens.90_10_hsds_genes Homo_sapiens.90_10_hsds_num
K00850 09101 Carbohydrate metabolism 00010 Glycolysis / Gluconeogenesis [PATH:ko00010] pfkA, PFK; 6-phosphofructokinase 1 XP_001694148.1 XP_001694148.1; XP_001696305.1 1 NP_194651.1 NP_194651.1; NP_567742.1; NP_568842.1; NP_195010.1; NP_200966.2; NP_199592.1; NP_850025.1 1 NP_001341664.1 NP_001341664.1; XP_005252522.1; XP_016883857.1 1
K01623 09101 Carbohydrate metabolism 00010 Glycolysis / Gluconeogenesis [PATH:ko00010] ALO; fructose-bisphosphate aldolase, class I XP_001700318.1 XP_001700318.1; XP_001700659.1; XP_001701797.1 1 NP_178224.1 NP_178224.1; NP_568049.1; NP_565508.1; NP_001328708.1; NP_181187.1; NP_190861.1; NP_568127.1; NP_001329763.1 1 NP_000026.2 NP_000026.2; NP_005156.1; NP_001230106.1 1
K01006 09101 Carbohydrate metabolism 00010 Glycolysis / Gluconeogenesis [PATH:ko00010] ppdK; pyruvate, orthophosphate dikinase XP_042914963.1 XP_042914963.1; XP_042919927.1 1 NA NA NA NA NA NA
K00627 09101 Carbohydrate metabolism 00010 Glycolysis / Gluconeogenesis [PATH:ko00010] LAT, aceF, pdhC; pyruvate dehydrogenase E2 component (dihydrolipoamide acetyltransferase) XP_001696403.1 XP_001696403.1; XP_042920693.1 1 NP_564654.1 NP_564654.1; NP_566470.1; NP_189215.1; NP_174703.1 1 NA NA NA
K00121 09101 Carbohydrate metabolism 00010 Glycolysis / Gluconeogenesis [PATH:ko00010] frmA, AH5, adhC; S-(hydroxymethyl)glutathione dehydrogenase / alcohol dehydrogenase XP_042919155.1 XP_042919155.1; XP_042919157.1 1 NP_173660.1 NP_173660.1; NP_001031079.1; NP_567645.1; NP_199040.1; NP_568453.1; NP_177837.1; NP_176652.3; NP_564409.1; NP_001190468.1 1 NP_000658.1 NP_000658.1; NP_000659.2; NP_000660.1; NP_001159976.1; NP_001095940.1; NP_000662.3; NP_001293100.1 1
K12957 09101 Carbohydrate metabolism 00010 Glycolysis / Gluconeogenesis [PATH:ko00010] ahr; alcohol/geraniol dehydrogenase (NAP+) XP_001692728.2 XP_001692728.2; XP_042921179.1 1 NA NA NA NA NA NA
K01895 09101 Carbohydrate metabolism 00010 Glycolysis / Gluconeogenesis [PATH:ko00010] ACSS1_2, acs; acetyl-CoA synthetase XP_001700210.2 XP_001700210.2; XP_001700230.1; XP_001702039.1 1 NA NA NA NA NA NA
K00026 09101 Carbohydrate metabolism 00020 itrate cycle (TA cycle) [PATH:ko00020] MH2; malate dehydrogenase XP_001693118.1 XP_001693118.1; XP_001703167.2; XP_001702586.1 1 NP_179863.1 NP_179863.1; NP_001119199.1; NP_188120.1; NP_564625.1; NP_190336.1 1 NA NA NA
K00012 09101 Carbohydrate metabolism 00040 Pentose and glucuronate interconversions [PATH:ko00040] UGH, ugd; UPglucose 6-dehydrogenase XP_001698004.1 XP_001698004.1; XP_001703656.1 1 NP_173979.1 NP_173979.1; NP_189582.1; NP_197053.1; NP_198748.1 1 NA NA NA

Troubleshooting 5.

Expected outcomes

HSDecipher is a set of custom scripts for users who are interested in performing downstream analysis of highly similar gene duplicates obtained using HSDFinder. In the first two steps, analysis of HSD statistics and categories can help users evaluate the distribution and quality of HSD data. The third step allows users to expand their dataset of HSDs based on consideration of multiple sequence similarity assessment metrics; in other words, HSD datasets can be enlarged by adding more data using relaxed thresholds following removal of duplicates retrieved using different thresholds. For example, HSDs identified at a threshold of 80%_50aa can be added to those identified at a threshold of 80%_30aa (denoted as “80%_50aa+80%_30aa”); If the more relaxed threshold (i.e., 80%_50aa) contains identical genes acquired using the stricter cut-off (i.e., 80%_30aa), the combined HSD candidates can be filtered to remove the redundancy. In the last step, users can carry out a comparative genomics analysis of intra-/inter-genomic analysis of HSD data using a heatmap, which shows the functional distribution of HSDs or the levels of HSD sequence similarity shared between different species. Users can easily visualize and compare those significant enriched HSDs. Users are also provided with a tabular file to compare HSDs with the same KEGG pathway function, thereby allowing them to conveniently choose HSDs for downstream comparative genomic analysis (e.g., identification of signatures of natural selection).

Limitations

There is a steep learning curve for researchers with limited knowledge of bioinformatics, especially those who are not familiar with the basic command lines and dash shell in a Linux/Unix environment. At the present time, a “one-click solution” does not exist because of the desire to retain flexibility in the usage of our scripts for different purposes. That said, HSDecipher is comparatively easier to use than some of the other options currently available, such as PhylomeDB15 and OrthoFinder.16,17 At present there are very few tools that can execute downstream comparative genomics analysis of highly similar duplicate gene data. HSDecipher thus fills a need for the bioinformatics and genomics community.

Since there is no golden rule to distinguish partial duplicates from more complete ones, a combination of thresholds is used to acquire a larger dataset of HSD candidates. But due to the limitation of this strategy, it should be noted that there are some large groups of HSD candidates in the database that likely diverged in function from one another. Users should thus proceed with caution when working with these types of datasets.

Troubleshooting

Problem 1

Why are non-redundant gene copies (i.e., gene duplicates) listed as a column in Table 1? (step 2).

Potential solution

Since the HSDs are filtered based on the all-against-all BLAST protein similarity search and the BLAST algorithms can limit the maximum target hits by default, it is possible that not all HSD gene copies group together based on a simple transitive link between the remaining genes, especially for genomes with many gene duplicates. Users can manually increase the setting of maximum target hits in their BLAST searches to solve this problem.

Problem 2

Do the HSD datasets include protein sequences from alternative splicing? (step 2).

Potential solution

To count the genuine gene copies in each group of HSDs, we suggest users remove isoforms derived from alternative splicing and keep the one with longest transcript length as the primary protein sequence. This is because conserved sequences derived from alternative splicing can have similar functional domains, resulting in the misprediction of gene duplicates. We have developed a custom script called isoform2one (https://github.com/zx0223winner/isoform2one) to carry out this type of filtering before running the BLAST all-against-all search.

Problem 3

Will the original HSD file be modified after running the HSD_batch_run.py script? (step 3).

Potential solution

Yes, since the HSD_batch_run.py script can automatically run the HSD_add_on.py script multiple times based on a series of combination thresholds, the original files in HSDs folder will be modified. Users should back up a copy of the HSD files before running the HSD_batch_run.py script.

Problem 4

What criteria were used to collect the HSDs in HSD_batch_run.py script? (step 3).

Potential solution

Although there is no easy rule for distinguishing partial duplicates from complete duplicates, candidate HSDs generally have less than 50% amino acid length difference and similar predicted functions of conserved domains. To balance HSD detection sensitivity and accuracy, we suggest using a series of thresholds from 90%_10aa to 90%_100aa and from 50%_10aa to 50%_100aa. The combination threshold is selected using a series of thresholds: E + (D + (C + (B + A))).

A = 90%_100aa+(90%_70aa+(90%_50aa+(90%_30aa+90%_10aa))).

B = 80%_100aa+(80%_70aa+(80%_50aa+(80%_30aa+80%_10aa))).

C = 70%_100aa+(70%_70aa+(70%_50aa+(70%_30aa+70%_10aa))).

D = 60%_100aa+(60%_70aa+(60%_50aa+(60%_30aa+60%_10aa))).

E = 50%_100aa+(50%_70aa+(50%_50aa+(50%_30aa+50%_10aa))).

Problem 5

How can I acquire the KEGG pathway KO information for each genome? (step 4).

Potential solution

The detailed KO accession with each gene model identifier can be retrieved from the KEGG database. Our previous protocol provides a step-by-step guide.8

Resource availability

Lead contact

Further information and requests for resources and reagents should be directed to John M. Archibald (john.archibald@dal.ca) and Technical Contact Xi Zhang (xi.zhang@dal.ca).

Materials availability

This study did not generate new unique reagents.

Acknowledgments

This work was supported by a Gordon and Betty Moore Foundation grant (GBMF5782) and a Discovery Grant from the Natural Sciences and Engineering Research Council of Canada (RGPIN-2019- 05058) awarded to J.M.A. This work was also supported by a Discovery Grant (RGPIN 04912) from the Natural Sciences and Engineering Research Council of Canada to Z.C. We thank David R. Smith for useful discussion of the manuscript.

Author contributions

The study was conceptualized by X.Z. The data and manuscript were analyzed and written by X.Z. Y.N.H. assisted with bioinformatics analysis and debugged the HSDecipher pipeline. Z.C. and J.M.A. edited the manuscript. All authors read, revised, and approved the final manuscript for peer review.

Declaration of interests

The authors declare no competing interests.

Contributor Information

Xi Zhang, Email: xi.zhang@dal.ca.

John M. Archibald, Email: john.archibald@dal.ca.

Data and code availability

The HSDecipher source code has been deposited at https://github.com/zx0223winner/HSDecipher. The archived version at Zenodo is https://doi.org/10.5281/zenodo.7437886.

References

  • 1.Zhang X., Hu Y., Smith D.R. HSDFinder: a BLAST-based strategy for identifying highly similar duplicated genes in eukaryotic genomes. Front. Bioinform. 2021;1:803176. doi: 10.3389/fbinf.2021.803176. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Zhang X., Hu Y., Smith D.R. HSDatabase - a database of highly similar duplicate genes from plants, animals, and algae. Database. 2022;2022:baac086. doi: 10.1093/database/baac086. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Kondrashov F.A. Gene duplication as a mechanism of genomic adaptation to a changing environment. Proc. Biol. Sci. 2012;279:5048–5057. doi: 10.1098/rspb.2012.1108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Zhang X., Smith D.R. An overview of online resources for intra-species detection of gene duplications. Front. Genet. 2022;13:1012788. doi: 10.3389/fgene.2022.1012788. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Zhang X., Cvetkovska M., Morgan-Kiss R., Hüner N.P., Smith D.R. Draft genome sequence of the Antarctic green alga Chlamydomonas sp. UWO241. iScience. 2021;24:102084. doi: 10.1016/j.isci.2021.102084. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Cvetkovska M., Szyszka-Mroz B., Possmayer M., Pittock P., Lajoie G., Smith D.R., Hüner N.P.A. Characterization of photosynthetic ferredoxin from the Antarctic alga Chlamydomonas sp. UWO241 reveals novel features of cold adaptation. New Phytol. 2018;219:588–604. doi: 10.1111/nph.15194. [DOI] [PubMed] [Google Scholar]
  • 7.Stahl-Rommel S., Kalra I., D’Silva S., Hahn M.M., Popson D., Cvetkovska M., Morgan-Kiss R.M. Cyclic electron flow (CEF) and ascorbate pathway activity provide constitutive photoprotection for the photopsychrophile, Chlamydomonas sp. UWO 241 (renamed Chlamydomonas priscuii) Photosynth. Res. 2022;151:235–250. doi: 10.1007/s11120-021-00877-5. [DOI] [PubMed] [Google Scholar]
  • 8.Zhang X., Hu Y., Smith D.R. Protocol for HSDFinder: identifying, annotating, categorizing, and visualizing duplicated genes in eukaryotic genomes. STAR Protoc. 2021;2:100619. doi: 10.1016/j.xpro.2021.100619. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Merchant S.S., Prochnik S.E., Vallon O., Harris E.H., Karpowicz S.J., Witman G.B., Terry A., Salamov A., Fritz-Laylin L.K., Maréchal-Drouard L., et al. The Chlamydomonas genome reveals the evolution of key animal and plant functions. Science. 2007;318:245–250. doi: 10.1126/science.1143609. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Lamesch P., Berardini T.Z., Li D., Swarbreck D., Wilks C., Sasidharan R., Muller R., Dreher K., Alexander D.L., Garcia-Hernandez M., et al. The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools. Nucleic Acids Res. 2012;40:D1202–D1210. doi: 10.1093/nar/gkr1090. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Rhee S.Y., Beavis W., Berardini T.Z., Chen G., Dixon D., Doyle A., Garcia-Hernandez M., Huala E., Lander G., Montoya M., et al. The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community. Nucleic Acids Res. 2003;31:224–228. doi: 10.1093/nar/gkg076. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Lander E.S., Linton L.M., Birren B., Nusbaum C., Zody M.C., Baldwin J., Devon K., Dewar K., Doyle M., FitzHugh W., et al. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. doi: 10.1038/35057062. [DOI] [PubMed] [Google Scholar]
  • 13.Venter J.C., Adams M.D., Myers E.W., Li P.W., Mural R.J., Sutton G.G., Smith H.O., Yandell M., Evans C.A., Holt R.A., et al. The sequence of the human genome. Science. 2001;291:1304–1351. doi: 10.1126/science.1058040. [DOI] [PubMed] [Google Scholar]
  • 14.Kanehisa M., Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28:27–30. doi: 10.1093/nar/28.1.27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Huerta-Cepas J., Capella-Gutiérrez S., Pryszcz L.P., Marcet-Houben M., Gabaldón T. PhylomeDB v4: zooming into the plurality of evolutionary histories of a genome. Nucleic Acids Res. 2014;42:D897–D902. doi: 10.1093/nar/gkt1177. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Emms D.M., Kelly S. OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy. Genome Biol. 2015;16:157–214. doi: 10.1186/s13059-015-0721-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Emms D.M., Kelly S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol. 2019;20:238–314. doi: 10.1186/s13059-019-1832-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The HSDecipher source code has been deposited at https://github.com/zx0223winner/HSDecipher. The archived version at Zenodo is https://doi.org/10.5281/zenodo.7437886.


Articles from STAR Protocols are provided here courtesy of Elsevier

RESOURCES