Skip to main content
3 Biotech logoLink to 3 Biotech
. 2024 Oct 23;14(11):276. doi: 10.1007/s13205-024-04121-4

Genome sequencing of Caridina pseudogracilirostris and its comparative analysis with malacostracan crustaceans

NandhaGopal SoundharaPandiyan 1, Carlton Ranjith Wilson Alphonse 1, Subramoniam Thanumalaya 2, Samuel Gnana Prakash Vincent 3, Rajaretinam Rajesh Kannan 4,
PMCID: PMC11499489  PMID: 39464522

Abstract

The Caridina pseudogracilirostris is commonly found in the brackish waters of the southwestern coastal regions of India. This study provides a comprehensive genomic investigation of the shrimp species C. pseudogracilirostris, offering insights into its genetic makeup, evolutionary dynamics, and functional annotations. The genomic DNA was isolated from tissue samples, sequenced using next-generation sequencing (NGS), and stored in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) database (Accession No: PRJNA847710). De novo sequencing indicated a genome size of 1.31 Gbp with a low heterozygosity of about 0.81%. Repeat masking and annotation revealed that repeated elements constitute 24.60% of the genome, with simple sequence repeats (SSRs) accounting for 7.26%. Gene prediction identified 14,101 genes, with functional annotations indicating involvement in critical biological processes such as development, cellular function, immunological responses, and reproduction. Furthermore, phylogenetic analysis revealed genomic links among Malacostraca species, indicating gene duplication as a strategy for genetic diversity and adaptation. C. pseudogracilirostris has 1,856 duplicated genes, reflecting a distinct genomic architecture and evolutionary strategy within the Malacostraca branch. These findings enhance our understanding of the genetic characteristics and evolutionary relationships of C. pseudogracilirostris, providing significant insights into the overall evolutionary dynamics of the Malacostraca group.

Supplementary Information

The online version contains supplementary material available at 10.1007/s13205-024-04121-4.

Keywords: Caridina pseudogracilirostris, Comparative genomics, Crustacean genome, Orthofinder, Atyidae, NGS

Introduction

Arthropods (chelicerates, myriapods, crustaceans, and hexapods) are the most biodiverse and varied species on Earth, adapted to all major habitats in all major ecosystems. They are recognizable by their articulated limbs and chitin-based cuticle, often mineralized with calcium compounds. Decapod crustaceans, part of the class Malacostraca, include several well-known species such as crabs, lobsters, crayfish, shrimp, and prawns (Mente 2008). The initial group of decapods to diverge was the Dendrobranchiata (prawns), which appeared in the Late Ordovician, approximately 455 million years ago (Wolfe et al. 2019). High species diversity first emerged during the Jurassic and Cretaceous ages, when modern coral reefs first appeared and spread worldwide. Marine decapods rely on coral reefs as a habitat (Lloyd et al. 2008). Order Decapoda is divided into two suborders, Dendrobranchiata and Pleocyemata based on gill and leg structures and larval development. Various species of prawns in the Dendrobranchiata suborder are called “shrimp,” including the “white shrimp,” Litopenaeus setiferus, and the tiger shrimp, Penaeus monodon.

Decapods exhibit tremendous diversity due to various genetic alterations and innovations that have been favored throughout their evolutionary history. However, linking the diversity of phenotypes to fundamental genetic alterations remains challenging (Thomas et al. 2020). These innovations could arise from various genomic mechanisms, but a comprehensive phylogenetic analysis of the underlying molecular changes has not yet been conducted. It is essential to link the whole genome data to a reliable phylogenetic framework to track these transitions at the genomic level. Currently, crayfish serve as the primary foundation for most scientific knowledge regarding the ecology and general biology of freshwater decapods, likely due to their frequent occurrence (Thorp and Rogers 2011) and their status as a favorite food in some regions of the world. Interestingly, despite certain species being able to live for more than 100 years, decapods, bivalves, and echinoderms rarely acquire neoplastic and age-related disorders. Their successful adaptations must have resulted from genome rearrangements, indicating the presence of undiscovered advantageous traits that could lead to novel perspectives on various topics. Discovering the underlying molecular and regulatory systems may inspire fresh concepts for creating commercially advantageous and highly resilient organisms.

The primary objectives are the discovery, curation, and comparison of genes that have a significant effect on our understanding, particularly in comparison to other species. We have selected a widely available freshwater Caridean shrimp, C. pseudogracilirostris for our investigation, and have developed practical methods to maintain them in the lab for an extended period. This species is a member of the Atyidae family and the suborder Caridea has an extensive natural range stretching from Kanyakumari to Cochin (Thomas et al. 1973), (Soundharapandiyan et al. 2022). C. pseudogracilirostris, an algae-eating species, inhabits mangroves and marshes. Its small size, ease of capture, transparent body, strong tolerance for fluctuation in a variety of environmental conditions, average size, year-round reproductive capability, and rapid embryonic development, make it a viable experimental model for studies on crustacean adaptation, development, and other areas. Samples for this investigation were gathered from different sampling sites. This species, occurring in the euryhaline environments, where salinity changes periodically, shows remarkable adaptability to osmotic conditions, as evidenced by continuous breeding activity in the Rajakkamangalam estuary. Given this adaptability to estuarine conditions, a molecular study on the physiological mechanisms controlling growth, metabolism, and reproduction is desirable. Therefore, this study aims to identify the genomic regions associated with different physiological processes in the draft genome sequence of this species for the first time. This study also seeks to identify genes that have been already localized in the whole genome of highly evolved Malacostraca. Interestingly, few genomic analyses have been conducted on Caridina species (Yuan et al. 2017). Furthermore, this study aims to comprehend the whole genome sequence of C. pseudogracilirostris and compare it to the genomes of related species.

In recent decades, sequencing technologies, specifically next-generation sequencing (NGS), have been widely used in various scientific research and clinical applications. NGS allows for higher sequencing throughput and lower sequencing costs. Through the development and optimization of experimental and data analysis methods, the results can be highly accurate. Therefore, quality control and data preprocessing are crucial for obtaining high-quality and high-confidence analytical data reducing false positives and false negatives (He et al. 2020). Currently, many software programs are available for data quality preprocessing. Trimmomatic (Bolger et al. 2014) includes various processing steps for read trimming and filtering, with significant algorithmic innovations related to the identification of adapter sequences and quality filtering. Selection of the core set of optimal kmer size is essential for achieving high quality during de novo assembly. Chikhi and Medvedev 2014 devised a new method for selecting the best kmer size for de novo genome assembly. Other methods have been proposed such as estimating the number of interesting characteristics like paths with variations or repeats for different kmer sizes from an FM index over the reads (Simpson 2014), also using an optimal kmer range for de novo read error correction (Ilie et al. 2011 and Schulz et al. 2014). Eukaryotic organisms, in particular, contain a varying but significant proportion of repeated elements throughout their sequences. These repeats, which can originate from transposons or viral insertions, can directly affect gene expression (Muñoz-López and García-Pérez 2010). However, they also pose challenges in genomics data analysis, which can be mitigated by various applications. After these improvements, sequences are subjected to gene prediction and annotation. Phylogenetic orthology inference for comparative genomics was studied to create a framework for understanding the evolution and diversity of life on Earth and enabling the extrapolation of biological knowledge between organisms. The majority of these software tools attempt to infer phylogenetic relationships between gene sequences through experimental analyses of pairwise sequence similarity scores obtained from an all-vs-all BLAST (Camacho et al. 2009) search, or enhanced alternatives to BLAST such as DIAMOND (Buchfink et al. 2015) or MMseqs2 (Steinegger and Söding 2017). Widely used methods include InParanoid (Östlund et al. 2010), OrthoMCL (Li et al. 2003), OMA (Altenhoff et al. 2011), and OrthoFinder (Emms and Kelly 2015). Each application takes different approaches to cross-examining sequence similarity scores, producing various outputs, including ortho groups, paralogs, and orthologs, and in some cases, all three. This work aims to understand the genomic and phylogenetic relationship of C. pseudogracilirostris with other species of Malacostraca.

Materials and method

Shrimp collection and rearing

Live shrimps were collected from various locations of the sampling site at Rajakkamangalam estuary in Kanyakumari, Tamil Nadu, India (8°07′17.9″N, 77°22′19.3″E)0.89. The study samples were collected using the convenience sampling method. Fresh shrimps were collected and euthanized using tricaine (MS-222) and fixed in 4% paraformaldehyde. Shrimp identification was based on morphological features, described earlier by Thomas (Thomas et al. 1973). The Minimum Information about any (x) Sequence (MIxS) data are presented in Table 1.

Table 1.

MIxS mandatory information for samples

Item Definition
Investigation type Eukaryote
Project name NGS whole genome sequencing
Organism Caridina pseudogracilirostris
Classification Animalia (kingdom)
Arthropoda (phylum)
Crustacea (subphylum)
Malacostraca (class)
Eucarida (superorder)
Decapoda (order)
Pleocyemata (suborder)
Caridea (infraorder)
Atyidae (family)
Caridina (genus)
Caridina pseudogracilirostris (species)
Submitted_to SRA database

Bioproject: PRJNA847710

BioSample: SAMN28951408: NGS_1004 (TaxID: 1,042,303)

Tissue type Whole shrimp
Geographic location Rajakkamangalam estuary, Kanyakumari District. India
Environment Brackish water
Collection date 2019-07
Sequencing technology Illumina Novaseq6000
Assembly ABySS de novo assembler
Annotation source BLASTx, KEGG

Further, a purposive sampling method was employed to collect Caridina while avoiding other types of shrimps. Samples were collected, using a long aquarium net (30.5 × 30.5 cm) and immediately packed in large polyethylene bags containing fresh water. Packed samples were brought to the laboratory within 20 h and transferred to laboratory tanks at the aquaculture facility (Aquaneering USA). The handling and experimentation on animals followed ARRIVE guidelines and the UK Animals (Scientific Procedures) Act, 1986, and associated guidelines.

DNA extraction

A sample for genome sequencing was collected from an adult C. pseudogracilirostris maintained at the Aquatic Physiology facility of Sathyabama Institute of Science and Technology, Chennai. The shrimp was acclimated in tap water aquaria at 26 ± 2 °C for 1 month. Live shrimp was euthanized using MS-222 immediately dissected under ice-cold conditions and the tissue was transferred immediately into 100% ethanol. DNA isolation was carried out using Qiagen DNeasy Blood & Tissue DNA isolation kit as per the manufacturer’s instruction. DNA quantity was measured using a Nanodrop (Thermo, USA) and confirmed by running a 1% agarose gel electrophoresis.

Library preparation and genome sequencing

Genomic DNA used for TruSeq Nano DNA protocol was quantified using the Tape station (Agilent) and diluted. The genomic DNA was sheared using an S2 Ultrasonicator following the settings provided in TruSeq DNA protocol. Library preparation was performed according to the manufacturer’s instructions. Adaptor enrichment was performed through multiple cycles of PCR as per the manufacturer’s guidelines.

Technical validation

The raw reads were quality filtered using a flexible read trimming tool Trimmomatic (http://www.usadellab.org/cms/?page=trimmomatic) version 0.39 (Bolger et al. 2014). This tool employs a pipeline-based architecture, allowing individual “steps” (adapter removal, quality filtering, and so on) to be applied to each read/read pair in the order specified by the user. Each step can operate on the reads in isolation or on the combined pair, as appropriate. The tool tracks read pairing and stores “paired” and “single” reads separately and the raw sequences were checked for adapter contamination and poor quality bases by FASTQC v0.11.8 software (Simon Andrews 2010).

Kmer size selection and genome size estimation

The best k-mer length was estimated using Kmergenie v. 1.7051 (http://kmergenie.bx.psu.edu/). The best k-mer identified by this algorithm was used for the genome size estimation using Jellyfish v.2.3.0 (https://github.com/gmarcais/Jellyfish) and Genomescope v. 1.0 (http://qb.cshl.edu/genomescope/).

De novo assembly and annotation

De novo genome assembly of the shrimp was performed using the ABySS tool for the Illumina reads. ABySS is a de novo sequence assembler designed for short paired-end reads and large genomes. The raw Illumina reads were quality filtered using stringent filtering criteria (Phred Score > 30) and used for the best k-mer prediction and k-mer-based genome size estimation. The raw reads were assembled into contigs and scaffolds using ABySS de novo genome assembly tool. The scaffolds were subjected to gap filling and error correction. The resulting genome with scaffolds was then used for the repeat masking and genome annotations. The predicted genes were assigned with putative functions. The complete workflow for the genome assembly and annotation is presented in Fig. 1. The best k-mer selected was used for the de novo genome assembly using de Bruijn graph-based single-k de novo genome assemblers suitable for larger eukaryotic genomes such as ABySS v.2.1.5 (https://www.bcgsc.ca/resources/software/abyss) (Simpson et al. 2009).

Fig. 1.

Fig. 1

Work flowchart showing the methods and tools used for the de novo assembly and analysis

Genome evaluation

A quantitative assessment of genome assembly and annotation completeness was based on evolutionarily informed expectations of gene content. It was implemented by the assessment procedure using open-source software, with sets of Benchmarking Universal Single-Copy Orthologs, named BUSCO (Simão et al. 2015).

Repeat masking and repeat annotation

Generally, the whole genome sequence has a high proportion of repetitive elements (∼80%) and high genome heterozygosity in the raw genome, (Yu et al. 2015). These repetitive elements should be identified and masked before proceeding to genome annotation. Transposable elements (TE) were identified using RepeatModeler V 1.08 by creating the de novo repeats library which is then masked and annotated using RepeatMasker V 4.1.1. RepeatModeler is a de novo transposable element (TE) family identification and modeling package that includes three de novo repeat finding programs (RECON, RepeatScout, and LtrHarvest/Ltr_retriever) that employ complementary computational methods to identify repeat element boundaries and family relationships from provided sequence data. RepeatModeler assists in automating the runs of the various algorithms given a genomic database, clustering redundant results, refining and classifying the families, and producing a high-quality library of TE families suitable for use with RepeatMasker. RepeatMasker searches for repetitive sequences by aligning the input genome sequence against a library of known repeats (Tarailo-Graovac and Chen 2009). RepeatMasker is a program that screens DNA sequences for interspersed repeats and low-complexity DNA sequences (Smit 1996). The output of RepeatMasker includes a detailed annotation of the repeats present in the query sequence and a modified version of the query sequence with all annotated repeats masked. Sequence comparisons in RepeatMasker are performed by the program cross-match, an efficient implementation of the Smith–Waterman–Gotoh algorithm (Tarailo-Graovac and Chen 2009).

Gene prediction

The first step of genome annotation is to identify all the genes in a given genomic sequence. Genes were predicted using the ab initio method from the repeat-masked genome using Augustus v.2.5.5 (http://augustus.gobics.de/). AUGUSTUS is based on a generalized hidden Markov model (GHMM) that defines probability distributions for various sections of genomic sequences. Introns, exons, intergenic regions, etc. correspond to states in the model and the purpose of each state is to generate DNA sequences with certain pre-defined emission probabilities (Stanke and Morgenstern 2005). Augustus was used to create de novo models from scaffolds, which were later identified or confirmed by searching in BLAST.

Gene annotation

The function of the predicted genes was annotated using BLASTx developed by National Center for Biotechnology Information (NCBI) against the previously characterized non-redundant protein database. BLASTX compares a nucleotide query sequence translated along all six reading frames (both strands) against the amino acid sequence database.

Functional annotation and pathway analysis

Eggnog-mapper is a tool for fast functional annotation of novel sequences (genes or proteins) using precomputed eggNOG-based orthology assignments. The GO and KO annotations were retrieved and gene sorting was performed in Panther db analysis. Based on Panther analysis, the genes involved in the developmental processes, cellular processes, immune system processes, and reproductive processes were predicted. These genes were analyzed for their roles in the biological pathway using the web-based application named Pathway Commons.

Comparative genomics and phylogenic studies

This study was carried out using OrthoFinder Software (https://github.com/davidemms/OrthoFinder), which provides high-accuracy ortho-groups inference to provide a phylogenetic inference of orthologs, rooted gene trees, gene duplication events, the rooted species tree, and comparative genomics statistics. The protein sequence of the species Armadillidium nasatum, Armadillidium vulgare, Cherax quadricarinatus, Eriocheir sinensis, Gammarus roeselii, Homarus americanus, Hyalella Azteca, Macrobrachium nipponense, Penaeus indicus, Penaeus japonicus, Penaeus monodon, Penaeus vannamei, Portunus trituberculatus, Procambarus clarkia, and Trinorchestia longiramus along with the sequence of Caridina pseudogracilirostris were used for the phylogenic and orthologous studies.

Phylogenetic analysis was conducted within the above-mentioned 16 different species to generate a species tree. This analysis was based on a STAG (Species Tree inference from All Genes) species tree inferred from all ortho groups, containing STAG support values at internal nodes and rooted using STRIDE (Species Tree Root Inference from Gene Duplication Events). This species tree was visualized and exported using an online tool called ITOL (Interactive Tree of Life) https://itol.embl.de/.

Results

Library preparation and genome sequencing

The genetic material, extracted from the tissue sample, exhibited a quantified DNA concentration of 116.724 ng/µl. Following extraction, the sample successfully underwent next-generation sequencing (NGS) library quality control with a concentration of 65.13 ng/µl, assessed via the Tape Station (TruSeq Nano DNA—350). The subsequent step involved library preparation utilizing the Illumina TruSeq Nano DNA Library (350 bp PE insert), and the sequencing was executed on a NovaSeq6000 platform with 2 × 150 bp paired-end read length, generating a total of 160 GB of data. The comprehensive whole genome sequence data, encapsulating the genetic blueprint, has been deposited in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) database. The assigned accession number for this sequence is PRJNA847710, and the data was officially published on June 10, 2022.

Kmer size selection and genome size estimation

The whole genome’s raw data boasted an estimated coverage of 56x, with a total of approximately 489,387,876 Illumina raw sequencing reads. The cumulative base pairs of these raw reads reached 73,897,569,276, with a GC content of 38.01% and AT content of 61.99%. The Phred quality scores for Q20 and Q30 were 94.62% and 88.13%, respectively. Table 2 delineates the specifics of total and quality-filtered reads, including their lengths and the percentage of reads surpassing Q30. Kmergenie identified the best k-mer length as 69, subsequently utilized for genome size estimation through Jellyfish and Genomescope (Table 3, Fig. 2). GenomeScope2 model fitting of the k-mer distribution analysis proposed a genome size of 1.31 Gbp with a low heterozygosity of approximately 0.81% (Fig. 2). Unique sequences accounted for 77.9%, totaling about 1.02 Gbp (Table 3).

Table 2.

Summary of Kmer analysis and genome size estimation

SI. no Particulars Number
1 Estimated K (best) 69
2 Estimated heterozygosity 0.81%
3 Estimated genome haploid length 1,308,850,805 bp
4 Estimated genome repeat length 288,637,448 bp
5 Estimated genome unique length 1,020,213,357 bp

Table 3.

Summary of repeats identified from C pseudogracilirostris genome

Elements Number of elements* Length occupied Percentage of sequence
Retroelements 119,335 31,987,404 bp 3.19%
SINEs 0 0 bp 0.00%
Penelope 20,195 4,832,234 bp 0.48%
LINEs 99,932 23,743,059 bp 2.37%
CRE/SLACS 3264 651,164 bp 0.06%
L2/CR1/Rex 43,007 10,558,906 bp 1.05%
R1/LOA/Jockey 8571 2,391,652 bp 0.24%
R2/R4/NeSL 0 0 bp 0.00%
RTE/Bov-B 14,949 2,720,789 bp 0.27%
L1/CIN4 1227 61,535 bp 0.01%
LTR elements 19,403 8,244,345 bp 0.82%
BEL/Pao 126 37,428 bp 0.00%
Ty1/Copia 0 0 bp 0.00%
Gypsy/DIRS1 17,150 7,908,568 bp 0.79%
Retroviral 1009 203,099 bp 0.02%
DNA transposons 27,558 6,549,101 bp 0.65%
hobo-Activator 10,032 2,782,116 bp 0.28%
Tc1-IS630-Pogo 8197 1,985,750 bp 0.20%
En-Spm 0 0 bp 0.00%
MuDR-IS905 0 0 bp 0.00%
PiggyBac 131 34,120 bp 0.00%
Tourist/Harbinger 1404 306,575 bp 0.03%
Other (Mirage, P-element, Transib) 243 69,465 bp 0.01%
Rolling circles 7479 1,076,387 bp 0.11%
Unclassified 886,502 120,058,713 bp 11.96%
Total interspersed repeats 158,595,218 bp 15.80%
Small RNA 29,150 4,517,743 bp 0.45%
Satellites 74 14,407 bp 0.00%
Simple repeats 1,148,445 72,870,646 bp 7.26%
Low complexity 130,707 9,779,534 bp 0.97%

*Most repeats fragmented by insertions or deletions were counted as one element

Fig. 2.

Fig. 2

Genome scope profile using the histogram generated using k-mer length of 69

Utilizing the optimal k-mer, ABySS v.2.1.5 was employed for de novo genome assembly, as outlined in Table 2. The BUSCOs scores for the final assembly indicated a relatively complete genome assembly, achieving a BUSCO completeness of 52.8%. The minor disparity observed between the genome size and initial estimation is unlikely to stem from erroneous assembly duplication, given the duplicated BUSCOs score of 2.0%. The quality of the genome assembly is reinforced by Jellyfish and Genomescope analyses, collectively portraying the C. pseudogracilirostris draft genome assembly as comprehensive, non-redundant, and valuable for diverse applications (Fig. 2).

Repeat masking and repeat annotation

The assembled genome underwent masking and annotation of repeats using a de novo repeats library and RepeatMasker, as outlined in Table 3. Remarkably, repetitive elements constituted 24.60% of the genomic sequences. Specifically, de novo repeats elements, identified through RepeatModeler, accounted for 24.60% (247 Mb) of the assembled genome. An intriguing feature was the substantial presence of simple sequence repeats (SSR), comprising 7.26% of the genome. Other noteworthy repeat classes encompassed retroelements (3.19%), LINEs (2.37%), and L2/CR1/Rex (1.05%). Additionally, minor repeat families included low complexity regions (0.97%), LTR elements (0.82%), Gypsy/DIRS1 (0.79%), DNA transposons (0.65%), Penelope (0.48%), small RNA (0.45%), hobo-Activator (0.28%), RTE/Bov-B (0.27%), R1/LOA/Jockey (0.24%), and Tc1-IS630-Pogo (0.20%) (Fig. 3).

Fig. 3.

Fig. 3

Phylogenetic analysis between 16 species of Malacostraca showed that Macrobrachium nipponense was closely linked to Caridina pseudogracilirostris, creating sister taxa

Comparisons with other shrimp genomes revealed distinct characteristics in the C. Pseudogracilirostris genome. Notably, P. chinensis and P. vannamei exhibited a higher proportion of DNA transposons and low complexity repeats. In contrast, the proportion of SSRs in C. pseudogracilirostris was found to be lower than that reported for the P. indicus genome (49.31%). This comprehensive analysis sheds light on the repetitive landscape of the pseudogracilirostris genome, providing valuable insights into its unique genomic features and differentiating it from other shrimp species.

Gene prediction and gene annotations

Gene prediction for the repeat-masked genome was accomplished using the ab initio method with Augustus v.2.5.5 (refer to Supplementary Table S1). A comprehensive total of 14,101 genes were successfully identified, covering a total gene length of 63,100,629 bases. To assign putative functions to the predicted genes, BLASTx was employed, utilizing the non-redundant protein database (Table 4).

Table 4.

Summary of BLASTX annotation

SI. no Particulars Number
1 Total number of genes subjected to BLASTx 14,101
2 Total number of genes assigned with function 12,345
3 Total number of genes with no similarity 1756

Functional annotation and pathway analysis using eggnog mapper, panther db, and pathway common analysis

A total of 9395 transcripts, representing 66.62% of the total sequences, exhibited hits in the EggNOG database. Among these, 5231 (55.67%), 6476 (68.93%), and 3867 (41.16%) were associated with relevant Gene Ontology (GO) keywords, KEGG orthology functional annotations, and gene symbols, respectively. Mapping these genes with the Panther database revealed 19 biological processes, including the developmental process, multicellular organismal process, cellular process, reproduction, localization, reproductive process, biological adhesion, immune system process, biological regulation, growth, signaling, metabolic process, biological process involved in interspecies interaction between organisms, pigmentation, response to stimulus, biological phase, behaviour, rhythmic process, and locomotion. Furthermore, genes were clustered based on their involvement in four major processes: developmental (31 genes), cellular (30 genes), immune system (20 genes), and reproduction (24 genes). Notably, specific pathways were identified for each cluster, revealing the intricate molecular processes governing various aspects of the organism’s biology (Table 5).

Table 5.

Orthologous genes analysis between 16 species of Malacostraca

Overall statistics
Number of species 16
Number of genes 317,043
Number of genes in orthogroups 281,390
Number of unassigned genes 35,653
Percentage of genes in orthogroups 88.8
Percentage of unassigned genes 11.2
Number of orthogroups 21,685
Number of species-specific orthogroups 5294
Number of genes in species-specific orthogroups 26,377
Number of C. Pseudogracilirostris-specific orthogroups 7396
Number of genes in C. Pseudogracilirostris-specific orthogroups 11,619
Percentage of genes in species-specific orthogroups 8.3
Percentage of genes in C. Pseudogracilirostris orthogroups 3.7
Mean orthogroup size 13
Median orthogroup size 9
G50 (assigned genes) 20
G50 (all genes) 18
O50 (assigned genes) 3524
O50 (all genes) 4453
Number of orthogroups with all species present 1
Number of single-copy orthogroups 0

For instance, the 31 genes associated with the developmental process, including actS, APC, APC2, CNTN5, DAG1, DDR2, and others, were found to be involved in pathways such as SUMOylation of intracellular receptors, nuclear receptor transcription pathway, and thyroid hormone-mediated signaling pathway (Supplementary Table SII). Similarly, genes linked to cellular processes (e.g., ABHD12, ACLY, APBB2) and immune system processes (e.g., CLEC7A/inflammasome pathway, toll-like receptor cascades) were identified with corresponding pathway associations (Supplementary Tables SIII and SIV).

In the reproductive process cluster, comprising 24 genes such as DDX3X, DMRT2, and RAD50, pathways such as HDR through homologous recombination, meiotic cell cycle, oocyte construction, and regulation of TP53 activity through phosphorylation were discerned (Supplementary Table S5.1, S5.2, S5.3, S5.4, and S5.5). This comprehensive analysis provides a detailed insight into the functional annotation of genes and their involvement in key biological processes, shedding light on the molecular intricacies governing development, cellular functions, immune responses, and reproductive processes in the studied organism.

Species conservation and phylogenetic analysis

In our study of the protein sequences of Malacostraca species, a comprehensive analysis was conducted to uncover the genomic relationships and evolutionary dynamics within this diverse group (Table 5). One notable finding pertains to gene duplication, which sheds light on the unique genomic features of the studied species. Specifically, we identified instances of gene duplication in several Malacostraca species. Notably, C. pseudogracilirostris exhibited 1,856 duplicated genes, underscoring its genetic complexity. Homarus americanus, another species under scrutiny, demonstrated a substantial number of duplicated genes, a total of 39,322, while Cherax quadricarinatus displayed around 91 instances of gene duplication. The C. pseudogracilirostris is closely related to and represented as a sister taxa with M. nipponense. (fig. 3)

Discussion

The genomic analysis of C. pseudogracilirostris provides valuable insights into its genetic makeup and functional elements. The high-quality sequencing data, deposited in the NCBI SRA database under accession number PRJNA847710, underscores the thoroughness of the study. The quality of FASTQ data was improved using tools such as Trimmomatic (Bolger et al. 2014), a popular trimming adapter tool and FASTQC (Simon Andrews 2010), a Java-based QC tool that provides per-base and per-read quality profiling features. The raw reads were quality-filtered with Trimmomatic and checked for adapter contamination and poor-quality bases using FASTQC software. This tool removes Illumina sequencing adapters and low-quality sequences (Phred score > 30). The Genomescope value was comparatively lower than the previously reported value by Swathi et al. (2018) and Kawato et al. (2021).

The de novo sequencing analysis revealed robust characteristics of the C. Pseudogracilirostris genome. The estimated genome size of 1.31 Gbp, obtained through Kmergenie and validated by GenomeScope2, reflects a comprehensive assembly with low heterozygosity. Using ABySS v.2.1.5 for de novo genome assembly resulted in a relatively complete genome with a BUSCO completeness score of 52.8%. The minor disparity in genome size compared to the initial estimation is well explained by the duplicated BUSCOs score, emphasizing the accuracy of the assembly.

Repeat masking and annotation unveiled intriguing aspects of the genome’s repetitive landscape, with 24.60% comprising repetitive elements. Notably, simple sequence repeats (SSRs) constituted 7.26%, showcasing a distinctive feature. Comparative analysis with other shrimp genomes highlighted unique characteristics in C. pseudogracilirostris, such as lower SSR proportions compared to P. indicus and variations in DNA transposons and low complexity repeats in comparison to P. chinensis and P. vannamei. In general, large-scale DNA editing of retrotransposons, by simultaneously generating large numbers of mutations, may have accelerated their exaptation during mammalian evolution (Carmi et al. 2011). Similarly, inverted SINE repeats promote RNA editing by adenosine to inosine deamination creating potential novelties in both coding and regulatory sequences (Daniel et al. 2014). The role of SSRs in adaptive evolution was recently demonstrated for shrimp (Yuan et al. 2021). It was once considered that only a negligible amount of transposable elements (TEs) were present in eukaryotic genomes, and later it was known that TEs account for a major proportion of genomes (Britten and Kohne 1968). The proportion of TEs in the genome can vary widely depending on the organism, ranging from 3% in the yeast (Carr et al. 2012) to over almost the entire genome about > 80% in maize (Meyers et al. 2001). The human genome, for example, is rich in repetitive sequences, accounting for about 45%, as per International Human Genome Sequencing Consortium, 2001.

Gene prediction and annotation further enriched the understanding of the genome. A total of 14,101 genes were identified, covering a substantial gene length of 63,100,629 bases. Functional annotation using BLASTx and the non-redundant protein database provided valuable insights into the putative functions of these genes. The comprehensive transcriptomic analysis of C. pseudogracilirostris provides valuable insights into the functional landscape of its genome. A significant portion of the transcripts, 66.62% of the total sequences, displayed hits in the EggNOG database, underscoring the presence of evolutionarily conserved genes. Further annotation revealed that a substantial proportion of these transcripts were associated with Gene Ontology (GO), KEGG orthology (KO) functional annotations, and gene symbols, contributing to a more nuanced understanding of the molecular functions encoded in the genome.

Mapping these genes to the Panther database uncovered 19 distinct biological processes, illuminating the multifaceted molecular activities governing crucial aspects of the organism’s biology. Clustering genes based on their involvement in key processes, such as development, cellular functions, immune system responses, and reproduction, allowed for a more focused exploration of their functional roles. In-depth pathway analysis within each cluster revealed specific molecular mechanisms at play. For instance, genes associated with developmental processes were found to participate in pathways like SUMOylation of intracellular receptors, the nuclear receptor transcription pathway, and the thyroid hormone-mediated signaling pathway, underscoring the intricate regulatory networks governing developmental events.

Similarly, cellular processes were linked to pathways such as cellular-modified amino acid catabolic process, providing insights into the metabolic activities crucial for cellular homeostasis. Immune system processes were associated with pathways like CLEC7A/inflammasome pathway and toll-like receptor cascades, shedding light on the molecular defenses and recognition mechanisms in the organism. The reproductive process cluster exhibited genes involved in pathways critical for reproductive events, including HDR through homologous recombination and the meiotic cell cycle. These findings underscore the importance of these molecular processes in ensuring successful reproduction.

Next-generation sequencing (NGS) technologies and genomic sequence information aid in our goal of decoding unknown genes and their evolutionary secrets. Innate immunity-related molecules such as cytokines, toll-like receptors, and the complement family, as well as acquired immunity-related molecules such as MHC and antibody receptors, are well known to be expressed in the brain and play key roles in brain development (Morimoto and Nakajima 2019). Generally, shrimps do not have the classical adaptive immune system like T cells and specific memory of antigens to survive under poor environmental conditions (Hoffmann et al. 1999), (Hauton and Smith 2007). Crustaceans such as shrimp, prawns, crayfish, lobster, and crabs are farmed widely to meet global demand through intensive aquaculture techniques (Hauton and Smith 2007). Specialized adaptive functions, such as digestive functions, can be revealed through transcriptomic analysis (Wang et al. 2021). Further analysis of the genome and RNA sequence will reveal the physiological functions and their role in evolutionary adaptations.

Phylogenetic analysis was based on rooted trees that show the evolutionary relationships of one lineage stemming from the root of the tree. This species tree is rooted using the STRIDE method (Emms and Kelly 2017). STRIDE is a fast, effective, and outgroup-free method for species tree root inference from gene duplication events from the set of well-supported in-group gene duplication events from the set of unrooted gene trees and analyzed those events to infer a probability distribution over an unrooted species tree to locate its root. OrthoFinder software automatically used raw amino acid sequence data to derive rooted species trees, allowing outgroups rooting of the complete set of orthogroups gene trees for the input set of species and all gene trees. The C. pseudogracilirostris is closely related to and represented as a sister taxa with M. nipponense because they both belong to the same infraorder Caridea, The longer branch length indicates that more genetic change (or divergence) has occurred when compared to M. nipponense and other species, suggesting that evolution contributes to its highly adaptability to varied environmental stress conditions (Yuan et al. 2024), (Soundharapandiyan et al. 2022).

Thus, starting from just protein sequences, OrthoFinder has inferred orthogroups, orthologs, also the complete set of gene trees for all orthogroups, the rooted species tree, all gene duplication events, and computed the comparative genomic statistics based on the software algorithm (Emms and Kelly 2019). Gene duplication occurrences are used to identify orthologs and paralogs (Yuan et al. 2024). When species diverge, orthologs diverge as well, and they are as closely linked as the genes of the two species (Kristensen et al. 2011). At a gene duplication event, paralogs diverge to a shared ancestor and become more distantly linked. Because such gene duplication events must exist before species divergence, paralogs between two species are always less closely related than orthologs. Further comparative transcriptomic studies can provide insights into conservation, and epigenetic information that can deliver genetic resources to improve biomass production, faster growth, or enhanced adaptation. These can then be applied to higher organisms with vastly different needs.

Conclusion

Here, we provided a genome assembly of the highly adaptable crustacean shrimp, C. pseudogracilirostris. The 1.3 Gbp draft genome assembly has a BUSCO score indicating 52.8% completeness. We anticipate that comprehensive sequencing datasets will be a valuable tool for basic evolutionary and comparative studies among crustaceans. Phylogenetic relationship studies, comparing 14 other genera, have provided insights into the evolution and diversity among malacostracan species. Further studies on expression profiling using transcriptomics of C. pseudogracilirostris will offer essential genetic information necessary for understanding decapod evolution and its adaptation to brackish water environments.

Supplementary Information

Below is the link to the electronic supplementary material.

Acknowledgements

The authors acknowledge the support from NEXTGEN Lab facility, Sathyabama Institute of Science and Technology.

Funding

The corresponding author RRK acknowledges the Department of Science and Technology (Govt. of India) for financial support (EMR/2014/000630).

Data availability

The datasets generated and/or analyzed during the current study are available in the NCBI SRA repository, accession number: SRR19611691.

Declarations

Conflict of interest

The author declared no potential conflicts of interest concerning the research, authorship, and/or publication of this article.

References

  1. Altenhoff AM, Schneider A, Gonnet GH, Dessimoz C (2011) OMA 2011: orthology inference among 1000 complete genomes. Nucleic Acids Res 39:D289. 10.1093/NAR/GKQ1238 [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Simon Andrews (2010) FastQC A Quality Control tool for High Throughput Sequence Data. In: http://www.bioinformatics.babraham.ac.uk/projects/fastqc. https://www.bioinformatics.babraham.ac.uk/projects/fastqc/. Accessed 8 Jun 2022
  3. Bolger AM, Lohse M, Usadel B (2014) Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30:2114–2120. 10.1093/BIOINFORMATICS/BTU170 [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Britten RJ, Kohne DE (1968) Repeated sequences in DNA. Hundreds of thousands of copies of DNA sequences have been incorporated into the genomes of higher organisms. Science 161:529–540. 10.1126/SCIENCE.161.3841.529 [DOI] [PubMed] [Google Scholar]
  5. Buchfink B, Xie C, Huson DH (2015) Fast and sensitive protein alignment using DIAMOND. Nat Methods 12:59–60. 10.1038/NMETH.3176 [DOI] [PubMed] [Google Scholar]
  6. Camacho C, Coulouris G, Avagyan V et al (2009) BLAST+: architecture and applications. BMC Bioinform 10:1–9. 10.1186/1471-2105-10-421/FIGURES/4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Carmi S, Church GM, Levanon EY (2011) Large-scale DNA editing of retrotransposons accelerates mammalian genome evolution. Nat Commun. 10.1038/NCOMMS1525 [DOI] [PubMed] [Google Scholar]
  8. Carr M, Bensasson D, Bergman CM (2012) Evolutionary genomics of transposable elements in saccharomyces cerevisiae. PLoS One 7:e50978. 10.1371/JOURNAL.PONE.0050978 [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Chikhi R, Medvedev P (2014) Informed and automated k-mer size selection for genome assembly. Bioinformatics 30:31–37. 10.1093/BIOINFORMATICS/BTT310 [DOI] [PubMed] [Google Scholar]
  10. Daniel C, Silberberg G, Behm M, Öhman M (2014) Alu elements shape the primate transcriptome by cis-regulation of RNA editing. Genome Biol. 10.1186/GB-2014-15-2-R28 [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Emms DM, Kelly S (2015) OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy. Genome Biol 16:1–14. 10.1186/S13059-015-0721-2/FIGURES/7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Emms DM, Kelly S (2017) STRIDE: species tree root inference from gene duplication events. Mol Biol Evol 34:3267–3278. 10.1093/MOLBEV/MSX259 [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Emms DM, Kelly S (2019) OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol 20:1–14. 10.1186/S13059-019-1832-Y/FIGURES/5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Hauton C, Smith VJ (2007) Adaptive immunity in invertebrates: a straw house without a mechanistic foundation. BioEssays 29:1138–1146. 10.1002/BIES.20650 [DOI] [PubMed] [Google Scholar]
  15. He B, Zhu R, Yang H et al (2020) Assessing the impact of data preprocessing on analyzing next generation sequencing data. Front Bioeng Biotechnol 8:817. 10.3389/FBIOE.2020.00817/FULL [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Hoffmann JA, Kafatos FC, Janeway CA, Ezekowitz RAB (1999) Phylogenetic perspectives in innate immunity. Science 284:1313–1318. 10.1126/SCIENCE.284.5418.1313 [DOI] [PubMed] [Google Scholar]
  17. Ilie L, Fazayeli F, Ilie S (2011) HiTEC: accurate error correction in high-throughput sequencing data. Bioinformatics 27:295–302. 10.1093/BIOINFORMATICS/BTQ653 [DOI] [PubMed] [Google Scholar]
  18. Kawato S, Nishitsuji K, Arimoto A et al (2021) Genome and transcriptome assemblies of the kuruma shrimp, Marsupenaeus japonicus. G3 Genes, Genomes, Genet. 10.1093/g3journal/jkab268 [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Kristensen DM, Wolf YI, Mushegian AR, Koonin EV (2011) Computational methods for gene orthology inference. Brief Bioinform 12:379. 10.1093/BIB/BBR030 [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Li L, Stoeckert CJ, Roos DS (2003) OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res 13:2178. 10.1101/GR.1224503 [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Lloyd GT, Davis KE, Pisani D et al (2008) Dinosaurs and the cretaceous terrestrial revolution. Proc R Soc B Biol Sci 275:2483. 10.1098/RSPB.2008.0715 [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Mente E (2008) Reproductive biology of crustaceans : case studies of decapod crustaceans
  23. Meyers BC, Tingey SV, Morgante M (2001) Abundance, distribution, and transcriptional activity of repetitive elements in the Maize Genome. Genome Res 11:1660–1676. 10.1101/GR.188201 [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Morimoto K, Nakajima K (2019) Role of the immune system in the development of the central nervous System. Front Neurosci. 10.3389/FNINS.2019.00916 [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Muñoz-López M, García-Pérez JL (2010) DNA transposons: nature and applications in genomics. Curr Genomics 11:115. 10.2174/138920210790886871 [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Östlund G, Schmitt T, Forslund K et al (2010) InParanoid 7: new algorithms and tools for eukaryotic orthology analysis. Nucleic Acids Res 38:D196. 10.1093/NAR/GKP931 [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Schulz MH, Weese D, Holtgrewe M et al (2014) Fiona: a parallel and automatic strategy for read error correction. Bioinformatics 30:i356–i363. 10.1093/BIOINFORMATICS/BTU440 [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Simão FA, Waterhouse RM, Ioannidis P et al (2015) BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31:3210–3212. 10.1093/BIOINFORMATICS/BTV351 [DOI] [PubMed] [Google Scholar]
  29. Simpson JT (2014) Exploring genome characteristics and sequence quality without a reference. Bioinformatics 30:1228–1235. 10.1093/BIOINFORMATICS/BTU023 [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Simpson JT, Wong K, Jackman SD et al (2009) ABySS: a parallel assembler for short read sequence data. Genome Res 19:1117. 10.1101/GR.089532.108 [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Smit AFA (1996) The origin of interspersed repeats in the human genome. Curr Opin Genet Dev 6:743–748. 10.1016/S0959-437X(96)80030-X [DOI] [PubMed] [Google Scholar]
  32. Soundharapandiyan N, Thanumalayaperumal S, Rajaretinam RK (2022) Real-time imaging and developmental biochemistry analysis during embryogenesis of Caridina pseudogracilirostris. J Exp Zool Part A Ecol Integr Physiol 337:206–220. 10.1002/JEZ.2556 [DOI] [PubMed] [Google Scholar]
  33. Stanke M, Morgenstern B (2005) AUGUSTUS: a web server for gene prediction in eukaryotes that allows user-defined constraints. Nucleic Acids Res. 10.1093/NAR/GKI458 [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Steinegger M (2017) Söding J (2017) MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol 3511(35):1026–1028. 10.1038/nbt.3988 [DOI] [PubMed] [Google Scholar]
  35. Swathi A, Shekhar MS, Katneni VK, Vijayan KK (2018) Genome size estimation of brackishwater fishes and penaeid shrimps by flow cytometry. Mol Biol Rep. 10.1007/s11033-018-4243-3 [DOI] [PubMed] [Google Scholar]
  36. Tarailo-Graovac M, Chen N (2009) Using repeatmasker to identify repetitive elements in genomic sequences. Curr Protoc Bioinforma Chapter. 10.1002/0471250953.BI0410S25 [DOI] [PubMed] [Google Scholar]
  37. Thomas MM, Pillai VK, Pillai NN (1973) Caridina pseudogracilirostris sp.nov. (Atyidae: Caridina) from the Cochin backwater. J Mar Biol Assoc India 15:871–872 [Google Scholar]
  38. Thomas GWC, Dohmen E, Hughes DST et al (2020) Gene content evolution in the arthropods. Genome Biol 21:1–14. 10.1186/S13059-019-1925-7/FIGURES/4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Thorp JH, Rogers DC (2011) Crayfish, Crabs, and Shrimp: Subphylum Crustacea, Class Malacostraca, Order Decapoda. F Guid to Freshw Invertebr North Am 157–168. 10.1016/B978-0-12-381426-5.00018-1
  40. Wang Z, Tang D, Shen C, Wu L (2021) Identification of genes involved in digestion from transcriptome of parasesarma pictum and parasesarma affine hepatopancreas. Thalass an Int J Mar Sci 381(38):93–101. 10.1007/S41208-021-00296-2 [Google Scholar]
  41. Wolfe JM, Breinholt JW, Crandall KA et al (2019) A phylogenomic framework, evolutionary timeline and genomic resources for comparative studies of decapod crustaceans. Proceed Biol Sci. 10.1098/RSPB.2019.0079 [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Yu Y, Gu J, Jin Y et al (2015) Panoramix enforces piRNA-dependent cotranscriptional silencing. Science 350:339–342. 10.1126/SCIENCE.AAB0700 [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Yuan J, Gao Y, Zhang X et al (2017) Genome sequences of marine shrimp exopalaemon carinicauda holthuis provide insights into genome size evolution of caridea. Mar Drugs. 10.3390/MD15070213 [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Yuan J, Zhang X, Wang M et al (2021) (2021) Simple sequence repeats drive genome plasticity and promote adaptive evolution in penaeid shrimp. Commun Biol 41(4):1–14. 10.1038/s42003-021-01716-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Yuan H, Cai P, Zhang W et al (2024) Identification of genes regulated by 20-Hydroxyecdysone in Macrobrachium nipponense using comparative transcriptomic analysis. BMC Genomics 25:35. 10.1186/S12864-023-09927-9 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data Availability Statement

The datasets generated and/or analyzed during the current study are available in the NCBI SRA repository, accession number: SRR19611691.


Articles from 3 Biotech are provided here courtesy of Springer

RESOURCES