Abstract
Genome sequencing has greatly contributed to our understanding of parasitic protozoa. This is particularly the case for Cryptosporidium species (phylum Apicomplexa) which are difficult to propagate. Because of their polymorphic nature, simple sequence repeats have been used extensively as genotypic markers to differentiate between isolates, but no global analysis of amino acid repeats in Cryptosporidium genomes has been reported. Taking advantage of several newly sequenced Cryptosporidium genomes, a comparative analysis of single-amino-acid repeats (SAARs) in seven species was undertaken. This analysis revealed a striking difference between the SAAR profile of the gastric and intestinal species which infect mammals and one species which infects birds. In average, total SAAR length in gastric species is only 25% of the cumulative SAAR length in the genome of C. parvum, C. hominis and C. meleagridis, species infectious to humans. The SAAR profile in the avian parasite C. baileyi stands out due to the presence of long asparagine repeats. C. baileyi proteins with repeats with ≥20 residues are significantly enriched in regulatory functions. As postulated for the related apicomplexan species Plasmodium falciparum, these observations suggest that Cryptosporidium SAARs evolve in response to selective pressure. The putative selective mechanisms driving SAAR evolution in Cryptosporidium species are unknown.
Keywords: Cryptosporidium, cryptosporidiosis, single-amino-acid repeat, simple sequence repeat, mucin, microsatellite
1. Introduction
Tandem repeats are a source of genetic variation for evolutionary adaptation. In humans, single-amino-acid repeats (SAARs) are of particular interest due to the observation that repeat expansions are associated with a growing list of genetic diseases (Karlin et al. 2002; van Eyk et al. 2012), mostly neurodegenerative conditions (Paulson, 2000). Strand slippage during DNA replication may cause repetitive sequences to expand or contract at a faster rate than single nucleotide polymorphisms and other indels accumulate in the genome. Evidence for repeat length evolving in response to selective pressure has been observed. One of the argument in support of this view is the differential evolution of human short tandem repeats (syn. microsatellites) located in genes and in intergenic regions. (Shimada et al. 2016).
The phylum Apicomplexa includes several species of human and animal pathogens, including the agents of malaria, toxoplasmosis, coccidiosis and cryptosporidiosis. SAARs in the genome of the malaria parasite Plasmodium falciparum have been extensively researched. The abundance of asparagine (N) SAARs in P. falciparum has been linked to the AT-richness of the genome (Zilversmit et al. 2010). Compared to homologous proteins in other eukaryotes, P. falciparum proteins are enriched in low-complexity amino acid regions (Aravind et al. 2003) that are predicted to have no secondary structure (Wootton, 1994). Almost one quarter of P. falciparum proteins contain N- and Q-rich protein domains potentially capable of forming self-propagating amyloid fibers (Singh et al. 2004). To put this number in context, the proportion of such proteins in the human proteome is only 0.3%. Several mechanisms favoring high abundance of N repeats have been proposed, including modulation of mRNA stability and rate of translation (Frugier et al. 2010), protein-protein interaction and, in P. falciparum, antigen diversification (Ferreira et al. 2003).
Due to extensive length polymorphism, many repetitive DNA sequences in the genome of Cryptosporidium parasites have been used as genetic markers (Feng et al. 2014; Tanriverdi et al. 2008). In the absence of morphological traits to differentiate Cryptosporidium species and genotypes, length polymorphisms have played an important role in elucidating the epidemiology of human and animal cryptosporidiosis. Although less attention has focused on SAARs, two genes possessing such repeats, gp900 and gp60 (Barnes et al. 1998; Strong et al. 2000), have been extensively studied. Gp60 in particular, has found wide application as a genetic marker. The proteins encoded by these two genes have been shown to be extensively O-glycosylated and to have mucin-like properties. As with P. falciparum SAARs, no evidence for an association between SAAR profile and phenotype has been reported.
The evolutionary process which leads to speciation in the genus Cryptosporidium is unknown. The parasitic life style of these species and the observation that certain Cryptosporidium species or genotypes appear to have a narrow host range, has led to the assumption that parasite populations infecting different host species have evolved separately and differentiated into genetically distinct populations and/or have co-evolved with the host (Garcia & Hayman, 2016). The existence of Cryptosporidium species with distinct organ tropism, together with the recent expansion of good quality Cryptosporidium genome sequences, provides new opportunities to query genomes for polymorphisms which may reflect adaptation. Because of the evolutionary plasticity of SAARs and other types of repeats, as well as the extracellular location of glycosylated proteins, such as mucins, the comparative analysis of SAARs may shed light on Cryptosporidium speciation and on the emergence of the human pathogen C. hominis and of C. parvum genotypes which are thought to be transmitted predominantly among humans (Mallon et al. 2003).
This study was initially motivated by the observation that S and T SAARs were much rarer and shorter in the newly sequenced genome of C. muris as compared to C. parvum and C. hominis. S and T-rich proteins may be extensively glycosylated and form mucins (Johansson et al. 2013). The low abundance of C. muris mucins inferred from the initial analysis led to the hypothesis that the interaction of Cryptosporidium parasites with the host is mediated by different surface proteins, depending on the organ infected (stomach or small/large intestine) and that SAAR abundance and length in different Cryptosporidium species reflects adaptation to a different host environment. The analysis of SAARs in the genome of eight Cryptosporidium species presented here reveals a distinct SAAR profile in the avian parasite C. baileyi and a large difference in SAAR abundance between gastric and intestinal Cryptosporidium species.
2. Material and methods
2.1. Genomes and annotation
Annotated protein sequences from the following isolates were downloaded from CryptoDB.org (Heiges et al. 2006): C. parvum IOWA (Abrahamsen et al. 2004), C. hominis UdeaA01 (Isaza et al. 2015), C. muris RN66. Eimeria sequences were obtained from ToxoDB.org (Gajria et al. 2008). Genomes for which the annotation was not available were annotated using Companion (Steinbiss et al. 2016). The C. muris RN66 genome was used as reference to annotate C. andersoni 30847. The C. parvum IOWA genome was used as reference to annotate C. hominis TU502_new, C. baileyi TAMU-09Q1, C. meleagridis UKMEL1 (Ifeonu et al. 2016) and C. ubiquitum 39726. Information on the Cryptosporidium genome projects is summarized in Supplementary Table S1.
2.2. Repeat analyses
Two methods were used to identify single-amino acid repeats. In the first methods Microsoft Excel was used to identify uninterrupted repeats ≥5 amino acid residues in length. Line breaks were first removed from the annotated protein FASTA files using Fasta Width (FASTX toolkit; http://hannonlab.cshl.edu/fastx_toolkit/) in galaxy (Afgan et al. 2016). The sequences and corresponding gene IDs were imported into Excel one gene per row. Genes containing perfect repeats of ≥5 residues were then identified using excel functions countif or search.
In the second method, perfect and imperfect SAARs were identified using the web interface of XSTREAM (Newman & Cooper, 2007). Except for Min Period = 1, Max Period = 1, Min Copy No. = 20 and Min Domain Length = 20 default values were used. With these settings, the program returned SAARs of minimum length of 20 residues including imperfect repeats interrupted by insertions of one different amino acid. The same settings were used for all species. Except for the data presented in Fig. 2 below, the analyses are based on SAARs with a minimum length of 20 amino acids.
Fig. 2.

Abundance of uninterrupted repeats in C. parvum and C. muris. Genes featuring SAARs of the length shown on the x axis are plotted on the y axis, ranked by abundance. Repeats of potentially O-glycosylated threonine and serine are more abundant in C. parvum than in C. muris. Repeats of other amino acids do not show the same difference as exemplified by A and L SAARs.
The cumulative repeat length was extracted from the XSTREAM output for each species and each SAAR type, i.e., repeats of S, repeats of T, etc.. Based on these data, the distance between pairs of species was calculated in GenAlEx (Peakall & Smouse, 2012) using the SSR distance metric. The SSR distance is the sum of the squared difference in repeat length, summed over all the repeat types. For example, XSTREAM detected a total of 1128 T residues in SAARs ≥20 amino acids long in the genome of C. baileyi, and 793 T residues is SAAR in C. meleagridis. The contribution of T repeats to the total SSR distance between these two species is therefore (1128-793)2. The analogous calculations were performed to obtain SSR distances based on each repeat type. The distance between SAAR profiles is thus not based on a comparison of SAARs in orthologous genes.
The genetic distance between species was calculated from aligned 18S rRNA gene sequences approximately 1725 nucleotides (nt) in length. The following sequences, identified here using the NCBI accession number or CryptoDB gene ID, were included: C. muris, AF093497; C. parvum, AF093493; C. meleagridis, AF112574; C. ubiquitum, LRBP01000038; C. baileyi, AF093495; C. hominis, rrn016 and rrn022 (gene IDs from CryptoDB.org); C. andersoni AY954885 and AB089285. The distance value represents the proportion of divergent nucleotide positions.
The correlation between the abundance of each SAAR type across the eight Cryptosporidium genomes was visualized by Principal Components Analysis (PCA). The analysis was run in CANOCO (Braak & Šmilauer, 2002) and plotted with the same program. For this analysis, SAAR data were centered and standardized across species, i.e., expressed in units of standard deviation. Principal Coordinates Analysis (PCoA) was used to visualize genetic distances or SAAR distances between pairs of species calculated as described in section 2.2. above.
2.3. PCR confirmation of C. parvum repeats
Because repetitive sequences increase the probability of genome miss-assembly, we PCR amplified two SAARs to test the accuracy of the genome sequence. We picked two genes, designated cgd5_450 and cgd8_1160 in the C. parvum IOWA genome, to verity the accuracy of the C. parvum IOWA and C. hominis TU502 genome sequence at these two repetitive loci. A portion of gene comprising the SAAR of interest was amplified on an MJ Research PTC-100 thermal cycler from DNA extracted from oocysts of C. parvum isolate MD (Okhuysen et al. 2002) using Red Taq Mastermix (Sigma). The following primers were used: cgd5_450 primers 3F and 3R (GAAGGTCACACAGAAAATGGTG and GTGATGCCCACAACGTGTAA); cgd8_1160 primer 2F and 2R (AAGGACTCTAATGCCACCAC and TACTTGCTTCTTCCTGGGGAT). Amplicons were Sanger-sequenced in both directions. cgd8_1160 amplicons were cloned in a pCR2.1-TOPO plasmid using procedures recommended by Life Technologies for the TOPO TA Cloning kit.
2.4. Analysis of gene function
C. parvum orthologs of 61 C. baileyi genes with N SAARs ≥20 were identified using Companion (Steinbiss et al. 2016). Enriched cellular and molecular functions in 43 C. parvum orthologs were identified using the Gene Ontology Enrichment tool in CryptoDB.org. The Bonferroni and Benjamini-corrected significance threshold was set at 0.05. A similar analysis was performed with C. parvum genes containing uninterrupted T SAARs ≥20 residues and S SAARs of the same length.
2.5. O-glycosylation prediction
Out of the 40 C. parvum genes with the longest uninterrupted T repeats, 17 had C. muris orthologs in OrthoMCL (Li et al. 2003). The glycosylation prediction software NetOGlyc (Steentoft et al. 2013) was used to predict O-glycosylation sites on these 17 C. parvum - C. muris orthologous pairs. The program returns a glycosylation score ranging from 0 to 1 for each amino acid position. To avoid ascertainment bias, that could arise from the fact that C. muris genes were included in the analysis based on orthology to C. parvum genes with the longest T repeats, C. muris genes with the longest T repeats (≥10 residues) were selected to run the analogous O glycosylation prediction analysis. The C. parvum and C. muris amino acid sequences were concatenated in the same order. Dot plots were created for these concatenated sequences using Gepard (Krumsiek et al. 2007). A 4-residue sliding window was used to visualize conservation of sequence motifs.
3. Results
3.1. Global analyses
The genomes of seven Cryptosporidium species vary in SAAR abundance and length (Fig. 1). The two species belonging to the gastric clade included in the analysis, C. muris and C. andersoni, have a much smaller SAAR inventory. C. baileyi stands out for two reasons; its large cumulative SAAR length totaling 3383 amino acid residues, and the abundance of asparagine (N) repeats. I tested whether the cumulative length of SAARs is correlated to genome length or number of protein coding genes. Repeat length and number of protein coding genes were found to be negatively correlated (r = −0.78, n=8, p=0.02). A similar negative but non-significant correlation was also found between repeat length and genome size (r = −0.69, n=8, p=0.057). As expected, genome size and number of genes were positively correlated (r=0.95, n=8, p=3.1 × 10−4). Thus, SAAR abundance is independent of genome length, but is lower in genomes encoding more genes. The negative correlation between the number of protein-coding genes and repeat length argues against the data being biased by genome coverage. The two variables would likely be positively correlated if the differences between species would have resulted from gaps in the sequence.
Fig. 1.

Cumulative length of single-amino acid repeats by repeat type. Perfect and interrupted repeats longer than 20 amino acid identified with XSTREAM are included. Colors indicate amino acid as shown in key. C. muris (Cmur) and C. andersoni (Cand) infect the stomach and abomasum, respectively, whereas C. baileyi (Cbai) is found exclusively in birds. Two C. hominis (Chom) genomes were analyzed to provide an estimate of within-species variation. Cpar, C. parvum; Cmel, C. meleagridis; Cubi, C. ubiquitum.
The histograms in Fig. 2 compare ranked abundances of four types of uninterrupted SAARs ≥5 residues in C. parvum and C. muris. Because of the importance of O-glycosylated mucins for the interaction between Cryptosporidium parasites and the host (Barnes et al. 1998; Bhalchandra et al. 2013; Chatterjee et al. 2010; Petersen et al. 1991), this analysis is focused on T and S SAARs. Consistent with the histogram shown in Fig. 1, abundance of T and S SAARs is higher in C. parvum than in C. muris. The number of C. parvum genes with uninterrupted T and S SAARs ≥5 residues (55 and 100 genes, respectively) is 4.2 and 1.6-fold, respectively, the C. muris gene count with such repeats (13 and 64 genes, respectively). The longest uninterrupted T SAAR in C. parvum IOWA is 307 residues long and is located in gene cgd3_720, which is annotated as probable mucin with signal peptide. In contrast, in C. muris the longest uninterrupted T repeat, located in CMU_014140, is 54 residues long. CMU_014140 and cgd3_720 are not orthologous; CMU_014140 is orthologous to C. parvum cgd7_4020, which is a mucin with signal peptide, and in the literature has been referred to as GP900 (Barnes et al., 1998). The uninterrupted T SAAR in cgd7_4020 is 155 residues long, the second longest in the C. parvum IOWA genome. The rank-abundance analysis (Fig. 2) was extended to alanine and lysine repeats to assess to what extent this observation also applies to SAARs unrelated to mucins. An excess of C. parvum over C. muris genes was also found for repeats of A and L ≥5 residues long but, with 15% and 71%, the difference between the two species was smaller.
To explore whether the difference in SAAR profile originates primarily from gene loss or from differential SAAR evolution, I tabulated 19 C. parvum genes with perfect and imperfect T repeats ≥20 amino acids in length and their C. muris orthologs (Table 1). C. muris orthologs were found for 11 (57%) of these genes. This ratio deviates significantly from the genome-wide expectation based on 3363 C. parvum genes with C. muris orthologs out of 3805 annotated genes (88%; Chi2=17.3; p<0.001). For S SAARs, 11 C. parvum genes were found based on the same SAAR length threshold (Table 2); eight of these (72%) had C. muris orthologs. This proportion does not deviate significantly from the genome-wide expectation (Chi2=2.5; p=0.11). Consistent with the global analysis shown in Figs. 1, in most C. parvum - C. muris orthologous pairs, C. muris SAARs are shorter or absent. This analysis points to shortening of SAARs in gastric species, or lengthening in intestinal species, and possibly to gene loss, as causes of the smaller C. muris SAAR inventory.
Table 1.
Threonine SAAR repeat length in orthologous C. parvum and C. muris genes
| C. parvum | C. muris | annotation | ||
|---|---|---|---|---|
|
| ||||
| gene ID | SAAR length | gene ID | SAAR length* | |
| cgd1_3550 | 34 | CMU_020040 | 11 | PAN domain-containing protein |
| cgd2_3140 | 28 | CMU_005070 | 0 | hypothetical |
| cgd3_1540 | 44 | CMU_043200 | 0 | hypothetical |
| cgd3_440 | 30 | CMU_038310 | 38 | hypothetical |
| cgd3_720 | 389 | CMU_037990 | 6 | hypothetical |
| cgd4_30 | 20 | CMU_030750 | 28 | hypothetical |
| cgd4_3550 | 20 | CMU_009430 | 0 | kazal-type serine protease inhibitor domain- containing protein |
| cgd6_710 | 34 | CMU_027520 | 0 | hypothetical |
| cgd7_4020 | 293 | CMU_014140 | 170 | hypothetical |
| cgd7_4660 | 22 | CMU_014820 | 0 | hypothetical |
| cgd8_2800 | 48 | CMU_026830 | 0 | notch domain-containing protein |
minimum C. muris SAAR length = 6 residues
Table 2.
Serine SAAR repeat length in orthologous C. parvum and C. muris genes
| gene ID | SAAR length | gene ID | SAAR length | Product description |
|---|---|---|---|---|
| cgd2_3540 | 52 | CMU_003850 | 0 | PHD-finger domain-containing protein |
|
| ||||
| cgd3_1030 | 40 | CMU_042720 | 9 | hypothetical protein, conserved |
| cgd5_210 | 27 | CMU_031880 | 0 | hypothetical protein, conserved |
| cgd5_450 | 28 | CMU_041480 | 0 | zinc finger, C3HC4 type domain-containing protein |
| cgd6_4290 | 30 | CMU_020470 | 0 | hypothetical protein, conserved |
| cgd6_580 | 29 | CMU_027660 | 0 | hypothetical protein, conserved |
| cgd6_650 | 25 | CMU_027590 | 0 | 5′-AMP-activated protein kinase catalytic subunit alpha-1, putative |
| cgd6_830 | 22 | CMU_027400 | 0 | PH domain-containing protein |
minimum C. muris SAAR length = 6 residues
3.2. PCR verification of SAARs
Highly repetitive sequences are prone to sequence errors and can be difficult to assemble. To ensure that SAAR length information obtained from the annotated Cryptosporidium genomes is accurate, PCR amplicons derived from long SAAR were Sanger-sequenced. The sequences were then aligned to the homologous genomic sequences downloaded from CryptoDB.org. Although the number of SAARs that can be analyzed in this manner is small compared to the entire inventory, this analysis is adequate to detect a possible general bias. Such a bias could have resulted from the fact that different Cryptosporidium genomes were sequenced and assembled in different laboratories. A gel analysis of amplicons from genes cgd8_1160 and cgd5_450 spanning a 165-nt (55 amino acids) T SAAR and a 84-nt (28 amino acids) S SAAR, respectively, in the C. parvum IOWA genome is shown in Supplementary Fig. 1. The homologous loci were also amplified from oocyst DNA from a recent propagation of C. hominis TU502 isolate. The length of the cgd8_1160 amplicons in IOWA and TU502 predicted by the genome sequence is 457 nt and 411 nt, respectively. For cgd5_450, the genomic sequences predict amplicon sizes of 726 nt and 681 nt in IOWA and TU502, respectively. As seen in Supplementary Fig. 1, the gel analysis is consistent with the amplicon sizes predicted by the genome. Further confirming the accuracy of the C. parvum IOWA and C. hominis TU502 genome sequence at these loci, the cgd5_450 amplicon sequence encodes a 84-nt repeat encoding 28 S residues in IOWA and a 33-nt sequences encoding 11 S residues in TU502. Multiple attempts to sequence the cgd8_1160 amplicons failed, including attempts to sequence cgd8_1160 amplicons cloned in a pCR-TOPO plasmid. This observation suggests that the 165-nt T SAAR interferes with the sequencing reaction. The sequences of the cgd5_450 amplicon included a second, shorter S SAAR located 219 nt downstream. In the IOWA cgd5_450 amplicon, the repeat was identical to that found in the IOWA genome. In contrast, in C. hominis isolate TU502 a length polymorphism was detected. The sequence of the amplicon was two S residues longer than the reference TU502 genome sequenced in 2004 (Xu et al. 2004) and one residue longer than the newly sequenced TU502 genome from 2012 (Ifeonu et al. 2016). In two additional C. hominis genome sequences (isolates UKH1 and 37999) the repeat sequences were identical as found in the amplicon sequenced here. These results indicate that no systematic bias is present in SAAR sequences in the genomes. Minor differences in SAAR length in C. hominis are consistent with intra-species sequence polymorphism (Ifeonu et al. 2016; Widmer et al. 2012).
3.3. Correlation between genetic distance and SAAR distance
The phylogeny of the genus Cryptosporidium is defined by two distinct clades comprising species infecting the stomach and species infecting the intestine and colon, respectively (Slapeta, 2013). Within the latter group, the avian parasite C. baileyi stands out for its divergence from species infecting mammals. The genome-wide SAAR analysis shown in Fig. 1 suggests that genetic distance and SAAR divergence are correlated. To further explore this apparent trend, the genetic and SAAR distances between eight genomes from seven Cryptosporidium species were quantified and visualized on separate PCoA plots (Fig. 3 A, B). Both plots reveal a similar topology; C. baileyi does not cluster with any other species, whereas gastric and intestinal species fall into two distinct groups. Based on the phylogeny of the genus Cryptosporidium, the 18S topology was expected. The similar PCoA topology obtained with the SAAR data, indicates that divergence in SAAR length to some extent parallels divergence between species. However, the correlation between the two distance metrics (Fig. 3 C, D) shows that pairwise SAAR distances between C. baileyi and the other species exceeds by a factor >3 what would be expected from the genetic distances. A numerical evaluation of pairwise 18S and SAAR distances confirms this observation. The mean 18S distance between C. baileyi and any other species equals 0.048 (n=8), which is 1.2 times higher than the mean 18S distance for all other (excluding C. baileyi) pairwise comparisons (n=28) of 0.040. In contrast, the analogous calculation for pairwise SAAR distances (SSR distance as defined in Materials and Methods) shows that the average SAAR distance between C. baileyi and any other species (2.88 x 106, n=7) is 4.3 times higher than the mean distance for all other pairwise comparisons (6.7 x 105, n=21). Compared to the 18S distances, C. baileyi’s SAAR distances to other species are thus in average more than three times larger. This observation indicates that C. baileyi SAARs have diverged at a faster rate than expected from the 18S genetic distance, suggesting that in C. baileyi selection favors longer repeats. Conversely, the data could also indicate selection for shorter SAARs in species other than C. baileyi. Supplementary Fig. 2 shows that the excess SAAR distance is driven primarily by N SAARs, as expected from Fig. 1. Whereas for T, Q and S genetic distance and SSR distance are strongly correlated, this is not the case for N, D and E SAARs.
Fig. 3.

Correlation between genetic distance and SAAR distance. A. PCoA visualizing pairwise distances between 9 18S sequence pertaining to 7 Cryptosporidium species. The genetic distances were calculated from the aligned 18S rRNA sequence approximately 1725 nt in length. Intestinal species and gastric species are represented with green and brown data points, respectively. To provide a measure of intra-species distance, two C. andersoni (Cand) and two C. hominis (Chom) sequences were included. The accession numbers of the sequences are listed in Materials and Methods. B. Analogous PCoA analysis as shown in panel A but based on pairwise distance between SAAR profiles. The distances were calculated using the SSR metric as described in Materials and Methods. As represented by modified symbols, two C. hominis data sets are included to visualize intra-species distance. C, D. Correlation analysis of 18S and SAAR data shown in A and B reveals high divergence of C. baileyi SAARs. Colored symbols in panel C (n=12) indicate distance between a gastric species, C. muris or C. andersoni, and one of the six genomes from intestinal species. The same data points are depicted in panel D, but in this panel colored symbols (n=7) represent distances between C. baileyi the other species. Crossed, semi-filled and hourglass symbols show replicate distances obtained from the analysis of two C. hominis genomes (TU502 and Udea).
To gain further insight into evolutionary forces shaping Cryptosporidium SAARs, I analyzed the correlation between SAAR length across seven Cryptosporidium species (Supplementary Table S2). A Principal Components Analysis (PCA) focused on SAAR types was used to visualize the data (Fig. 4). As indicated in the biplot by a small angle between the S and T SAAR arrows, the two repeat types found in mucins are correlated across species; longer S SAARs tend to occur in species with longer T repeats. In contrast T and S repeat length across species are uncorrelated with N and D SAARs as indicated by the ~90° between T/S and N/D arrows. As expected from Fig. 1, the N SAAR arrow points in the direction of C. baileyi and in the opposite direction of C. meleagridis.
Fig 4.

Principal Components Analysis biplot representing SAAR length across species. SAAR arrows point in the direction of the steepest increase in SAAR length. The smaller the angle between arrows, the better the correlation between SAAR types across species. Projection of species symbols onto the arrow or their extension approximates the cumulative SAAR abundance in each species. Distance between species symbols approximates difference in SAAR profile, but the species topology in relation to the SAAR arrows only approximately corresponds to the histogram shown in Fig. 1 because the PCA is focused on SAARs. Horizontal and vertical ordination axis explain 41% and 25% of variance, respectively.
3.4. Gene-specific analysis: comparison of C. parvum - C. muris orthologs containing long T SAARs
As illustrated in Fig. 2, the difference in abundance and length of uninterrupted T and S in intestinal and gastric species is striking. For instance the 10 longest T SAAR in C. parvum together total 840 residues, whereas in C. muris the total is 171 residues. To better understand the cause of the differences in amino acid sequence underlying this observation, 17 C. parvum genes with the longest T SAARs were aligned to orthologous C. muris genes. The concatenated amino acid sequence of these 17 pairs of orthologs was 32,766 residues long in C. parvum but only 22,515 residues in C. muris. A dot plot based on a 4 amino acid sliding window revealed one of the reasons for this difference. A ~650-amino motif is repeated eight time in C. parvum gene cgd3_720, but only once in C. muris homolog CMU_037990 (Fig. 5). As a result, C. parvum gene cgd3_720 encodes 11696 amino acids, but the C. muris ortholog is only 3523 amino acids long, i.e., 30% of the homologous C. parvum gene. As shown by the T frequency histograms in Fig. 5, many T-rich motifs present in C. parvum are absent from the C. muris homologs. Many of these repeats are predicted to be heavily O-glycosylated as shown by the O-glyc histograms in Fig. 5. To avoid ascertainment bias, the symmetric analysis starting with the 17 C. muris genes with the longest T SAARs was also performed (Supplementary Fig. 3). Even though the set of orthologous genes was different, this analysis gave similar results.
Fig. 5.

Dotplot analysis of threonine-rich C. parvum - C. muris homologs. Seventeen C. parvum genes with the longest uninterrupted T SAARs, and for which C. muris homologs were found, were concatenated and compared to the homologous C. muris sequences arranged in the same order. Dots mark the coordinates of conserved 4-amino acid motifs present in both species. Long regions of sequence conservation between the species appear as an interrupted diagonal line. C. parvum gene cgd3_720 (coordinates 1941 - 13618 on the x axis) indicated with the horizontal grey bar is much longer than the C. muris homolog CMU_037990 due to the presence of a T-rich ~650 amino acid repeat present multiple times in C. parvum but only once in C. muris. The length of the concatenated C. parvum and C. muris amino acid sequences are indicated top right and bottom left, respectively. The number of Ts in a 10-residue running window is indicated for each species on the plots labelled “T10”. O-glycosylation prediction scores on a scale of 0-1 are shown in the lowermost and rightmost plots labelled “O-glyc”.
3.5. Functional analyses
A striking feature of the C. baileyi SAAR profile is the abundance of N repeats, and to a lesser extent repeats of aspartic acid (D) (Figs. 1, 5). Added together, N SAARs represents 1692 residues out of 3383 (50%) residues of the entire C. baileyi SAAR repertoire. In comparison, in the C. muris genome the cumulative length of SAARs is 730 residues. Thus, added together, C. baileyi N SAARs are 2.3 times longer than all C. muris SAARs together. This observation prompted me to investigate the function of C. baileyi genes with long N SAARs. For this analysis, the 61 C. baileyi genes which encode uninterrupted N SAARs ≥20 in length were selected. Because the C. baileyi genome has not been annotated, the C. baileyi genes were transformed by orthology to C. parvum using OrthoMCL (Li et al. 2003). Orthologs were identified for 43 (70%) genes. An analysis of these genes for significantly enriched annotations was then performed. Significantly enriched molecular functions included “transcription factor activity”, “sequence-specific DNA binding” and “nucleic acid binding transcription factor activity”. Fifteen out of 19 enriched cellular functions were related to regulation, such as “regulation of nucleic acid-templated transcription”, “regulation of RNA metabolic process”, “regulation of RNA biosynthetic process”, “regulation of transcription, DNA-templated” or “regulation of gene expression” (Supplementary Table S3).
Analogous analyses were run with C. parvum T and S SAARs ≥20 and ≥15 residues, respectively. T SAARs were picked because they are by far the most abundant in C. parvum (see Supplementary Table S2). As for T SAARs, S SAARs are important for their potential to form mucins and play an important role in the interaction of the parasite with the host. Among the 17 C. parvum genes with T homopolymers ≥20 residues in length, 12 genes (70%) have a signal peptide. Among these 12 genes, 7 (41%) also have transmembrane domains. The genome-wide proportion of C. parvum genes with signal peptide is 683/3886 (18%) and with signal peptide and transmembrane domains is 379/3886 (10%). The association between the presence of T SAAR and signal peptide is significant (Chi square = 33.1, p<0.0001). The same holds for the proportion of genes with signal peptide and transmembrane domain (Chi square = 19.2, p<0.0001). A functional enrichment analysis for the 17 C. parvum genes revealed a single significantly enriched GO term (0005576, extracellular region; Bonferroni-corrected p = 0.0044). No enriched molecular function terms were identified. In the analysis of C. parvum S SAARs, since only eight genes with uninterrupted repeats of ≥20 residues were found, 18 C. parvum genes with S homopolymers ≥15 residues were analyzed. Two of the 18 genes have a signal peptide and the same two genes also possess ≥1 trans-membrane domains. This number is not significantly different from the genome-wide proportion p=0.47 and p=0.8, respectively. However, ten significantly enriched metabolic processes and seven significantly enriched molecular functions were identified (Supplementary Table S4). Consistent with the predicted low proportion of proteins with signal peptide/trans-membrane domain in this group, all enriched metabolic processes are related to cytoplasmic pathways such as oligosaccharide/carbohydrate metabolism.
3.6. N SAARs in Eimeria species infecting birds
The difference between the repeat profile in C. baileyi and in non-avian intestinal Cryptosporidium species, led me to investigate whether N SAAR abundance, or SAAR abundance in general, is associated with species adapted to birds. As no other genome from avian Cryptosporidium species other than C. baileyi TAMU-09Q1 has been sequenced, I compared four Eimeria genomes downloaded from the ToxoDB database (Gajria et al. 2008); E. falciformis, E. tenella, E. maxima and E. necatrix. E. falciformis is a parasite of mice and the only non-avian species for which the genome was available. The results (Supplementary Fig. 4) show that cumulative SAAR length in the three species infecting birds exceeds that of the mouse parasite E. falciformis. However, N SAARs are not overrepresented, indicating that N SAARs are unlikely to represent an adaptation of enteric Apicomplexa to birds. Similarly, comparing P. gallinaceum, an avian malaria parasite to P. falciparum did not reveal a higher abundance of N SAARs in the former species (not shown). To the contrary, P. falciparum N SAARs total 9135 residues, or 91% of the total length of SAARs identified using XSTREAM as described in Material and Methods. In P. gallinaceum, N SAARs were much less abundant (3909 residues) and represented only 66% of the cumulative SAAR length (data not shown).
Discussion
The primary motivation for this study was to extend an initial observation showing that in the newly sequenced genome of C. muris S and T SAARs are fewer and shorter compared to C. parvum and C. hominis. S and T-rich proteins are potentially extensively glycosylated, forming mucins which play an important role in the interaction between host and Cryptosporidium parasites. Mucins may also be recognized by the immune response (Barnes et al. 1998; Cevallos et al. 2000; Chatterjee et al. 2010; O’Connor et al. 2009). Intra-species polymorphism in SAAR length has been attributed to immune-mediated selection for antigenic diversity (Gatei et al. 2007). The low abundance of mucins in gastric Cryptosporidium species, confirmed here by the analysis of the C. andersoni genome, led to the hypothesis that the parasite’s interaction with the host in the stomach and the small intestine differs, and that SAAR abundance in different species reflects adaptation to different host environments. This putative adaptive process appears to have primarily led to a shortening of C. muris SAARs in relation to C. parvum, as opposed to selective C. muris gene loss. However, the results from the analysis of C. parvum - C. muris orthologs (see Tables 1, 2) cannot exclude that C. muris genes orthologous to C. parvum genes with long T SAARs may have been lost following the branching into a gastric and an intestinal Cryptosporidium clade. To investigate a possible adaptive process, the SAAR analysis was expanded to other sequenced Cryptosporidium species and to other SAAR types, as new genomes sequence became available. Stark differences in SAAR profile between related species is not unique to the genus Cryptosporidium. Repeats of N are very abundant in P. falciparum and P. reichnowi, but are rare in other Plasmodium species (Muralidharan & Goldberg, 2013; Muralidharan et al. 2011).
The functional significance of the small SAAR inventory in C. muris and C. andersoni is unknown. Since extracellular stages of gastric species are exposed to acidity in the stomach and abomasum, it is conceivable that a reduced SAAR inventory reflects adaptation to the extreme acidic environment. The mechanism selecting against long SAARs in these species, if it exists, remains unknown. Weighing the relative importance of SAAR adaptation vs. random contraction/expansion during divergence between species requires the analysis of additional genomes. As no other apicomplexan protozoa are know to colonize the stomach, extending SAAR analyses to other genera inhabiting the same environment is not feasible, limiting our ability to identify mechanisms driving the evolution of SAARs in these species. In the genus Helicobacter, the species H. pylori infects the stomach, whereas other species multiply in other organs. The analysis of the genome of H. pylori and H. cineadi, an enterohepatic species, revealed no difference in SAAR abundance (data not shown). This comparison is far from ideal, because SAARs in prokaryotes are rare (Sim & Creamer, 2002) and may be subject to very different evolutionary constraints. Also, in contrast to Cryptosporidium, Helicobacter species are primarily extracellular. Moreover, H. pylori may not be directly exposed to stomach acidity because it is known to neutralize the pH of the mucus in which the bacteria multiply. In spite of these differences, the limited analysis of Helicobacter suggest that SAAR contraction is not a feature shared by gastric pathogens.
In addition to revealing clear differences in the SAAR profile of gastric and intestinal Cryptosporidium species, the analyses presented here highlight unique and previously unknown features in the genome of the avian parasite C. baileyi. In comparison to other intestinal species, the C. baileyi genome has accumulated a large number of proteins with long N SAARs, and to a lesser extent, D SAARs. Three observations are consistent with an adaptive significance of the expansion of these types of SAARs; 1) repeats of other amino acids have not expanded in C. baileyi; 2) the divergence between the C. baileyi SAAR profile and that of other species exceeds the genetic distance estimated from the 18S rRNA gene (Fig. 3, Supplementary Fig. 2); 3) in terms of function, N-rich C. baileyi proteins are significantly enriched for regulatory functions (Supplementary Table 3). Interestingly, the latter observation also applies to the yeast proteome (Young et al. 2000). Although Eimeria species infecting birds are more SAAR-rich (Supplementary Fig. 4), the difference between avian Eimeria and the murine parasite E. falciformis lies primarily in the abundance of glutamine (Q) repeats, not N repeats as in C. baileyi. A preliminary analysis of GO term enrichment in C. baileyi genes with N SAARs and E. tenella genes with Q SAARs, did however uncover a statistically significant number of shared enriched GO terms (data not shown). Arguing against a simple adaptive model driving N SAAR expansion in avian parasites is the observation that the SAAR profile of C. meleagridis, a species with a wide host range that includes birds (Morgan et al. 2001; Slavin, 1955), clusters with the intestinal species, rather than with C. baileyi. However, as only one avian Cryptosporidium genome and only one non-avian Eimeria genome have been sequenced to date, firmer conclusions on the adaptive significance of sequence repeats, and on the evolution of C. baileyi SAARs in particular, will require additional genome sequences. The analysis of additional genomes of C. baileyi and other Cryptosporidium genotypes infecting birds (Holubova et al. 2016) is clearly warranted. In the meantime, selected loci with long SAARs in C. baileyi can be amplified from other avian Cryptosporidium isolates and sequenced to assess whether N SAARs are conserved among avian Cryptosporidium species and genotypes.
The genome of P. falciparum is rich in amino acid and nucleotide repeats. Repeats of N are the most common type (Singh et al. 2004). Several interesting examples of malaria phenotypes linked to repeat length or repeat presence/absence have recently been reported (McHugh et al., 2015; Davies et al., 2016; Cutts et al., 2017). Some authors have suggested that SAARs in P. falciparum have an adaptive significance, such as favoring immune evasion (Hughes, 2004). However, this view is not universally shared (DePristo et al. 2006). Because of the tendency of SAARs, microsatellites and minisatellites to contract or expand, it has been proposed that variation in intragenic repeat length provides antigenic and functional diversity that, in fungi and other pathogens, allows rapid adaptation to the host environment and immune evasion (Ferreira et al. 2003; Verstrepen et al. 2005). The observations by Verstrepen et al. also support the view that intragenic repeat number polymorphism affects cell surface properties in a quantitative manner. Another yeast phenotype modulated by the length of a T and S-rich mucin-like protein has been described (Fidalgo et al., 2006). The length of a putatively O-glycosylated surface protein was found to be linked to cell buoyancy. The significant association in C. parvum proteins of a signal peptide or transmembrane domain with the presence of T SAARs is consistent with such repeats mediating the interaction between the parasite and the host cell. N repeats have attracted interest because of the potential of N-rich proteins to form amyloid and prions (An et al. 2016). To what extent such structures may occur in C. baileyi, and what evolutionary significance they might have, is unknown. A practical implication of these observations is the possibility that depletion of intracellular N may inhibit C. baileyi multiplication without impacting the host. This is supported by reports on anti-tumor activity of asparaginase in mice (Capizzi et al. 1970) and the in clinic (Koprivnikar et al. 2017). These observation raise the possibility that asparaginase may be useful to control cryptosporidiosis in birds. P. berghei without functional asparaginase displayed a delayed development in the mouse and mosquito (Nagaraj et al. 2015).
Other types of SAARs which have attracted attention due to their biological properties are Q repeats. PolyQ domains are found in eukaryotic transcription factors (Escher et al. 2000). An analysis of human protein interactomes revealed that proteins with Q SAARs engage in more interactions with other proteins compared to proteins lacking such motifs. Expansion of Q repeats results in abnormal interactions, and may favor protein aggregation. (Schaefer et al. 2012). As shown above (Figs. 1, 4), Q SAARs are more abundant in intestinal Cryptosporidium species, a species distribution which is almost completely opposite of that of D and N SAARs (Fig. 4). This observation again raises the possibility of SAAR profiles being shaped by selective pressure, possibly related to host range.
To provide an estimate of variability of the SAAR profile within species, two C. hominis genomes were included in the analyses. The close similarity between the two C. hominis SAAR profiles (Figs. 1, 3, 4) indicates that the sequencing strategy (Supplementary Table S1) did not majorly impact the SAAR profile. The potential for sequencing and assembly artefacts to generate errors in repetitive sequences is a concern, particularly with short read sequencing. The two C. hominis projects used different sequencers and different assembly programs, MaSuRCA and NEWBLER, respectively. The data provide no evidence that SAAR polymorphism between species can be explained by the sequencing and assembly strategy. Moreover, Supplementary Table S1 shows no association between the predominance of short SAARs in gastric species and the use of a particular sequencing platform. The presence of different T and S codons within Cryptosporidium repetitive sequences would also be expected to reduce the likelihood of misassembly.
In conclusion, a global analysis of SAARs in eight genomes pertaining to seven Cryptosporidium species has revealed different SAAR signatures in species infecting the stomach, species infecting the small intestine and colon and the only sequenced avian species. The rapidly expanding volume of apicomplexan sequence data will support more comprehensive analyses and enable statistical tests which are not supported by the small number of genomes currently sequenced. More extensive analyses of rapidly evolving sequences such as simple repeats, micro- and minisatellites are expected to converge on explaining specific Cryptosporidium phenotypes, like host range and virulence.
Supplementary Material
Fig. S1. PCR amplification of oocyst DNA confirms the length of two SAARs in C. parvum and C. hominis. DNA from C. parvum isolate IOWA and C. hominis isolate TU502 oocysts was analyzed. The two SAARs are located in the genes designated using C. parvum gene IDs as cgd8_1160 and cgd5_450. The genome sequence predicts cgd8_1160 amplicons of 411 nt and 457 nt, and cgd5_450 amplicons of 681 nt and 726 nt in TU502 and IOWA, respectively. Lane 1, 100 bp ladder; lanes 4, 7, negative PCR controls. The size of the 1 kbp, 500 bp and 100 bp band is indicated.
Fig. S2. Correlation between genetic distance and 18S distance by amino acid. Triangles denote distances between C. baileyi and any other species.
Fig. S3. Dotplot analysis of threonine-rich C. muris - C. parvum homologs. Seventeen C. muris genes with the longest uninterrupted T SAARs, and for which C. parvum homologs were found, were concatenated and compared to the homologous C. parvum sequences as described in Fig. 5. Horizontal axis, C. parvum; vertical axis, C. muris.
Fig. S4. Cumulative length of single-amino-acid repeats by repeat type in four Eimeria genomes. Perfect and imperfect repeats longer than 20 amino acids were identified with XSTREAM. Colors indicate amino acid as shown in key. All species except E. falciformis infect birds. Left, raw amino acid counts; right, counts normalized to 107 nt coding sequence. Efal, E. falciformis; Eten, E. tenella; Emax, E. maxima; E. nec, E. necatrix.
Table S1. Summary of genome sequencing projects.
Table S2. Cumulative SAAR length for 7 species and eight genomes calculated by amino acid using XSTREAM.
Table S3. Enriched functions in C. baileyi N-rich proteins.
Table S4. Enriched functions in C. parvum T and S-rich proteins.
Key findings.
The analysis of single-amino-acid repeats (SAARs) in Cryptosporidium genomes revealed a striking difference between gastric and intestinal species.
Cumulative SAAR length in gastric species is only about 25% of cumulative SAAR length in intestinal species.
The SAAR profile in the genome of the avian parasite C. baileyi stands out due to the presence of long asparagine repeats.
The evolutionary significance of the different SAAR profiles in the genus Cryptosporidium is unknown.
Acknowledgments
I thank Julia Dilo for expert assistance with PCR analyses and Alex Grinberg for valuable comments and suggestions. Saul Tzipori and Donna Akiyoshi generously provided C. hominis oocysts. Refaat Ras analyzed the Helicobacter genomes.
Financial support
Financial support from the National Institute of Allergy and Infectious Diseases (grants R21AI125891, R15AI122152) is gratefully acknowledged.
Abbreviations
- SAAR
single-amino-acid repeat
- nt
nucleotide
- Cpar
Cryptosporidium parvum
- Chom
Cryptosporidium hominis
- Cmel
Cryptosporidium meleagridis
- Cubi
Cryptosporidium ubiquitum
- Cand
Cryptosporidium andersoni
- Cmur
Cryptosporidium muris
- Cbai
Cryptosporidium baileyi
- bgd
background
- GO
gene ontology
- PCoA
principal coordinates analysis
- PCA
principal component analysis
References
- Abrahamsen MS, Templeton TJ, Enomoto S, Abrahante JE, Zhu G, Lancto CA, Deng M, Liu C, Widmer G, Tzipori S, Buck GA, Xu P, Bankier AT, Dear PH, Konfortov BA, Spriggs HF, Iyer L, Anantharaman V, Aravind L, Kapur V. Complete genome sequence of the apicomplexan, Cryptosporidium parvum. Science. 2004;304:441–445. doi: 10.1126/science.1094786. [DOI] [PubMed] [Google Scholar]
- Afgan E, Baker D, van den Beek M, Blankenberg D, Bouvier D, Cech M, Chilton J, Clements D, Coraor N, Eberhard C, Gruning B, Guerler A, Hillman-Jackson J, Von Kuster G, Rasche E, Soranzo N, Turaga N, Taylor J, Nekrutenko A, Goecks J. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update. Nucleic Acids Research. 2016;44:W3–W10. doi: 10.1093/nar/gkw343. [DOI] [PMC free article] [PubMed] [Google Scholar]
- An L, Fitzpatrick D, Harrison PM. Emergence and evolution of yeast prion and prion-like proteins. BMC Evolution Biol. 2016;16:24. doi: 10.1186/s12862-016-0594-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Aravind L, Iyer LM, Wellems TE, Miller LH. Plasmodium biology: genomic gleanings. Cell. 2003;115:771–785. doi: 10.1016/s0092-8674(03)01023-7. [DOI] [PubMed] [Google Scholar]
- Barnes DA, Bonnin A, Huang JX, Gousset L, Wu J, Gut J, Doyle P, Dubremetz JF, Ward H, Petersen C. A novel multi-domain mucin-like glycoprotein of Cryptosporidium parvum mediates invasion. Molecular Biochemical Parasitology. 1998;96:93–110. doi: 10.1016/s0166-6851(98)00119-4. [DOI] [PubMed] [Google Scholar]
- Bhalchandra S, Ludington J, Coppens I, Ward HD. Identification and characterization of Cryptosporidium parvum Clec, a novel C-type lectin domain-containing mucin-like glycoprotein. Infection and Immunity. 2013;81:3356–3365. doi: 10.1128/IAI.00436-13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Braak Ct, Šmilauer P. CANOCO reference manual and CanoDraw for Windows user’s guide: software for canonical community ordination (version 4.5) Microcomputer Power; Ithaca, New York: 2002. [Google Scholar]
- Capizzi RL, Bertino JR, Handschumacher RE. L-asparaginase. Annual Revue in Medicine. 1970;21:433–444. doi: 10.1146/annurev.me.21.020170.002245. [DOI] [PubMed] [Google Scholar]
- Cevallos AM, Bhat N, Verdon R, Hamer DH, Stein B, Tzipori S, Pereira ME, Keusch GT, Ward HD. Mediation of Cryptosporidium parvum infection in vitro by mucin-like glycoproteins defined by a neutralizing monoclonal antibody. Infection and Immunity. 2000;68:5167–5175. doi: 10.1128/iai.68.9.5167-5175.2000. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chatterjee A, Banerjee S, Steffen M, O’Connor RM, Ward HD, Robbins PW, Samuelson J. Evidence for mucin-like glycoproteins that tether sporozoites of Cryptosporidium parvum to the inner surface of the oocyst wall. Eukaryotic Cell. 2010;9:84–96. doi: 10.1128/EC.00288-09. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cutts EE, Laasch N, Reiter DM, Trenker R, Slater LM, Stansfeld PJ, Vakonakis I. Structural analysis of P. falciparum KAHRP and PfEMP1 complexes with host erythrocyte spectrin suggests a model for cytoadherent knob protrusions. PLoS Pathogens. 2017;13(8):e1006552. doi: 10.1371/journal.ppat.1006552. [DOI] [PMC free article] [PubMed] [Google Scholar]
- DePristo MA, Zilversmit MM, Hartl DL. On the abundance, amino acid composition, and evolutionary dynamics of low-complexity regions in proteins. Gene. 2006;378:19–30. doi: 10.1016/j.gene.2006.03.023. [DOI] [PubMed] [Google Scholar]
- Escher D, Bodmer-Glavas M, Barberis A, Schaffner W. Conservation of glutamine-rich transactivation function between yeast and humans. Molecular Cell Biol. 2000;20:2774–2782. doi: 10.1128/mcb.20.8.2774-2782.2000. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Feng Y, Tiao N, Li N, Hlavsa M, Xiao L. Multilocus sequence typing of an emerging Cryptosporidium hominis subtype in the United States. Journal of Clinical Microbiology. 2014;52:524–530. doi: 10.1128/JCM.02973-13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ferreira MU, Ribeiro WL, Tonon AP, Kawamoto F, Rich SM. Sequence diversity and evolution of the malaria vaccine candidate merozoite surface protein-1 (MSP-1) of Plasmodium falciparum. Gene. 2003;304:65–75. doi: 10.1016/s0378-1119(02)01180-0. [DOI] [PubMed] [Google Scholar]
- Fidalgo M, Barrales RR, Ibeas JI, Jimenez J. Adaptive evolution by mutations in the FLO11 gene. Proceedings of the National Academy of Sciences of the USA. 2006;103:11228–112233. doi: 10.1073/pnas.0601713103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Frugier M, Bour T, Ayach M, Santos MA, Rudinger-Thirion J, Theobald-Dietrich A, Pizzi E. Low Complexity Regions behave as tRNA sponges to help co-translational folding of plasmodial proteins. FEBS Letters. 2010;584:448–454. doi: 10.1016/j.febslet.2009.11.004. [DOI] [PubMed] [Google Scholar]
- Gajria B, Bahl A, Brestelli J, Dommer J, Fischer S, Gao X, Heiges M, Iodice J, Kissinger JC, Mackey AJ, Pinney DF, Roos DS, Stoeckert CJ, Jr, Wang H, Brunk BP. ToxoDB: an integrated Toxoplasma gondii database resource. Nucleic Acids Research. 2008;36:D553–556. doi: 10.1093/nar/gkm981. doi: gkm981 [pii] 10.1093/nar/gkm981. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Garcia RJ, Hayman DT. Origin of a major infectious disease in vertebrates: The timing of Cryptosporidium evolution and its hosts. Parasitology. 2016;143:1683–1690. doi: 10.1017/s0031182016001323. [DOI] [PubMed] [Google Scholar]
- Gatei W, Das P, Dutta P, Sen A, Cama V, Lal AA, Xiao L. Multilocus sequence typing and genetic structure of Cryptosporidium hominis from children in Kolkata, India. Infection, Genetics and Evolution. 2007;7:197–205. doi: 10.1016/j.meegid.2006.08.006. [DOI] [PubMed] [Google Scholar]
- Heiges M, Wang H, Robinson E, Aurrecoechea C, Gao X, Kaluskar N, Rhodes P, Wang S, He CZ, Su Y, Miller J, Kraemer E, Kissinger JC. CryptoDB: a Cryptosporidium bioinformatics resource update. Nucleic Acids Research. 2006;34:D419–422. doi: 10.1093/nar/gkj078. doi: 34/suppl_1/D419 [pii] 10.1093/nar/gkj078. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Holubova N, Sak B, Horcickova M, Hlaskova L, Kvetonova D, Menchaca S, McEvoy J, Kvac M. Cryptosporidium avium n. sp. (Apicomplexa: Cryptosporidiidae) in birds. Parasitology Research. 2016;115:2243–2251. doi: 10.1007/s00436-016-4967-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hughes AL. The evolution of amino acid repeat arrays in Plasmodium and other organisms. Journal of Molecular Evolution. 2004;59:528–535. doi: 10.1007/s00239-004-2645-4. [DOI] [PubMed] [Google Scholar]
- Ifeonu OO, Chibucos MC, Orvis J, Su Q, Elwin K, Guo F, Zhang H, Xiao L, Sun M, Chalmers RM, Fraser CM, Zhu G, Kissinger JC, Widmer G, Silva JC. Annotated draft genome sequences of three species of Cryptosporidium: Cryptosporidium meleagridis isolate UKMEL1, C. baileyi isolate TAMU-09Q1 and C. hominis isolates TU502_2012 and UKH1. Pathogy of Disease. 2016;74 doi: 10.1093/femspd/ftw080. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Isaza JP, Galvan AL, Polanco V, Huang B, Matveyev AV, Serrano MG, Manque P, Buck GA, Alzate JF. Revisiting the reference genomes of human pathogenic Cryptosporidium species: reannotation of C. parvum Iowa and a new C. hominis reference. Science Reports. 2015;5:16324. doi: 10.1038/srep16324. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Johansson ME, Sjovall H, Hansson GC. The gastrointestinal mucus system in health and disease. Nature Review in Gastroenterology and Hepatology. 2013;10:352–361. doi: 10.1038/nrgastro.2013.35. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Karlin S, Brocchieri L, Bergman A, Mrazek J, Gentles AJ. Amino acid runs in eukaryotic proteomes and disease associations. Proceedings of the National Academy of Sciences U S A. 2002;99:333–338. doi: 10.1073/pnas.012608599. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Koprivnikar J, McCloskey J, Faderl S. Safety, efficacy, and clinical utility of asparaginase in the treatment of adult patients with acute lymphoblastic leukemia. Oncology Targets and Therapy. 2017;10:1413–1422. doi: 10.2147/ott.s106810. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Krumsiek J, Arnold R, Rattei T. Gepard: a rapid and sensitive tool for creating dotplots on genome scale. Bioinformatics. 2007;23:1026–1028. doi: 10.1093/bioinformatics/btm039. [DOI] [PubMed] [Google Scholar]
- Li L, Stoeckert CJ, Jr, Roos DS. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Research. 2003;13:2178–2189. doi: 10.1101/gr.1224503. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mallon ME, MacLeod A, Wastling JM, Smith H, Tait A. Multilocus genotyping of Cryptosporidium parvum Type 2: population genetics and sub-structuring. Infection Genetics and Evolution. 2003;3:207–218. doi: 10.1016/s1567-1348(03)00089-3. [DOI] [PubMed] [Google Scholar]
- McHugh E, Batinovic S, Hanssen E, McMillan PJ, Kenny S, Griffin MD, Crawford S, Trenholme KR, Gardiner DL, Dixon MW, Tilley L. A repeat sequence domain of the ring-exported protein-1 of Plasmodium falciparum controls export machinery architecture and virulence protein trafficking. Molecular Microbiology. 2015;98:1101–1114. doi: 10.1111/mmi.13201. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Morgan UM, Monis PT, Xiao L, Limor J, Sulaiman I, Raidal S, O’Donoghue P, Gasser R, Murray A, Fayer R, Blagburn BL, Lal AA, Thompson RC. Molecular and phylogenetic characterisation of Cryptosporidium from birds. International Journal for Parasitology. 2001;31:289–296. doi: 10.1016/s0020-7519(00)00164-8. [DOI] [PubMed] [Google Scholar]
- Muralidharan V, Goldberg DE. Asparagine repeats in Plasmodium falciparum proteins: good for nothing? PLoS Pathology. 2013;9:e1003488. doi: 10.1371/journal.ppat.1003488. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Muralidharan V, Oksman A, Iwamoto M, Wandless TJ, Goldberg DE. Asparagine repeat function in a Plasmodium falciparum protein assessed via a regulatable fluorescent affinity tag. Proceedings of the National Academy of Sciences U S A. 2011;108:4411–4416. doi: 10.1073/pnas.1018449108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nagaraj VA, Mukhi D, Sathishkumar V, Subramani PA, Ghosh SK, Pandey RR, Shetty MC, Padmanaban G. Asparagine requirement in Plasmodium berghei as a target to prevent malaria transmission and liver infections. Nature Communications. 2015;6:8775. doi: 10.1038/ncomms9775. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Newman AM, Cooper JB. XSTREAM: a practical algorithm for identification and architecture modeling of tandem repeats in protein sequences. BMC Bioinformatics. 2007;8:382. doi: 10.1186/1471-2105-8-382. [DOI] [PMC free article] [PubMed] [Google Scholar]
- O’Connor RM, Burns PB, Ha-Ngoc T, Scarpato K, Khan W, Kang G, Ward H. Polymorphic mucin antigens CpMuc4 and CpMuc5 are integral to Cryptosporidium parvum infection in vitro. Eukaryotic Cell. 2009;8:461–469. doi: 10.1128/EC.00305-08. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Okhuysen PC, Rich SM, Chappell CL, Grimes KA, Widmer G, Feng XC, Tzipori S. Infectivity of a Cryptosporidium parvum isolate of cervine origin for healthy adults and interferon-gamma knockout mice. Journal of Infectious Diseases. 2002;185:1320–1325. doi: 10.1086/340132. [DOI] [PubMed] [Google Scholar]
- Paulson HL. Toward an understanding of polyglutamine neurodegeneration. Brain Pathology. 2000;10:293–299. doi: 10.1111/j.1750-3639.2000.tb00263.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Peakall R, Smouse PE. GenAlEx 6.5: genetic analysis in Excel. Population genetic software for teaching and research–an update. Bioinformatics. 2012;28:2537–2539. doi: 10.1093/bioinformatics/bts460. doi: bts460 [pii] 10.1093/bioinformatics/bts460. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Petersen C, Gut J, Nelson RG, Leech JH. Characterization of a Cryptosporidium parvum sporozoite glycoprotein. Journal of Protozoology. 1991;38:20S–21S. [PubMed] [Google Scholar]
- Schaefer MH, Wanker EE, Andrade-Navarro MA. Evolution and function of CAG/polyglutamine repeats in protein-protein interaction networks. Nucleic Acids Research. 2012;40:4273–4287. doi: 10.1093/nar/gks011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shimada MK, Sanbonmatsu R, Yamaguchi-Kabata Y, Yamasaki C, Suzuki Y, Chakraborty R, Gojobori T, Imanishi T. Selection pressure on human STR loci and its relevance in repeat expansion disease. Molecular and Genetic Genomics. 2016;291:1851–1869. doi: 10.1007/s00438-016-1219-7. [DOI] [PubMed] [Google Scholar]
- Sim KL, Creamer TP. Abundance and distributions of eukaryote protein simple sequences. Molecular and Cellular Proteomics. 2002;1:983–995. doi: 10.1074/mcp.m200032-mcp200. [DOI] [PubMed] [Google Scholar]
- Singh GP, Chandra BR, Bhattacharya A, Akhouri RR, Singh SK, Sharma A. Hyper-expansion of asparagines correlates with an abundance of proteins with prion-like domains in Plasmodium falciparum. Molecular and Biochemical Parasitology. 2004;137:307–319. doi: 10.1016/j.molbiopara.2004.05.016. [DOI] [PubMed] [Google Scholar]
- Slapeta J. Cryptosporidiosis and Cryptosporidium species in animals and humans: a thirty colour rainbow? Int J Parasitology. 2013;43:957–970. doi: 10.1016/j.ijpara.2013.07.005. [DOI] [PubMed] [Google Scholar]
- Slavin D. Cryptosporidium meleagridis (sp. nov.) Journal of Comparative Pathology. 1955;65:262–266. doi: 10.1016/s0368-1742(55)80025-2. [DOI] [PubMed] [Google Scholar]
- Steentoft C, Vakhrushev SY, Joshi HJ, Kong Y, Vester-Christensen MB, Schjoldager KT, Lavrsen K, Dabelsteen S, Pedersen NB, Marcos-Silva L, Gupta R, Bennett EP, Mandel U, Brunak S, Wandall HH, Levery SB, Clausen H. Precision mapping of the human O-GalNAc glycoproteome through SimpleCell technology. EMBO Journal. 2013;32:1478–1488. doi: 10.1038/emboj.2013.79. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Steinbiss S, Silva-Franco F, Brunk B, Foth B, Hertz-Fowler C, Berriman M, Otto TD. Companion: a web server for annotation and analysis of parasite genomes. Nucleic Acids Research. 2016;44:W29–34. doi: 10.1093/nar/gkw292. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Strong WB, Gut J, Nelson RG. Cloning and sequence analysis of a highly polymorphic Cryptosporidium parvum gene encoding a 60-kilodalton glycoprotein and characterization of its 15- and 45-kilodalton zoite surface antigen products. Infection and Immunity. 2000;68:4117–4134. doi: 10.1128/iai.68.7.4117-4134.2000. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tanriverdi S, Grinberg A, Chalmers RM, Hunter PR, Petrovic Z, Akiyoshi DE, London E, Zhang L, Tzipori S, Tumwine JK, Widmer G. Inferences about the global population structures of Cryptosporidium parvum and Cryptosporidium hominis. Applied and Environmental Microbiology. 2008;74:7227–7234. doi: 10.1128/AEM.01576-08. [DOI] [PMC free article] [PubMed] [Google Scholar]
- van Eyk CL, McLeod CJ, O’Keefe LV, Richards RI. Comparative toxicity of polyglutamine, polyalanine and polyleucine tracts in Drosophila models of expanded repeat disease. Human Molecular Genetics. 2012;21:536–547. doi: 10.1093/hmg/ddr487. [DOI] [PubMed] [Google Scholar]
- Verstrepen KJ, Jansen A, Lewitter F, Fink GR. Intragenic tandem repeats generate functional variability. Nature Genetics. 2005;37:986–990. doi: 10.1038/ng1618. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Widmer G, Lee Y, Hunt P, Martinelli A, Tolkoff M, Bodi K. Comparative genome analysis of two Cryptosporidium parvum isolates with different host range. Infection Genetics and Evolution. 2012;12:1213–1221. doi: 10.1016/j.meegid.2012.03.027. doi: S1567-1348(12)00108-6 [pii] 10.1016/j.meegid.2012.03.027. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wootton JC. Non-globular domains in protein sequences: automated segmentation using complexity measures. Computational Chemistry. 1994;18:269–285. doi: 10.1016/0097-8485(94)85023-2. [DOI] [PubMed] [Google Scholar]
- Xu P, Widmer G, Wang Y, Ozaki LS, Alves JM, Serrano MG, Puiu D, Manque P, Akiyoshi D, Mackey AJ, Pearson WR, Dear PH, Bankier AT, Peterson DL, Abrahamsen MS, Kapur V, Tzipori S, Buck GA. The genome of Cryptosporidium hominis. Nature. 2004;431:1107–1112. doi: 10.1038/nature02977. [DOI] [PubMed] [Google Scholar]
- Young ET, Sloan JS, Van Riper K. Trinucleotide repeats are clustered in regulatory genes in Saccharomyces cerevisiae. Genetics. 2000;154:1053–1068. doi: 10.1093/genetics/154.3.1053. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zilversmit MM, Volkman SK, DePristo MA, Wirth DF, Awadalla P, Hartl DL. Low-complexity regions in Plasmodium falciparum: missing links in the evolution of an extreme genome. Molecular Biology and Evolution. 2010;27:2198–2209. doi: 10.1093/molbev/msq108. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Fig. S1. PCR amplification of oocyst DNA confirms the length of two SAARs in C. parvum and C. hominis. DNA from C. parvum isolate IOWA and C. hominis isolate TU502 oocysts was analyzed. The two SAARs are located in the genes designated using C. parvum gene IDs as cgd8_1160 and cgd5_450. The genome sequence predicts cgd8_1160 amplicons of 411 nt and 457 nt, and cgd5_450 amplicons of 681 nt and 726 nt in TU502 and IOWA, respectively. Lane 1, 100 bp ladder; lanes 4, 7, negative PCR controls. The size of the 1 kbp, 500 bp and 100 bp band is indicated.
Fig. S2. Correlation between genetic distance and 18S distance by amino acid. Triangles denote distances between C. baileyi and any other species.
Fig. S3. Dotplot analysis of threonine-rich C. muris - C. parvum homologs. Seventeen C. muris genes with the longest uninterrupted T SAARs, and for which C. parvum homologs were found, were concatenated and compared to the homologous C. parvum sequences as described in Fig. 5. Horizontal axis, C. parvum; vertical axis, C. muris.
Fig. S4. Cumulative length of single-amino-acid repeats by repeat type in four Eimeria genomes. Perfect and imperfect repeats longer than 20 amino acids were identified with XSTREAM. Colors indicate amino acid as shown in key. All species except E. falciformis infect birds. Left, raw amino acid counts; right, counts normalized to 107 nt coding sequence. Efal, E. falciformis; Eten, E. tenella; Emax, E. maxima; E. nec, E. necatrix.
Table S1. Summary of genome sequencing projects.
Table S2. Cumulative SAAR length for 7 species and eight genomes calculated by amino acid using XSTREAM.
Table S3. Enriched functions in C. baileyi N-rich proteins.
Table S4. Enriched functions in C. parvum T and S-rich proteins.
