Abstract
Clustered regularly interspaced short palindromic repeats (CRISPRs) and their associated genes (cas) are essential components of adaptive immune systems that protect bacteria and archaea from viral infection. CRISPR-Cas systems are found in about 40% of bacterial and 85% of archaeal genomes, but not in eukaryotic genomes. Recently, an article published in Communications Biology reported the identification of 12,572 putative CRISPRs in the human genome, which they call “hCRISPR.” In this study, we attempt to reproduce this analysis and show that repetitive elements identified as putative CRISPR loci in the human genome contain neither the repeat-spacer-repeat architecture nor the cas genes characteristic of functional CRISPR systems.
Introduction
Repetitive DNA sequences are diverse and abundant in a broad range of species, from bacteria to mammals, where they participate in many different biological functions.1 Clustered regularly interspaced short palindromic repeats (CRISPRs) belong to a specific family of DNA repeats found in bacteria and archaea that all share a common architecture.2,3 Each CRISPR locus consists of a series of short repeat sequences (20–50 bps) separated by unique spacer sequences of a similar length (Fig. 1A).4 The repeat sequences within a CRISPR locus are generally conserved, but repeats between different CRISPR loci can vary in both sequence and length.2 The diversity of these elements makes them challenging to confidently identify, especially when the number of repeat sequences within a locus is small.
FIG. 1.
Architectural features of prokaryotic CRISPRs and proposed human CRISPR loci. (A) Structure of typical CRISPR-Cas locus in bacteria and archaea. (B) Structure of a recently reported human CRISPR (hCRISPR).
The discovery of CRISPR-based adaptive immunity and the ongoing adaptation of these systems for applications in genome editing has engaged scientists from nearly every discipline and many perspectives. Thus, the recent article by van Riet et al., which reported the discovery of >12,500 CRISPR loci in the human reference genome (GRCh38.p13; GCF_000001405.39), comes as a surprise.5 They report that hCRISPRs are similar to prokaryotic CRISPRs in terms of array architecture, repeat length, repeat sequence, and that hCRISPR are flanked by CRISPR-associated (cas) genes (Fig. 1B).5
Moreover, they claim that hCRISPRs are found across all human chromosomes and that the distribution is directly correlated to chromosome length. Collectively, these claims indicate that CRISPR loci are preserved in the human genome and their association with cas genes implies a functional connection that will have profound biological and biotechnological implications. In this study, we attempt to reproduce this analysis and show that repetitive elements identified in the human genome contain neither the repeat-spacer-repeat architecture nor the cas genes characteristic of functional CRISPR systems.
Methods
CRISPR predictions
The human reference genome (GRCh38.p13; GCF_000001405.39) was downloaded from the NCBI RefSeq Assembly database (193 sequences) (August 22, 2022).6 Local versions of CRISPRCasFinder7 (v4.2.20; CasFinder v2.0.3), CRISPRDetect8 (v3), and CRISPRCasTyper9 (v1.6.4) were used to identify putative CRISPR arrays in the downloaded human reference genome. We used the following search command for CRISPRCasFinder “-minSP 21 -maxSP 72 -fast,” and “-array_quality_score_cutoff 0 -minimum_word_repeatation 2 -minimum_no_of_repeats 2” for CRISPRDetect. Default parameters were used for the CRISPRCasTyper.
Predicted putative arrays from CRISPRCasFinder, CRISPRDetect, and CRISPRCasTyper were sorted based on the assigned confidence score. Evidence-level scores 3 or 4 by CRISPRCasFinder, array quality score >4 by CRISPRDetect, and probability score >0.75 by CRISPRCasTyper were considered as high-confidence arrays. Results from each CRISPR detection tool were compared to identify shared hits. Twenty candidate arrays predicted by CRISPRDetect were removed from the list of high-confidence arrays due to low confidence scores for these same arrays by CRISPRCasFinder. The remaining 72 arrays were then considered candidate arrays. Two of these arrays with the highest confidence scores from each of the three detection methods were manually inspected.
In addition, the Telomere-to-Telomere (T2T) assembly of the human genome with the Y chromosome (GCF_009914755.1 T2T-CHM13v2.0) was downloaded from the NCBI RefSeq Assembly database (January 10, 2023).10 CRISPRDetect was used to predict putative CRISPR arrays in the T2T assembly of the human genome using the same search parameters described earlier. Predicted putative arrays by CRISPRDetect were sorted by quality score.
Genomes of Aedes aegypti (GCF_002204515.2),11 Arabidopsis thaliana (GCF_000001735.4),12 and Caenorhabditis elegans (GCF_000002985.6)13 were downloaded from the NCBI RefSeq Assembly database (August 22, 2022). CRISPRDetect was used to predict putative CRISPR arrays in the downloaded genomes with the same search parameters described earlier. Arrays identified by CRISPRDetect were sorted by quality score.
The mitochondria genomes of Vicia faba (X59246.1)14 and the assembly of Acanthamoeba polyphaga mimivirus genome (GCF_000888735.1)15 were downloaded from the NCBI RefSeq database (January 12, 2023). Previously reported CRISPR-like sequences from each genome were extracted and manually inspected.
Identification of cas genes in the human genome
CRISPRCasFinder and CRISPRCasTyper were used to identify cas genes in the human reference genome. CRISPRCasFinder failed to predict any putative cas gene in the human reference genome. Putative cas genes predicted by CRISPRCasTyper were subjected to a filtering step according to E-value (<10 − 5) and sequence coverage cutoff (sequence coverage >50%). In addition, translated coding sequences of both the human reference genome (n = 60,090) and the T2T assembly of the human genome (n = 65,591) were downloaded from the NCBI RefSeq Assembly database.6,10
One hundred twenty-one cas gene hidden Markov model (HMM) profiles from CRISPRCasdb were used to query the coding sequences using MacsyFinder (v1.0.5; HMMER 3.2.1).16,17 Queries were performed using the following parameters: “macsyfinder--sequence-db <open_reading_frames>--db-type gembase -d <CRISPR_subtype_definitions> -p < HMM_profiles> -w 50 -vv all.” Results from the HMMER analysis were filtered with the same E-value threshold and sequence coverage criteria described earlier.
Comparison of repeat length distribution of CRISPRs from prokaryotes and hCRISPRs
The repeat sequences from 19,483 unique putative hCRISPR arrays were extracted using a Python v2.7 script that uses the Bio SeqIO package.18 In addition, the repeats of bacterial and archaeal CRISPRs were downloaded from CRISPRCasdb (n = 28,712). The distribution of repeat lengths was visualized in RStudio by comparing the list of repeats from CRISPRs and hCRISPRs using ggplot.19
Results and Discussion
To understand how the putative CRISPR arrays, reported by van Riet et al.,5 may have been missed by the rest of the scientific community and to learn what these new CRISPRs might teach us about the evolution and function of CRISPR systems, we searched the human reference genome (GRCh38.p13; GCF_000001405.39) for CRISPRs using a combination of CRISPRCasFinder,7 CRISPRDetect,8 and CRISPRCasTyper9 according to the search parameters defined by the authors.5 Collectively, we identified a total of 19,483 unique putative arrays in the human reference genome (Supplementary Data S1).
Results from CRISPRCasFinder (n = 12,616) and CRISPRCasTyper (n = 28) were nearly identical to those reported by van Riet et al.,5 whereas CRISPRDetect (n = 10,056) identified 6,403 additional arrays (Fig. 2A). The exact search parameter used for CRISPRDetect were not provided by the authors (i.e., -array_quality_score_cutoff 0 -minimum_word_repeatation 2 -minimum_no_of_repeats 2), so the additional CRISPR loci identified in this study are anticipated to reflect differences in the search criteria.
FIG. 2.
Repetitive elements in the human genome do not contain architectural features consistent with CRISPR loci. (A) The human reference genome was queried using CRISPRCasFinder, CRISPRDetect, and CRISPRCasTyper. A total of 19,483 unique putative arrays were predicted; 3214 arrays were detected by both CRISPRCasFinder and CRISPRDetect, whereas only three of the arrays were predicted by both CRISPRDetect and CRISPRCasTyper. None of the arrays predicted by CRISPRCasFinder and CRISPRCasTyper were the same. (B) CRISPRCasFinder predicted only three arrays with high confidence (evidence-level 3 or 4), whereas CRISPRDetect predicted 61 candidate arrays (array quality score >4) and CRISPRCasTyper assigned eight arrays as high confidence (probability >75%). None of these predicted arrays are shared between all three CRISPR detection methods. (C) Architecture of two arrays presented by van Riet et al.5 (top), followed by the top two most confident predictions from each of the three detection methods. The long array identified by van Riet et al.5 contains repeats that are only 29% identical and spacers that range in size. The other array is too short to predict with confidence. The top two arrays predicted by CRISPRCasFinder contain spacer sequences that are degenerate versions of the repeat. The other four arrays identified by CRISPRDetect and CRISPRCasTyper contain spacers that are related, resulting in arrays with tandem repeats. (D) The direct repeats of bacterial and archaeal CRISPRs compared with lengths of the repeats in hCRISPR. All repeat sequences were extracted and the number of each repeat corresponding to a particular length is plotted for bacterial and archaeal genomes (top, n = 28,712) and the human reference genome (bottom, n = 19,483).
CRISPR prediction tools assign confidence scores based on a combination of parameters. These parameters include the number of repeats in an array, conservation of repeat sequences, and the presence of nonidentical spacer sequences.7–9,20 CRISPRDetect incorporates additional factors, such as the identification of previously described repeat sequences and detection of conserved cas genes (i.e., cas1 or cas2), which are used to calculate confidence scores. Although confidence scores are not mentioned by van Riet et al.,5 we sorted the arrays according to scores calculated by each algorithm. CRISPRCasFinder identified a total of three arrays with high confidence scores, CRISPRDetect identified 61 arrays with high confidence, and CRISPRCasTyper assigned eight arrays with high confidence (Fig. 2B).
To further evaluate these predictions, we manually inspected the two examples presented by van Riet et al.5 (neither of which earned a high confidence score by any of the search methods), as well as two arrays with the highest confidence scores from each of the three detection methods (Fig. 2C; Supplementary Data S2). The two hCRISPR examples presented by van Riet et al. include one long and one short locus (Fig. 2C, top).5 The long array contains eight “repeat” sequences that are 29% identical to each other and are separated by unique spacers sequences that vary in length (8–75 bps) (Fig. 2; Supplementary Data S1 and S2). The short “hCRISPR” contains two identical 28 bp repeats and a single 39 bp spacer sequence.
We cannot definitively eliminate the possibility that the short array (composed of two repeats) is a CRISPR, but no cas genes are found flanking the CRISPR, the repeats do not correspond to any of repeats in CRISPRCasdb,16 and the spacer sequence only matches to itself. In addition to the two examples provided by van Riet et al.,5 we selected two arrays with high confidence scores from each of the three different CRISPR detection algorithms (Fig. 2C, middle). High-confidence arrays identified by CRISPRCasFinder contain “spacer” sequences that are degenerate versions of the repeat sequences, whereas putative arrays identified using CRISPRDetect and CRISPRCasTyper all contain “spacer” sequences that are 60% to 72% identical to one another. Thus, these arrays contain an alternating series of two repeats (i.e., Repeat A and Repeat B), rather than repeats separated by unique spacer sequences (Fig. 2C, bottom).
To determine if the repeat sequences in the putative human CRISPR loci are similar in length to those typically found in prokaryotic CRISPRs, we downloaded the direct repeats of bacterial and archaeal CRISPRs from CRISPRCasdb16 (n = 28,712) and compared them with the distribution of repeat lengths from putative CRISPR loci in the human genome (Fig. 2D). CRISPR repeats identified in bacterial and archaeal genomes are primarily 27–28 bp or 36–37 bp in length, whereas repeats identified in the human reference genome are primarily 23 bp and decrease in frequency as the repeat length increases. Although the average length of the repeat sequences is similar (30 bp as identified by van Riet et al.5), the distributions are very different.
In addition to CRISPR detection, the authors also report the detection of cas genes in the flanking regions of 336 hCRISPRs.5 However, neither the sequences for these ORFs nor significance values for these cas genes are provided. Our efforts to reproduce the identification of cas genes in the human reference genome using CRISPRCasFinder and CRISPRCasTyper failed. CRISPRCasFinder failed to predict any putative cas gene, whereas CRISPRCasTyper failed to predict significant hits (E-value <10−5 and sequence coverage >50%). To further verify these results, we downloaded 121 profile HMMs for cas genes from CRISPRCasdb16 and queried the human genome using MacSyFinder (v1.0.5).21 Consistent with CRISPRCasFinder and CRISPRCasTyper, we find no significant hits (E-value <10−5 and sequence coverage >50%) to cas genes in the human genome.
The human reference genome (GRCh38.p13) screened by van Riet et al.5 was originally assembled nearly 20 years ago.6 Although early drafts of the human genome provided critical new insight into human biology, these assemblies preferentially excluded many repetitive sequences. However, the recently released T2T assembly of the human genome has resolved this problem.10 Owing to the repetitive nature of the human genome, we hypothesized the T2T genome might contain CRISPRs not found in earlier versions. To test this hypothesis, we searched the T2T genome (GCF_009914755.1 T2T-CHM13v2.0) using the methods described earlier. Collectively, we identified a total of 10,481 putative CRISPR arrays in the T2T genome (Supplementary Data S3). CRISPRDetect identified 78 putative CRISPRs with high confidence (52 arrays were identical to those predicted in human reference genome); however, none of these arrays maintain the architectural characteristics of a CRISPR.
Repetitive sequences make up ∼50% of the human genome.10,22 This is not a unique feature of the human genome, but characteristic of many eukaryotic genomes, where repetitive elements make up 1–80% of genomes.23–25 Several annotation tools have been developed to classify eukaryotic repetitive elements and notably, none have found evidence for CRISPRs.26,27 Using the same methods described in this study for finding CRISPRs in prokaryotic genomes, we found 3,498 putative CRISPRs in the A. aegypti genome (310 Mb), 1,419 putative CRISPRs in the A. thaliana genome (30 Mb), and 490 putative CRISPRs in the C. elegans genome (15 Mb). In this small nonrepresentative sample of genomes, the number of putative CRISPRs identified correlates with the size of the genome, which is similar to the distribution of putative CRISPRs identified by van Riet et al.5 (“hCRISPRs” are found across all chromosomes and their distribution is directly correlated to the chromosome length).
Apart from the work by van Riet et al.,5 CRISPR loci have been previously reported in the mitochondrial genome of a plant (V. faba) and the genome of a giant mimivirus that infects single-celled eukaryotes.3,15 However, neither of these repeats meet the CRISPR criteria described earlier (Supplementary Fig. S1).2,28
Overall, the evidence for functional CRISPR-Cas systems in the human genome and other eukaryotic genomes remains unsupported. Although the phylogenetic and functional diversity of CRISPRs continue to expand, (to date) CRISPR-Cas adaptive immune systems remain restricted to prokaryotes.
Supplementary Material
Acknowledgment
The authors thank Dr. Andrew Santiago-Frangos, for bringing the Communications Biology paper to our attention.
Authors' Contributions
M.B. and B.W. designed research. M.B., W.S.H., and B.W. wrote the article. M.B. performed CRISPRCasFinder, CRISPRDetect, and CRISPRCasTyper searches. B.W. supervised experiments. All authors contributed to data analysis and editing the article.
Data and Code Availability
The data sets and code generated during this study are available in the published article or at https://github.com/WiedenheftLab.
Author Disclosure Statement
B.W. is the founder of SurGene, LLC and VIRIS Detection Systems Inc., and is an inventor of patent applications related to CRISPR-Cas systems.
Funding Information
Research in the Wiedenheft laboratory is supported by National Institutes of Health (R35GM134867), the M.J. Murdock Charitable Trust, a young investigator award from Amgen, a generous gift from the Rosolowsky family, and the Montana State University Agricultural Experimental Station (USDA NIFA).
Supplementary Material
References
- 1. Treangen TJ, Salzberg SL. Repetitive DNA and next-generation sequencing: Computational challenges and solutions. Nat Rev Genet 2011;13(1):36–46; doi: 10.1038/nrg3117 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Jansen R, van Embden JDA, Gaastra W, et al. Identification of genes that are associated with DNA repeats in prokaryotes. Mol Microbiol 2002;43(6):1565–1575; doi: 10.1046/j.1365-2958.2002.02839.x [DOI] [PubMed] [Google Scholar]
- 3. Mojica FJ, Diez-Villasenor C, Soria E, et al. Biological significance of a family of regularly spaced repeats in the genomes of Archaea, Bacteria and mitochondria. Mol Microbiol 2000;36(1):244–246; doi: 10.1046/j.1365-2958.2000.01838.x [DOI] [PubMed] [Google Scholar]
- 4. Ishino Y, Krupovic M, Forterre P. History of CRISPR-Cas from encounter with a mysterious repeated sequence to genome editing technology. J Bacteriol 2018;200(7):e00580-17; doi: 10.1128/JB.00580-17 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. van Riet J, Saha C, Strepis N, et al. CRISPRs in the human genome are differentially expressed between malignant and normal adjacent to tumor tissue. Commun Biol 2022;5(1):338; doi: 10.1038/s42003-022-03249-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Schneider VA, Graves-Lindsay T, Howe K, et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res 2017;27(5):849–864; doi: 10.1101/gr.213611.116 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Couvin D, Bernheim A, Toffano-Nioche C, et al. CRISPRCasFinder, an update of CRISRFinder, includes a portable version, enhanced performance and integrates search for Cas proteins. Nucleic Acids Res 2018;46(W1):W246–W251; doi: 10.1093/nar/gky425 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Biswas A, Staals RH, Morales SE, et al. CRISPRDetect: A flexible algorithm to define CRISPR arrays. BMC Genomics 2016;17:356; doi: 10.1186/s12864-016-2627-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Russel J, Pinilla-Redondo R, Mayo-Munoz D, et al. CRISPRCasTyper: Automated identification, annotation, and classification of CRISPR-Cas Loci. CRISPR J 2020;3(6):462–469; doi: 10.1089/crispr.2020.0059 [DOI] [PubMed] [Google Scholar]
- 10. Nurk S, Koren S, Rhie A, et al. The complete sequence of a human genome. Science 2022;376(6588):44–53; doi: 10.1126/science.abj6987 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Matthews BJ, Dudchenko O, Kingan SB, et al. Improved reference genome of Aedes aegypti informs arbovirus vector control. Nature 2018;563(7732):501–507; doi: 10.1038/s41586-018-0692-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Pucker B, Holtgrawe D, Stadermann KB, et al. A chromosome-level sequence assembly reveals the structure of the Arabidopsis thaliana Nd-1 genome and its gene set. PLoS One 2019;14(5):e0216233; doi: 10.1371/journal.pone.0216233 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. C. elegans Sequencing Consortium. Genome sequence of the nematode C. elegans: A platform for investigating biology. Science 1998;282(5396):2012–2018; doi: 10.1126/science.282.5396.2012 [DOI] [PubMed] [Google Scholar]
- 14. Flamand MC, Goblet JP, Duc G, et al. Sequence and transcription analysis of mitochondrial plasmids isolated from cytoplasmic male-sterile lines of Vicia-faba. Plant Mol Biol 1992;19(6):913–923; doi: 10.1007/Bf00040524 [DOI] [PubMed] [Google Scholar]
- 15. Levasseur A, Bekliz M, Chabriere E, et al. MIMIVIRE is a defence system in mimivirus that confers resistance to virophage. Nature 2016;531(7593):249–252; doi: 10.1038/nature17146 [DOI] [PubMed] [Google Scholar]
- 16. Pourcel C, Touchon M, Villeriot N, et al. CRISPRCasdb a successor of CRISPRdb containing CRISPR arrays and cas genes from complete genome sequences, and tools to download and query lists of repeats and spacers. Nucleic Acids Res 2020;48(D1):D535–D544; doi: 10.1093/nar/gkz915 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Prakash A, Jeffryes M, Bateman A, et al. The HMMER web server for protein sequence similarity search. Curr Protoc Bioinformatics 2017;60:3.15.1–3.15.23; doi: 10.1002/cpbi.40 [DOI] [PubMed] [Google Scholar]
- 18. Cock PJA, Antao T, Chang JT, et al. Biopython: Freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 2009;25(11):1422–1423; doi: 10.1093/bioinformatics/btp163 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Villanueva RAM, Chen ZJ. ggplot2: Elegant Graphics for Data Analysis, 2nd edition. Meas-Interdiscip Res 2019;17(3):160–167; doi: 10.1080/15366367.2019.1565254 [DOI] [Google Scholar]
- 20. Alkhnbashi OS, Meier T, Mitrofanov A, et al. CRISPR-Cas bioinformatics. Methods 2020;172:3–11; doi: 10.1016/j.ymeth.2019.07.013 [DOI] [PubMed] [Google Scholar]
- 21. Abby SS, Neron B, Menager H, et al. MacSyFinder: A program to mine genomes for molecular systems with an application to CRISPR-Cas systems. PLoS One 2014;9(10):e110726; doi: 10.1371/journal.pone.0110726 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Venter JC, Adams MD, Myers EW, et al. The sequence of the human genome. Science 2001;291(5507):1304–1351; doi: 10.1126/science.1058040 [DOI] [PubMed] [Google Scholar]
- 23. Goemann CLC, Wilkinson R, Henriques W, et al. Genome sequence, phylogenetic analysis, and structure-based annotation reveal metabolic potential of Chlorella sp. SLA-04. Algal Res 2023;69:102943; doi: 10.1016/j.algal.2022.102943 [DOI] [Google Scholar]
- 24. Hufford MB, Seetharam AS, Woodhouse MR, et al. De novo assembly, annotation, and comparative analysis of 26 diverse maize genomes. Science 2021;373(6555):655–662; doi: 10.1126/science.abg5289 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Udriste AA, Iordachescu M, Ciceoi R, et al. Next-generation sequencing of local Romanian tomato varieties and bioinformatics analysis of the Ve Locus. Int J Mol Sci 2022;23(17):9750; doi: 10.3390/ijms23179750 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Bao W, Kojima KK, Kohany O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mob DNA 2015;6:11; doi: 10.1186/s13100-015-0041-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Hubley R, Finn RD, Clements J, et al. The Dfam database of repetitive DNA families. Nucleic Acids Res 2016;44(D1):D81–D89; doi: 10.1093/nar/gkv1272 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Mohanraju P, Makarova KS, Zetsche B, et al. Diverse evolutionary roots and mechanistic variations of the CRISPR-Cas systems. Science 2016;353(6299):aad5147; doi: 10.1126/science.aad5147 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data sets and code generated during this study are available in the published article or at https://github.com/WiedenheftLab.


