Skip to main content
Genetics logoLink to Genetics
. 2014 Jul 9;198(1):167–170. doi: 10.1534/genetics.114.166769

A Defined Zebrafish Line for High-Throughput Genetics and Genomics: NHGRI-1

Matthew C LaFave *, Gaurav K Varshney *, Meghana Vemulapalli , James C Mullikin †,, Shawn M Burgess *,1
PMCID: PMC4174928  PMID: 25009150

Abstract

Substantial intrastrain variation at the nucleotide level complicates molecular and genetic studies in zebrafish, such as the use of CRISPRs or morpholinos to inactivate genes. In the absence of robust inbred zebrafish lines, we generated NHGRI-1, a healthy and fecund strain derived from founder parents we sequenced to a depth of ∼50×. Within this strain, we have identified the majority of the genome that matches the reference sequence and documented most of the variants. This strain has utility for many reasons, but in particular it will be useful for any researcher who needs to know the exact sequence (with all variants) of a particular genomic region or who wants to be able to robustly map sequences back to a genome with all possible variants defined.

Keywords: zebrafish, SNV, genome sequence, CRISPR, variants


THE zebrafish (Danio rerio) is a powerful tool for understanding vertebrate biology. The usefulness of this model organism is bolstered by the availability of a “finished” sequenced and annotated genome (Howe et al. 2013; Flicek et al. 2014). As a natural extension of this resource, there are several high-throughput efforts to systematically mutagenize all zebrafish protein-coding genes (Moens et al. 2008; Kettleborough et al. 2013; Varshney et al. 2013a,b).

In addition to such projects, the combination of a sequenced genome and developments in targeted nuclease technology mean that the zebrafish community is now able to rapidly take advantage of custom genome-editing technologies (Doyon et al. 2008; Bedell et al. 2012; Hruscha et al. 2013; Hwang et al. 2013; Jao et al. 2013). CRISPRs in particular provide an efficient, easy, and inexpensive means of manipulating and interrogating the genome (Jinek et al. 2012; Cong et al. 2013; Mali et al. 2013). However, because there are very few hardy inbred zebrafish lines (overinbreeding tends to result in unhealthy stocks) and polymorphism rates are close to 1 every 100 bases, variants frequently have the potential to interfere with target site design (Stickney et al. 2002; Guryev et al. 2006; Bowen et al. 2012) or with regions of homology used for homologous recombination. In general, genome targeting is heavily dependent on an exact match to the primary sequence. Depending on the sequence, even a single mismatch can severely reduce the cutting efficiency (Hsu et al. 2013). In addition, other techniques such as RNA-Seq or ChIP-Seq are substantially less accurate without having fully characterized variants in the background strain. Therefore, it is preferable to carry out studies in a zebrafish strain in which the regions of invariant sequence are known with a high degree of confidence and all variants are categorized to allow for robust genomic mapping.

With these concerns in mind, we derived the zebrafish line NHGRI-1. NHGRI-1 fish were derived from an original strain known as “TAB-5” made from a hybrid cross between fish from two of the most commonly used zebrafish lines: Tübingen and AB (Streisinger et al. 1981; Haffter et al. 1996). The F1 fish from this cross were inbred and screened to be clear of any mutations affecting the first 5 days of development. Since its initial isolation in 1997, we have carried the strain in the laboratory until the present day without introducing other outside genetic diversity. We selected several mating pairs from the TAB-5 pool, and the most robust mating pair was chosen as the founding pair for NHGRI-1. We are now on the third generation of NHGRI-1 and their fecundity and overall health remain strong.

We carried out high-throughput sequencing to a depth of ∼50× for each parent. The male and female sequencing libraries had a combined 1,289,142,362 nonduplicate reads, with a median coverage of 52× and 47×, respectively. By doing so, we identified >10 million previously unreported single-nucleotide variants (SNVs). The raw sequence data have been deposited in the NCBI Sequence Read Archive [BioProject ID: 246102]. In addition, we have identified nearly all the regions of the genome that are invariant relative to the Zv9 reference sequence. We generated a browser extensible data (BED) file of invariant nucleotides, which indicates the regions in which there were both a lack of alternative alleles and a lack of sufficient read depth and genotype confidence to call bases as invariant (Figure 1). Seventy-one percent of the genome fits these criteria. The invariant file is hosted on the NHGRI-1 website at http://research.nhgri.nih.gov/manuscripts/Burgess/zebrafish/download.shtml, a University of California, Santa Cruz (UCSC) data hub called “ZebrafishGenomics” has been established at http://genome.ucsc.edu/cgi-bin/hgHubConnect, and data have been transferred to http://zfin.org/. Information on the variants themselves can be downloaded from dbSNP (submitter handle, NHGRI_DGS; submitter batch ID, NHGRI-1_founders). The invariant regions are easily identified by using the BED file, simplifying the design of CRISPR targets, amplicon primers, finding regions for homologous recombination, Morpholino design, or essentially any experiment that requires high confidence in the exact sequence of the genomic region of interest.

Figure 1.

Figure 1

Screenshot of the UCSC browser custom tracks for NHGRI-1. Twenty mating pairs from 6-month-old TAB-5 fish were screened to select a robust founding pair with good clutch size and healthy progeny; the most fecund pair was renamed NHGRI-1. Fin clips from the NHGRI-1 male and female were prepared as separate genomic DNA libraries and sequenced on the Illumina HiSeq 2000 by the National Institutes of Health (NIH) Intramural Sequencing Center. Both libraries were subjected to paired-end sequencing with 101-bp reads. We aligned the sequence to the zebrafish genome [Zv9 (Howe et al. 2013)] with Novoalign version 2.08.02 (http://www.novocraft.com/). We removed PCR duplicates via SAMtools version 0.1.18 (Li et al. 2009). We used bam2mpg to identify the most probable genotype (MPG) for nucleotides in both parents (Teer et al. 2010). Bases that did not have an MPG score of at least 10, coverage of at least 20×, and a ratio of MPG score to coverage >0.5 were discarded. Regions of low sequence complexity were not specifically excluded from the analysis unless they failed to meet these criteria. The bases that matched the reference and met the above criteria in both fish were used to build the BED track of invariant nucleotides. The top track indicates the bases that were invariant in both fish sequenced. The white regions indicate either variation in at least one fish or insufficient read depth to confidently call the region as invariant. The second track indicates two nonsense mutations detected in this region. The letter indicates the alternative allele, and the color indicates whether the mutation was homozygous (red) or heterozygous (blue) in the NHGRI-1 population. Both tracks are available on the ZebrafishGenomics track hub, which is hosted at http://research.nhgri.nih.gov/manuscripts/Burgess/zebrafish/downloads/NHGRI-1/hub.txt and accessible through http://genome.ucsc.edu/cgi-bin/hgHubConnect.

We detected >17 million total variants upon merging the variant calls from the two libraries. Of that total, 236,301 were in exons of Ensembl transcripts (Table 1). Variants were called as homozygous only if they were homozygous in both fish; such variants will stably retain the variant allele in future generations.

Table 1. Raw counts of variants in NHGRI-1.

Variants SNV DIV Total
Total variants 14,917,339 2,210,080 17,127,419
 Heterozygous 12,245,715 1,953,277 14,198,992
 Homozygous 2,642,908 225,347 2,868,255
 Unknown 28,716 31,456 60,172
Exon variants 233,141 3,160 236,301
 Heterozygous 190,626 2,815 193,441
 Homozygous 42,153 311 42,464
 Unknown 362 34 396

Single-nucleotide variants and deletion and insertion variants were annotated using ANNOVAR version 2012-10-16 (Wang et al. 2010). Our annotation used the ensGene track hosted on the UCSC genome browser, which corresponded to Ensembl release 74 (Flicek et al. 2014). We annotated the male and female fish separately and then combined the ANNOVAR output to determine overall homozygosity and heterozygosity. Variants were considered homozygous in NHGRI-1 only if they were independently called as homozygous in both sexes. We identified a variant as unknown if it was called as (1) unknown in both sexes or (2) unknown in one fish and homozygous reference in the other. All remaining variants were considered to be heterozygous in NHGRI-1, even if they were called as homozygous in one of the sexes. In cases in which deletion or insertion variants (DIVs) of different lengths were reported at the same position, both were counted as separate variants.

To underscore the issues related to background variation in the commonly used zebrafish lines, we detected 669 variants that formed premature stop codons in at least one transcript, 105 of which were homozygous mutant in both sexes (Table 2). We have generated a BED track of these variants, indicating the location, the alternative allele, and the homo/heterozygosity. This track is available on the ZebrafishGenomics hub and the NHGRI-1 website (Figure 1). A list of affected genes can also be found in supporting information, Table S1.

Table 2. Mutations introduced by variants in NHGRI-1.

Annotations Total
SNV annotation
 Nonsynonymous 77,791
 Synonymous 149,378
 Stop gain 640
 Stop loss 90
 Unknown 5,242
DIV annotation
 Frameshift deletion 638
 Frameshift insertion 540
 Nonframeshift insertion 944
 Nonframeshift deletion 872
 Stop gain 29
 Stop loss 8
 Unknown 129

We detected 3160 deletion or insertion variants (DIVs) in exons. DIVs of a length divisible by three were highly represented and comprise ∼60% of the DIVs (Figure 2A). Presumably, this is because the resultant nonframeshift mutations would be less likely to be selected against than those that produce frameshifts. A similar profile has been reported in human indels (Chen et al. 2007). This trend is not present in the genome-wide set of 2,210,080 NHGRI-1 DIVs (Figure 2B).

Figure 2.

Figure 2

Deletion and insertion variant length distribution within exons. (A) The 3160 DIVs in exons. (B) The 2,210,080 DIVs detected genome-wide. Red bars indicate the number of deletions of a given length; blue bars represent insertions.

We compared the SNVs identified in NHGRI-1 with dbSNP (Build ID: 139) and a publically available data set obtained from low-coverage sequencing of multiple zebrafish lines (Sherry et al. 2001; Bowen et al. 2012). For simplicity, we compared only biallelic SNVs for which the reference sequence is known (i.e., no “N”s). The majority of NHGRI-1 SNVs had not been previously reported in either data set (Figure 3). We find that the rate of SNVs per sequenced base in NHGRI-1 is 0.01 or ∼12.5–20× higher than the rate in humans (Kidd et al. 2008). It is important to note that, while the 0.01 number is relevant for NHGRI-1, the regions of homozygosity created by inbreeding mean it certainly underestimates the SNV load in zebrafish as a whole.

Figure 3.

Figure 3

SNV overlap with publicly available data sets. This comparison incorporates only SNVs that were biallelic and for which the reference base was an unambiguous A, C, G, or T. The Bowen et al. (2012) SNVs were downloaded from http://fishbonelab.org/harris/Resources_files/parental_variants.tar; both data sets were downloaded on March 12th, 2014.

We also compared the mutational profile of NHGRI-1 to that reported for a zebrafish captured from the wild and sequenced at 39× coverage (Patowary et al. 2013). Different cutoffs had been applied for variant calling in said study, such as a minimum of 32 reads to call an SNV and 5 reads to call a DIV, but the ratios of variant types can be compared. The differences are statistically significant, but small. Among the SNVs in the wild zebrafish, 22.3% were reported as being homozygous, compared to 17.8% in NHGRI-1 (Fisher’s exact test, P < 2.2 × 10−16). Deletions are more prevalent than insertions in both studies, with the wild zebrafish reported as having 53.9% deletions, compared to 51.6% in NHGRI-1 (P < 2.2 × 10−16).

This fish line will have utility in terms of automated design for targeted nucleases, as well as for studies such as ChIP-Seq or RNA-Seq where SNVs or DIVs might reduce the accuracy of mapping the raw sequence data. In addition, techniques such as homologous recombination are very sensitive to variants (te Riele et al. 1992), and NHGRI-1 will allow researchers to target genomic regions that do not contain any variant nucleotides. Thus, NHGRI-1 will prove useful in a variety of circumstances where absolute knowledge of the possible sequence variation is needed. The line will be distributed by the Zebrafish International Resource Center (http://zebrafish.org) and the European Zebrafish Resource Center (http://www.ezrc.kit.edu).

Supplementary Material

Supporting Information

Acknowledgments

This research was supported by the Intramural Research Program of the National Human Genome Research Institute, National Institutes of Health.

Footnotes

Available freely online through the author-supported open access option.

Communicating editor: D. Parichy

Literature Cited

  1. Bedell V. M., Wang Y., Campbell J. M., Poshusta T. L., Starker C. G., et al. , 2012.  In vivo genome editing using a high-efficiency TALEN system. Nature 491: 114–118 [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Bowen M. E., Henke K., Siegfried K. R., Warman M. L., Harris M. P., 2012.  Efficient mapping and cloning of mutations in zebrafish by low-coverage whole-genome sequencing. Genetics 190: 1017–1024 [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Chen F. C., Chen C. J., Li W. H., Chuang T. J., 2007.  Human-specific insertions and deletions inferred from mammalian genome sequences. Genome Res. 17: 16–22 [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Cong L., Ran F. A., Cox D., Lin S., Barretto R., et al. , 2013.  Multiplex genome engineering using CRISPR/Cas systems. Science 339: 819–823 [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Doyon Y., McCammon J. M., Miller J. C., Faraji F., Ngo C., et al. , 2008.  Heritable targeted gene disruption in zebrafish using designed zinc-finger nucleases. Nat. Biotechnol. 26: 702–708 [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Flicek P., Amode M. R., Barrell D., Beal K., Billis K., et al. , 2014.  Ensembl 2014. Nucleic Acids Res. 42: D749–D755 [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Guryev V., Koudijs M. J., Berezikov E., Johnson S. L., Plasterk R. H., et al. , 2006.  Genetic variation in the zebrafish. Genome Res. 16: 491–497 [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Haffter P., Granato M., Brand M., Mullins M. C., Hammerschmidt M., et al. , 1996.  The identification of genes with unique and essential functions in the development of the zebrafish, Danio rerio. Development 123: 1–36 [DOI] [PubMed] [Google Scholar]
  9. Howe K., Clark M. D., Torroja C. F., Torrance J., Berthelot C., et al. , 2013.  The zebrafish reference genome sequence and its relationship to the human genome. Nature 496: 498–503 [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Hruscha A., Krawitz P., Rechenberg A., Heinrich V., Hecht J., et al. , 2013.  Efficient CRISPR/Cas9 genome editing with low off-target effects in zebrafish. Development 140: 4982–4987 [DOI] [PubMed] [Google Scholar]
  11. Hsu P. D., Scott D. A., Weinstein J. A., Ran F. A., Konermann S., et al. , 2013.  DNA targeting specificity of RNA-guided Cas9 nucleases. Nat. Biotechnol. 31: 827–832 [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Hwang W. Y., Fu Y., Reyon D., Maeder M. L., Tsai S. Q., et al. , 2013.  Efficient genome editing in zebrafish using a CRISPR-Cas system. Nat. Biotechnol. 31: 227–229 [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Jao L. E., Wente S. R., Chen W., 2013.  Efficient multiplex biallelic zebrafish genome editing using a CRISPR nuclease system. Proc. Natl. Acad. Sci. USA 110: 13904–13909 [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Jinek M., Chylinski K., Fonfara I., Hauer M., Doudna J. A., et al. , 2012.  A programmable dual-RNA-guided DNA endonuclease in adaptive bacterial immunity. Science 337: 816–821 [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Kettleborough R. N., Busch-Nentwich E. M., Harvey S. A., Dooley C. M., de Bruijn E., et al. , 2013.  A systematic genome-wide analysis of zebrafish protein-coding gene function. Nature 496: 494–497 [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Kidd J. M., Cooper G. M., Donahue W. F., Hayden H. S., Sampas N., et al. , 2008.  Mapping and sequencing of structural variation from eight human genomes. Nature 453: 56–64 [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., et al. , 2009.  The Sequence Alignment/Map format and SAMtools. Bioinformatics 25: 2078–2079 [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Mali P., Yang L., Esvelt K. M., Aach J., Guell M., et al. , 2013.  RNA-guided human genome engineering via Cas9. Science 339: 823–826 [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Moens C. B., Donn T. M., Wolf-Saxon E. R., Ma T. P., 2008.  Reverse genetics in zebrafish by TILLING. Brief. Funct. Genomics Proteomics 7: 454–459 [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Patowary A., Purkanti R., Singh M., Chauhan R., Singh A. R., et al. , 2013.  A sequence-based variation map of zebrafish. Zebrafish 10: 15–20 [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Sherry S. T., Ward M. H., Kholodov M., Baker J., Phan L., et al. , 2001.  dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29: 308–311 [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Stickney H. L., Schmutz J., Woods I. G., Holtzer C. C., Dickson M. C., et al. , 2002.  Rapid mapping of zebrafish mutations with SNPs and oligonucleotide microarrays. Genome Res. 12: 1929–1934 [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Streisinger G., Walker C., Dower N., Knauber D., Singer F., 1981.  Production of clones of homozygous diploid zebra fish (Brachydanio rerio). Nature 291: 293–296 [DOI] [PubMed] [Google Scholar]
  24. Teer J. K., Bonnycastle L. L., Chines P. S., Hansen N. F., Aoyama N., et al. , 2010.  Systematic comparison of three genomic enrichment methods for massively parallel DNA sequencing. Genome Res. 20: 1420–1431 [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. te Riele H., Maandag E. R., Berns A., 1992.  Highly efficient gene targeting in embryonic stem cells through homologous recombination with isogenic DNA constructs. Proc. Natl. Acad. Sci. USA 89: 5128–5132 [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Varshney G. K., Huang H., Zhang S., Lu J., Gildea D. E., et al. , 2013a The Zebrafish Insertion Collection (ZInC): a web based, searchable collection of zebrafish mutations generated by DNA insertion. Nucleic Acids Res. 41: D861–D864 [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Varshney G. K., Lu J., Gildea D. E., Huang H., Pei W., et al. , 2013b A large-scale zebrafish gene knockout resource for the genome-wide study of gene function. Genome Res. 23: 727–735 [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Wang K., Li M., Hakonarson H., 2010.  ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 38: e164. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Information

Articles from Genetics are provided here courtesy of Oxford University Press

RESOURCES