A SNP resource for Douglas-fir: de novo transcriptome assembly and SNP detection and validation

Glenn T Howe; Jianbin Yu; Brian Knaus; Richard Cronn; Scott Kolpak; Peter Dolan; W Walter Lorenz; Jeffrey FD Dean

doi:10.1186/1471-2164-14-137

. 2013 Feb 28;14:137. doi: 10.1186/1471-2164-14-137

A SNP resource for Douglas-fir: de novo transcriptome assembly and SNP detection and validation

Glenn T Howe ^1,^✉, Jianbin Yu ^1,², Brian Knaus ³, Richard Cronn ³, Scott Kolpak ¹, Peter Dolan ⁴, W Walter Lorenz ⁵, Jeffrey FD Dean ⁵

PMCID: PMC3673906 PMID: 23445355

Abstract

Background

Douglas-fir (Pseudotsuga menziesii), one of the most economically and ecologically important tree species in the world, also has one of the largest tree breeding programs. Although the coastal and interior varieties of Douglas-fir (vars. menziesii and glauca) are native to North America, the coastal variety is also widely planted for timber production in Europe, New Zealand, Australia, and Chile. Our main goal was to develop a SNP resource large enough to facilitate genomic selection in Douglas-fir breeding programs. To accomplish this, we developed a 454-based reference transcriptome for coastal Douglas-fir, annotated and evaluated the quality of the reference, identified putative SNPs, and then validated a sample of those SNPs using the Illumina Infinium genotyping platform.

Results

We assembled a reference transcriptome consisting of 25,002 isogroups (unique gene models) and 102,623 singletons from 2.76 million 454 and Sanger cDNA sequences from coastal Douglas-fir. We identified 278,979 unique SNPs by mapping the 454 and Sanger sequences to the reference, and by mapping four datasets of Illumina cDNA sequences from multiple seed sources, genotypes, and tissues. The Illumina datasets represented coastal Douglas-fir (64.00 and 13.41 million reads), interior Douglas-fir (80.45 million reads), and a Yakima population similar to interior Douglas-fir (8.99 million reads). We assayed 8067 SNPs on 260 trees using an Illumina Infinium SNP genotyping array. Of these SNPs, 5847 (72.5%) were called successfully and were polymorphic.

Conclusions

Based on our validation efficiency, our SNP database may contain as many as ~200,000 true SNPs, and as many as ~69,000 SNPs that could be genotyped at ~20,000 gene loci using an Infinium II array—more SNPs than are needed to use genomic selection in tree breeding programs. Ultimately, these genomic resources will enhance Douglas-fir breeding and allow us to better understand landscape-scale patterns of genetic variation and potential responses to climate change.

Background

The availability of high-throughput sequencing methods has led to the discovery of thousands to millions of single nucleotide polymorphisms (SNPs) in diverse organisms, particularly humans, model experimental organisms, and agriculturally important plants and animals. Combined with high-throughput genotyping platforms, SNP markers are having substantial impacts on human medicine as well as plant and animal breeding [1-3]. They are also being used to provide detailed insights into the population genetics of natural populations, and are likely to help elucidate the functional basis of simply inherited traits. In addition, they are frequently cited as the solution for understanding the explicit genetic basis of quantitative traits [4], although prospects for the latter remain uncertain [5].

Our main goal was to develop and test a large number of SNP markers for Douglas-fir (Pseudotsuga menziesii (Mirb.) Franco) that could be used to enhance and accelerate Douglas-fir breeding via genomic selection. Tree breeders typically make breeding decisions based on an individual’s breeding value, which is the average performance of an individual’s offspring. Currently, breeding values are estimated from measurements made in progeny tests containing thousands to tens of thousands of trees. Genomic selection, or whole-genome marker-assisted selection [6], could revolutionize tree breeding by allowing breeders to dramatically reduce the generation interval and extent of progeny testing. Genomic selection has been widely adopted in livestock breeding [7], where empirical studies suggest that accuracies of genomic selection are often 70% or more, compared to accuracies of 30 to 40% for breeding values estimated from parental performance, and accuracies of about 85% for breeding values estimated from progeny testing, which is both time-consuming and costly [3]. However, these encouraging results required SNP resources consisting of thousands to tens of thousands of SNPs—numbers that far exceed what is available for Douglas-fir. In addition to genomic selection, SNP markers are expected to replace simple sequence repeats (SSRs) for routine, automated uses of markers for other breeding program applications. The high variability of SSR markers makes them ideal for many applications, but automated marker scoring is often challenging. In seed orchards, genetic markers (mostly SSRs) are routinely used to confirm the identity of seed orchard trees, measure pollen contamination, assess the effectiveness of pollen management techniques, measure and manage inbreeding and genetic diversity, determine parental contributions to open-pollinated seedlots (i.e., progeny populations), and verify seedlot integrity [8,9]. Highly informative genetic markers may also allow breeders to combine simple, cost-effective mating designs (e.g., polymix or open-pollinated designs) with parental analysis to reduce breeding costs, speed breeding progress, and increase genetic gains [10,11].

Douglas-fir is one of the most ecologically and economically important tree species in the world. It occupies diverse habitats from central British Columbia to Mexico, and from the Pacific Ocean to the eastern slopes of the Rocky Mountains. In the Pacific Northwest, coastal Douglas-fir (var. menziesii) forms ancient forests that serve as key habitats for endangered species, and are widely grown in plantations that form the foundation of a multi-billion dollar forest products industry. In the Rocky Mountains, the interior or Rocky Mountain variety (var. glauca (Beissn.) Franco) occupies mostly drier and colder sites, and has a more varied impact on the ecology and economy of the region. In Mexico, Douglas-fir exists as widely dispersed ‘sky-island’ populations that are typically considered extensions of var. glauca, but may deserve their own varietal status [12,13]. Overall, Douglas-fir is ecologically, physiologically, and genetically diverse, within and among varieties (reviewed in [14]). Because of its economic importance, Douglas-fir has one of the largest tree breeding programs in the world. The Northwest Tree Improvement Cooperative program for coastal Douglas-fir has nearly 4 million tested trees, including more than 31,000 first-generation parents tested on 1,016 progeny test sites, and 2,980 second-cycle crosses tested on 129 sites [14] (K. Jayawickrama, personal communication). Smaller breeding programs exist for interior Douglas-fir in the United States and Canada (reviewed in [14]). In coastal Douglas-fir, breeding focuses on improving growth, stem form, and wood properties while maintaining climatic adaptability.

Our primary goal was to greatly expand the SNP resources for Douglas-fir beyond the 200–300 validated SNPs that were currently available [15]. Therefore, we combined two high-throughput sequencing technologies (454 pyrosequencing and Illumina sequencing-by-synthesis) to sequence the transcriptomes of diverse tissues and Douglas-fir genotypes. Our objectives were to (1) develop a reference transcriptome for coastal Douglas-fir by combining existing Sanger sequences with new 454 sequences, (2) annotate and evaluate the quality of the reference transcriptome, (3) map 454 and Illumina short-read sequences to the reference and identify SNPs, and (4) construct and test a high-density Infinium genotyping array. In addition to the SNP markers we developed, our reference transcriptome will facilitate studies of gene expression and function, and will aid efforts to assemble and annotate reference genome sequences of Douglas-fir and other conifers (http://pinegenome.org/pinerefseq/).

Results

Pre-assembly sequence processing for the reference transcriptome

We used long reads from three datasets as the basis for de novo assembly of a reference transcriptome for coastal Douglas-fir. Prior to the final assembly, we cleaned and filtered these datasets as shown in Figure 1 (Steps 1–5). These datasets included 454 sequences from a single genotype (SG₄₅₄ = 1.241 M reads) and sequences from two multi-genotype pools produced using 454 pyrosequencing (MG2₄₅₄ = 1.709 M reads) and Sanger sequencing (MG1_SANG = 12,157 reads). Our initial pool of 2.96 × 10⁶ reads was reduced to 2.78 × 10⁶ reads after filtering using the SnoWhite pipeline (Table 1). The percentage of filtered sequences was substantially smaller for the normalized than for the non-normalized 454 dataset (2.4% for MG2₄₅₄ versus 11.2% for SG₄₅₄), and this effect was most pronounced for the rRNA and retrotransposon-like sequences (Table 1). After removing additional fungal and bacterial sequences, and excluding reads shorter than 50 nt, 2.76 × 10⁶ sequences were available to assemble the reference transcriptome (Table 2).

**Strategy for assembling the Douglas-fir reference transcriptome and detecting SNPs.** We used one Sanger sequence dataset (MG1_SANG) and two 454 sequence datasets (MG2₄₅₄ and SG₄₅₄) to assemble the reference transcriptome. We then used these same datasets plus four Illumina short read datasets (MG2_IL, CB_IL, YK_IL, INT_IL) to detect flanking variants. Orange boxes represent Sanger and 454 datasets, blue boxes represent Illumina short-read datasets, green boxes represent the reference transcriptome, red boxes represent SNP filtering steps, and yellow boxes represent SNP genotyping and analytical steps. The number of SNPs for which Infinium genotyping assays were successfully designed (Assay Design Tool score ≥ 0.6) depends on the probability used for filtering the target SNPs (P_S< 0.01, 0.001, and 0.0001) and the probability used to mask nucleotides in the flanking regions (P_F = 0.1, 0.01, 0.001, and 0.0001). Larger P_F values resulted in more flanking variants and fewer target SNPs with successful designs.

Table 1.

Sequence datasets used to construct the Douglas-fir reference transcriptome*

Plant materials (dataset ID) *Collection information*	Method^†*cDNA library*	Total reads in dataset (%)	Number of reads filtered from the input dataset (% of library total)
			Short or low-quality	Adapter or vector	Chloro-plast	Mitochon-drial	rRNA	Retro-transposon
Multi-genotype #1 (MG1_SANG)	Sanger	12,157 (100)	57 (0.47)	0 (0.00)	2 (0.02)	2 (0.02)	0 (0.00)	1 (0.01)
Cold season	Normalized
Greenhouse	Non-normalized
Multi-genotype #2 (MG2₄₅₄)	GS-FLX Titanium	1,709,211 (100)	6649 (0.39)	1893 (0.11)	8570 (0.50)	5519 (0.32)	7264 (0.42)	11,114 (0.65)
Cold and warm seasons	Normalized	1,709,211 (100)	6649 (0.39)	1893 (0.11)	8570 (0.50)	5519 (0.32)	7264 (0.42)	11,114 (0.65)
Single-genotype (SG₄₅₄)	GS-FLX Titanium	1,241,260 (100)	6582 (0.53)	1826 (0.15)	11,070 (0.89)	10,463 (0.84)	86,828 (7.00)	21,849 (1.76)
July 8, 2008	Non-normalized	1,241,260 (100)	6582 (0.53)	1826 (0.15)	11,070 (0.89)	10,463 (0.84)	86,828 (7.00)	21,849 (1.76)
All datasets		2,962,628 (100)	13,288 (0.45)	3719 (0.13)	19,642 (0.66)	15,984 (0.54)	94,092 (3.18)	32,964 (1.11)

Open in a new tab

* For each dataset, the numbers of reads filtered using the SnoWhite pipeline (Figure 1, Step 3) are shown by sequence type.

^† GS-FLX Titanium is the Roche 454 sequencing platform.

Table 2.

Characteristics of the Douglas-fir transcriptome assembly using Newbler v2.3

		Length (nt)
Statistic	Number	Mean	Median	N50	Total
Reads used by Newbler^*	2,764,549	360	392	416	996,614,802
Reads assembled by Newbler^†	2,544,087	364	394	416	925,577,338
Isotigs^§	38,589	1390	1141	1883	53,622,767
Isogroups	25,002	1443	1181	1864	36,069,331
Isogroups with 1 isotig (I1)	18,774	1334	1053	1750	25,046,862
Isogroups with >1 isotig (IM)^‡	6228	1770	1547	2141	11,022,469
Singletons	102,623	356	384	413	36,504,221
Total (isogroups + singletons)	127,625	569	413	517	72,573,552

Open in a new tab

^* The input number of reads is less than the total in Table 1 (2.96 × 10⁶) because reads shorter than 50 nt were excluded. Statistics were calculated using the sequences actually used in the assembly after applying a default minimum length of 40 for reads trimmed by Newbler.

^† Includes reads that assembled as complete reads or as partial reads.

^§ Isotigs ≥ 200 nt were deposited at DDBJ/EMBL/GenBank under accession GAEK01000000.

^‡ Statistics for the IM isogroups were calculated using the longest isotig in each isogroup.

Assembly of the reference transcriptome

In this section, we describe the preliminary and final assemblies of the reference transcriptome (Figure 1, Steps 4–6), and analyses we used to infer the orientation of the resulting isotigs and singletons. Different assembly parameters resulted in few differences in the number of resulting isogroups (overlap length = 35 or 45; overlap identity = 82 to 98%; overlap difference score = −2 or −6). In particular, there was almost no increase in the total number of isogroups when the overlap identity was increased from 82% to 90%, and only a slight increase from 90% to 98%. The final de novo assembly was constructed using a minimum overlap length of 45 nt, minimum overlap identity of 96%, and an alignment difference score of −6. However, before conducting the final assembly, we assembled the 454 datasets (MG2₄₅₄ and SG₄₅₄) separately, and then used BLASTN to compare the resulting isotigs and singletons to a series of databases to identify and remove sequences from contaminating fungal and bacterial organisms (Figure 2). After the final assembly, we used Vmatch to eliminate redundant sequences from 40,010 assembled isotigs, resulting in 38,589 non-redundant isotigs with an average length of 1,390 nt and N50 of 1,883 nt (Table 2). The resulting reference transcriptome consisted of 25,002 isogroups (unique gene models) and 102,623 singletons. Of these 25,002 isogroups, 18,744 were represented by a single isotig (transcript variant), and are inferred to correspond to a single transcript. These isogroups and isotigs are referred to as the ‘I1’ (Isogroups with 1 isotig) subset in the following analyses. The remaining 6,228 isogroups were represented by multiple isotigs, which suggests they represent alternatively spliced transcripts from the same gene. These isogroups and isotigs are subsequently referred to as the ‘IM’ (Isogroups with Multiple isotigs) subset. The reference transcriptome (i.e., 37,177 isotigs ≥ 200 nt) has been deposited at DDBJ/EMBL/GenBank under accession GAEK01000000. The characteristics of the transcriptome isotigs and singletons are described in Additional files 1 and 2.

**Taxonomic distributions of Douglas-fir sequences identified as bacterial or fungal contaminants.** We used preliminary assemblies of the SG₄₅₄ and MG2₄₅₄ datasets and BLAST searches to identify isotigs and singletons resulting from bacterial or fungal contamination (see Methods). Reads corresponding to these singletons and isotigs were removed prior to the final assembly. Numbers in parentheses are the total number of sequences (isotigs and singletons) in each category.

Mapping of strand-oriented reads from the CB_IL and YK_IL libraries allowed us to infer the orientations of 73.4% of the isotigs and 9.5% of the singletons (Additional file 1: Tables S1 and S3). The orientations of the remaining isotigs and singletons were ambiguous because the binomial test was non-significant or no data were available (i.e., no Illumina reads were successfully mapped). Other assembly characteristics are reported in Table 2.

Comparison to white spruce and loblolly pine

To evaluate our assembled isotigs, we compared them to a well characterized set of white spruce unigenes (Figure 1, Step 8). Isotigs with clear interpretable relationships to these unigenes were assigned high confidence scores, and were preferentially included on the SNP array. We categorized the isotigs into confidence classes (C1-C7) based on their relationships to the white spruce unigenes described by Rigault et al. [16] (Table 3). Lower numbers represent simpler relationships and (hypothetically) greater confidence that the assembly is correct. Although the percentages of unmatched isotigs (Class C7) were roughly equal for the two subsets (I1 = 28% and IM = 24%), the other classes differed dramatically between the I1 and IM subsets (Table 3). This reflects the more complex relationships that are possible for isogroups with multiple isotigs (IM subset), and shows how this leads to generally lower confidence scores for this group.

Table 3.

Comparison between Douglas-fir isotigs and white spruce unigenes[16]

					Number of isotigs
Class^*	No. of WS matches^†	Do other DF match the same WS?^§	Do other matching DF overlap?^‡	Isotig confidence^#	I1 subset (1 isotig per isogroup) (18,774)	IM subset (>1 isotig per isogroup) (19,815)	Example visual representations^@
C1	1	No	-	Highest	5140	261
C2	2+	No	-	Higher	896	88
C3	1	Yes	No	Higher	1767	577
C4	2+	Yes	No	Medium	586	159
C5	1	Yes	Yes	Lower	1736	6974
C6	2+	Yes	Yes	Lowest	3405	7040
	Subtotal	-	-	-	13,530	15,099
C7	No matches	-	-	Unknown	5244	4716

Open in a new tab

^*Douglas-fir (DF) isotigs were categorized into seven classes (C1-C7) and three levels of confidence based on their relationships to white spruce (WS) contigs using the SCARF program [68].

^†Number of white spruce contigs that matched the Douglas-fir query.

^§‘Yes’ indicates that at least one non-query isotig also matched the same white spruce contig.

^‡‘Yes’ indicates that the query and at least one non-query isotig matched the same region of the white spruce contig (overlapped).

^#Subjective level of confidence in the isotig assembly based on the information presented in columns 2–4.

^@Cross-hatched bars represent white spruce contigs, black bars represent query Douglas-fir isotigs, and white bars represent non-query Douglas-fir isotigs.

When we conducted the same SCARF analysis using 35,550 loblolly pine contigs, the results were nearly identical to those of white spruce (data not shown). For example, the correlation between the numbers of isotigs in each confidence class was 0.96 between spruce and pine (i.e., 0.96 for both the I1 and IM isotigs). In addition, 67% of the no-hit isotigs found using spruce as the reference (n = 9960) were also no-hits using pine as the reference (n = 6651). Conversely, 80% of the no-hit isotigs found using pine as the reference (n = 8293) were also no-hits using spruce (n = 6651).

Annotation

We annotated the isotigs and singletons (Figure 1, Step 8), and then used this information to select SNPs for the SNP array. For all three protein databases, the percentages of sequences with matches were highest for the I1 subset, moderate for the IM subset, and lowest for the singletons (S subset) (Table 4). The Annot8r annotation tool creates subsets of selected UniProt databases that only include protein sequences with GO, EC, or KEGG annotations. Therefore, in contrast to the results from Annot8r, many of the proteins in the Uniref50 database, and some of the proteins in the TAIR10 database have unknown functions. Thus, the results from Annot8r provide the percentages of Douglas-fir sequences that can be annotated by function (62.5% for the I1 subset, 46.0% for the IM subset, and 14.5% for the singletons). We subsequently used these annotations to target SNPs associated with growth, phenological traits, stress resistance, or adaptation to temperature or drought. In contrast, the distribution of matches among taxonomic groups did not differ substantially among subsets I1, IM, and S (Table 5). Small percentages (I1 = 0.89%, IM = 0.35%, S = 3.55%) of assembled Douglas-fir sequences matched fungal, bacterial, and viral sequences at an E-value < 10^-5, which is greater than the much more stringent 10^-10 E-value we used to identify contaminating isotigs and singletons during the filtering that preceded our final assembly.

Table 4.

Numbers and percentages of Douglas-fir sequences with matches to sequences in three protein databases*

	Isogroups (25,002)^†				Singletons (102,623)^§
	Isogroups with 1 isotig (I1 = 18,774)		Isogroups with >1 isotig (IM = 6228)		Singletons (S = 102,623)
Database	Number	Percent	Number	Percent	Number	Percent
Uniref50	15,054	80.2	3446	55.3	25,757	25.1
TAIR10	13,749	73.2	3260	52.3	15,907	15.5
Annot8r	11,733	62.5	2862	46.0	14,836	14.5

Open in a new tab

^*Matches were recorded for isogroups and singletons at a tBLASTX E-value < 10^-5.

^†Isogroups are Newbler v2.3 isogroups. For the isogroups with more than 1 isotig (IM subset), a hit was counted only if all isotigs matched the same protein in the database.

^§Singletons are 454 reads that did not assemble with any other reads.

Table 5.

Numbers and percentages of Douglas-fir sequences with matches to sequences in the Uniref50 protein database*

	Isogroups (25,002)^†				Singletons (102,623)^§
	Isogroups with 1 isotig (I1 = 18,774)		Isogroups with >1 isotig (IM = 6228)		Singletons (S = 102,623)
Taxonomic category	Number	Percent of matches	Number	Percent of matches	Number	Percent of matches
Conifers	4088	27.16	1073	31.14	6486	25.18
Other plants	9713	64.52	2047	59.40	16,061	62.36
Other Eukaryotes	582	3.87	182	5.28	658	2.55
Invertebrates	487	3.24	120	3.48	1087	4.22
Bacteria	123	0.82	8	0.23	830	3.22
Environmental	21	0.14	6	0.17	37	0.14
Vertebrates	17	0.11	6	0.17	92	0.36
Fungi	19	0.13	4	0.12	487	1.89
Viruses	4	0.03	0	0.00	19	0.07
Total matches	15,054	100.00	3446	100.00	25,757	100.00
Unmatched	3720	-	2782	-	76,866	-
Percent matched	80.2	-	55.3	-	25.1	-

Open in a new tab

^*Matches are grouped by taxonomic affiliation and percentages are relative to the total number of matches (tBLASTX E-value < 10^-5). Numbers of input Douglas-fir sequences are in parentheses.

^†Isogroups are Newbler v2.3 isogroups. For the isogroups with more than 1 isotig (IM subset), a hit was counted only if all isotigs matched the same protein in the database.

^§Singletons are 454 reads that did not assemble with any other reads.

The differences in the distributions of GO slim classifications among the three types of Douglas-fir sequences (I1, IM, and S) were small (Figure 3). Compared to Douglas-fir, many more Arabidopsis sequences fell into the “unknown cellular components” and “unknown molecular functions” classes. This indicates that Douglas-fir sequences were less likely to match these classes of Arabidopsis sequences than others, suggesting that they tend to exhibit species-specific characteristics (i.e., are more highly diverged or absent from Douglas-fir). Presumably, many of the unmatched Douglas-fir genes would fall into these GO slim classes had we used a less stringent E-value.

**Distributions of Douglas-fir sequences and** ***Arabidopsis*** **genes by GO slim terms.** Distributions are shown for *Arabidopsis* genes (TAIR10 accessions), two types of Douglas-fir isogroups (I1 subset = isogroups with one isotig and IM subset = isogroups with more than one isotig), and Douglas-fir singletons.

SNP detection

Two criteria are important for selecting SNPs for an Infinium genotyping array. First, the target SNP should have a high probability (i.e., low P-value, P_S) of being a true variant. Second, the target SNP should have no variants in its flanking sequences where the genotyping primers must hybridize. Therefore, the P-values for flanking variants (SNPs and indels, P_F) should also be considered. A high SNP conversion rate is expected when a very high P-value (permissive probability threshold) is used for flanking variants, and a very low P-value (stringent probability threshold) is used for target SNPs. However, this approach will dramatically reduce the number of SNPs that can be detected and assayed. In this section, we describe how we filtered all potential target SNPs based on P_S, P_F, and other criteria (Figure 1, Steps 9–10).

We used a permissive probability threshold (P_F = 0.10) to detect potential SNPs and indels in the flanking regions of target SNPs (Figure 1, Step 9). These positions were then excluded (masked) from consideration when the genotyping primers were designed. Out of a total assembly of 72.6 × 10⁶ nucleotides, we masked 820,253 SNPs and 119,728 indel positions. In the subsequent filtering step to identify target SNPs, we identified bi-allelic SNPs that were not near high-quality indels (i.e., indels with scores ≥ 25), had a mapping quality score > 40 in at least one dataset, and target SNP probabilities (P_S) of < 10^-2, 10^-3, or 10^-4 in at least one dataset (Figure 1, Step 10). For the most stringent (10^-4) level of probability, these criteria resulted in 278,979 potential SNPs (Additional file 3), 183,380 of which were detected in more than one dataset (Table 6). Many SNPs were detected in both the coastal and interior datasets—151,014 shared SNPs in 17,361 isogroups. On average, these shared SNPs represented 74% of all coastal SNPs and 67% of all interior SNPs. Not surprisingly, more SNPs were detected in the larger datasets (Table 6).

Table 6.

Numbers of potential SNPs detected in Douglas-fir using an individual dataset probability value of 10^-4

			No. of reads (× 10⁶)	No. of unique or shared SNPs^*
Plant materials (dataset ID)	Seed source	Sequencing platform	No. of reads (× 10⁶)	Unique	Coastal	Yakima	Interior	Total^†
				--------------- All isotigs (1 isotig/isogroup (I1)) --------------
Multi-genotype #1 (MG1_SANG)	Coastal	Sanger	2.77	3982 (2606)	101,089 (85,635)	29,922 (25,523)	81,633 (69,109)	107,884 (90,487)
Multi-genotype #2 (MG2₄₅₄₎	Coastal	Roche 454
Single-genotype (SG₄₅₄)	Coastal	Roche 454
Multi-genotype #2 (MG2_IL)	Coastal	Illumina	64.00	18,694 (15,617)	192,693 (162,560)	41,952 (35,700)	146,242 (123,503)	192,693 (162,560)
Coos Bay (CB_IL)	Coastal	Illumina	13.41	1044 (895)	66,304 (56,547)	29,051 (24,703)	53,275 (45,437)	66,304 (56,547)
Yakima (YK_IL)	Yakima	Illumina	8.99	638 (545)	43,066 (36,621)	-	40,840 (34,750)	47,573 (40,505)
Interior (INT_IL)	Interior	Illumina	80.45	71,241 (61,334)	151,014 (127,403)	40,840 (34,750)	-	226,124 (192,076)
Total			169.62

Open in a new tab

^*The number of unique SNPs and the number of SNPs shared in other datasets of the coastal, Yakima, and interior seed sources are presented for all isogroups (I1 + IM) and for the 1 isotig per isogroup subset (I1) (in parentheses). The total number of unique SNPs detected in all datasets was 278,979.

^†SNP totals are not the sums of the values in the same row because SNPs are double-counted. For example, we detected 66,304 SNPs in the CB_IL dataset, 29,051 of which were detected in the YK_IL dataset and 53,275 of which were detected in the INT_IL dataset.

SNP array

In this section, we describe other criteria, including Infinium design scores, which were used to select the final set of SNPs to test on the genotyping array. Design scores are values generated by the Infinium Assay Design Tool that are associated with the performance of SNP assays. Design scores could be generated for only 34% (95,478/278,979) of the target SNPs submitted to Illumina (Figure 1, Step 11), primarily because of the permissive probability threshold (P_F = 0.10) used for calling variants in the flanking sequences. That is, assays were not possible for 66% of the target SNPs because of flanking SNPs and indels in the assay design region (failure code 106). We then selected 8769 SNPs to test using an Infinium genotyping array (Figure 1, Step 12) [18]. Selection criteria included differential expression, annotations, target SNP probabilities, minor allele frequencies (MAF), Illumina design scores, and SNP array capacity. Of the 8769 attempted SNPs, only 8067 (92%) were assayed because of the normal loss of bead types that occurs during array manufacture (Figure 1, Step 13). Of these, 7256 SNPs had call frequencies ≥ 85%, and 5847 of these were polymorphic in a sample of 260 trees (i.e., successful SNPs in Figure 1 and Table 7). Of the 5847 successful SNPs, 263 (4.5%) had significant deviations from Hardy-Weinberg equilibrium (HWE). The characteristics of the successfully called and polymorphic SNPs are described in Additional file 4 and summarized in Table 8.

Table 7.

Douglas-fir SNPs detected using an Illumina Infinium SNP array (n = 260 trees)

		No. of SNPs in category/No. of SNPs attempted or assayed (%)^*
SNP category	Number of SNPs	Attempted (n=8769)	Assayed (n=8067)
SNPs attempted	8769	100.0	108.7
SNPs assayed	8067	92.0	100.0
Called SNPs (call frequency > 0.85)^†	7256	82.7	89.9
Called SNPs that are polymorphic (MAF > 0)	5847	66.7	72.5
Percent of called SNPs that are polymorphic (5847/7256) = 80.6

Open in a new tab

^*The number of SNPs in each category is expressed as a percentage of the total number of SNPs attempted (n = 8769) and number of SNPs successfully assayed on the array (n = 8067).

^†Successful calls are those with a GenCall score ≥ 0.15 [19].

Table 8.

Characteristics of 5847 successful SNPs based on data from an Illumina Infinium SNP array^*

SNP characteristic	Mean	Median	Range
GenTrain score	0.81	0.84	0.35-0.98
GC50 score (median GenCall score)	0.78	0.87	0.15-0.99
Call frequency^†	0.99	1.00	0.85-1.00
Minor allele frequency (MAF)	0.24	0.24	0.002-0.5
Heterozygosity (observed)	0.33	0.36	0.00-1.00
Heterozygosity (expected)	0.32	0.36	0.004-0.5
Number of SNPs with a significant HWE deviation = 263 (4.5%)^§

Open in a new tab

^*Successful SNPs are those with a call frequency > 0.85 and MAF > 0 based on an analysis of 260 trees.

^† Successful calls are those with a GenCall score ≥ 0.15 [19].

^§ Tested using an exact test of HWE and a probability level of 0.9 × 10^-5 (i.e., Bonferroni-corrected P-value of 0.05 based on 5847 SNPs).

Using logistic regression, we identified eight bioinformatic characteristics significantly related to the ability to distinguish the 5847 successful SNPs from the remaining 2220 SNPs that were assayed (P < 0.05). The order in which the variables entered the model reflects their relative importance: (1) number of datasets in which the SNP was detected, (2) mean number of reads across datasets, (3) number of contigs per isotig, (4) minimum SNP probability across datasets, (5) isotig length, (6) isotig type (i.e., single isotig/isogroup, longest of multiple isotigs/isogroup, or one of multiple isotigs/isogroup), (7) mean SNP probability across datasets, and (8) confidence group (C1-C7). The four variables that did not enter the model were the mean minor allele frequency across datasets, number of isotigs per isogroup, SNP IUPAC code, and Illumina design score.

Discussion

We developed a reference transcriptome and large SNP database for Douglas-fir that will serve as a resource for a variety of research and breeding applications. We detected SNPs by aligning 454 and Illumina short-read sequences to a reference transcriptome, and then identifying SNP and indel polymorphisms. During this process, we incorporated steps specifically designed to sequence transcripts from diverse genotypes, tissues, and environmental conditions; identify highly-likely SNP positions; and maximize the number of SNPs that can be reliably assayed using an Illumina Infinium II SNP array. A thorough evaluation of the reference transcriptome provides information on the sequences from which the SNPs were derived, including annotations. A set of 278,979 SNPs were deposited in the dbSNP database with submitted SNP ID numbers (ss#) ranging from 523,746,501 to 524,245,331.