Abstract
Until very recently, complete characterization of the megagenomes of conifers has remained elusive. The diploid genome of sugar pine (Pinus lambertiana Dougl.) has a highly repetitive, 31 billion bp genome. It is the largest genome sequenced and assembled to date, and the first from the subgenus Strobus, or white pines, a group that is notable for having the largest genomes among the pines. The genome represents a unique opportunity to investigate genome “obesity” in conifers and white pines. Comparative analysis of P. lambertiana and P. taeda L. reveals new insights on the conservation, age, and diversity of the highly abundant transposable elements, the primary factor determining genome size. Like most North American white pines, the principal pathogen of P. lambertiana is white pine blister rust (Cronartium ribicola J.C. Fischer ex Raben.). Identification of candidate genes for resistance to this pathogen is of great ecological importance. The genome sequence afforded us the opportunity to make substantial progress on locating the major dominant gene for simple resistance hypersensitive response, Cr1. We describe new markers and gene annotation that are both tightly linked to Cr1 in a mapping population, and associated with Cr1 in unrelated sugar pine individuals sampled throughout the species’ range, creating a solid foundation for future mapping. This genomic variation and annotated candidate genes characterized in our study of the Cr1 region are resources for future marker-assisted breeding efforts as well as for investigations of fundamental mechanisms of invasive disease and evolutionary response.
Keywords: conifer genome, transposable elements, white pine blister rust
THE gymnosperm genus Pinus is diverse and ubiquitous in temperate zones (Critchfield and Little 1966; Farjon and Filer 2013). Pines are often the keystone trees of terrestrial ecosystems (Richardson and Rundel 1998; Keane et al. 2012, and citations therein). Typical of conifers, pines have megagenomes that vary greatly in size among species, yet their karyotype is highly conserved. Pinus is divided into two large, ancient monophyletic subgenera, Strobus and Pinus, “white pines” and “yellow pines,” respectively (Critchfield and Little 1966; Gernandt et al. 2005). The first Pinus genome sequence (22 Gbp) was recently reported for Pinus taeda L. (Zimin et al. 2014), a yellow pine commonly known as loblolly pine. The genomes of white pines are larger and more variable in size (Tomback 1982). Fossils allied with Strobus are known from the early Tertiary and late Cretaceous (Millar 1998), consistent with molecular phylogenetic dating of the crown group Strobus at 45–85 MYA (Willyard et al. 2007; DeGiorgio et al. 2014). Populations of a number of the majestic white pines of North America, and their associated ecosystems, have been devastated over the last century by white pine blister rust, WPBR (Kinloch 1992) caused by a highly pathogenic and invasive fungus, Cronartium ribicola J.C. Fischer ex Raben. While major gene resistance to this disease has been discovered in several species, and loci have been placed on the genetic maps of Pinus lambertiana Dougl. (Harkins et al. 1998; Jermstad et al. 2011) and P. monticola Dougl. ex D.Don (Liu et al. 2006), the discovery of the underlying genes, and of markers serviceable for genetic improvement in reforestation, may be greatly accelerated by the genome sequence itself.
P. lambertiana, commonly known as sugar pine, is a white pine native to western North America that is distributed from northern Oregon to Baja California at a wide span of altitudes. It is currently the tallest pine species, with heights reaching 76 m. The female cones of sugar pine are also gigantic, often longer than 600 mm (Kinloch and Scheuner 1990; Van Pelt 2001; American Forests 2015). P. lambertiana trees may live > 500 years, and the onset of the species’ sexual reproduction is delayed compared to other pines, possibly due to the height and girth needed to support these massive strobili. Paralleling these oversized dimensions, the genome of P. lambertiana was estimated from cytometry to be 31 Gbp (see below), nearly 50% larger than that of P. taeda and ten times the size of the human genome. While P. lambertiana was historically a significant timber source, heavy harvesting, and the arrival of the devastating white pine blister rust to its range, has changed the management focus. Since this species plays important ecological roles in the maintenance of biodiversity, carbon sequestration, soil stabilization, and watershed protection (Maloney 2012), considerable effort and resources have been deployed both by the US Forest Service and the private sector to structure the genetics of reforestation to fit the ecological factors, especially WPBR (reviewed in Waring and Goodrich 2012). In particular, the screening by progeny testing of diverse seed sources for individual trees carrying the major gene for WPBR resistance, Cr1 (Kinloch 1992), has been ongoing for more than a decade. These extra costs of collecting seeds from candidate trees throughout the species range, of progeny testing for WPBR resistance (requiring several years), and the deployment of resistant seedlings, are significant components of forest management. Genotyping by markers with strong associations to WPBR resistance has the potential to greatly reduce both the effort and time required by the ongoing approach, and could open new strategies. Here, we demonstrate that the sequencing, assembly, and annotation of the genome sequence of P. lambertiana greatly accelerates the discovery of such genetic tools.
Conifer evolution and genome size
All members of the genus Pinus have 12 chromosomes (Saylor 1960) and are considered to be karyotypically stable throughout their evolutionary history (Sax 1960; Saylor 1964). With the exception of a potential event preceding the radiation (Li et al. 2015), whole genome polyploidy is thought to be absent among the ≥100 species. However, the amount of nuclear DNA that comprises a single copy of a pine genome can vary widely between species. Flow cytometric estimates for the genus Pinus in the C-values database (Bennett and Leitch 2012) range from a low of 20 Gbp for P. muricata D. Don, to a high of 35 Gbp for P. ayacahuite Ehrenb. ex Schltdl. (Figure 1B). The correlates and causes of this variation in genome size, including in Pinus, are an open topic of speculation and investigation (Williams et al. 2002; Grotkopp et al. 2004; Ahuja and Neale 2005; Morse et al. 2009).
The two subgenera of Pinus diverged ∼45–85 MYA ago (Figure 1A) (see also Willyard et al. 2007). Members of Strobus have an average genome size 5.2 Gbp larger than the subgenus Pinus (Figure 1B) (Grotkopp et al. 2004). The majority of sequenced conifer megagenomes are composed of interspersed repetitive sequences, with estimates ranging from 69% for Picea abies (L.) H. Karst. (Nystedt et al. 2013) to 80% for P. taeda (Wegrzyn et al. 2014). The evolutionary dynamics of transposable elements (TEs) have long been suspected to shape genomic change, including overall genome size, in numerous species (Orgel and Crick 1980; Hawkins 2006; Piegu et al. 2006; Tenaillon et al. 2011), including conifers (Nystedt et al. 2013). In contrast to angiosperms, where genome duplication events and LTR retrotransposon bursts are frequent, and account for most of the genome size expansions, a continual accretion of repeats may provide a better explanation of genome size variation within the genus Pinus (Morse et al. 2009). The genome sequence of P. lambertiana presents a new opportunity to address elements of the hypothesis that TE dynamics are behind these significant changes in genome size.
White pine blister rust
WPBR, the non-native heteroecious fungus Cronartium ribicola, infects North American pines of the Strobus subgenus. An invasive species, C. ribicola has devastated populations of five-needle pines, including P. strobus L. (eastern white pine), P. monticola (western white pine), P. lambertiana (sugar pine), P. flexilis James (limber pine), and P. albicaulis Engelm. (whitebark pine), and foxtail pine, along with closely related bristlecone pines (subgenus Pinus subsection Balfourianae) since its introduction from Asia or Europe a century ago. Damage from C. ribicola is known to reduce reproduction and survival of the majority of white pine species (Kinloch 1970; Waring and Goodrich 2012). Exacerbated by recent outbreaks of the mountain pine beetle, decreasing pine populations have affected wildlife, biodiversity, watershed, and timber potential. Rare individuals among the white pines species exhibit innate and heritable resistance that forms the basis for various selective reforestation efforts (Kinloch 2003). A major “gene” of resistance (MGR) to WPBR was mapped in P. lambertiana over 40 years ago (Kinloch 1970). An apparently biallelic locus, Cr1R/Cr1r locus has been mapped in several P. lambertiana families (Devey et al. 1995; Harkins et al. 1998; Jermstad et al. 2011). In this work, we leverage these markers and the assembled P. lambertiana genome to identify large genomic scaffolds tightly linked to Cr1 and SNPs in strong association with Cr1R. We discuss possible Cr1 candidates among the annotated genes.
Sequencing and assembly
The sequencing and assembly approach used here for P. lambertiana is an adaptation of the approach successfully used for P. taeda (Neale et al. 2014; Zimin et al. 2014). We have found that the haploid DNA obtainable from a single megagametophyte from the target genotype is sufficient to form the basis of a high quality whole genome shotgun assembly. For additional contiguity, haploid megagametophyte coverage is supplemented with longer linking mate pair libraries using DNA isolated from abundantly available diploid needle tissue of the maternal parent. For additional contiguity of the gene space, we performed transcriptome-based scaffolding using deep coverage RNA-Seq data. The nearly 50% larger size of the P. lambertiana genome required changes to the previous software methods to make assembly tractable. The resulting draft genome sequence described here has an N50 scaffold size of 246.6 kbp and a total estimated genome size of 31 Gbp, making it the largest genome sequenced and assembled to date.
Materials and Methods
Plant material
Our target tree for reference genome sequencing was P. lambertiana genotype 5038 in the collection of the United Stated Department of Agriculture (USDA) Forest Service, which is in the public domain. Haploid megagametophyte tissue was sourced from wind-pollinated seeds from grafted ramets collected in 1994 from the USDA Forest Service Badger Hill site. P. lambertiana needle tissue was collected in August 2011 from a ramet at the same location.
DNA isolation
As described by Zimin et al. (2014), our sole source of haploid DNA was a single megagametophyte. Prior to DNA extraction, seeds were immersed in water for 4 days, after which individual haploid megagametophytes were dissected from each seed. DNA was subsequently extracted from dissected megagametophytes as described in Zimin et al. (2014). For diploid DNA, large-scale extractions were prepared from P. lambertiana needles. For long insert mate pair libraries, nuclei were isolated and DNA was extracted and quantified at University of California (UC) Davis using the methods previously reported (Zimin et al. 2014). The resulting DNA was treated with 0.33 μl PreCR Repair Mix (New England Biolabs) per microgram DNA prior to use in library construction. DNA for the P. lambertiana fosmid pools was isolated and quantified at CHORI using the slightly modified method previously reported (Zimin et al. 2014). Further details can be found in the Supplemental Material, File S1 (Supplementary Methods).
Error correction
Paired end reads were error corrected using QuorUM (Marçais et al. 2013), as packaged in the MaSuRCA 2.3.0 assembly pipeline (Zimin et al. 2013). Only k-mers from the haploid sequences were used in constructing the error correction database. Detailed error correction results are given in Tables S1–S3 in File S1.
Super-read construction
Error-corrected data were used to construct super-reads (Zimin et al. 2013), which are longer, nonredundant, and overall much more compact than the original read data. For the P. lambertiana paired end sequence data, the super-reads procedure reduced the 6.36 billion error-corrected read pairs to 148 million super reads (Figure 2). The average length of the super-reads was 502 bp with a total length of 75 Gbp. By comparison, the average super-read length for P. taeda was 362 bp (Zimin et al. 2014).
Mate pair cleaning and filtering
Mate pairs from diploid libraries were cleaned and filtered as follows. (1) Mate pair sequence were error corrected by QuorUM, using a k-mer database from the haploid data. This step had the secondary effect of enriching for our target haplotype. (2) Nonjunction fragments, “short innies,” were detected and removed using a procedure that attempted to connect pairs by k-mer extensions (again using k-mers from the haploid data) off the “wrong” ends. (3) Reads <100 bp were extended via unique k-mers to a length of 64–100 bp. If both reads in a pair could not be extended to at least 64 bp, the pair was discarded.
Initial assembly
The preprocessed reads from both the haploid and diploid libraries were then assembled with SOAPdenovo2 (Luo et al. 2012) using a k-mer size of 99. Paired end libraries (Table S2 in File S1) were divided into three progressively less reliable fragment sizes: <200, 200–400, and >400 bp. Mate-pair libraries (Table S4 in File S1) were divided into two groups: <10 and >10 kbp.
Gap closing
To increase contiguity, gap closing was performed on the output of the SOAPdenovo assembler using the MaSuRCA gap closer, plus the super-read sequences to “patch” gaps in the SOAP assembly.
Transcriptome scaffolding
Additional scaffolding steps used a set of transcript sequences assembled from Pacific Biosciences (PacBio) and Illumina RNA-seq data (Table S7) from Gonzalez-Ibeas et al. (2016). We aligned transcript sequences to the whole genome shotgun (WGS) scaffolds using both nucmer (-maxmatch -nosimplify -l 45 -c 4) (Kurtz et al. 2004) and bwa-mem (-k 45 -O 60 -E 10) (Li 2013). We then merged alignments that were adjacent on both the transcript and the corresponding scaffold. For pairs of scaffolds that were aligned adjacently to the same transcript, we subsequently created a link. We sorted the links in descending order according to intron size. Next, we built a graph by visiting links in order. Each link corresponds to a potential edge in the graph between vertices corresponding to scaffolds. We added a link/edge to the graph if it did not create a cycle or a vertex degree >2. Upon completion, the graph consisted only of paths, which we converted to superscaffolds that contained one or more of the original assembly scaffolds.
Transcriptome assembly
Thirty-one tissue-specific samples, including needle, root, stem, pollen, cone, strobili, and embryonic tissues, were used for the construction of cDNA libraries. A variety of treatments were applied to seedlings before RNA extraction, including: cold/heat shock, flood/drought stress, wounding, and salicylic or methyl jasmonate exposure. RNA sequencing was done using Illumina MiSeq and HiSeq to generate short (100–300 bp) reads, and PacBio Iso-Seq reads, which range from 1000 to over 6000 bp. Seven Miseq libraries, nine Hiseq, and 18 PacBio libraries were created, and a total of 40 SMRT cells (1–4 SMRT cells per library) was sequenced. Quality trimmed reads were used for assembly with Trinity (Haas et al. 2013), and protein coding sequences (CDS) were identified with Transdecoder (Haas et al. 2013). All CDSs were clustered at 95% sequence identity with Uclust (Edgar 2010) (usearch v8.1.1861) to generate a nonredundant set of transcripts.
Identification of genomic scaffolds and mapping in the Cr1 region
Jermstad et al. (2011) reported the sequences from cloned RAPD bands OP_G16 and BC_432 that were linked to Cr1. To identify these genomic loci, the representative consensus sequence for each RAPD band was aligned to the P. lambertiana genome assembly using gmap (Wu and Watanabe 2005). In both cases, a unique top hit (path1) was observed identifying target scaffolds, which we used to develop new markers.
Target scaffolds were masked for annotated simple and interspersed repeats (see Supplementary Methods in File S1). We designed pairs of nested PCR primers using PRIMER3 (Rosen and Skaletsky 1999) for unique regions in these two target scaffolds. All of the PCR assays used standard PCR reaction conditions: 2.0 mM MgCl2, 0.2 mM each of dNTPs, 0.5 mM each of forward and reverse primers, 1 U of Taq polymerase, and 50 ng of DNA. For validation purposes, we used the available primer sequences of PCR amplicon, UMN_3258_01 (http://treegenesdb.org/ftp/CRSP/) to develop a new marker, cr1lC.
Gene annotation
Annotations were generated using the automated genome annotation pipeline MAKER-P (Campbell et al. 2014). Inputs and training sets for MAKER-P included the P. lambertiana genome assembly, a P. lambertiana transcriptome assembly (see Supplementary Methods in File S1), ESTs from spruce and pine (1,027,297 downloaded from GenBank), protein sequence data from Vitis vinifera L. (25,665), Amborella trichopoda Baill. (25,354), Populus trichocarpa Torr. and A.Gray ex Hook (38,655), Picea abies (22,721), Picea sitchensis (Bong.) Carrière (17,841), Pinus taeda (34,059), and RNA-seq data from P. lambertiana. Default MAKER-P mapping parameters were used (80% coverage and 85% identity threshold for EST-genome alignments, and 50% coverage and 40% identity for protein-genome alignments). More details can be found in the Supplementary Methods in File S1.
Interspersed repeat annotation
To find interspersed repeat elements, we used both similarity and de novo based approaches (Figure S3 in File S1). RepeatModeler combines two complementary de novo repeat element prediction algorithms: RECON (Bao et al. 2002) and RepeatScout (Price et al. 2005). To make the RepeatModeler computation tractable, we used only the Illumina sequenced fosmid pools (above), along with the longest 2.5% of genomic scaffolds. We also used a combination of TEclass (Abrusán et al. 2009), CENSOR (Kohany et al. 2006), and manual characterization to identify the uncharacterized elements from the repeat library produced by RepeatModeler. We used this library, along with the plant Repbase library (Jurka et al. 2005) (plant component only, v19.01) as the reference database for RepeatMasker (Tarailo-Graovac et al. 2009). Full-length elements were determined by applying a cut-off of 80-80-80 (80% sequence similarity, and 80 bp minimum length) (Wicker et al. 2007).
Data availability
The P. lambertiana assembly and annotation are available from GenBank as accession GCA_001447015.2 and BioProject 174450, and also from http://www.pinegenome.org/pinerefseq. Genomic DNA and RNA reads are also available under BioProject 174450.
Results
Sequencing
Our sequencing strategy for conifer genomes has taken advantage of the haploid tissue of the conifer megagametophyte (Neale et al. 2014; Zimin et al. 2014). Fortunately the observed correlation over the evolutionary diversity of gross seed weight with genome size (Wakamiya et al. 1993) (Grotkopp et al. 2004) in the genus Pinus worked to our advantage. Our collection of P. lambertiana megagametophytes had an average weight of 225 mg compared to only 23.5 mg for P. taeda. This translated into substantially larger yields of haploid genomic DNA from single seeds. From our target P. lambertiana megagametophyte, we were able to obtain 36.2 mg of DNA, from which we generated 1.91 trillion base pairs of sequence (Figure 2 and Table 1), representing ∼62× coverage of the 31 Gbp haploid genome.
Table 1. Characteristics of the P. lambertiana sequence data and 1.0 assembly, compared to known cytometric and cytological properties.
Cytometric Genome Size | 31 Gbp |
---|---|
Chromosome number | 12 |
Assembly V1.0 | |
Total size | |
Scaffolds ≥ 200 bp | 4,259,911 scaffolds |
27.6 Gbp including gaps | |
25.5 Gbp without gaps | |
Scaffolds ≥ 500 bp | 1,089,992 scaffolds |
26.9 Gbp including gaps | |
24.7 Gbp without gaps | |
54,147,744 contigs | |
Contigs < 200 bp (“chaff”) | 6.5 Gbp |
N50 scaffold size (31 Gb) | 246.6 kbp |
N50 contig size (31 Gb) | 4.25 kbp |
Sequence data | |
Number of paired-end libraries | 56 |
Paired end sequencing depth | 1,910 Gbp (61.5×) |
By platform | |
Hiseq 2000 (125 bp + 125 bp) | 2.8 × 1011 bp (9.0×) |
Hiseq 2500 (150 bp + 150 bp) | 1.4 × 1012 bp (45.1×) |
GAIIx (160 bp + 156 bp) | 1.8 × 1011 bp (5.8×) |
MiSeq (255 bp + 255 bp) | 4.7 × 1010 bp (1.5×) |
By fragment size | |
[200 bp, 400 bp] | 9.6 × 1011 bp (31.0×) |
[400 bp, 600 bp] | 4.6 × 1011 bp (15.0×) |
[600 bp, 900 bp] | 4.8 × 1011 bp (15.6×) |
Long fragment libraries (1.5–25 kbp) | 34 |
Long fragment coverage | |
Illumina Trueseq | 22.5× physical coverage |
Nextera mate pair | 71.2× physical coverage |
N50 statistics were calculated using an estimated genome size of 31 Gbp. Paired end sequencing depth represents the raw output prior to error correction. Physical coverage estimated by MaSuRCA (including the inferred DNA fragement) is reported here for all libraries by chemistry (see Supplementary Methods in File S1).
Estimating genome size
We analyzed the k-mer distribution of the paired reads to derive an independent estimate of the haploid size of the genome for coverage estimates. Using the jellyfish program (Marçais and Kingsford 2011), we computed distributions of k-mer depth for k = 24 and k = 36 for all the paired sequences derived from our megagametophyte. We estimated genome size from the k-mer distribution as described previously (Zimin et al. 2014), using both the mean and the mode of the distributions for k = 24 and k = 31. As shown in Table 2, all four estimates of the genome size are in close agreement, ranging from 30.9 to 31.9 Gbp.
Table 2. Estimates of the genome size of P. lambertiana based on the distribution of k-mers in the paired read data.
k = 24 | k = 31 | |
---|---|---|
Total k-mers | 1.56 × 1012 | 1.47 × 1012 |
Erroneous k-mers | 1.20 × 1010 | 2.20 × 1010 |
Total correct k-mers | 1.55 × 1012 | 1.45 × 1012 |
E(unique k-mer depth) mode | 49.72 | 46.77 |
Estimated genome size | 31.1 Gbp | 30.9 Gbp |
E(unique k-mer depth) mean | 48.53 | 46.02 |
Estimated genome size | 31.9 Gbp | 31.4 Gbp |
Erroneous k-mers refer to k-mers that were identified as likely to contain errors, and these were removed from the calculation.
Our haploid library based estimates were in the range of previous experimental estimates in the literature. The Gymnosperm DNA C-values Database release 6.0 (Bennett and Leitch 2001) contains three flow cytometry-based estimates for the genome size of P. lambertiana: 33.4 Gbp (Grotkopp et al. 2004); 31.1 Gbp (Williams et al. 2002); and 29.4 Gbp (Wakamiya et al. 1993). The authors of the 33.4 Gbp estimate noted that their genome size estimates of various species were consistently higher than values already in the literature. The mean of these experimental estimates, 31 Gbp, is in close agreement with our sequenced-based estimates, and therefore we chose this value as the estimated total size of the genome.
Assembly
Super-reads (Zimin et al. 2013) played a fundamental role in the assembly of P. lambertiana, where they allowed us to dramatically reduce the size of the input to subsequent assembly steps (Figure 2). Nevertheless the CABOG assembler (Miller et al. 2008) used for the 22 Gbp genome of P. taeda could not process the larger P. lambertiana genome, so we instead used the de Bruijn graph-based SOAPdenovo 2 assembler (Luo et al. 2012) for initial contig and scaffold construction. Following this step, we reassembled the contigs, with SOAPdenovo 2, adding the 93× coverage from long-fragment libraries, yielding scaffolds with an N50 size of 196 kbp. We then ran a separate gap-closing procedure to reduce the number of intrascaffold gaps, which closed 12.6 million out of 26.2 million gaps in the assembly. This reduced the total gap length by ∼780 Mbp, and increased the N50 contig size to 3.4 kbp.
Finally, we used transcript sequences to improve contiguity in the vicinity of genes. We aligned a set of 17,167 assembled transcripts (see Materials and Methods) to the scaffolds. We joined scaffolds together if the links created were consistent with a colinear transcript alignment. In total, 32,619 scaffolds were merged during this step. The resulting assembly (version 1.0) has an N50 scaffold length of 246.6 kbp. The combined length of the assembly, including all scaffolds and contigs >200 bp, is 27.6 Gbp (Table 1). The assembly contains another 6.48 Gbp in contigs, and scaffolds ≤200 bp that were not considered for most analyses.
Validation
As an independent assessment of assembly quality, we sequenced four pools of 48 fosmids each using the PacBio RS II platform (see Supplementary Methods in File S1). We collected deep coverage (>250×) of each pool. The vector-trimmed HGAP3-assembled pools are reported in Table 3. Most of the assembled contigs appeared to span the full length of a fosmid, ∼40 kbp (Table 3, and Table S6 in File S1). Overall, the PacBio fosmid assemblies were 98.8% identical to the WGS assembly, which covered >95% of their total length. Because the haploid fosmids were constructed from diploid needle tissue, at most half were expected to match exactly. Thus, the 1.2% divergence represents an upper bound on alignment and assembly errors, or, alternatively, half the heterozygosity rate.
Table 3. Assemblies of the four fosmid pools sequenced with PacBio technology.
Pool | Contigs | Minimum | Mean | Maximum | Length | N50 |
---|---|---|---|---|---|---|
SPPB1 | 61 | 979 | 31983 | 45177 | 1950994 | 34685 |
SPPB2 | 54 | 586 | 33949 | 44946 | 1833274 | 35595 |
SPPB3 | 58 | 586 | 29525 | 43039 | 1712462 | 35375 |
SPPB4 | 73 | 551 | 27960 | 43934 | 2041131 | 35324 |
Each pool contained 48 fosmids.
As a measure of the correctness of the WGS assembly, we looked for large insertions, deletions, or rearrangements between the PacBio and WGS assemblies. The comparison yielded only one noncolinear alignment, and one WGS scaffold with a large 7.6 kbp deletion, for which we could not rule out haplotype differences. A second scaffold with a 5.3 kbp deletion was clearly a heterozygous insertion of an LTR element in the assembled fosmid.
For further validation, we examined the alignment of the WGS scaffolds just prior to transcriptome scaffolding to our collection of 12,533 PacBio and 4634 Illumina assembled transcripts; >99% of these alignments were consistent. When examining the 1% that were not colinear, we found that these were dominated by Illumina-based transcripts, leading to the conclusion the most of these represented errors in the transcript assembly rather than the WGS assembly.
Gene content
Annotation yielded 13,936 high-quality gene models and 71,117 low-quality models, the presence of direct RNA evidence being the primary distinction between the two classes (Supplementary Methods in File S1). A total of 11,769 scaffolds were annotated with at least one high-quality gene model, ranging from one to eight models per scaffold (1.2 models/scaffold on average). Only 33 scaffolds were annotated with five or more models. Completeness of the gene space evaluated with BUSCO (Simão et al. 2015) was 53% when using the high-quality models, and 58% when the low-quality models were included. Alternatively, DOGMA (Dohmen et al. 2016) estimated a coverage of 94% for their Conserved Domain Arrangements, For comparison, when run on the complete set of P. taeda gene models, BUSCO estimated 50% completeness and DOGMA estimated 61% (Table S8 in File S1)
In total, 11,595 of the 13,936 gene models were functionally annotated with a characterized plant protein sequence. A total of 2041 were classified as uninformative (protein alignment with no functional assignment), and 300 showed no homology to characterized proteins. As expected, Vitis vinifera, Arabidopsis thaliana (L.) Heynh., and Ricinus communis L. were the species that contributed the most to the functional annotations. The largest P. lambertiana intron, at 578 kbp, is the second largest (after one in P. taeda) found in a plant genome to date (Table 4), although the draft state of the genome means that larger introns are highly likely to be scattered among multiple scaffolds.
Table 4. Comparison of gene metrics among sequenced conifer genomes and select angiosperms.
Pinus taeda | Picea abies | Pinus lambertiana | Picea glauca | Arabidopsis thaliana | Populus trichocarpa | Vitis vinifera | Amborella trichopoda | |
---|---|---|---|---|---|---|---|---|
Genome size (Mbp) | 20,148 | 19,600 | 31,000 | 20,000 | 135 | 423 | 487 | 706 |
Chromosomes | 12 | 12 | 12 | 12 | 5 | 19 | 19 | 13 |
Gc+cC content (%) | 38.2 | 37.9 | 35.1 | 31.1 | 35 | 33.3 | 36.2 | 35.5 |
TE content (%) | 74 | 70 | 79 | N/A | 15.3 | 42 | 41.4 | N/A |
Number of genes | 9,024 | 26,359a | 13,936 | 14,462 | 27,160 | 36,393 | 25,663 | 25,347 |
Average CDS length (bp) | 1,562 | 931 | 1,330 | 1,421 | 1,102 | 1,143 | 1,095 | 969 |
Average intron length (bp) | 12,875 | 1,020 | 8,039 | 603 | 182 | 366 | 933 | 1,538 |
Maximum intron length (bps) | 8,91,919 | 68,269 | 5,78,081 | 1,19,319 | 10,234 | 4,698 | 38,166 | 1,75,748 |
High confidence genes from the Congenie project.
Transposable elements
TE sequences constitute 79% of the P. lambertiana genome, higher than the 74% found in P. taeda (see Supplementary Methods in File S1). Of these, 67% of the transposable sequences in P. lambertiana are LTR retrotransposons. The distribution of transposable element families is very similar in the two species (see Figure 3). The most substantial difference in repeat content observed between the genomes is a 35% greater proportion of Gypsy elements in P. lambertiana. The distributions of estimated insertion times among LTR retrotransposons are congruent with those reported for spruce in Nystedt et al. (2013) (Figure S5). The median LTR insertion time for P. lambertiana (16.0 MYA) is younger than that of P. taeda (17.4 MYA). As a class, P. lambertiana Gypsy elements are significantly younger (14.5 MYA; P < 1.5 × 10−12), consistent with their increased numbers and a lineage-specific expansion. These observations are consistent with the hypothesis that TEs make up the bulk of the enlarged genomes of subgenus Strobus, with much of the expansion in P. lambertiana attributable to Gypsy.
The similarity in TEs among the sequenced conifer genomes supports the hypothesis that conifers have experienced massive expansion of TEs throughout their history (Neale et al. 2014), likely including the period prior to the radiation of Pinus, yielding their large and varied sizes. The bulk of TE sequences are ancient and diverged. Consistent with this, we observed that partial elements are far more abundant than full-length sequences in P. lambertiana, representing 67.3% of the genome, and 87% of the total repetitive content. And while the vast majority of LTRs were ancient and inactive, we did find evidence of recent transposition in the form of a recently inserted heterozygous TE. We observed a complete heterozygous insertion of a PARTC element in a genomic segment captured in an assembled fosmid clone. Heterozygosity is inferred from the insertion of the element, and the presence of a target site duplication in the alignment to the alternate haplotype (Figure S6). Previous analysis of the many copies of the PARTC subfamily suggested that it was dead (Zuccolo et al. 2015). However, this copy has identical LTR sequences, and apparently functional proteins.
Identification of genomic scaffolds and mapping in the Cr1 region
Using the draft WGS assembly, we succeeded in anchoring the cloned RAPD sequences, scarOPG16_950 and scarBC432_1110, which had previously been mapped near the Cr1 locus (Jermstad et al. 2011), to two distinct scaffolds (Table S13). No longer limited to designing PCR primers within those cloned sequences, we utilized the entire repeat masked scaffolds as a resource, and were able to identify many clear single nucleotide polymorphisms (SNPs) in each flanking amplicon, including that adjacent to scarBC432_1110, which had previously yielded no scorable SNPs (Jermstad et al. 2011).
PCR primers were designed to amplify two small genomic loci, one in scaffold 223,058 and the other in 370,413 (Table S10). The amplicons of successful primer pairs were sequenced and tested for segregation in a small sample of both Cr1R and Cr1r segregant megagametophytes from maternal tree 5701 (Cr1R/Cr1r), for which the rescued embryos were genotyped for Cr1. Note, the pollen parent of the rescued embryos was assumed to be Cr1r/Cr1r because the frequency of Cr1R is assumed to be rare. Alternative haplotype sequences were found for both amplicons that segregated (see Supplementary Methods in File S1 for the Fasta sequence), and appeared to be linked to one another and to the Cr1 locus.
A large sample of megagametophytes was efficiently genotyped using Cleaved Amplified Polymorphic Sequence (CAPS) assays (Konieczny and Ausubel 1993). We developed two new CAPS markers, cr1lA and cr1lB, based on the sequence variation in these two amplicons physically linked via the assembly to the previously reported RAPD markers (see Table S13 in File S1). Genotyping of cr1lA and cr1lB on a sample of 245 megagametophytes from maternal tree 5701 yielded two apparent single crossovers between markers cr1lA and cr1lB (both cr1lAR - cr1lBr), and 225 nonrecombinants (Table S12 in File S1). We were not able to confirm the Harkins et al. (1998) gene order BC_432_1110 – Cr1 – OPG_16_950. For the RAPD markers BC_432_1110 and OPG_16_950, Harkins et al. (1998) reported recombination fractions of 3%, and genetic map distances of 1.2 cM between both markers and Cr1 for maternal tree 5701. For our data, Harkins et al. (1998) gene order results in 12 putative double recombinants, which can alternatively be interpreted as Cr1 genotyping error.
Our two crossovers between cr1lA and cr1lB indicate that Cr1 is closer to cr1lB. To validate this result, and place it in a slightly broader mapping context, we added a new marker cr1lC to the genetic map, derived from a previously characterized PCR amplicon (Jermstad et al. 2011; UMN_3258_01; http://treegenesdb.org/ftp/CRSP/) Genotyping with SNPs derived from this amplicon placed it closest to scarOPG_16_959 on the side away from Cr1 in the Jermstad et al. (2011) map for maternal tree 5701 at a distance of ∼12 cM. We genotyped cr1lC in the two cr1lA - cr1lB recombinant megagametophytes from 5701, in three randomly selected Cr1r nonrecombinants, and four randomly selected Cr1R nonrecombinants to further refine marker order. Two distinct cr1lC haplotypes were determined among these progeny. None were recombinant between cr1lA and cr1lC, thus placing Cr1 outside of these loci (Figure 4, left), consistent with the gene order (cr1lC – cr1lA) – (Cr1 – cr1lB).
Increasing the Cr1 genomic region
To expand our annotated intervals linked to Cr1, we walked outward from the two marker-anchored scaffolds using physical linkage inferred from one or more aligned fosmid DiTag reads not included in the assembly. Using this approach, an additional gene-containing scaffold was physically linked to one of our anchored scaffolds by two fosmid DiTags (Figure 4).
The genome assembly allowed a more targeted identification of potential gene candidates for Cr1. Figure 4 shows a total of 14 gene annotations on the two scaffolds genetically linked to Cr1, and a third scaffold that was physically linked by fosmid DiTags. Of the 14 linked genes, PILA_lg017786 stands out as a candidate because it contains both the NB-ARC and LRR domains that are common elements of disease-resistance genes. We looked for direct evidence of expression in transcriptome assemblies and found only one transcript (TR43508|c1_g1_i2|m.82078; see Supplementary Methods in File S1) assembled from a library constructed from a WPBR resistant tree. The transcript overlaps two exons of the candidate gene (red bar above the gene in Figure 4). The most similar known gene is in P. monticola (Western white pine), a TIR-NBS-LRR protein (GI:321530320). The closest well-annotated gene appears to be the disease resistance protein RGA2 in the grass Aegilops tauschii Coss. (GI:475615320).
Cr1 association
To both confirm the tight linkage of cr1lB to Cr1 and to provide a potential resource for marker-assisted selection, small, representative samples of Cr1-genotyped trees were genotyped by sequencing the amplicon from a single megametophyte. A total of six Cr1R/Cr1R, 12 Cr1R/Cr1r, and 22 Cr1r/Cr1r genotyped sugar pine seed trees from the center of the species’ range were assayed (Table S15 in File S1). Genotyping of the diploid parent for Cr1 was done by the Forest Service at the El Dorado National Forest, Placerville Nursery, using their standard protocol of germinating, and scoring at least 56 exposed seed trees for WPBR resistance. These Cr1R/Cr1R and Cr1R/Cr1r trees were previously reported in Vangestel et al. (2016).
We selected one megagametophyte each from a maternal parent that had been genotyped for resistance. The cr1lB primers appeared to work well outside of their original context (maternal genotype 5701) and the haploid nature of the DNA afforded additional confirmation of the sequencing results. Evaluating only sequences associated with known Cr1 alleles (i.e., transmitted from Cr1r/Cr1r, Cr1R/Cr1R, or phenotyped progeny of 5701) we identified a five-site motif that predicted the Cr1 allele nearly completely (see Figure 5). All seven of our Cr1R associated haplotypes (six transmitted from Cr1R/Cr1R and the Cr1R linked haplotype of 5701) had the motif “TTACT.” Furthermore, 23 out of 24 of our Cr1r/Cr1r transmitted haplotypes had the alternate Cr1r linked motif “GCGGC.” The association is almost complete; the differences in the frequencies of the two haplotypes transmitted with known Cr1 genotypes is statistically significant, P < 10−5 by χ2 with 2 d.f. Both motifs segregated in the progeny of one Cr1r/Cr1r parent. The observation of this single heterozygous tree is consistent with a low frequency of “recombinant” haplotypes. Still the association of Cr1 with SNPs in the cr1lB amplicon on scaffold 370,314 is strong.
Discussion
A key step in the sequencing strategy for P. lambertiana was the generation of deep sequencing coverage of the haploid genome. Even so, the unprecedented amount of data, two trillion bases, required an alternative strategy in order to assemble the genome in a reasonable time frame. The contiguity of the P. lambertiana assembly, as measured by the N50 scaffold size, is higher than previous conifer genome assemblies (Birol et al. 2013; Nystedt et al. 2013; Neale et al. 2014; Warren et al. 2015). A combination of factors, including deeper sequence coverage, more physical coverage from new linking mate pair library chemistries, and better computational methods, all likely contributed to the advance. Like other conifers, a critical biological aspect of the P. lambertiana genome that allows it to be assembled, is the accumulated divergence among the ancient repeats comprising the majority of the genome. This increased contiguity of the P. lambertiana assembly clearly suggests that the contiguity of conifer genome assemblies will continue to increase as scalable, long-range linking methods become available.
The characterization of P. lambertiana transposable element sequences supports the hypothesis advanced by Nystedt et al. (2013) that an ancient accretion of mostly inactive TEs at a rate faster than they are removed, explains the majority of the increased genome size observed in the Pinus subgenus Strobus. Given the huge genome sizes, the time scale involved, and the still sparse sampling of genome sequences of conifer species, recent TE dynamics (if such exist) are difficult to detect. Nevertheless, we made two observations relevant to the hypothesis. First, sequences of gypsy families are more abundant in the P. lambertiana genome lineage, and this likely contributed to the increase in genome size. This hypothesis is supported by Gypsy families having increased fractions of repeat sequences with younger age. Second, we detected what appears to be an actively transposing Part-C element, based on its fully intact coding genome, and its heterozygous insertion state. These observations are consistent with the simplest hypothesis that the many transposon families remain an active but small cohort, and that their sequences accumulate over millions of years because their replicative transposition rate exceeds their removal rate. So far, there is no evidence for any very recent huge expansion of specific families. We did detect the signature of recent duplication in the P. lambertiana genome in the k-mer distribution, perhaps evidence of nonhomologous crossover. However such duplications were not abundant enough to explain the difference in genome sizes. While ancient genome duplication (Li et al. 2015) may also have played a role, the hypothesized event predates the radiation of Pinus.
The immense size and repetitive nature of the conifer genome, especially that of P. lambertiana, has been, and remains, a daunting barrier to genetic analyses, especially the investigation of pathogen resistance. And this challenge, compounded with those inherent to the long generation time, as well as resource requirements, have translated into strenuous efforts to achieve modest advances in understanding and impacts on the genetics of reforestation. This reference genome brings new powerful tools to genetics/genomic research in P. lambertiana. We sought to apply the new reference genome sequence to the characterization of the genetics of resistance to WPBR, building on the rich previous research, and indeed the availability of genomic samples from now classic efforts to genetically map a major disease resistance gene. Also (as discussed above) strong ecological and economic considerations motivate the pursuit of both new knowledge, and effective practical tools that can be applied to forest management (Waring and Goodrich 2012). Large scaffolds in the assembly of P. lambertiana bearing short sequences previously linked to Cr1 (Harkins et al. 1998; Jermstad et al. 2011) were identified, validated as linked to Cr1, and annotated as containing a promising candidate gene. Of substantial immediate practical relevance is the strong association between SNPs anchored in one of these scaffolds and Cr1 in natural populations. Genotyping with such SNPs is a long-sought-after tool that will increase the efficiency of ongoing and future WPBR-resistant reforestation. The present expensive and time consuming process of identifying candidate trees, collecting seed (during a narrow period), and waiting 2 years for infection bioassay results, does ultimately identify trees heterozygous (or rarely homozygous) for Cr1R that can then be harvested for seeds to go into reforestation. But the efficiency is low, and the cost to identify a single such tree is thousands of dollars [see the estimated replacement costs in a 2013 supplement to a US Forest Service Handbook (page 5), available at http://www.fs.fed.us/im/directives/field/r5/fsh/2409.18/r5U2409U18U50U2013U1.doc]; furthermore, the supply is not always adequate or ecologically optimal. Ongoing efforts to develop these and other SNPs as practical tools for sugar pine forest management have great promise, and may lead the way to similar tools for other white pines.
Acknowledgments
We thank Carson Holt and Mark Yandell for their modifications to their MAKER-P pipeline to support conifer genomes. Funding for this project was provided through a United States Department of Agriculture/ National Institute of Food and Agriculture (USDA/NIFA) (2011-67009-30030) award to D.B.N. at University of California, Davis.
Note added in proof: See Gonzalez-Ibeas et al. 2016 (pp. 3787–3802) in G3: Genes, Genomes, Genetics for a related work.
Footnotes
Communicating editor: S. C. Gonzalez-Martinez
Supplemental material is available online at www.genetics.org/lookup/suppl/doi:10.1534/genetics.116.193227/-/DC1.
Literature Cited
- Abrusán G., Grundmann N., DeMester L., Makalowski W., 2009. TEclass—a tool for automated classification of unknown eukaryotic transposable elements. Bioinformatics 25(10): 1329–1330. [DOI] [PubMed] [Google Scholar]
- Ahuja M. R., Neale D. B., 2005. Evolution of genome size in conifers. Silvae Genet. 54(3): 126–137. [Google Scholar]
- American Forests, 2015 This Is It! The Quest for a New Champion Sugar Pine. Available at: http://www.americanforests.org/blog/quest-for-a-new-champion-sugar-pine/.
- Bao Z., Eddy S. R., 2002. Automated de novo identification of repeat sequence families in sequenced genomes. Genome Res. 12(8): 1269–1276. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bennett, M. D., and I. J. Leitch, 2012 Plant DNA C-values database, release 6.0, Dec. 2012. Available at: http://data.kew.org/cvalues/.
- Birol I., Raymond A., Jackman S. D., Pleasance S., Coope R., et al. , 2013. Assembling the 20 Gb white spruce (Picea glauca) genome from whole-genome shotgun sequencing data. Bioinformatics 29: 1492–1497. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Campbell M. S., Law M., Holt C., Stein J. C., Moghe G. D., et al. , 2014. MAKER-P: a tool kit for the rapid creation, management, and quality control of plant genome annotations. Plant Physiol. 164(2): 513–524. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Critchfield, W. B. and E. L. Little, Jr. 1966. Geographic distribution of the pines of the world. USDA Forest Service Miscellaneous Publication 991. US Department of Agriculture, Washington DC.
- DeGiorgio M., Syring J., Eckert A. J., Liston A., Cronn R., et al. , 2014. An empirical evaluation of two-stage species tree inference strategies using a multilocus dataset from North American pines. BMC Evol. Biol. 14(1): 67. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Devey M. E., Delfino-Mix A., Kinloch B. B., Neale D. B., 1995. Random amplified polymorphic DNA markers tightly linked to a gene for resistance to white pine blister rust in P. lambertiana. Proc. Natl. Acad. Sci. USA 92(6): 2066–2070. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dohmen E., Kremer L. P., Bornberg-Bauer E., Kemena C., 2016. DOGMA: domain-based transcriptome and proteome quality assessment. Bioinformatics 32: 2577–2581. [DOI] [PubMed] [Google Scholar]
- Eckert A. J., Wegrzyn J. L., Liechty J. D., Lee J. M., Cumbie W. P., et al. , 2013a The evolutionary genetics of the genes underlying phenotypic associations for loblolly pine (Pinus taeda, Pinaceae). Genetics 195: 1353–1372. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Eckert A. J., Bower A. D., Jermstad K. D., Wegrzyn J. L., Knauss B. J., et al. , 2013b Multilocus analyses reveal little evidence for lineage wide adaptive evolution within major clades of soft pines (Pinus subgenus Strobus). Mol. Ecol. 22: 5635–5650. [DOI] [PubMed] [Google Scholar]
- Edgar R. C., 2010. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26(19): 2460–2461. [DOI] [PubMed] [Google Scholar]
- Farjon A., Filer D., 2013. An Atlas of the World’s Conifers, Brill Publishing, Leiden, The Netherlands. [Google Scholar]
- Fattash I., Rooke R., Wong A., Hui C., Luu T., et al. , 2013. Miniature inverted-repeat transposable elements: discovery, distribution, and activity 1. Genome 56(9): 475–486. [DOI] [PubMed] [Google Scholar]
- Gernandt D. S., López G. G., García S. O., Liston A., 2005. Phylogeny and classification of Pinus. Taxon 54(1): 29–42. [Google Scholar]
- Gonzalez-Ibeas D., Martínez-García P. J., Famula R. A., Delfino-Mix A., Stevens K. A., et al. , 2016. Assessing the gene content of the megagenome: sugar pine (Pinus lambertiana) G3 (Bethesda) 6: 3787–3802. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Grotkopp E., Rejmánek M., Sanderson M. J., Rost T. L., 2004. Evolution of genome size in pines (Pinus) and its life-history correlates: supertree analysis. Evolution 58(8): 1705–1729. [DOI] [PubMed] [Google Scholar]
- Harkins D. M., Skaggs P. A., Mix A. D., Dupper G. E., Devey M. E., et al. , 1998. Saturation mapping of a major gene for resistance to white pine blister rust in P. lambertiana. Theor. Appl. Genet. 97(8): 1355–1360. [Google Scholar]
- Haas B. J., Papanicolaou A., Yassour M., Grabherr M., Blood P. D., et al. , 2013. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nature Prot. 8(8): 1494–1512. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hawkins J. S., Kim H., Nason J. D., Wing R. A., Wendel J. F., 2006. Differential lineage-specific amplification of transposable elements is responsible for genome size variation in Gossypium. Genome Res. 16(10): 1252–1261. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jermstad K. D., Eckert A. J., Wegrzyn J. L., Delfino-Mix A., Davis D. A., et al. , 2011. Comparative mapping in Pinus: P. lambertiana (Pinus lambertiana Dougl.) and P. taeda (Pinus taeda L.). Tree Genet. Genomes 7(3): 457–468. [Google Scholar]
- Keane, R. E., D. F. Tomback, C. A. Aubry, A. D. Bower, E. M. Campbell et al., 2012 A range-wide restoration strategy for whitebark pine (Pinus albicaulis). Gen. Tech. Rep. RMRS-GTR-279. Fort Collins, CO: U.S. Department of Agriculture, Forest Service, Rocky Mountain Research Station. p. 108.
- Kinloch B. B., Jr, 1992. Distribution and frequency of a gene for resistance to white pine blister rust in natural populations of P. lambertiana. Can. J. Bot. 70(7): 1319–1323. [Google Scholar]
- Kinloch B. B., Jr, 2003. White pine blister rust in North America: past and prognosis. Phytopathology 93(8): 1044–1047. [DOI] [PubMed] [Google Scholar]
- Kinloch B. B., Jr, Scheuner W. H., 1990. Pinus lambertiana Dougl., P. lambertiana. Agric. Handb 654: 370–378. [Google Scholar]
- Kinloch B. B., Jr, Parks G. K., Fowler C. W., 1970. White pine blister rust: simply inherited resistance in sugar pine. Science 167(3915): 193–195. [DOI] [PubMed] [Google Scholar]
- Kohany O., Gentles A. J., Hankus L., Jurka J., 2006. Annotation, submission and screening of repetitive elements in Repbase: RepbaseSubmitter and Censor. BMC Bioinformatics 7(1): 474. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Konieczny A., Ausubel F. M., 1993. A procedure for mapping Arabidopsis mutations using co‐dominant ecotype‐specific PCR‐based markers. Plant J. 4(2): 403–410. [DOI] [PubMed] [Google Scholar]
- Kurtz S., Phillippy A., Delcher A. L., Smoot M., Shumway M., et al. , 2004. Versatile and open software for comparing large genomes. Genome Biol. 5(2): R12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li Z., Baniaga A. E., Sessa E. B., Scascitelli M., Graham S. W., et al. , 2015. Early genome duplications in conifers and other seed plants. Science Advances 1(10): e1501084. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li H., 2013. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997.
- Liu J. J., Ekramoddoullah A. K., Hunt R. S., Zamani A., 2006. Identification and characterization of random amplified polymorphic DNA markers linked to a major gene (Cr2) for resistance to Cronartium ribicola in Pinus monticola. Phytopathology 96(4): 395–399. [DOI] [PubMed] [Google Scholar]
- Luo R., Liu B., Xie Y., Li Z., Huang W., et al. , 2012. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience 1(1): 18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Maloney P. E., Vogler D. R., Eckert A. J., Jensen C. E., Neale D. B., 2011. Population biology of sugar pine (Pinus lambertiana Dougl.) with reference to historical disturbances in the Lake Tahoe Basin: implications for restoration. Forest Ecology and Management 262: 770–779. [Google Scholar]
- Marçais, G., J. A. Yorke, and A. Zimin, 2013 QuorUM: an error corrector for Illumina reads. arXiv preprint arXiv:1307.3515. [DOI] [PMC free article] [PubMed]
- Millar C. I., 1998. Early evolution of pines, pp. 69–94 in Ecology and Biogeography of Pinus, edited by Richardson D. M. Cambridge University Press, Cambridge, UK. [Google Scholar]
- Morse A. M., Peterson D. G., Islam-Faridi M. N., Smith K. E., Magbanua Z., et al. , 2009. Evolution of genome size and complexity in Pinus. PLoS One 4(2): e4332. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Neale D. B., Wegrzyn J. L., Stevens K. A., Zimin A. V., Puiu D., et al. , 2014. Decoding the massive genome of P. taeda using haploid DNA and novel assembly strategies. Genome Biol. 15(3): R59. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nystedt B., Street N. R., Wetterbom A., Zuccolo A., Lin Y.-C., et al. , 2013. The Norway spruce genome sequence and conifer genome evolution. Nature 497(7451): 579–584. [DOI] [PubMed] [Google Scholar]
- Orgel L. E., Crick F. H., 1980. Selfish DNA: the ultimate parasite. Nature 284(5757): 604. [DOI] [PubMed] [Google Scholar]
- Piegu B., Guyot R., Picault N., Roulin A., Saniyal A., et al. , 2006. Doubling genome size without polyploidization: dynamics of retrotransposition-driven genomic expansions in Oryza australiensis, a wild relative of rice. Genome Res. 16(10): 1262–1269. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Richardson D. M., Rundel P. W., 1998. Ecology and biogeography of Pinus: an introduction. pp. 3–48. in Ecology and Biogeography of Pinus, edited by Richardson D. M. Cambridge University Press, Cam: brid; ge, UK. [Google Scholar]
- Rosen S., Skaletsky H. J., 1999. Primer3 on the WWW for general users and for biologist programmers, pp. 365–386. in Bioinformatics Methods and Protocols: Methods in Molecular Biology, edited by Krawetz S., Misener S. Humana Press: Totowa. [DOI] [PubMed] [Google Scholar]
- Sax K., 1960. Meiosis in intraspecific pine hybrids. For. Sci. 6: 135–138. [Google Scholar]
- Saylor, L. C., 1961 A karyotypic analysis of selected species of Pinus. Master’s Thesis, North Carolina State University. Genetica 10: 77–84. [Google Scholar]
- Simão F. A., Waterhouse R. M., Ioannidis P., Kriventseva E. V., Zdobnov E. M., 2015. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31: 3210–3212. [DOI] [PubMed] [Google Scholar]
- Tenaillon M. I., Hufford M. B., Gaut B. S., Ross-Ibarra J., 2011. Genome size and transposable element content as determined by high-throughput sequencing in maize and Zea luxurians. Genome Biol. Evol. 3: 219–229. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tomback D. F., 1982. Dispersal of whitebark pine seeds by Clark’s nutcracker: a mutualism hypothesis. J. Anim. Ecol. 51: 451–467. [Google Scholar]
- Van Pelt R., 2001. Forest giants of the Pacific coast. University of Washington Press, Seattle. [Google Scholar]
- Vangestel C., Vázquez-Lobo A., Martínez-García P. J., Calic I., Wegrzyn J. L., et al. , 2016. Patterns of neutral and adaptive genetic diversity across the natural range of sugar pine (Pinus lambertiana Dougl.). Tree Genet. Genomes 12(3): 1–10. [Google Scholar]
- Wakamiya I., Newton R. J., Johnston J. S., Price H. J., 1993. Genome size and environmental factors in the genus Pinus. Am. J. Bot. 80: 1235–1241. [Google Scholar]
- Waring K. M., Goodrich B. A., 2012. Artificial regeneration of five-needle pines of western North America: a survey of current practices and future needs. Tree Planters Notes 55: 55–71. [Google Scholar]
- Warren R. L., Keeling C. I., Yuen M. M., Raymond A., Taylor G. A., et al. , 2015. Improved white spruce (Picea glauca) genome assemblies and annotation of large gene families of conifer terpenoid and phenolic defense metabolism. Plant J. 83(2): 189–212. [DOI] [PubMed] [Google Scholar]
- Wegrzyn J. L., Lin B. Y., Zieve J. J., Dougherty W. M., Martínez-García P. J., et al. , 2013. Insights into the P. taeda genome: characterization of BAC and fosmid sequences. PLoS One 8(9): e72439. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wegrzyn J. L., Liechty J. D., Stevens K. A., Wu L. S., Loopstra C. A., et al. , 2014. Unique features of the P. taeda (Pinus taeda L.) megagenome revealed through sequence annotation. Genetics 196(3): 891–909. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Williams C. G., Joyner K. L., Auckland L. D., Johnston S., Price H. J., 2002. Genomic consequences of interspecific Pinus spp. hybridization. Biol. J. Linn. Soc. Lond. 75(4): 503–508. [Google Scholar]
- Willyard A., Syring J., Gernandt D. S., Liston A., Cronn R., 2007. Fossil calibration of molecular divergence infers a moderate mutation rate and recent radiations for Pinus. Mol. Biol. Evol. 24(1): 90–101. [DOI] [PubMed] [Google Scholar]
- Wu T. D., Watanabe C. K., 2005. GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21(9): 1859–1875. [DOI] [PubMed] [Google Scholar]
- Zimin A. V., Marçais G., Puiu D., Roberts M., Salzberg S. L., et al. , 2013. The MaSuRCA genome assembler. Bioinformatics 29(21): 2669–2677. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zimin A., Stevens K. A., Crepeau M. W., Holtz-Morris A., Koriabine M., et al. , 2014. Sequencing and assembly of the 22-Gbp P. taeda genome. Genetics 196(3): 875–890. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zuccolo A., Scofield D. G., De Paoli E., Morgante M., 2015. The Ty1-copia LTR retroelement family PARTC is highly conserved in conifers over 200MY of evolution. Gene 568(1): 89–99. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The P. lambertiana assembly and annotation are available from GenBank as accession GCA_001447015.2 and BioProject 174450, and also from http://www.pinegenome.org/pinerefseq. Genomic DNA and RNA reads are also available under BioProject 174450.