Abstract
Cassava (Manihot esculenta Crantz) is a major staple crop in Africa, Asia, and South America, and its starchy roots provide nourishment for 800 million people worldwide. Although native to South America, cassava was brought to Africa 400–500 years ago and is now widely cultivated across sub-Saharan Africa, but it is subject to biotic and abiotic stresses. To assist in the rapid identification of markers for pathogen resistance and crop traits, and to accelerate breeding programs, we generated a framework map for M. esculenta Crantz from reduced representation sequencing [genotyping-by-sequencing (GBS)]. The composite 2412-cM map integrates 10 biparental maps (comprising 3480 meioses) and organizes 22,403 genetic markers on 18 chromosomes, in agreement with the observed karyotype. We used the map to anchor 71.9% of the draft genome assembly and 90.7% of the predicted protein-coding genes. The chromosome-anchored genome sequence will be useful for breeding improvement by assisting in the rapid identification of markers linked to important traits, and in providing a framework for genomic selection-enhanced breeding of this important crop.
Keywords: genotyping-by-sequencing, SNP, composite genetic map, pseudomolecules, F1 cross
Cassava (Manihot esculenta Crantz) is cultivated as a staple in much of Africa, South America, and Asia because it is easy to grow with limited inputs (Howeler et al. 2013). Smallholder farmers typically grow cassava in plots of a hectare or less. Its starchy roots can be left in the ground until they are needed, making cassava an excellent food security crop. Because cassava can be clonally propagated, desirable varieties can be genetically fixed immediately and multiplied for distribution. The crop is relatively drought-tolerant, and therefore likely robust to climate change. Cassava is also grown for industrial starch production and biofuel applications, particularly in Southeast Asia (Howeler et al. 2013).
Despite its advantages, cassava faces several biotic and abiotic challenges. Clonal propagation facilitates the rapid spread of bacterial and viral diseases. Furthermore, roots of most farmer-preferred varieties are nutrient-poor and deteriorate rapidly after harvest, preventing farmers from generating income from the sale of excess crop.
The use of modern genetic and genomic techniques, such as quantitative trait locus (QTL) mapping, genomic selection, genome-wide association studies (GWAS), and genetic engineering, can accelerate the pace of disease resistance locus identification and trait improvement. However, these require a high-quality genome assembly and a dense genetic map. A draft cassava genome assembly was generated and covers 532.5 Mb (69%) of the estimated 770 Mb cassava genome (Awoleye et al. 1994). This assembly captures half of the genome sequence in the 487 largest scaffolds, all longer than 258 kb, and 90% of the assembly is accounted for in 2654 scaffolds all longer than 23 kb; however, these are not linked to chromosomes (Prochnik et al. 2012). To date, a number of genetic maps have been generated for cassava using different marker systems: restriction fragment length polymorphism (RFLP), random amplified polymorphic DNA (RAPD), microsatellite, and isoenzyme (Fregene et al. 1997); simple sequence repeat (SSR) (Okogbenin et al. 2006); amplified fragment length polymorphism (AFLP) and SSR (Kunkeaw et al. 2010); expressed sequence tag (EST) and EST-SSR (Sraphet et al. 2011); SSR and EST-derived single nucleotide polymorphism (SNP) (Rabbi et al. 2012); and GBS-SNP (Rabbi et al. 2014a,b). These maps have low marker resolution and/or do not resolve a complete set of linkage groups (LGs) representative of the 2n = 36 karyotype of cassava (De Carvalho and Guerra 2002). Furthermore, the densest GBS-derived SNP map anchors only 313.3 Mb (58.7%) of the reference genome assembly (Rabbi et al. 2014b). Because there are many cassava breeding programs, working with varying accessions and trait(s) of interest, a broadly useful genetic map should include markers that segregate in diverse populations.
A single biparental cassava cross rarely yields enough progeny to make a dense map and, in any event, would only capture markers from a small sample of haplotypes segregating in the species. We therefore merged 10 maps derived from diverse parents to produce a composite genetic map. To generate such a composite map, we obtained one S1 and nine F1 populations (14 parents total) from African cassava breeding projects. Markers were generated via GBS (Elshire et al. 2011) and a map was constructed from each of the 10 crosses with JoinMap (Van Ooijen 2011). These maps were merged with LPmerge (Endelman and Plomion 2014) to generate a 2412-cM genetic map comprising 18 LGs, in agreement with the number of chromosomes found cytogenetically (De Carvalho and Guerra 2002). Furthermore, 71.9% of the genome assembly was anchored to the genetic map. The resulting chromosome-scale assembly will accelerate the application of a wide variety of modern tools for crop improvement.
Materials and Methods
Generation of mapping populations
Nine biparental (F1) crosses and one self-pollinated (S1) cross were performed (Table 1). Biparental populations NxA, KAR, NCAR, MT, NDLAR, and ARAL were generated from crossing blocks planted in four locations in Tanzania: Naliendele [10°23′ S, 40°09′ E, altitude ∼137 m above sea level (ASL)], Kibaha (6°46′ S, 38°58′ E, ∼162 m ASL), Ukiriguru (2°43′23′′ S, 33°1′39′′ E, 1229 m ASL), and Maruku (1°24′55′′ S; 31°46′48′′ E, 1340 m ASL). Populations 412×425, MP4, and MP5 were generated at the IITA crossing sites in Nigeria. Pedigrees of parents are shown in Supporting Information, Table S2. Cassava stakes for planting were obtained from research stations or farmers’ fields. For populations developed in Tanzania (with the exception of ARAL), more than 4000 hand pollinations were performed according to Kawano (1980). More than 10,000 seeds were generated with more than 1000 seeds per population except for population MT. After 3 months, seeds were sown in trays and raised in a screen house for 1 month before transplanting into the field at Makutupora research station in Dodoma, Tanzania. Poor germination rates resulted in a total of approximately 3500 seedlings, with further losses incurred during field establishment.
Table 1. Mapping populations used in this study.
Population | Female Parent | Male Parent | Cross Type | No. of Individuals Sequenced | No. of Validated Progeny | Purpose of Cross (Segregating Traits) |
---|---|---|---|---|---|---|
ARALa | AR40-6 | Albert | F1 | 154 | 129 | CBSD and green mite resistance |
KARa | Kiroba | AR37-80 | F1 | 192 | 132 | CBSD and green mite resistance |
MP4b | TMS-IBA30001 | TMS-IBA961089A | F1 | 190 | 177 | Starch, dry matter content, CMD resistance, and root rot |
MP5 b | TMS-IBA961089A | TMS-IBA30001 | F1 | 187 | 162 | Starch, dry matter content, CMD resistance, and root rot |
MTa | Mkombozi | Unknown | F1 | 157 | 135 | CBSD resistance |
NCARa | Nachinyaya | AR37-80 | F1 | 240 | 233 | CBSD and green mite resistance |
NDLARa | NDL06/132 | AR37-80 | F1 | 247 | 244 | CBSD and green mite resistance |
NxAa | Namikonga | Albert | F1 | 303 | 256 | CBSD resistance |
TMEB419-S1c | TMEB419 | TMEB419 | S1 | 149 | 117 | Starch content |
412×425b | TMS-IBA4(2)1425 | TMS-IBA011412 | F1 | 177 | 155 | Root carotenoid content, CMD resistance |
Nine biparental (F1) and one self-pollinated (S1) populations were generated in which a variety of disease and agronomic traits were segregating. After sequencing, individuals that were not full sibs and/or had insufficient read depth for accurate variant calling were removed prior to map construction. CBSD, cassava brown streak disease; CMD, cassava mosaic disease.
Ferguson laboratory, International Institute of Tropical Agriculture (IITA) Nairobi, Kenya.
Rabbi laboratory, IITA Ibadan, Nigeria.
Egesi laboratory, National Root Crops Research Institute (NRCRI) Umudike, Nigeria.
For the TMEB419 S1 cross, cassava stakes for planting were derived from the NRCRI cassava breeding experimental plots and planted at a hybridization plot at Ubiaja, Nigeria. Up to 5000 hand pollinations were performed by selfing the same variety, using it as both male and female parents but with flowers and pollen from different plants. Approximately 800 seeds were generated. Seeds were sown in trays and raised in a screen house for 1 month before transplanting into the field at NRCRI Umudike experimental field in Nigeria. Poor germination rates resulted in a total of approximately 200 seedlings, with further losses incurred during field establishment.
DNA isolation
Genomic DNA was isolated from the F1 populations according to Dellaporta et al. (1983), with some modifications to allow for processing of many samples in parallel in small volumes. DNA was extracted from population TMEB419-S1 using the DNeasy Plant Mini Kit (Qiagen) following manufacturer protocol. In all cases, young apical leaves were first freeze-dried and then ground using a GenoGrinder beadmill at 1500 strokes/min for 2 min. The NxA, KAR, NCAR, MT, NDLAR, and ARAL populations were initially screened as described by Kawuki et al. (2013), with approximately 12 SSR markers that were polymorphic among the parents, to detect off-types and selfs. These were removed from the population prior to GBS. Additional off-types, half-sibs, and selfs were later detected by GBS and removed from mapping populations (see below).
Genotyping-by-sequencing library preparation and sequencing
GBS library construction was performed at the University of California, Berkeley (UC Berkeley) using a protocol adapted from Elshire et al. (2011). The restriction enzyme ApeKI [New England Biolabs (NEB)] was used in conjunction with the barcode sequences included in the Elshire article. Differences from their protocol following protocol optimization are described below and discussed further in the Results and Discussion section.
Y-shaped adapters were designed based on the Y-shaped Illumina DNA paired-end (PE) adapters (Illumina, Inc.; Figure S1). “Forward” adapter oligos had the sequence 5′ ACACTCTTTCCCTACACGACGCTCTTCCGATCTxxxx, and “reverse” oligos had the sequence 5′ CWGyyyyAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAG; where xxxx represents the 4- to 8-bp barcode and yyyy represents its reverse complement. CWG is the ApeKI-specific overhang. Reverse oligos were phosphorylated at the 5′ end. Adapters were ordered from IDT as a “primer premix plate,” with standard desalting, in a total volume of 50 µl with each oligo at a concentration of 200 µM. Adapters were annealed at 50 µM in TE using a thermocycler: 95° 4 min, ramp −70° at 0.1°/5 sec, 25° 5 min, hold at 4°. Annealing was confirmed by running adapters on a 4% agarose gel next to single-stranded oligos of similar length, as annealed adapters run at a larger size. Annealed adapters were diluted 1:10 and then to 12 ng/µl in TE, and finally to 1.25 ng/µl in 10 mM Tris pH 8. Adapter volumes for the last two dilutions were based on quantitation with Picogreen reagent (Invitrogen) and an FLx800 microplate reader (Biotek Instruments, Inc.).
Libraries comprised 63–96 samples. Sample DNA quantitation was performed with Picogreen. Typically, DNA samples were diluted in water to approximately 20 ng/µl and re-quantitated before library preparation. Digests were performed on a 96-well plate, and each consisted of 100 ng DNA in 20 µl 1× NEBuffer #3 (NEB) with 5 U ApeKI. Three microliters of the desired pre-annealed and diluted adapter was added to each well of digested DNA, followed by ligation mix containing 720 cohesive end units T4 DNA ligase and 5 µl T4 DNA Ligase Reaction Buffer (NEB), to a final ligation volume of 50 µl. Ligations were pooled such that each offspring sample contributed an equal amount of DNA. For parental DNA samples, to ensure adequate sequence coverage, a greater amount of digested/ligated DNA was added to the pool. Pools contained a total amount of 1–2 µg DNA and were purified and concentrated using the MinElute PCR Purification Kit (Qiagen).
Size selection was performed on pooled libraries using a 2% agarose gel run at 140 V. A size fraction of 400–800 bp was excised from the gel and purified via the MinElute Gel Extraction Kit (Qiagen), melting the gel at room temperature to avoid G-C bias (Quail et al. 2008). Most libraries were amplified with five PCR cycles using Phusion polymerase (NEB), 460 ng of each PCR primer, and an extension time of 45 sec. After PCR amplification, libraries were cleaned using 0.7 volumes AMPure XP SPRI beads (Beckman Coulter, Inc.). Size distributions of Illumina libraries were assayed by the Vincent J. Coates Genomic Sequencing Laboratory (VCGSL) at UC Berkeley using a 2100 Bioanalyzer (Agilent Technologies, Inc.). Library concentrations were determined by the VCGSL using a Qubit (Life Technologies) and quantitative PCR. Quantitative PCR was performed with Kapa Biosystems’ Illumina Library Quantification Kits and Roche LightCycler 480, following all kit protocols. One hundred–basepair paired-end sequencing was performed by the VCGSL on HiSeq 2000 or 2500 instruments (Illumina, Inc.). Some libraries were sequenced more than once, usually because the first run was suboptimal. Sequence read totals for each population are given in Table 2.
Table 2. Sequence reads and variability.
Population Name | Total Reads | Average no. of reads/barcode | Coefficient of Variation |
---|---|---|---|
ARAL | 1,004,686,146 | 6,440,295 | 0.3689 |
KAR | 913,517,846 | 3,713,487 | 0.3669 |
MP4 | 734,460,036 | 3,672,300 | 0.6334 |
MP5 | 735,540,482 | 3,733,708 | 0.7023 |
MT | 609,010,716 | 3,421,408 | 0.7470 |
NCAR | 1,076,417,780 | 4,375,682 | 0.3830 |
NDLAR | 1,329,433,056 | 5,113,204 | 0.3741 |
NxA | 1,189,124,090 | 3,823,550 | 0.4403 |
TMEB419-S1 | 693,520,182 | 4,532,811 | 0.3051 |
412×425 | 1,183,937,274 | 6,399,660 | 0.5171 |
Total or average | 9,406,878,938 | 3,770,292 | 0.4853 |
Summary statistics are shown for the 10 mapping populations used in this study. The coefficient of variation for a given population is an average of libraries sequenced for that population. The total (for reads) or average (for average reads per barcode and coefficient of variation) are shown on the last line (bold).
Genotype calling and filtering
GBS data were analyzed by pipelining several widely used sequence analysis tools with custom scripts to extract markers with parental genotype combinations useful for the cross-pollinated (CP) genetic mapping strategy implemented by JoinMap (v4.1 2013, July 11 release) (Van Ooijen 2011). An outline of the pipeline is shown in Figure 1, and step-by-step command-line instructions are available at https://bitbucket.org/rokhsar-lab/gbs-analysis.
Figure 1.
Data analysis pipeline used in this study. Using a combination of publicly available and custom tools (gray text), the pipeline starts with sequence data and generates a map for each population (Pop) through a series of analyses (white text on blue). Finally, the maps are merged using LPmerge to generate a single composite map (Endelman and Plomion 2014).
To ensure high-quality data for genetic map estimation, the raw read data were trimmed of adapter sequences with the fastq-mcf tool from the ea-utils package (Aronesty 2011), demultiplexed with an allowance of one mismatch in the barcode sequence using a custom script, and then base-quality–trimmed (Q = 28) and aligned to the cassava v4.1 draft genome assembly using BWA (Li and Durbin 2009). Variants and genotypes were called using the HaplotypeCaller tool from the GATK (v2.7-2) (McKenna et al. 2010) with a minimum mapping quality threshold of 25. Because map estimation software can be sensitive to excessive missing data and genotyping error, genotypes with quality scores and read depths less than 30 and 10×, respectively, were marked as missing data and sites with more than 20% missing genotypes were discarded. To avoid spurious genotype calls within repeat regions, sites with average depth more than approximately 120 reads per individual or with log10(GATK HaplotypeScore+1) values more than 0.5 were removed. A maximum log10(GATK HaplotypeScore+1) value of 1.0 was enforced for the ARAL and 412×425 populations because they had been sequenced more deeply.
Further filtering removed individuals with insufficient data or half-sib, off-type, or (for biparental crosses) self-pollinated individuals (see below). The chi-squared P value for F1 Mendelian ratios was then calculated for each variant site, and sites with P < 0.05 were discarded. Parental genotypes were then inferred from the segregation pattern of each marker in the progeny and this information was used to impute missing parental genotypes in either parent. Markers were then grouped into nonoverlapping 50-kb bins and one marker of each segregating type (i.e., lm × ll, nn × np, hk × hk, ef × eg, and ab × cd) with the most genotyped individuals was chosen to represent each bin.
Identification of off-types, half-sibs, and selfs in the progeny
Cassava farmers prefer to grow several varieties in a field at one time, and this often leads to volunteer seedlings and, hence, genetic variability within nominally clonally propagated cassava. This can create spurious genetic variation within varieties used to generate populations. We therefore checked the progeny from each biparental mapping population for the presence of off-types, half-sibs, and selfs. We used metrics of relatedness to determine whether a progeny was a full-sib F1 (or S1). First, we used the frequency of genotypes violating, or inconsistent with, the expected Mendelian patterns of segregation. We defined this as the fraction of genotypes homozygous for the minor/unshared parental allele (or contained nonparental alleles) per individual. Individuals with a rate of Mendelian inconsistency in excess of 0.005 at a minimum genotype quality of 30 were discarded (Figure 2A).
Figure 2.
Analysis of parentage and sibling relationships. (A) The fraction of non-Mendelian genotypes detected in each individual in a population is plotted as a function of the minimum genotype quality (GQ) threshold. Off-types can be detected by a consistently high rate of Mendelian violation (black dotted lines), while, for legitimate progeny (solid gray), the non-Mendelian genotype rate is consistently lower than that of the putative parents (solid black lines). (B) A bivariate clustering analysis was performed on each population to verify parentage and full-sibling relationships. In this plot, we show the ARAL population as an example. Pairwise kinship coefficients phi are calculated between progeny and parents (Manichaikul et al. 2010). Each putative progeny is represented as a point in two-dimensional space colored by its cluster assignment: full-sib F1 progeny are in red and S1 progeny of AR40-6 (therefore unrelated to Albert) are in green; individuals consistent with being progeny from a full sib of Albert crossed to AR40-6 are in purple; individuals consistent with being the progeny from an S1 of AR40-6 being crossed to Albert are in blue.
Second, a bivariate clustering analysis was performed with the Mclust (v4.2) R package using as the similarity measure the kinship coefficient phi calculated by vcftools (v0.1.12) (Manichaikul et al. 2010; Danecek et al. 2011; Fraley et al. 2012; R Core Team 2014). Values of phi were calculated between all individuals and the putative parents, with each parent constituting an axis of potential genetic contribution (Figure 2B). Mclust calculates the optimal number of clusters under a Bayesian Information Criterion (BIC) regime for a given set of candidate multivariate Gaussian mixture models, the parameters for which are estimated via the Expectation-Maximization algorithm. Individuals are then classified into clusters by choosing the maximum conditional probability among the Gaussian distributions of the chosen mixture model (Fraley et al. 2012). Individuals belonging to a single cluster, or several overlapping clusters, near (X,Y) = (0.25,0.25) were kept for downstream analysis. We performed a univariate analysis of the S1 population data because Mclust assumes that the two variables of the bivariate normal distribution do not have covariance near unity, and performing a bivariate analysis on an S1 population would violate this assumption.
Calculating genetic maps for single populations
A genetic map was produced for each mapping population using JoinMap software to estimate linkage, map order, and distance. To make mapping calculations more tractable, only one marker was retained from a set of markers at the same genetic position and individuals with more than 50% missing data were removed. For each population, the minimum LOD threshold for grouping was determined by identifying grouping tree branches with stable marker numbers over increasing consecutive LOD values. Groups with three or more markers at the chosen minimum LOD threshold were retained for mapping. Specific LOD thresholds applied to each map are available in Table 3. Marker order and distances were determined using JoinMap’s Maximum Likelihood mapping algorithm for cross-pollinated (CP) populations. Default mapping parameters were assumed with the following modifications: spatial sampling thresholds reduced to 0.050, 0.025, 0.015, 0.010, and 0.005, and the number of Monte Carlo EM cycles increased to a value of 7. Maps were then compared with each other as a check for internal consistency (see next section) and genetically redundant markers that had been removed earlier were reincorporated into the component maps by assigning each marker to the LG and genetic position of the marker that was physically closest to it. The scaffolds in the draft assembly were used to determine physical distance. Finally, each map was converted from Haldane to Kosambi map units prior to merging.
Table 3. Mapping parameters and statistics.
Population | No. of Markers | No. of framework markers | LOD Threshold | LGs |
---|---|---|---|---|
ARAL | 6765 | 3657 | 13.0 | 21 |
KAR | 3047 | 1952 | 8.0 | 19 |
MP4 | 3392 | 1903 | 6.0 | 19 |
MP5 | 3388 | 1803 | 6.0 | 21 |
MT | 3991 | 2301 | 10.0 | 18 |
NCAR | 5192 | 2894 | 16.0 | 20 |
NDLAR | 4460 | 2385 | 12.0 | 20 |
NxA | 3940 | 2241 | 15.0 | 18 |
TMEB419-S1 | 4340 | 1943 | 10.0 | 21 |
412×425 | 5942 | 2975 | 7.0 | 18 |
Composite map | 22,403 | NA | NA | 18 |
For each population and the composite map, we report the total number of markers and the number of markers used by JoinMap for map estimation, the minimum LOD threshold for defining LG, and the number of LGs output from the grouping procedure. NA, not applicable
Quality control of component maps
Maps containing inter-marker genetic distances of 50 cM or greater were remapped with increased simulated annealing chain length and stopping chains. This process was iterated until maps no longer contained such distances or JoinMap ran out of memory. If these distances could not be removed before memory failure, then individual markers near the interval with extreme values of “Nearest Neighbor Fit” (a measure reported by JoinMap to indicate the compatibility of one marker between two flanking markers) were removed by trial and error (recalculating the genetic map after each iteration) (removed markers are listed in File S1). If this failed to fix the gap, then the LG was split manually into two sets of parent-specific markers (i.e., lm × ll and nn × np). Separate linkage maps were calculated for each parent using the methods described above. Single-parent maps that broke at the location of the initial marker gap were included for merging.
The relative correspondences and orientations of component LGs between populations were determined by examining dot plots, plotting the genetic positions of markers between pairs of maps (Figure 3). LGs were split or joined by following majority rule of the set of cognate LGs. For example, a component LG sharing markers with two LGs in each of the other component maps was split; multiple smaller LGs in one component map sharing markers with a single LG in each of the other component maps were joined. In joining, an interval was inserted equal to the average genetic distance found in the other corresponding component groups.
Figure 3.
Pairwise comparison of single population maps. (A) An example of a pairwise comparison between maps. Every marker is plotted at a position corresponding to its genetic distance in each map (for shared markers) or along the axis (for markers unique to one map). Shared markers reveal the correspondences and relative orientations between LGs produced by JoinMap (arbitrarily numbered). Runs of shared markers appearing as approximately straight, continuous lines demonstrate marker orders consistent between the two maps; a positive or negative gradient indicates identical or opposite orientation, respectively, in the two maps. In addition, LGs to be joined or split are revealed, respectively, as multiple unusually small LGs in one map corresponding to a single LG in another map (NCAR LG 3, and KAR LGs 13 and 14) and as a single unusually large LG in one map corresponding to multiple LGs in the other (NCAR LGs 7 and 9, and KAR LG 1). (B) An example of a “V”-shaped dot plot pattern, typically observed in LGs, that required (and could be corrected with) JoinMap parameters that increased the sensitivity (see Materials and Methods).
Merging maps with LPmerge
We chose the ARAL map to be the reference for numbering and orienting LGs during merging (Table S1) because it was deeply sequenced, and this map had the best agreement between genetic and physical marker order. The initial numbering of LGs followed that produced by JoinMap. As described above (Figure 3), a set of corresponding LGs and orientations was constructed for each of the 18 chromosomes in cassava. Each of these sets of corresponding LGs was merged into a composite map with LPmerge v1.4 (Endelman and Plomion 2014), with maps being weighted by population size. LPmerge was run 10 times with the maximum interval parameter set to each of the values in the range 1–10. We then chose the merged map with the value of maximum interval that gave the shortest total composite map length, as described in the documentation for LPmerge. For most LGs, the value of maximum interval was 1 (Table S1).
Each merged map was plotted along with all its component maps on a plot of marker number against genetic map distance. We noticed that markers at the end of some LGs (Figure S2) were tens of cM distant from adjacent markers. These markers were found to be present in only one contributing map and were distorting the merged map distances calculated by LPmerge. To fix this, terminal markers from single contributing maps were removed and another round of merging with LPmerge was performed. This process was repeated until there were no singleton markers at the end of a LG.
Correcting scaffolding mis-assemblies
Mapped markers from all 10 maps were considered simultaneously. Scaffolds from the v4.1 assembly (Prochnik et al. 2012) containing markers that mapped to different LGs were broken as long as the markers on the different LGs were derived from at least two maps or there were at least two markers from a single map. Mean pairwise linkage disequilibrium (LD) r2 statistics (Purcell et al. 2007) were calculated using vcftools (Danecek et al. 2011) for all variants derived from putatively mis-assembled scaffolds. Scaffolds were broken at scaffolding gaps within each region flanked by markers of different LG assignments if no evidence of LD was observed. If multiple scaffolding (i.e., sequence) gaps existed between flanking markers, then the gap with the fewest supporting mate-pairs used for scaffolding the v4.1 assembly was broken. If multiple, equally supported gaps existed, then the scaffold was broken at the largest gap. Minority markers from unbroken candidate mis-assembled scaffolds were removed.
Scaffold anchoring and orienting
After breaking mis-assembled scaffolds, the v4.1 genome assembly was anchored to the genetic map and joined with 1000 Ns to produce pseudomolecules. Scaffold order was determined by median genetic position and orientation by the sign of Kendall’s rank correlation coefficient (tau) between physical and genetic positions. Scaffolds that could not be assigned to any LG (due to ambiguity in grouping or lack of markers) were not incorporated into the pseudomolecules and retained their original identifiers from the v4.1 assembly. Scaffolds that were anchored but could not be oriented were incorporated in their original (i.e., arbitrary) orientations. The LGs were then ordered by decreasing genetic size and numbered with Roman numerals to provide a canonical nomenclature (Figure 4).
Figure 4.
Marker distribution in the composite map. The scale on the left shows the map distance in cM. Our composite map consists of 18 LGs with a marker density of 1.95 nonredundant markers per cM. Previous maps did not recapitulate 18 LGs with a clear one-to-one mapping to our map, so we adopted a Roman numeral numbering system, ordered by decreasing genetic size.
Results and Discussion
To produce a robust and dense genetic map that can be used to follow traits segregating in diverse mapping populations, we generated 10 mapping populations (Table 1) and genotyped them using a modified GBS approach (Elshire et al. 2011). We were stringent in selecting markers to both mitigate the effects of missing data inherent in a GBS approach and to ensure that variants were segregating in a Mendelian fashion. We then generated 10 independent genetic maps from these populations. Comparison between them allowed us to identify cases where the standard map estimation approach was unable to resolve consistent LGs, and reconciling these differences produced a highly concordant set of maps. After constructing maps using a reduced set of stringently selected markers, we reintroduced genetically redundant markers into the maps, and these individual component maps were then integrated into a single composite framework map with the expected 18 LGs. Finally, we used this map to organize the v4.1 draft cassava genome sequence (Prochnik et al. 2012) into 18 pseudomolecules, anchoring 90.7% of the predicted genes onto LGs.
Mapping populations
We analyzed nine F1 and one S1 mapping populations derived from 14 diverse cultivated varieties of cassava segregating for agriculturally important traits (Table 1). One population included here was recently used for a QTL analysis for cassava mosaic disease (CMD) resistance (Rabbi et al. 2014b). Due to their desirable agricultural properties, some cultivars were used as parents in multiple crosses. Six populations from the Ferguson group at IITA were selected specifically for segregation of resistance to cassava brown streak disease. Several of these parents have a known wild species component in their pedigree (Table S2).
Reduced representation sequencing
Several groups have developed inexpensive methods for generating a reduced representation of a genome with a restriction enzyme and sequencing to obtain genotypes (genotyping-by-sequencing or GBS) (Altshuler et al. 2000; Baird et al. 2008; Davey et al. 2011; Elshire et al. 2011). To generate mapping markers in cassava, we adapted the GBS approach of Elshire et al. (2011) (Materials and Methods). We used Y-shaped GBS adapters to ensure that all pieces of DNA with adapter ligated to both ends could cluster and be sequenced. Because the same barcode was added to both ends of the DNA extracted from a given individual (Figure S1), checking for the identity of barcodes from both ends provided additional quality control. To save time, we dispensed with the DNA and adapter drying step before digestion. To allow for robust ligation, we phosphorylated the “reverse” oligos of our adapters at the 5′ end (Materials and Methods, Figure S1); although this can increase the number of adapter dimers, these dimers are effectively removed by size selection as noted below. To maximize the number of DNA fragments that are ligated to adapters, we used adapters in excess. This also prevents the formation of DNA concatemers, which would confound downstream analysis of reads. To increase the accuracy of mapping reads to the genome, the probability of detecting variation adjacent to a cut site, and the amount of data, we performed 100-bp paired-end sequencing, rather than the single-end 86-bp sequencing of Elshire et al. (2011).
During protocol optimization, we found that a gel-based size selection step led to more robust amplification of the desired size fraction. Fragments of sizes 400–800 bp cluster well on the flow cell but are unlikely to contain adapter-mers. A gel-based size selection also provides for modularity in the GBS approach: excising a narrower size range from the gel can be a simple way to sample fewer sites in the genome if fewer markers or more depth per site are needed, and/or to facilitate multiplexing more samples per lane. In the PCR, we used Phusion polymerase rather than Taq to minimize error, and we found that increasing the amount of PCR primer yielded more amplified library. We reduced the number of PCR cycles to five to reduce bias that can interfere with accurate variant calling. The 0.7× ratio of SPRI beads to DNA removes fragments less than 300 bp in length (Quail et al. 2008; Lennon et al. 2010), and thus effectively removes any remaining adapter-mers and excess PCR primer (neither of which are detected in appreciable amounts by Bioanalyzer on completed libraries).
Genotyping-by-sequencing performance in cassava
Our adapted GBS protocol makes use of the ApeKI restriction enzyme used in the protocol developed by Elshire et al. (2011) and recommended by Hamblin & Rabbi (2014) for cassava. This enzyme and our method of library preparation allowed us to sample, per population, an estimated 85.5k restriction cut sites (with 10 reads or more) distributed throughout the genome on 5900 scaffolds (approximately 173 cut sites/Mbp) and sampling approximately 42k variable loci. Multiplexing up to 96 samples per lane of sequencing was effective for our outbred populations: we obtained an average of approximately 3.7 million reads per individual in a typical 96-plex sequencing run.
Genotype calling, filtering, and quality control
We genotyped all individuals against the currently available v4.1 draft cassava genome assembly with the GATK (Mckenna et al. 2010). Based on an examination of the relationship between minimum genotype quality (GQ) threshold and the rate of violation of Mendelian segregation, we set a minimum GQ of 30, which effectively filtered low-confidence genotype calls (Figure 2A). In addition, we removed 44 individuals with excessive genotyping error or in which the pattern of marker segregation was inconsistent with that expected in an F1 or S1 population.
Because botanical seed-based fecundity is low in cassava, generating large mapping populations often requires crossing multiple cloned parental genotypes over an extended period of time, increasing the likelihood of error in tracking parental or progeny genotypes. We therefore performed a relatedness analysis (Materials and Methods) to identify and remove individuals that were not part of the intended cross (Figure 2B). Across the 10 populations, we found 14 individuals that were unrelated to both parents and 68 related to only a single parent (26 selfs, 2 clones, 40 half-sib offspring), and identified 119 additional individuals with more complex relationships not suitable for mapping. Individuals identified by the relatedness and/or the Mendelian violation analyses were removed from further analysis. The number of progeny sequenced and used for map construction from each population is listed in Table 1. In one extreme case, none of the offspring in the MT population was related to the pollen parent that was sequenced [putatively TMS4(2)1425] (Figure S3); furthermore, neither these offspring nor the putative TMS4(2)1425 parent matched the TMS4(2)1425 parent of the 412×425 population. We could, however, confirm that most members of the MT population were full sibs. Given this fact, it was nevertheless possible to statistically infer reliable genotypes for the true pollen parent and estimate the combined genetic map for the population, as was performed for all other populations included in this study (see next section). Each population of verified full-sib progeny was then subjected to a significance threshold (P < 0.05) for deviation from Mendelian genotype frequencies, from which we obtained approximately 4400 segregating sites per population (Table 3).
Calculating genetic maps
Genetic maps were estimated for each population using the Maximum Likelihood mapping algorithm for CP populations implemented by JoinMap (v4.1). Forty-four (22.6%) component LGs contained intervals 50 cM or larger that could not be corrected by increasing the simulated annealing parameters and required manual curation. In this curation, markers that were causing the large gaps were identified by trial-and-error and removed. A total of 179 (out of 24,403 mapped) markers were discarded (File S1). After this step, one LG still contained a large gap. For this LG, a separate map was made from the markers unique to each parent and the LGs split into two pieces at the position of the gap. Subsequently, shared markers were reincorporated based on their scaffold coordinates.
Resolving discrepancies between maps, refining maps, and gauging accuracy
Comparing the 10 component linkage maps with each other allowed us to identify and correct inconsistencies, including incorrectly split or joined LGs. We were able to verify long-range colinearity as the median number of markers shared between any two maps was 793 (quartiles Q1 = 656 and Q3 = 983), or approximately 44 per LG (Figure 3A). Occasionally, the dot plot pattern of a LG was “V”-shaped (Figure 3B), indicating inconsistent marker order in one of the maps. Because we could compare 10 independently constructed maps with each other, it was straightforward to identify those specific maps that differed from the consensus. In all but two cases (see below), the discrepant “V”-shaped dot plots were resolved into colinear relationships simply by extending the simulated annealing in JoinMap. Thus, comparison among the various maps allowed us to identify cases where JoinMap had not yet converged to a final marker ordering, and to ultimately reach consistent convergence relative to other maps.
Most LGs had a one-to-one correspondence with an LG in each of the other populations. However, in a total of 11 cases, this was not true and we used a majority rule approach to decide whether the component LGs should be split (two cases) or joined (nine cases). In one additional case, the decision to break or join was ambiguous, because five populations each contained two LGs that together corresponded to single LGs in the other five populations. We resolved this ambiguity by jointly examining the dot plots for the two-LG component maps against a corresponding single-LG map. From this, we discovered that none of the breaks in the different two-LG component maps occurred at the same genetic position. It is therefore likely that the LGs should be joined in the two-LG component maps to generate single LGs. Finally, three component LGs were not included in the merging because they could not be resolved into a map with a linear relationship with the corresponding LG from the other maps, and four were discarded because, although they could be matched with LGs in other maps, they contained too few markers to be confidently oriented.
Comparing maps with one another using dot plots provides a visual measure of internal consistency, whereas accuracy of marker order may be estimated by comparison to the physical sequence. We calculated Kendall’s rank correlation coefficient (tau) between physical and genetic positions on scaffolds that contained 10 or more markers and were not broken (described below). Across all maps and scaffolds examined, the median value tau was 0.84 (quartiles Q1 = 0.748 and Q3 = 0.908), indicating good agreement between physical and genetic positions.
Reincorporation of genetically redundant markers
To more fully represent the genetic diversity found within the mapping populations, 8418 markers that had been excluded earlier based on their genetic redundancy (Materials and Methods) were reincorporated into the composite map. By creating datasets with the redundant markers included before and after map integration, and then by comparing their median Kendall’s rank correlation coefficients tau (for 46 scaffolds with 50 or more markers, totaling 64.5 Mb of sequence), we found that including the redundant markers prior to map integration increased the marker order accuracy (tau = 0.765 vs. tau = 0.726). This improvement arises because the same genetically redundant marker is not always chosen to represent its genetic position in all component maps, and reincorporating redundant such markers increases the number of marker order constraints used by LPmerge (Endelman and Plomion 2014) when a marker order conflict is encountered during map integration.
An integrated framework map
With the exception of the three LGs noted above, the component maps were colinear with each other (Figure 3A). We next combined them into a single composite map. We merged genetic maps by finding shared markers between individual maps. Ideally, all shared markers would be colinear, but in many situations discrepancies needed to be resolved. Sometimes these discrepancies could be due to rearrangements in the genome of one of the parents, but most often they are due to inaccurate genetic distances between markers because of insufficient recombinations, the stochastic nature of recombination, and errors introduced in individual map construction by missing and/or erroneous marker data. We used LPmerge (Endelman and Plomion 2014) to generate a composite map from the 10 component maps we had generated because it merges maps without recalculating recombination frequencies, drastically reducing the computational time required. LPmerge generates a consensus map with the minimum absolute error relative to the component maps while preserving the marker order by imposing linear inequality constraints and deleting a minimum number of constraints if a marker order conflict is observed (Endelman and Plomion 2014).
Combining maps from multiple independent crosses has the advantage of increasing the genetic diversity that is captured in the map, increasing support for marker order and position, and allowing markers from a single map to be placed relative to other markers. Composite maps have been generated from six Diversity Array Technology [DArT]-based and RFLP-based maps for sorghum (Mace et al. 2009), six SSR-based maps for pigeonpea (Bohra et al. 2012), and 11 SSR-based maps for groundnut (Gautami et al. 2012), but, to our knowledge, no attempt has been made to integrate this many GBS-SNP–based genetic maps.
After merging, we noticed that markers at the ends of LGs that belonged to a single map were placed on the merged map tens of cM from their neighbors (Figure S2). To improve the merging, we removed terminal markers belonging to a single map and repeated the merging until no further singleton markers remained. This removed a total of 182 markers (Table S1) to generate a 2412-cM map containing 22,403 markers, averaging 134 cM and 1245 markers per LG (Table 4, File S2). The SNP density (∼9 SNPs per cM) is substantially higher than that of the previous densest map (mean inter-SNP distance = 0.52 cM) (Rabbi et al. 2014b), although this straightforward statistic does not account for the fact that many of the markers are genetically redundant (Table 4). For the first time, a cassava genetic map recapitulates the coherent set of 18 LGs matching the cassava chromosomes (Figure 4).
Table 4. Linkage groups in the composite map.
Chromosome (LG) | No. of Markers | Length (cM) | Average Marker Density (markers/cM) | Maximum Inter-marker Distance (cM) | No. of Scaffolds Anchored | No. of Bases Anchored |
---|---|---|---|---|---|---|
I | 2323 | 164.78 | 2.72 | 3.63 | 90 | 26,714,966 |
II | 1366 | 164.22 | 2.40 | 7.85 | 81 | 24,343,195 |
III | 1326 | 155.60 | 2.04 | 5.97 | 116 | 22,858,152 |
IV | 1459 | 148.73 | 1.64 | 18.41 | 93 | 21,649,806 |
V | 1330 | 146.87 | 1.57 | 6.12 | 81 | 23,125,959 |
VI | 1462 | 144.73 | 2.56 | 3.42 | 79 | 22,319,908 |
VII | 848 | 141.96 | 1.65 | 6.73 | 88 | 18,888,952 |
VIII | 1212 | 137.48 | 2.01 | 7.21 | 111 | 23,111,533 |
IX | 1207 | 137.35 | 1.68 | 6.01 | 92 | 20,667,830 |
X | 1011 | 133.31 | 2.03 | 5.58 | 95 | 20,387,986 |
XI | 1330 | 132.16 | 2.12 | 2.56 | 96 | 20,727,479 |
XII | 863 | 128.85 | 1.30 | 10.09 | 103 | 22,667,256 |
XIII | 865 | 125.09 | 1.89 | 4.92 | 107 | 20,100,115 |
XIV | 1346 | 120.26 | 1.48 | 11.74 | 88 | 17,859,824 |
XV | 1548 | 117.89 | 2.54 | 2.56 | 72 | 20,107,995 |
XVI | 920 | 117.13 | 1.56 | 5.71 | 75 | 18,834,636 |
XVII | 974 | 104.95 | 1.71 | 7.48 | 95 | 20,464,108 |
XVIII | 1013 | 90.99 | 2.41 | 8.46 | 90 | 17,820,998 |
Total | 22,403 | 2412.35 | 35.31 | 124.45 | 1652 | 382,650,698 |
Average | 1245 | 134.02 | 1.96 | 6.91 | 91.78 | 21,258,372 |
The number of markers and genetic distances are shown for the 18 LGs. The average marker density is calculated for genetically nonredundant markers only, i.e., only one marker at a given genetic position was included.
We observed a steady increase (albeit with diminishing returns) in the number of markers and amount of assembled sequence that can be placed as a function of the number of mapping populations that were used (Figure 5). We note, however, that even with the addition of the tenth component map, we were still adding markers and anchoring more genome sequence (Figure 5), an observation that may be of interest to scientists building genetic maps for diverse plants. Future work will merge additional maps to take advantage of this feature. However, because the majority of markers were present in only one map (12,474 out of 22,403), they may be placed inaccurately in the merged map because they are only constrained by a single component map. This will be improved by the generation of a less fragmented genome assembly.
Figure 5.
Additional maps incorporate more markers, scaffolds, and anchored bases. These plots show the effects of adding maps to the framework map. Each additional map incorporates more genetically nonredundant markers (A) into the framework map, but the number of scaffolds incorporated is saturating (B) and the number of mapped bases (C) is reaching a plateau. This is because the scaffolds being added in later maps are getting smaller and smaller and, hence, adding ever fewer bases.
All maps were, to a large extent, colinear with the integrated framework map (Figure S4); however, component LGs from NxA (e.g., Chromosome II; Figure S4B) as well as 412×425 (e.g., Chromosome XVII; Figure S4Q) displayed notable divergence from this, likely because these maps were of lower quality. Finally, across the whole genome, we observed four gaps larger than 10 cM (terminal portion of Chromosome V; two on Chromosome III and one on Chromosome VIII). These could be recombination hotspots or regions that are identical-by-descent and thus lack polymorphisms.
Generation of chromosome pseudomolecules
One of our goals was to organize the fragmented draft genome assembly (Prochnik et al. 2012) into chromosome-scale sequence; therefore, it was necessary to first detect possible discrepancies between the genetic map and physical sequence. Of the 1347 scaffolds that contained multiple markers (median size = 174 kb), 41 scaffolds contained markers along their lengths that mapped to different LGs. Because all component maps used to generate the composite map were internally consistent, these discrepancies were unlikely to be due to errors in the genetic map and suggested a sequence mis-assembly in the scaffolds in question. Manual review, along with the aid of linkage disequilibrium and scaffolding information, allowed us to break likely misjoins where weak scaffolding linkages had been made, often in regions with a high density of scaffolding gaps. After breaking, we then organized the sequence assembly into pseudomolecules by anchoring the scaffolds onto LGs using their median genetic position. The resulting chromosome-scale sequences incorporate 1608 scaffolds and 382 Mb (71.9%) of the v4.1 draft genome sequence (Table 4). This includes 462 (94.9%) of the N50 scaffolds, 1430 (53.9%) of the N90 scaffolds, and 27,825 (90.7%) of the 30,666 predicted protein-coding genes. Of the scaffolds anchored onto the assembly, we could also orient 1024 (63.7%) of them by calculating the Kendall rank correlation coefficient tau between physical and genetic positions. Together, these oriented scaffolds comprise 315 Mb of sequence.
Concluding remarks
Here, we present a consensus genetic map of cassava that combines 10 mapping populations. Unlike many previous maps (Fregene et al. 1997; Okogbenin et al. 2006; Kunkeaw et al. 2010; Sraphet et al. 2011; Rabbi et al. 2012, 2014a,b), we recovered the 18 LGs predicted by the karyotype of M. esculenta (De Carvalho and Guerra 2002) and at the same time dramatically increased the number of markers to more than 22,400. The majority of genes in the current draft genome were placed on a LG and at their approximate chromosomal position, resulting in the first chromosome-scale assembly of cassava and a map that will serve as a valuable guide for future genome assembly improvements. Our approach can be readily extended to include future mapping populations, although, as seen here, care must be taken to explore discrepancies between biparental or selfed maps that may arise from mistaken parentage, missing or erroneous data, and the details of map construction. Using multiple maps at the same time allows such individual component map problems to be identified and corrected (or excluded from merger).
Within the context of disease threat, climate change and food insecurity, improving cassava for smallholder farmers in developing countries is becoming more urgent. A number of projects and collaborations are already engaged in this task. These initiatives are rapidly incorporating modern genetic tools and techniques such as genomic selection and QTL analysis, techniques that depend on an accurate genetic map and genome assembly. The chromosome-scale genome sequence and composite map we report here will allow cassava geneticists and breeders to generate the robust cassava varieties needed both for food security and to improve the economic conditions of smallholder farmers around the world.
Additional Information
The composite genetic map is available in File S2, and at CassavaBase (http://cassavabase.org/cview/map.pl?map_id=3). The pseudomolecule assembly can be downloaded from Phytozome at http://genome.jgi.doe.gov/pages/dynamicOrganismDownload.jsf?organism=Mesculenta under the directory “v5.1_unreleased”. Demultiplexed sequence reads, with the barcode and ApeKI cutsite removed, are available in the NCBI Sequence Read Archive (http://www.ncbi.nlm.nih.gov/sra) via Study Accession Number SRP051207.
Members of the International Cassava Genetic Map Consortium
Consortium members are listed alphabetically, with their affiliations and contributions. Oluwafemi A. Alaba [O.Alaba@cgiar.org; International Institute of Tropical Agriculture (IITA), PMB 5320, Ibadan, Oyo State, Nigeria]: DNA extraction, GBS library preparation. Jessen V. Bredeson [jessenbredeson@berkeley.edu; University of California, Berkeley, Department of Molecular and Cell Biology, Berkeley, CA, USA]: Design of bioinformatics pipeline, generation of GBS-SNPs, genetic maps, anchored chromosome sequence, project coordination, wrote the paper. Chiedozie N. Egesi [cegesi@yahoo.com; National Root Crops Research Institute (NRCRI), Umudike, Nigeria]: NRCRI Breeding team leadership and generation of S1 population, edited the manuscript, contributed writing. Williams Esuma [esumawilliams@yahoo.co.uk; National Crops Resources Research Institute (NaCCRI), Namulonge, Uganda]: Development of mapping populations, DNA extractions. Lydia Ezenwaka [lydiaezenwaka@yahoo.com; National Root Crops Research Institute (NRCRI), Umudike, Nigeria]: MSc Student. Morag E. Ferguson [M.Ferguson@cgiar.org; International Institute of Tropical Agriculture (IITA) Nairobi, Kenya]: Management of IITA Nairobi crosses, scientific leadership, edited the manuscript, contributed writing. Cindy M. Ha (cindy.ha@ucdenver.edu; University of California, Berkeley, Department of Molecular and Cell Biology, Berkeley, CA. *Current institution Anschutz Medical Campus, University of Colorado, Denver, CO): Generating GBS-SNPs and component genetic maps. Megan Hall (mhall45@gmail.com; University of California, Berkeley, Department of Molecular and Cell Biology, Berkeley, CA): GBS capacity building at UC Berkeley. Liezel Herselman (HerselmanL@ufs.ac.za; Department of Plant Sciences, University of the Free State, Bloemfontein, South Africa): PhD supervisor. Andrew Ikpan (A.Ikpan@cgiar.org; Headquarters IITA-Nigeria, PMB 5320, Ibadan, Oyo State, Nigeria): Generated IITA biparental mapping populations. Elly Kafiriti (ekafiriti@gmail.com; Naliendele Agricultural Research Institute, PO Box 509, Mtwara, Tanzania): Zonal Agricultural Research Director, Southern Zone. Edward Kanju (e.kanju@cgiar.org; IITA Mikocheni, PO Box 34441, Dar es Salaam, Tanzania): Selection of parents of mapping populations. Fortunus Kapinga (fakapinga@yahoo.com; Naliendele Agricultural Research Institute, Box 509, Mtwara, Tanzania): Development of mapping populations, DNA extractions. Arthur Karugu (a.karugu@cgiar.org; CGIAR/CIMMYT Nairobi, Kenya): DNA extractions. Robert Kawuki [kawukisezi@yahoo.com; National Crops Resources Research Institute (NaCCRI), Namulonge, Uganda]: Development of mapping populations, DNA extractions. Bernadetha Kimata (bkimatha@yahoo.co.uk; Naliendele Agricultural Research Institute, Box 509, Mtwara, Tanzania): DNA extraction. Paul Kimurto (kimurtopk@gmail.com; Egerton University Box 536, Njoro, Kenya): MSc supervisor. Peter Kulakow (P.Kulakow@cgiar.org; Headquarters IITA-Nigeria, PMB 5320, Ibadan, Oyo State, Nigeria): Mapping population development. Heneriko Kulembeka (kulembeka@yahoo.com; Lake Zonal Agricultural Research and Development Institute; Box 1433; Mwanza; Tanzania): Selection of parents of mapping population, development of mapping populations. Paul Kusolwa (kusolwap@gmail.com; Sokoine University of Agriculture, Box 3005, Morogoro, Tanzania): MSc supervisor. Jessica B. Lyons (jblyons@berkeley.edu; University of California, Berkeley, Department of Molecular and Cell Biology, Berkeley, CA): Optimization of GBS protocol, GBS library preparation and sequencing, project coordination, wrote the paper. Esther Masumba (emasumba@yahoo.com; Sugarcane Agricultural Research Institute, Box 30031 Kibaha, Tanzania): Development of mapping populations, DNA extractions. Albe van de Merwe (albe.vdMerwe@up.ac.za; Department of Genetics, University of Pretoria, South Africa): PhD supervisor. Geoffrey Mkamilo [geoffreymkamilo@yahoo.co.uk; Naliendele Agricultural Research Institute (ARI), Box 509, Mtwara, Tanzania]: Head of Roots and Tuber Crops program at ARI. Alexander A. Myburg (zander.myburg@fabi.up.ac.za; Department of Genetics, Forestry and Biotechnology University of Pretoria, South Africa): PhD supervisor. Ahamefule Nwaogu [ahamefulenwaogu@yahoo.com; National Root Crops Research Institute (NRCRI), Umudike, Nigeria]: Field management. Inosters Nzuki [i.nzuki@cgiar.org; Department of Genetics, University of Pretoria, South Africa 0002, Pretoria; and Biosciences Eastern and Central Africa (BecA) 30709-00100 Nairobi Kenya]: Development of mapping populations, DNA extraction. Bunmi Olasanmi (bunminadeco@yahoo.com; University of Ibadan, Ibadan, Nigeria): Generation of S1 mapping population. Emmanuel Okogbenin [eokogbenin@cassavacop.org; National Root Crops Research Institute (NRCRI), Umudike, Nigeria]: MSc supervisor. Onyeyirichi Onyegbule [princessonyin@yahoo.com; National Root Crops Research Institute (NRCRI), Umudike, Nigeria]: DNA extractions. James O. Owuoche (owuoche@hotmail.com; Egerton University Box 536, Njoro, Kenya): MSc supervisor. Anthony Pariyo [tkakau@yahoo.co.uk; National Crops Resources Research Institute (NaCCRI), Namulonge, Uganda]: Development of mapping populations. Simon E. Prochnik [seprochnik@lbl.gov; DOE Joint Genome Institute (JGI), Walnut Creek, CA]: Generation of genetic maps, generation of merged composite map, deposition of sequence data in SRA, project coordination, wrote the paper. Ismail Y. Rabbi (I.RABBI@cgiar.org; Headquarters IITA-Nigeria, PMB 5320, Ibadan, Oyo State, Nigeria): Development and management of mapping populations IITA Ibadan, edited the manuscript, contributed writing. Daniel S. Rokhsar [dsrokhsar@gmail.com; University of California, Berkeley, Department of Molecular and Cell Biology, Berkeley, CA; and Department of Energy Joint Genome Institute (JGI), Walnut Creek, CA]: Scientific and project leadership, wrote the paper. Steve Rounsley (rounsley@email.arizona.edu; University of Arizona, Tucson, AZ; and Dow Agrosciences, Indianapolis, IN): Scientific and project leadership. Kasele Salum (kasele_salum@yahoo.com; Lake Zonal Agricultural Research and Development Institute, Box 1433, Mwanza, Tanzania): DNA extraction. Kahya S. Shuaibu [kahyass@yahoo.com; National Root Crops Research Institute (NRCRI), Umudike, Nigeria]: DNA extraction, laboratory management, GBS library preparation. Caroline Sichalwe (carosicha@gmail.com; Sugarcane Agricultural Research Institute, Box 30031 Kibaha, Tanzania): DNA extraction. Mary Stephen (marystephen49@yahoo.com; Makutupora Agricultural Research Station, Tanzania): Field technician, managing field cassava seedling nursery.
Supplementary Material
Acknowledgments
The authors thank Richard Harland and Lisa Brunet for early input into optimization of the GBS protocol; Craig Miller lab/Andrew Glazer for reagents and helpful discussion; Minyong Chung, Justin Choi, Karen Lundy, and other members of the VCGSL at UC Berkeley for advice and technical assistance with sequencing; Jeff Endelman for advice on using LPmerge; and Ed Buckler, Jean-Luc Jannink, and Martha Hamblin, for helpful discussion. J.B.L., J.V.B., C.M.H., and work at UC Berkeley were funded by Bill and Melinda Gates Foundation (BMGF) Grant OPPGD1493 to S.R., D.S.R., and the University of Arizona. Development of populations at IITA Ibadan was funded by the CGIAR Research Program on Roots, Tubers, and Bananas. Next Generation Cassava Breeding grant OPP1048542 from BMGF and the United Kingdom Department for International Development supported S.E.P. and work at NRCRI and IITA. Development of the mapping populations in Tanzania was funded by BMGF grant OPPGD1016 to IITA. The work conducted by the U.S. Department of Energy Joint Genome Institute is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. This work used the VCGSL at UC Berkeley, supported by National Institutes of Health S10 Instrumentation Grants S10RR029668 and S10RR027303.
Footnotes
Supporting information is available online at http://www.g3journal.org/lookup/suppl/doi:10.1534/g3.114.015008/-/DC1
Communicating editor: S. A. Jackson
Literature Cited
- Altshuler D., Pollara V. J., Cowles C. R., Van Etten W. J., Baldwin J., et al. , 2000. An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature 407: 513–516. [DOI] [PubMed] [Google Scholar]
- Aronesty, E., 2011 Command-line tools for processing biological sequencing data. http://code.google.com/p/ea-utils.
- Awoleye F., Duren M., Dolezel J., Novak F. J., 1994. Nuclear DNA content and in vitro induced somatic polyploidization cassava (Manihot esculenta Crantz) breeding. Euphytica 76: 195–202. [Google Scholar]
- Baird N. A., Etter P. D., Atwood T. S., Currey M. C., Shiver A. L., et al. , 2008. Rapid SNP discovery and genetic mapping using sequenced RAD markers. PLoS ONE 3: e3376. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bohra A., Saxena R. K., Gnanesh B. N., Saxena K., Byregowda M., et al. , 2012. An intra-specific consensus genetic map of pigeonpea [Cajanus cajan (L.) Millspaugh] derived from six mapping populations. Theor. Appl. Genet. 125: 1325–1338. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Danecek P., Auton A., Abecasis G., Albers C. A., Banks E., et al. , 2011. The variant call format and VCFtools. Bioinformatics 27: 2156–2158. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Davey J. W., Hohenlohe P. A., Etter P. D., Boone J. Q., Catchen J. M., et al. , 2011. Genome-wide genetic marker discovery and genotyping using next-generation sequencing. Nat. Rev. Genet. 12: 499–510. [DOI] [PubMed] [Google Scholar]
- de Carvalho R., Guerra G., 2002. Cytogenetics of Manihot esculenta Crantz (cassava) and eight related species. Hereditas 136: 159–168. [DOI] [PubMed] [Google Scholar]
- Dellaporta S. L., Wood J. J., Hicks J. B., 1983. A plant DNA minipreparation: version II. Plant Mol. Biol. Rep. 1: 19–21. [Google Scholar]
- Elshire R. J., Glaubitz J. C., Sun Q., Poland J. A., Kawamoto K., et al. , 2011. A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. PLoS ONE 6: e19379. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Endelman J. B., Plomion C., 2014. LPmerge: an R package for merging genetic maps by linear programming. Bioinformatics 30: 1623–1624. [DOI] [PubMed] [Google Scholar]
- Fraley, C., A. E. Raftery, T. B. Murphy, and L. Scrucca, 2012 MCLUST Version 4 for R: Normal Mixture Modeling for Model-Based Clustering, Classification, and Density Estimation. Technical Report no. 597, Department of Statistics, University of Washington.
- Fregene M., Angel F., Gomez R., Rodriguez F., Chavarriaga-Aguirre P., et al. , 1997. A molecular genetic map of cassava (Manihot esculenta Crantz). Theor. Appl. Genet. 95: 431–441. [Google Scholar]
- Gautami B., Fonceka D., Pandey M. K., Moretzsohn M. C., Sujay V., et al. , 2012. An international reference consensus genetic map with 897 marker loci based on 11 mapping populations for tetraploid groundnut (Arachis hypogaea L.). PLoS ONE 7: e41213. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hamblin M. T., Rabbi I. Y., 2014. The effects of restriction-enzyme choice on properties of genotyping-by-sequencing libraries: A study in cassava (Manihot esculenta). Crop Sci. 54: 1–6. [Google Scholar]
- Howeler R., Lutaladio N., Thomas G., 2013. Save and Grow: Cassava, Food and Agriculture Organization of the United Nations, Rome. [Google Scholar]
- Kawano K., 1980. Cassava, pp. 225–233 in Hybridization of Crop Plants, edited by Fehr W. R., Hadley H. H. American Society of Agronomy, Crop Science Society of America, Madison, WI. [Google Scholar]
- Kawuki R. S., Herselman L., Labuschagne M. T., Nzuki I., Ralimanana I., et al. , 2013. Genetic diversity of cassava (Manihot esculenta Crantz) landraces and cultivars from southern, eastern and central Africa. Plant Genetic Resources 11: 170–181. [Google Scholar]
- Kunkeaw S., Tangphatsornruang S., Smith D. R., Triwitayakorn K., 2010. Genetic linkage map of cassava (Manihot esculenta Crantz) based on AFLP and SSR markers. Plant Breed. 129: 112–115. [Google Scholar]
- Lennon N. J., Lintner R. E., Anderson S., Alvarez P., Barry A., et al. , 2010. A scalable, fully automated process for construction of sequence-ready barcoded libraries for 454. Genome Biol. 11: R15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li H., Durbin R., 2009. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25: 1754–1760. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mace E. S., Rami J. F., Bouchet S., Klein P. E., Klein R. R., et al. , 2009. A consensus genetic map of sorghum that integrates multiple component maps and high-throughput Diversity Array Technology (DArT) markers. BMC Plant Biol. 9: 13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Manichaikul A., Mychaleckyj J. C., Rich S. S., Daly K., Sale M., et al. , 2010. Robust relationship inference in genome-wide association studies. Bioinformatics 26: 2867–2873. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McKenna A., Hanna M., Banks E., Sivachenko A., Cibulskis K., et al. , 2010. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20: 1297–1303. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Okogbenin E., Marin M., Fregene M., 2006. An SSR-based molecular genetic map of cassava. Euphytica 147: 433–440. [Google Scholar]
- Prochnik S., Marri P. R., Desany B., Rabinowicz P. D., Kodira C., et al. , 2012. The cassava genome: Current progress, future directions. Trop Plant Biol 5: 88–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Purcell S., Neale B., Todd-Brown K., Thomas L., Ferreira M. A., et al. , 2007. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81: 559–575. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Quail M. A., Kozarewa I., Smith F., Scally A., Stephens P. J., et al. , 2008. A large genome center’s improvements to the Illumina sequencing system. Nat. Methods 5: 1005–1010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- R Core Team , 2014. R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria. [Google Scholar]
- Rabbi I., Hamblin M., Gedil M., Kulakow P., Ferguson M., et al. , 2014a Genetic Mapping Using Genotyping-by-Sequencing in the Clonally Propagated Cassava. Crop Sci. 54: 1384–1396. [Google Scholar]
- Rabbi I. Y., Hamblin M. T., Kumar P. L., Gedil M. A., Ikpan A. S., et al. , 2014b High-resolution mapping of resistance to cassava mosaic geminiviruses in cassava using genotyping-by-sequencing and its implications for breeding. Virus Res. 186: 87–96. [DOI] [PubMed] [Google Scholar]
- Rabbi I. Y., Kulembeka H. P., Masumba E., Marri P. R., Ferguson M., 2012. An EST-derived SNP and SSR genetic linkage map of cassava (Manihot esculenta Crantz). Theor. Appl. Genet. 125: 329–342. [DOI] [PubMed] [Google Scholar]
- Sraphet S., Boonchanawiwat A., Thanyasiriwat T., Boonseng O., Tabata S., et al. , 2011. SSR and EST-SSR-based genetic linkage map of cassava (Manihot esculenta Crantz). Theor. Appl. Genet. 122: 1161–1170. [DOI] [PubMed] [Google Scholar]
- van Ooijen J. W., 2011. Multipoint maximum likelihood mapping in a full-sib family of an outbreeding species. Genet. Res. 93: 343–349. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.