Skip to main content
Bioinformatics Advances logoLink to Bioinformatics Advances
. 2024 Jul 23;4(1):vbae107. doi: 10.1093/bioadv/vbae107

Genotype imputation in F2 crosses of inbred lines

Saul Pierotti 1, Bettina Welz 2, Mireia Osuna-López 3, Tomas Fitzgerald 4, Joachim Wittbrodt 5, Ewan Birney 6,
Editor: Guoqiang Yu
PMCID: PMC11286293  PMID: 39077633

Abstract

Motivation

Crosses among inbred lines are a fundamental tool for the discovery of genetic loci associated with phenotypes of interest. In organisms for which large reference panels or SNP chips are not available, imputation from low-pass whole-genome sequencing is an effective method for obtaining genotype data from a large number of individuals. To date, a structured analysis of the conditions required for optimal genotype imputation has not been performed.

Results

We report a systematic exploration of the effect of several design variables on imputation performance in F2 crosses of inbred medaka lines using the imputation software STITCH. We determined that, depending on the number of samples, imputation performance reaches a plateau when increasing the per-sample sequencing coverage. We also systematically explored the trade-offs between cost, imputation accuracy, and sample numbers. We developed a computational pipeline to streamline the process, enabling other researchers to perform a similar cost–benefit analysis on their population of interest.

Availability and implementation

The source code for the pipeline is available at https://github.com/birneylab/stitchimpute. While our pipeline has been developed and tested for an F2 population, the software can also be used to analyse populations with a different structure.

1 Introduction

Pure inbred lines and structured crosses have been the bedrock of early genetics (Mendel 1866) and they have been used in the earliest organisms explored in the field (Morgan 1910, Aida 1921). Over time, phenotype-based markers have been replaced by anonymous DNA variation such as single nucleotide polymorphisms (SNPs) (Lander 1996). With the advent of cheap high throughput sequencing, one can use low-pass sequencing of subsequent generations (e.g. F2 generation) to accurately determine the genotype of an individual. This process is a type of imputation (i.e. the statistical technique to infer missing data), and statistical models have been specifically developed for the linked inheritance patterns of DNA sequence variation. Imputation approaches have been used for plant breeding (Bhattarai et al. 2020), farm animal agricultural studies (Li et al. 2022), and model organism studies (Yao et al. 2021).

Most imputation methods in current use rely on a hidden Markov model (HMM) where the sample genotypes represent the emission space, and the reference panel haplotypes define a 2D grid of hidden states where each row is a reference haplotype, and each column is a marker. This is known as the Li and Stephens model after their foundational work on statistical modelling of linkage disequilibrium and recombination rates (Li and Stephens 2003). In populations where the founding haplotypes are known, e.g. F2 crosses from inbred mice, implementations of the Li and Stephens model such as R/qtl2 (Broman et al. 2019) can be used as an effective solution to the imputation problem. However, often the founding haplotypes are not available or too costly to obtain.

The success of genotype imputation in humans has been made possible by the construction of large reference panels of high-quality genotypes. Milestones in this direction are the HapMap Project (Altshuler et al. 2010), the 1000 Genomes Project (The 1000 Genomes Project Consortium 2015), and the Haplotype Reference Consortium (McCarthy et al. 2016). Some of the imputation tools in widespread use that take advantage of reference panels are GLIMPSE (Rubinacci et al. 2021), QUILT (Davies et al. 2021), ShapeIT (Delaneau et al. 2019), Beagle (Browning et al. 2018), miniMAC (Fuchsberger et al. 2015), and Impute2 (Howie et al. 2009).

The reliance on reference panels or high-quality founder haplotypes has hindered the application of genotype imputation in less-studied organisms that lack extensive reference data. STITCH (Davies et al. 2016) overcomes this limitation by recognizing that in populations founded by a limited number of individuals, the unknown ancestral haplotypes are sequenced at relatively high depth even though each single sample is sequenced at low depth. STITCH can impute genotypes from low-depth sequencing data by alternating the Li and Stephens HMM with an expectation maximization (EM) step where randomly initialized ancestral haplotypes are tuned to maximize the expectation under the model of the observed sample genotypes. This alternating procedure is repeated a parameterized number of times, leading to the reconstruction of the founder haplotypes and the imputation of the sample genotypes. Several hyperparameters need to be optimized for the reconstruction to be optimal, key among them the number of ancestral haplotypes to be reconstructed (K parameter). This optimization can be performed either by relying on quality metrics that are internal to the model (info score) or on external validation (correlation of imputed and known genotypes). Although STITCH has been used successfully in a number of diverse populations (Nicod et al. 2016, Zan et al. 2019, Scott et al. 2021, Zha et al. 2023, Blain et al. 2024, Li et al. 2024), often the parameterization and the assessment of the required level of coverage are presented as a single best choice and with little exploration of alternative parameters and experimental designs. Previous work explored some of the parameters that influence STITCH accuracy in an Angora rabbit population (Wang et al. 2022), but this work did not explore the complex interplay between sequencing depth, population size, and cross structure.

We are studying phenotypes in the medaka fish (Oryzias latipes), which is a small teleost fish native to Japan, Korea, and eastern China (Wittbrodt et al. 2002). It has long been used as a model organism for genetic research thanks to its comparatively small genome (700.4 Mb) (Kasahara et al. 2007) and to its economic husbandry and tolerance to inbreeding. Taking advantage of these characteristics, an extensive medaka inbred panel has been established from a wild population (Fitzgerald et al. 2022). Inspired by some of the plant breeding strategies, we have been using eight parental lines, in 10 different cross combinations, for phenotype mapping. A recent example, which we will use in this manuscript, generated a total of 2219 samples (of which 2177 were used in the final analysis).

A practical problem that we encountered was the optimal imputation of the DNA sequence of each individual. Given that our medaka lines exhibit some residual genetic variability, we reasoned that STITCH would be the ideal tool for this task. We optimized STITCH hyperparameters for this type of multi-parent cross by assessing the imputation accuracy with a sample of high-coverage genomes and we explored the optimal parameters for future cost-effective cross designs. We show that a correctly parameterized STITCH pipeline can effectively impute sample genotypes and show that there is a trade-off between the number of individuals sequenced and the per-sample coverage needed.

2 Methods

2.1 Fish maintenance

All fish stocks were maintained and bred (fish husbandry, permit number 35–9185.64/BH Wittbrodt) in constant recirculating systems at 26°C on a 14 h light/10 h dark cycle. Fish husbandry and experiments were performed in accordance with local animal welfare standards (Tierschutzgesetz §11, Abs. 1, Nr. 1) and with European Union animal welfare guidelines (Bert et al. 2016). The fish facility is under the supervision of the local representative of the animal welfare agency.

2.2 DNA extraction

The genomic DNA of the F1 individuals was extracted from whole brain samples. Brains were digested overnight at 60°C with proteinase K (1 mg/ml) in DNA extraction buffer (0.4 M Tris/HCl pH 8.0, 0.15 M NaCl, 0.1% SDS, 5 mM EDTA pH 8.0). Subsequently, the DNA was purified by phenol–chloroform isoamyl alcohol [phenol:chloroform:isoamyl (25:24:1) pH 8.0] extraction using phase lock gel (PLG) tubes. After centrifugation, the aqueous phase was mixed with chloroform:isoamyl alcohol (24:1) in a new pre-spun PLG tube and centrifuged again. The genomic DNA was precipitated with isopropanol and washed twice with 70% ethanol. The precipitated DNA was resuspended in TE buffer (10 mM Tris pH 8.0, 1 mM EDTA in RNAse-free water) and stored at 4°C. For the genomic DNA extraction of the F2 individuals, hatched or manually dechorionated embryos were snap-frozen and stored in deep well plates at −80°C. The fish were lysed in 40 µl DNA extraction buffer (2M Tris pH 8.0, 5M NaCl, 20% SDS, 0.5M EDTA pH 8.0) overnight at 60°C. After a 1:2 dilution of the samples with nuclease-free water, the samples were used for library preparation.

2.3 Sequencing library preparation

For the high sequencing coverage samples, a ligation-based library preparation was performed using the NEBNext Ultra II DNA Library Prep Kit (New England Biolabs) with NEBNext Multiplex Oligos for Illumina (Unique Dual Index UMI Adaptors DNA Set 1) according to the manufacturer’s instructions without PCR. About 750 ng DNA were used as input. The libraries were selected for a fragment size of 500 bp and quantified with the Qubit High-Sensitivity kit. Further, the quality and the molarity of the library were assessed using the Bioanalyzer with the DNA HS Assay kit following the manufacturer’s protocol allowing equimolar pooling of libraries for sequencing. For the low sequencing coverage samples, a tagmentation-based library was prepared. The sequencing library preparation was done as described previously (Picelli et al. 2014). For the tagmentation reaction, 1.25 µl DNA was mixed with 1.25 µl of dimethylformamide, 1.25 µl tagmentation buffer (40 mM Tris-HCl pH 7.5, 40 mM MgCl2) and 1.25 µl of an in-house generated and purified Tn5 (Hennig et al. 2018) and incubated at 55°C for 3 min. To inactivate the tagmentation reaction the samples were cooled down to 10°C and 1.25µl 0.2% SDS was added followed by an incubation for 5 min at room temperature. The resulting fragments were amplified by PCR (72°C for 3 min, 95°C for 30 s, 12 cycles of 98°C for 20 s, 58°C for 15 s, 72°C for 30 s, and a final extension time of 3 min at 72°C) using 6.75 µl of KAPA 2 X HiFi master mix, 0.75 µl of dimethyl sulfoxide and 2.5 µL of dual indexed primers. PCR products were pooled and selected for an average size of 450 bp using two rounds of magnetic SPRI beads cleanup (0.7X and 0.6X, respectively). Finally, the library was quantified with the Qubit High-Sensitivity kit and the fragment size was assessed using the high-sensitivity assay in the Bioanalyzer.

2.4 Sample exclusions

Samples sequenced to an overall mean depth lower than 0.1× were excluded from the analysis. This reduced the sample set from 2219 to 2177 samples.

2.5 F2 cross design

The population that we designed was founded by eight inbred medaka lines from the Medaka Inbred Kiyosu Karlsruhe panel (MIKK panel) (Fitzgerald et al. 2022). The lines used were 72–2, 55–2, 139–4, 15–1, 62–2, 68–1, 79–2, and 22–1. Those lines were crossed in a set of 10 F2 crosses (72–2 × 79–2: 153 samples, 72–2 × 15–1: 141 samples, 72–2 × 139–4: 153 samples, 139–4 × 72–2: 149 samples, 72–2 × 55–2: 481 samples, 72–2 × 62–2: 142 samples, 72–2 × 68–1: 164 samples, 68–1 × 79–2: 147 samples, 15–1 × 62–2: 158 samples, 55–2 × 139–4: 373 samples, 72–2 × 22–1: 148 samples). The cross 72–2 × 139–4 was performed reciprocally with males from the 72–2 line and females from the 139–4 line, and vice-versa. The other crosses were performed with males from the first line in the cross name, and females from the second line. One male and multiple females were used for founding each cross.

2.6 Sequencing

A total of 2219 medaka samples were sequenced at the Genomics Core facility of the European Molecular Biology Laboratory (EMBL) in Heidelberg, Germany (https://www.embl.org/groups/genomics). We performed 150 bp paired-end Illumina (https://www.illumina.com/) short-read sequencing on a NextSeq2000 machine, with a variable number of samples per flow cell. In total, we sequenced our samples on nine P3 Illumina flow cells and two Illumina P2 flow cells. Twelve samples were sequenced at high depth (33× to 61×) to be used as a ground truth for validating the quality of the imputation. Of these, 10 were F1 samples (one per cross, multiplexed on a P3 flow cell) while the remaining two were F2 samples (multiplexed on a P2 flow cell). The 2207 low-coverage samples were sequenced at lower depth (1.4× overall and 9.5× maximum per sample) on eight P3 flowcells (multiplexed with 271, 276, 276, 287, 288, 359, 90, and 90 samples, respectively) and one P2 flow cell (270 samples). The mean per-sample sequencing depth distribution for all the samples is reported in Supplementary Fig. S1.

2.7 Alignment and preprocessing

Both low-coverage and high-coverage (ground truth) samples were pre-processed using the nf-core/sarek (Hanssen et al. 2024) pipeline. Reads were mapped to the Hdr-R medaka reference genome (ENSEMBL ID: ASM223467v1) using bwa-mem2 (Vasimuddin et al. 2019) and deduplicated using GATK MarkDuplicates (Van der Auwera and O’Connor 2020).

2.8 Sequencing depth calculations

All the sequencing depths reported in this work have been calculated on the output of GATK MarkDuplicates using Mosdepth (Pedersen and Quinlan 2017), as part of the nf-core/sarek pipeline. The coverage was obtained from the Mosdepth summary output by summing the total number of bases sequenced for each chromosome and dividing by the sum of chromosome lengths. The mitochondrial chromosome was excluded from this calculation.

2.9 Sequencing depth downsampling

The original mean depth for each sample (doriginal) was calculated as described above. Given a desired downsampled mean depth (dtarget), the downsampling factor (f) to be used was calculated as f=dtarget/doriginal.

Samtools view (Danecek et al. 2021) was then used to downsample the reads with the calculated downsampling factor using the --subsample flag.

Given the variable original mean depth of the low-coverage samples, not all the samples always had an original depth higher than the target depth. In such cases, no downsampling was performed. As a consequence of this, the depth values indicated represent an upper bound on the real mean depth across the cohort.

When assessing the imputation performance on the dataset with original sequencing depth, the ground truth samples were downsampled to a mean depth of 0.5× before being imputed, so as to avoid inflated performance figures driven by the relatively high coverage. When the depth was downsampled, the ground truth samples were also downsampled at the same target depth used for the rest of the cohort. Nonetheless, we observed that the specific coverage of the ground truth samples had a negligible effect on the imputation performance, while the overall coverage of the imputed cohort was much more important.

2.10 Sample size downsampling

The number of samples was downsampled in such a way that the ground truth samples were always retained. Among the remaining samples, the fish to be included were chosen randomly until the sample set (complete of ground truth samples) reached the desired size. In the overall downsampling, samples were chosen randomly without considering F2 cross membership (Fig. 3b).

Figure 3.

Figure 3.

Imputation performance in different scenarios. All plots show the squared Pearson correlation for imputation compared to the truth set for different minor allele frequency (MAF) bins. (a) Varying sequencing depth. (b) Varying the total number of samples jointly imputed. (c) Restricting the imputation to samples derived from a limited number of founder lines and F2 crosses. (d) Imputing a single F2 cross and varying the number of samples within that cross.

2.11 F2 crosses downsampling

In the downsampling performed in Fig. 2c, all the samples belonging to certain F2 crosses were removed from the sample set (including ground truth samples belonging to such crosses). The F2 crosses and founder lines were iteratively downsampled in the following order: (1) start from the full sample set (eight founders, 10 crosses); (2) remove all the crosses involving line 22–1 (seven lines, nine crosses); (3) remove all the crosses involving line 68–1 (six lines, seven crosses); (4) remove all the crosses involving line 79–2 (five lines, six crosses); (5) remove all the crosses involving line 15–1 (four lines, four crosses); (6) remove all the crosses involving line 62–2 (three lines, three crosses); (7) remove all the crosses involving line 139–4 (two lines, one cross).

Figure 2.

Figure 2.

(a) Refinement of the imputed marker set showing the trade-off between the number of markers retained and imputation performance. The numeric labels refer to the refinement iteration. (b) Imputation performance for different hyperparameter combinations. K is the number of ancestral haplotypes and nGen is the number of generations since the founding of the imputed population. In our analysis, K = 16 is optimal and nGen has only a modest effect on the results.

2.12 Single F2 cross downsampling

The single-cross downsampling in Fig. 3d starts from the subset remaining at the end of the F2 cross downsampling in Fig. 3c. This sample set includes only the samples belonging to the cross 72–2 × 55–2. The samples in this cross were then randomly selected while retaining the ground truth samples, until the desired sample size was reached.

2.13 Ground truth genotyping

The 12 ground truth samples were genotyped using the nf-core/sarek pipeline. The reads were aligned to the reference genome using bwa-mem2 and deduplicated using GATK MarkDuplicates. Genotypes were obtained with GATK via joint germline variant calling (Poplin et al. 2018).

2.14 Imputation

Imputation was performed using STITCH (Davies et al. 2016). The following parameters were used: K = 16, nGen = 2, expRate = 2, niterations = 100, shuffleHaplotypeIterations = seq(4, 88, 4), refillIterations = c(6, 10, 14, 18), shuffle_bin_radius = 1000. The K parameter was chosen so as to maximize the squared Pearson correlation with the ground truth on the full medaka genome. The other parameters were set by inspecting diagnostic plots produced by the software when imputing a small genomic region in an initial exploratory phase of the project. These other parameters were determined to have a far smaller impact on imputation accuracy than the K parameter.

When downsampling the number of crosses and founder lines, the K parameter was adjusted proportionally to reflect the fewer number of founder lines present in the population. Consistently with the choice of K = 16 determined to work best in the original dataset that was composed of eight founder lines, K was set to twice the number of founder lines in each downsampling iteration. When downsampling the number of samples in a single cross, K was set to 4 to reflect the presence of two founder lines.

A graphical representation of the ancestral haplotype usage along the genome for the optimal parameter set is reported in Supplementary Fig. S2.

2.15 SNP set definition

The set of SNPs used in the imputation was defined iteratively. We started from a set containing all the biallelic SNPs called by GATK joint calling in the ground truth sample set with a minor allele count greater or equal to 1 (6.2 million SNPs). To obtain this we used bcftools view (Danecek et al. 2021) and applied in series the following filtering criteria: -i ‘CHROM!=“MT”’, -i ‘TYPE==“snp”’, -i ‘N_ALT >= 1’, -i ‘MAC >= 1’. We ran the imputation jointly on the low-coverage and downsampled ground truth samples (0.5× mean depth) and measured the SNP-wise squared Pearson correlation against the ground truth genotypes. We retained SNPs only above a certain threshold of squared correlation and then repeated the imputation on such SNP set. We iteratively refined the SNP set in this way for a total of five iterations, using filter values of 0.5, 0.5, 0.75, 0.9, and 0.9. The final SNP set consisted of 3.2 million SNPs. This filtering procedure follows the approach suggested in the original STITCH publication (Davies et al. 2016).

2.16 Imputation quality calculations

The quality of the imputation was calculated in terms of squared Pearson correlation (r2) between the imputed genotypes and the genotypes obtained via joint germline GATK calling on the ground truth samples. GLIMPSE2 Concordance (Rubinacci et al. 2021) was used for this task. The performance was calculated jointly for SNPs in minor allele frequency (MAF) bins of size 0.05. The plotted MAF of each bin represents the average observed MAF of the SNPs that it includes. The performance was also calculated jointly for all the SNPs of a certain ground truth sample. This per-sample performance was averaged across samples to obtain single performance figures per imputation run like the one used in Fig. 3.

2.17 Relative cost calculations

The relative cost estimates reported in Fig. 3 represent the fraction of the original cost that would have approximately been spent for a certain downsampling scenario. These were calculated from the sequencing and library preparation cost at the Genomics Core facility of the European Molecular Biology Laboratory (EMBL) in Heidelberg, Germany. The following formula was used to compute the cost: ((n*d*g)/f) * cf+n * cp.

In the above formula, n is the number of samples, d is the mean sequencing depth, g is the haploid genome size, f is the mean output per sequencing flow cell, and cf and cp are the unitary costs for sequencing (per flow cell) and library preparation (per sample).

The mean output per flow cell was estimated from the mean observed output in our experiment for P3 Illumina flow cells (f=273542732637 bp after deduplication). The medaka haploid genome size (g=734040372 bp) was obtained by summing the chromosome lengths of the Hdr-R reference genome, mitochondrion excluded. The unitary costs had a ratio cf/cp=683.89.

From this, the relative cost was computed by dividing the result by the cost estimated using the number of samples and the mean depth of the original sample set.

2.18 Additional software used

Plots were produced using ggplot2 (Wickham 2016) and cowplot (Wilke 2020). JupyterLab was used for interactive explorations (Kluyver et al. 2016). The R programming language was used throughout the pipeline (R Core Team 2023).

3 Results

We first developed a framework for the optimization and assessment of imputation results from F2 crosses. We performed high-coverage sequencing on one F1 individual per cross (10 overall) and two F2 individuals (from one cross only). We reasoned that we could use F1 individuals as high-coverage “truth sets” for the imputation assessment to maximize the probability of observing all the haplotypes from a given cross in the smallest possible number of validation samples. We show that this choice does not affect model selection and does not inflate performance in Supplementary Fig. S3. The F1 individuals are also useful to determine the set of truly polymorphic markers in the cross. Although we have sequenced individuals from the F0 founder lines (Fitzgerald et al. 2022) both previous work (Jaegle et al. 2023) and our own investigations suggested that there will be apparent deviations from the expected inheritance patterns (e.g. in cryptic segmental duplications and in genomic regions that are not fully inbred), and so we preferred to develop a system where no assumptions about the founders’ haplotypes are made. Before using them for imputation, we downsampled the F1 and high-coverage F2 samples to obtain approximately equivalent coverage to the low-depth sequencing data across all F2s (Fig. 1). Thus, we had 12 samples imputed from low-coverage data for which we have also a high-quality, high-confidence genotype call. Using this truth set we explored STITCH parameters.

Figure 1.

Figure 1.

(a) Experimental design. Ten crosses of eight inbred MIKK panel lines were performed. Among the resulting F1 progeny, 10 samples were sequenced at high coverage (see Section 2). The F2 cohort was mostly sequenced at low coverage (2207 samples), but two F2 samples were sequenced at high coverage. WGS: whole-genome sequencing. (b) Schematic illustration of the analysis performed. High-depth samples were aligned to the medaka reference genome and jointly genotyped. Mapped reads were downsampled to the desired depth (see main text). Low-coverage samples were also aligned to the medaka reference genome. The low-coverage mapped reads and the mapped downsampled reads for the high-coverage samples were then jointly used for imputation with STITCH. The imputed genotypes for the high-coverage samples were compared to the ground truth genotypes that had been obtained with GATK joint calling for evaluating imputation performance (using the Pearson r2 as a metric).

In our hands, the two key parameters for a successful imputation were the number of ancestral haplotypes (K parameter) and the selection of valid SNPs to impute. To select the imputable SNPs, we started with the SNPs that are polymorphic in the 10 F1 individuals (one F1 per cross) and the two high-coverage F2s. We then refined this set by running the imputation including the downsampled high-coverage datasets, computing the correlation of the imputed SNPs to the high-coverage calls, and rejecting SNPs with a low correlation (Fig. 2a). We repeated this process for five iterations, progressively increasing the stringency of the correlation threshold (see Section 2.15). We show that this procedure substantially improves imputation performance and does not result in a biased dropout of low-frequency SNPs or SNPs present only in one founder line in Supplementary Figs S4–S6. For the selection of the best setting for K, we tested a range of values and selected the one that maximized the mean squared Pearson correlation across the 12 truth set samples (Fig. 2b). As both K and the final SNP set parameters are linked, our procedure is to first determine a plausible value for K using the unfiltered SNP set, then doing the SNP filtering (using the value of K determined in the initial assessment), and finally re-assessing the choice of K on the filtered SNP set.

Overall, our F2-STITCH pipeline proved to be highly successful in imputing the genotypes of our medaka cohort. We observed a mean per-sample squared Pearson correlation coefficient (r2) of 0.996 between the imputed and ground truth genotypes. When binning the SNPs according to their MAF, the squared Pearson correlation was as expected higher for bins containing more common SNPs (most likely due to being polymorphic in more than one cross) and decaying for rarer SNPs. Nonetheless, even the bin containing the least common SNPs (MAF from 0 to 0.05) still showed an r2 of 0.977.

3.1 Parameters that influence accurate imputation

Using the pipeline, we then explored downsampling our overall F2 dataset in a variety of dimensions: in the sequencing depth, in the number of individuals overall, in the number of crosses, and in the number of individuals in a single cross. This explores much of the design space for practical cross strategies.

As expected, we observed less accurate imputation for lower sequencing depth and lower sample numbers (Fig. 3a and b), though there is a non-linear response in both cases. For example, we observed a large difference in imputation performance between a coverage of 0.5× and 0.25×, but a comparatively smaller difference when the coverage was changed from 1× to 0.5×. We observed a similar trend when reducing the total number of samples while retaining the original sequencing depth (Fig. 3b). In this case, the performance started to drop significantly for cohorts that were smaller than 1000 samples.

On the contrary, reducing the sample numbers by reducing the number of F2 crosses involved (and hence the number of distinct haplotypes to be reconstructed; we changed K proportionally, see Section 2.14) did not significantly affect the quality of the imputation (Fig. 3c). When the dataset was reduced to a single F2 cross (and so two F0 founder lines) for a total of 474 samples, the r2 value was observed to be above 0.964 for all the MAF bins except the one with the least common SNPs (which had a much lower r2 of 0.694). This compares to an r2 range of 0.777–0.915 when using 500 samples randomly chosen from the whole dataset (contrast Fig. 3b, “500 random samples” to Fig. 3c, “two founder lines, one cross”, with 474 samples).

Finally, when we explored reducing the number of samples within a single F2 cross (Fig. 3d), we observed that when the sample number fell below 200, the r2 sharply decreased. However, it should be noted that in a population similar to the one presented in this study, different F2 crosses share some common founder lines. As such, STITCH would be able to more accurately impute the genotypes compared to this simplified single-cross scenario by borrowing sequencing coverage of the founders from the other F2 crosses that have common parents.

3.2 Cost–benefit analysis

To determine what would be the ideal sequencing depth to aim for when designing similar experiments, we performed a combined downsampling of mean sequencing depth and sample numbers. We observed that for the original cohort size (2177 samples) a maximum sequencing depth of 0.5× was sufficient for obtaining an imputation result that is only marginally worse than the original (Pearson r2 0.981 versus 0.996) while almost halving the financial burden (53.7% of the original).

For lower sample numbers, the required minimum sequencing depth proportionally increased (Fig. 4). For 1500 and 1000 samples, a depth of 1× was needed to avoid strong degradation of the imputation quality. For sample sizes lower than 500 any sequencing depth downsampling led to much worse imputation results. Also, we noticed that the Pearson r2 reached a plateau at different levels for different cohort sizes, with larger cohorts reaching the plateau earlier and ending at higher Pearson r2 values, though the difference between the best achievable correlations is marginal at high sample numbers.

Figure 4.

Figure 4.

Relative cost of library preparation and sequencing versus imputation accuracy for different combinations of mean sequencing depth and number of samples. Estimated costs are relative to the cost sustained for obtaining the original dataset (see Section 2). Depending on the number of samples, the imputation accuracy reaches a plateau at a certain level of mean sequencing depth, where increasing the depth further does not bring appreciable increases in imputation accuracy.

4 Discussion

In this work, we have systematically explored the imputation accuracy from low-coverage sequencing for the commonplace F2 cross designs used in agriculture, model organisms, and plant breeding. We also developed an economic model for the sequencing cost, on the assumption that the sample numbers would be fixed by other constraints (e.g. husbandry capacity). There is a relationship between the optimal sequencing depth and the number of samples for accurate imputation, as expected by the coverage of the underlying ancestral haplotypes. In practice, for sample sizes of 1000 aiming for a coverage of 1× at least for this cross structure provided nearly equivalent accuracy compared to our full set. In contrast for 2000 samples a similar level of accuracy would be achieved with 0.5×. An alternative way of viewing this is that for sample sizes of 2000, there is little to be gained by raising the sequencing depth from 0.5× to 1×.

However, it is important to note that the optimal coverage and sample size cannot be determined a priori for an arbitrary population. Our results shall serve only as a guideline for similar experimental designs, as the genetic diversity and homozygosity level of the founders will have an impact on the optimal experimental design. In general, more diverse populations will require higher levels of coverage and/or larger sample sizes (see Supplementary Notes). Nonetheless, our pipeline can be used for exploring context-specific experimental and software parameters in such scenarios.

An important feature in both the practical F2 imputation and our assessment was the sequencing of F1 individuals at high coverage. In our experience, this was critical to select an initial set of SNPs which were polymorphic in the crosses and to assess imputation accuracy for different parameter settings. In practice, this means it is important to track and store F1 individuals during the breeding. As not all inbred lines are fully inbred, it is also advantageous to set up F2 crosses between single F0 individuals and track the relationship between F2 individuals and their specific F0 parents wherever possible. A benefit of STITCH is that it does not require a reference panel and will impute variants which are present in the F2 regardless of whether they were fixed in the F0 or not. While the info score produced by STITCH could be used for performance assessment in place of an external validation set to reduce cost, we suggest using an external validation when possible (see Supplementary Notes).

Another point to note is that we used a different DNA extraction and sequencing library preparation strategy for the high sequencing coverage and low sequencing coverage samples (ligation-based and PCR-free in the first case, tagmentation-based in the second). Tagmentation-based libraries are cost-effective for large numbers of samples, while ligation-based PCR-free workflows are free from transposase binding site bias and PCR artefacts, and so are more suitable for obtaining a high-quality validation set. While this differential processing could introduce systematic differences in sensitivity and precision in detecting variants (Ribarska et al. 2022), we think it does not substantially influence the parameter selection and coverage/sample size assessment performed in this study because: (1) a predetermined set of variants is used for imputation, and so false positives in the low-coverage samples have no impact on the imputation process; (2) the high-coverage samples are only used to validate the imputation process, and are not directly compared to the low-coverage samples; (3) the relative imputation accuracy between two imputation runs depends on the respective ancestral haplotype reconstructions, which are the same for both sets of samples. However, it must be noted that the r2 values reported in this work refer to the high-coverage samples only and may not be fully representative of the imputation performance in samples prepared with a different methodology.

A consequence of imputing only the highly confident polymorphic SNPs is that many other variants will not be imputed, including copy number variants, structural variants, insertions, deletions and hard-to-call SNPs. This does not invalidate the accurate imputation of the markers we have, but for downstream applications such as association studies, analysts should remember that the causal variant is not necessarily among the imputed set of markers. In our hands, a successful exploration technique is to group samples by the associated marker of interest and treat the grouped samples as a high-coverage meta-sample to explore other forms of genetic variation linked around a selected marker.

We have made this STITCH-based imputation scheme into a Nextflow (Di Tommaso et al. 2017) pipeline inspired by the nf-core template (Ewels et al. 2020) and available at https://github.com/birneylab/stitchimpute. This pipeline can take as inputs high-coverage and low-coverage samples, with a specification of any sequencing depth downsampling to be performed (e.g. for later assessment) and will automatically iterate the imputation and performance assessment for building a final SNP set. The pipeline also allows the exploration of the STITCH parameters K (number of haplotypes) and nGen (number of generations) in a systematic manner. It can also be used in a more basic way to run a single imputation with a pre-defined SNP and parameter set. We have successfully used this pipeline in a number of F2 crosses with different operators and hope it will be useful to a broader set of research groups. We are currently processing a number of phenotypes on this and other complex cross designs and are aiming for a comprehensive association study using these imputed genotypes.

Although we have explored parameters and downsampling for this particular cross structure in this specific species (medaka fish), different cross structures with different divergence levels of parents and different recombination rates will mean that some level of parameter adjustment will be needed for each system. Our recommendations are: (a) to ensure that there are high-coverage F1 individuals, (b) to carefully consider the sequencing depth compared to the number of samples in the F2, and (c) to consider the number of parents and the number of founding haplotypes.

One aspect that we did not explore in detail in this work is the definition of the initial set of SNPs used in the imputation and filtering process. A possible extension that could be implemented in a future version of our pipeline is the use of a variant caller designed for (ultra) low-coverage sequencing data such as BaseVar (Liu et al. 2023) for defining this initial variant set without relying on the high-coverage samples.

Supplementary Material

vbae107_Supplementary_Data

Acknowledgements

We thank all the Birney and the Wittbrodt labs members for their support and constructive feedback on the work and the manuscript. We wish to acknowledge Amos Pierotti for providing the medaka fish illustrations used in Fig. 1. We thank Vladimir Benes, Ferris Jung, and all members of the EMBL Genomics Core Facility (GeneCore) for their support in the library preparation and the whole-genome sequencing.

Contributor Information

Saul Pierotti, European Bioinformatics Institute (EMBL-EBI), European Molecular Biology Laboratory, Hinxton, Cambridge CB101SD, United Kingdom.

Bettina Welz, Centre for Organismal Studies (COS), Heidelberg University, Heidelberg 69120, Germany.

Mireia Osuna-López, Genomics Core Facility, European Molecular Biology Laboratory (EMBL), Heidelberg 69117, Germany.

Tomas Fitzgerald, European Bioinformatics Institute (EMBL-EBI), European Molecular Biology Laboratory, Hinxton, Cambridge CB101SD, United Kingdom.

Joachim Wittbrodt, Centre for Organismal Studies (COS), Heidelberg University, Heidelberg 69120, Germany.

Ewan Birney, European Bioinformatics Institute (EMBL-EBI), European Molecular Biology Laboratory, Hinxton, Cambridge CB101SD, United Kingdom.

Supplementary data

Supplementary data are available at Bioinformatics Advances online.

Conflict of interest

None declared.

Funding

This research was supported by the European Research Council Synergy Grant IndiGene [grant number 810172].

Data availability

The sequencing data underlying this article will be shared on reasonable request to the corresponding author.

References

  1. Aida T. On the inheritance of color in a fresh-water fish, aplocheilus latipes temmick and schlegel, with special reference to sex-linked inheritance. Genetics 1921;6:554–73. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Altshuler DM, Gibbs RA, Peltonen L, International HapMap 3 Consortium et al. Integrating common and rare genetic variation in diverse human populations. Nature 2010;467:52–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bert B, Chmielewska J, Bergmann S. et al. Considerations for a European animal welfare standard to evaluate adverse phenotypes in teleost fish. EMBO J 2016;35:1151–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Bhattarai G, Shi A, Feng C. et al. Genome wide association studies in multiple spinach breeding populations refine downy mildew race 13 resistance genes. Front Plant Sci 2020;11:563187. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Blain SA, Justen HC, Easton W. et al. Reduced hybrid survival in a migratory divide between songbirds. Ecol Lett 2024;27:e14420. [DOI] [PubMed] [Google Scholar]
  6. Broman KW, Gatti DM, Simecek P. et al. R/qtl2: software for mapping quantitative trait loci with high-dimensional data and multiparent populations. Genetics 2019;211:495–502. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Browning BL, Zhou Y, Browning SR. et al. A one-penny imputed genome from next-generation reference panels. Am J Hum Genet 2018;103:338–48. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Danecek P, Bonfield JK, Liddle J. et al. Twelve years of SAMtools and BCFtools. Gigascience 2021;10:giab008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Davies RW, Kucka M, Su D. et al. Rapid genotype imputation from sequence with reference panels. Nat Genet 2021;53:1104–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Davies RW, Flint J, Myers S. et al. Rapid genotype imputation from sequence without reference panels. Nat Genet 2016;48:965–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Delaneau O, Zagury J-F, Robinson MR. et al. Accurate, scalable and integrative haplotype estimation. Nat Commun 2019;10:5436. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Di Tommaso P, , ChatzouM, , Floden EW. et al. Nextflow enables reproducible computational workflows. Nat Biotechnol 2017;35:316–9. [DOI] [PubMed] [Google Scholar]
  13. Ewels PA, , PeltzerA, , Fillinger S. et al. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol 2020;38:276–8. [DOI] [PubMed] [Google Scholar]
  14. Fitzgerald T, Brettell I, Leger A. et al. The Medaka Inbred Kiyosu-Karlsruhe (MIKK) panel. Genome Biol 2022;23:59. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Fuchsberger C, Abecasis GR, Hinds DA. et al. minimac2: faster genotype imputation. Bioinformatics 2015;31:782–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Hanssen F, , GarciaMU, , Folkersen L. et al. Scalable and efficient DNA sequencing analysis on different compute infrastructures aiding variant discovery. NAR Genom Bioinform 2024;6:lqae031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Hennig BP, Velten L, Racke I. et al. Large-scale low-cost NGS library preparation using a robust Tn5 purification and tagmentation protocol. G3 (Bethesda) 2018;8:79–89. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Howie BN, Donnelly P, Marchini J. et al. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet 2009;5:e1000529. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Jaegle B, Pisupati R, Soto-Jiménez LM. et al. Extensive sequence duplication in Arabidopsis revealed by pseudo-heterozygosity. Genome Biol 2023;24:44. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Kasahara M, Naruse K, Sasaki S. et al. The medaka draft genome and insights into vertebrate genome evolution. Nature 2007;447:714–9. [DOI] [PubMed] [Google Scholar]
  21. Kluyver T, Ragan-Kelley B, Perez F. et al. Jupyter notebooks—a publishing format for reproducible computational workflows. In: Loizides F, Scmidt B (eds.), Positioning and Power in Academic Publishing: Players, Agents and Agendas. Netherlands: IOS Press, 2016, 87–90. [Google Scholar]
  22. Lander ES. The new genomics: global views of biology. Science 1996;274:536–9. [DOI] [PubMed] [Google Scholar]
  23. Li N, Stephens M.. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics 2003;165:2213–33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Li W, Li W, Song Z. et al. Marker density and models to improve the accuracy of genomic selection for growth and slaughter traits in meat rabbits. Genes (Basel) 2024;15:454. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Li Y, Liu X, Bai X. et al. Genetic parameters estimation and genome-wide association studies for internal organ traits in an F2 chicken population. J Anim Breed Genet 2022;139:434–46. [DOI] [PubMed] [Google Scholar]
  26. Liu S, , HuangS, , Liu Y. et al. Utilizing non-invasive prenatal test sequencing data resource for human genetic investigation. bioRχiv 2023. 10.1101/2023.12.11.570976. preprint: not peer reviewed. [DOI] [Google Scholar]
  27. McCarthy S, Das S, Kretzschmar W, Haplotype Reference Consortium et al. A reference panel of 64,976 haplotypes for genotype imputation. Nat Genet 2016;48:1279–83. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Mendel G. Versuche über plflanzenhybriden. Verhandlungen des naturforschenden Vereines in Brünn. Bd. IV für das Jahr 1866;1865:3–47. [Google Scholar]
  29. Morgan TH. Sex limited inheritance in drosophila. Science 1910;32:120–2. [DOI] [PubMed] [Google Scholar]
  30. Nicod J, Davies RW, Cai N. et al. Genome-wide association of multiple complex traits in outbred mice by ultra-low-coverage sequencing. Nat Genet 2016;48:912–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Pedersen BS, Quinlan AR.. Mosdepth: quick coverage calculation for genomes and exomes. Bioinformatics 2017;34:867–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Picelli S, Faridani OR, Björklund AK. et al. Full-length RNA-seq from single cells using Smart-seq2. Nat Protoc 2014;9:171–81. [DOI] [PubMed] [Google Scholar]
  33. Poplin R, , Ruano-RubioV, , De Pristo MA. et al. Scaling accurate genetic variant discovery to tens of thousands of samples. bioRχiv 2018. 10.1101/201178. preprint: not peer reviewed. [DOI] [Google Scholar]
  34. R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing, 2023. [Google Scholar]
  35. Ribarska T, , BjørnstadPM, , Sundaram AYM. et al. Optimization of enzymatic fragmentation is crucial to maximize genome coverage: A comparison of library preparation methods for illumina sequencing. BMC Genomics 2022;23:92. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Rubinacci S, Ribeiro DM, Hofmeister RJ. et al. Efficient phasing and imputation of low-coverage sequencing data using large reference panels. Nat Genet 2021;53:120–6. [DOI] [PubMed] [Google Scholar]
  37. Scott MF, Fradgley N, Bentley AR. et al. Limited haplotype diversity underlies polygenic trait architecture across 70 0.167emyears of wheat breeding. Genome Biol 2021;22:137. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. The 1000 Genomes Project Consortium; Auton A, Brooks LD, Durbin RM, et al. A global reference for human genetic variation. Nature 2015;526:68–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Van der Auwera GA, O’Connor BD.. Genomics in the Cloud. Sebastopol, CA, USA: O’Reilly Media, Inc., 2020. [Google Scholar]
  40. Vasimuddin M, Misra S, Li H, Aluru S. Efficient architecture-aware acceleration of BWA-MEM for multicore systems. In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Rio de Janeiro, Brazil. 2019, 314–24. doi: 10.1109/IPDPS.2019.00041. [DOI]
  41. Wang D, Xie K, Wang Y. et al. Cost-effectively dissecting the genetic architecture of complex wool traits in rabbits by low-coverage sequencing. Genet Sel Evol 2022;54:75. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Wickham H. 2016. ggplot2: Elegant Graphics for Data Analysis. New York: Springer-Verlag. [Google Scholar]
  43. Wilke CO. 2020. cowplot: Streamlined Plot Theme and Plot Annotations for ‘ggplot2’. R package version 1.1.1. https://CRAN.R-project.org/package=cowplot.
  44. Wittbrodt J, Shima A, Schartl M. et al. Medaka–a model organism from the far east. Nat Rev Genet 2002;3:53–64. [DOI] [PubMed] [Google Scholar]
  45. Yao EJ, Babbs RK, Kelliher JC. et al. Systems genetic analysis of binge-like eating in a C57BL/6J × DBA/2J-F2 cross. Genes Brain Behav 2021;20:e12751. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Zan Y, Payen T, Lillie M. et al. Genotyping by low-coverage whole-genome sequencing in intercross pedigrees from outbred founders: a cost-efficient approach. Genet Sel Evol 2019;51:44. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Zha C, Liu K, Wu J. et al. Combining genome-wide association study based on low-coverage whole genome sequencing and transcriptome analysis to reveal the key candidate genes affecting meat color in pigs. Anim Genet 2023;54:295–306. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

vbae107_Supplementary_Data

Data Availability Statement

The sequencing data underlying this article will be shared on reasonable request to the corresponding author.


Articles from Bioinformatics Advances are provided here courtesy of Oxford University Press

RESOURCES