Fast and accurate imputation of genotypes from noisy low-coverage sequencing data in bi-parental populations

Cécile Triay; Alice Boizet; Christopher Fragoso; Anestis Gkanogiannis; Jean-François Rami; Mathias Lorieux

doi:10.1371/journal.pone.0314759

. 2025 Jan 30;20(1):e0314759. doi: 10.1371/journal.pone.0314759

Fast and accurate imputation of genotypes from noisy low-coverage sequencing data in bi-parental populations

Cécile Triay ^1,^#, Alice Boizet ^2,^#, Christopher Fragoso ^3,⁴, Anestis Gkanogiannis ⁵, Jean-François Rami ², Mathias Lorieux ^1,^5,^*

Editor: Andrea Tangherloni⁶

PMCID: PMC11781708 PMID: 39883620

Abstract

Motivation

Genotyping of bi-parental populations can be performed with low-coverage next-generation sequencing (LC-NGS). This allows the creation of highly saturated genetic maps at reasonable cost, precisely localized recombination breakpoints (i.e., the crossovers), and minimized mapping intervals for quantitative-trait locus analysis. The main issues with these low-coverage genotyping methods are (1) poor performance at heterozygous loci, (2) high percentage of missing data, (3) local errors due to erroneous mapping of sequencing reads and reference genome mistakes, and (4) global, technical errors inherent to NGS itself. Recent methods like Tassel-FSFHap or LB-Impute are excellent at addressing issues 1 and 2, but nonetheless perform poorly when issues 3 and 4 are persistent in a dataset (i.e., “noisy” data). Here, we present a new algorithm for imputation of LC-NGS data that eliminates the need of complex pre-filtering of noisy data, accurately types heterozygous chromosomal regions, precisely estimates crossover positions, corrects erroneous data, and imputes missing data. The imputation of genotypes and recombination breakpoints is based on maximum-likelihood estimation. We compare its performance with Tassel-FSFHap and LB-Impute using simulated data and two real datasets. NOISYmputer is consistently more efficient than the two other software tested and reaches average breakpoint precision of 99.9% and average recall of 99.6% on illumina simulated dataset. NOISYmputer consistently provides precise map size estimations when applied to real datasets while alternative tools may exhibit errors ranging from 3 to 1845 times the real size of the chromosomes in centimorgans. Furthermore, the algorithm is not only highly effective in terms of precision and recall but is also particularly economical in its use of RAM and computation time, being much faster than Hidden Markov Model methods.

Availability

NOISYmputer and its source code are available as a multiplatform (Linux, macOS, Windows) Java executable at the URL https://gitlab.cirad.fr/noisymputer/noisymputerstandalone/-/tree/1.0.0-RELEASE?reftype=tags.

Introduction

In genetic studies, bi-parental genetic populations can be created from inbred parental lines using various crossing systems, e.g., F₂ intercross issued from F₁ self-pollination (F₂) and recombinant inbred lines by single seed descent (SSD). These populations are used to create recombination maps and, if phenotypes are available, to find gene or quantitative-trait locus (QTL) genomic positions.

To do so, each individual of the population under study has to be characterized for its genomic content—or “genotyped” at many loci. This can be done using different molecular biology techniques, including various types of molecular markers. The gold standard for genetic variant discovery is obtained by different next-generation sequencing (NGS) techniques like restriction site-associated DNA sequencing (RADseq) [1], genotyping by sequencing (GBS) [2], and whole-genome sequencing (WGS) [3]. These techniques provide very large numbers of markers and therefore facilitate the construction of highly saturated genetic maps. This provides accurate locations of recombination breakpoints in each individual, which is important for a number of applications, e.g., studies of local recombination rate, genetic maps comparison, or QTL detection. Though NGS is less and less expensive to implement, sequencing a large number of samples can still be costly, and is commonly applied via reduced representation (RRS-NGS) or low-coverage (LC-NGS) strategies to reduce genotyping costs.

Reducing sequencing costs through minimized per-sample coverage has an important experimental downside: LC-NGS mechanically introduces a series of issues, the main ones being:

Issue 1: Low power to detect heterozygosity under low coverage: For example, if only one sequencing read is generated at a locus, only one of the two alleles is revealed. As each additional read has a 0.5 probability of detecting the second allele, even 3 reads have only 0.75 probability of detecting a heterozygous call. Spread over thousands of sites, extensive inaccuracy in heterozygous regions becomes highly problematic.
Issue 2: Extensive genotype missingness: The sparse distribution of reads at low coverage (3X coverage, for example, only implies an average of 3 reads per site) results in a complete lack of reads at some variant loci. Even in plants, which contain more genetic variation than humans, there are 6–22 SNPs per 1 Kb, resulting in abundant opportunity for non-reference variant missingness under low coverage [4].

And these issues are further compounded by those common to all types of NGS:
Issue 3: Errors due to erroneous mapping of sequencing reads: NGS technologies are based on short reads (e.g., 150 base pair, paired-end Illumina technology). Due to the combinatorial limitation of the sequence contained in short reads, multiple mapping locations may be identified, especially in plant genomes which exhibit much more repetitive content than human genomes. Additionally, in plants, such as rice, structural variation specific to subpopulations may be completely missing in single reference genomes. These assembly errors, omissions, and challenges posed by repetitious regions are sources of erroneous variants. Moreover, outright assembly errors may cause consistent, yet locally encountered genotyping errors.
Issue 4: Technical errors inherent to NGS methodology: Sequencing errors may be globally introduced at a variety of stages in the NGS pipeline, from errors incurred in PCR-dependent library construction to NGS sequencing itself. The initial GBS protocol is known to generate libraries contaminated by chimeric inserts [5]. Although rare, these errors may become problematic at low coverage, as additional reads refuting an erroneous call may not be available at a given locus.

Common imputation algorithms implemented in computer programs like Beagle [6, 7] or Impute2 [8], although very accurate in diversity panels, are not well adapted to the bi-parental context since they rely on large databases to infer haplotypes. Efficient methods have been recently developed to impute genotypic data derived from LC-NGS assays in bi-parental populations. For instance, Tassel-FSFHap (thereafter simply FSFHap) [9] and LB-Impute [10] can all address issues 1 and 2 accurately. Yet, these methods can produce inaccurate results when the errors mentioned in issues 3 and 4—thereafter called “noisy data”—are too frequent. Thus, these methods might require additional bioinformatic steps to filter out low-quality markers before and after imputation. Even then, troublesome markers might not be detected easily and could alter dramatically the quality of the imputation and the final genetic map.

In this work, we present NOISYmputer, a maximum likelihood estimation algorithm for imputation of LC-NGS data that eliminates the need of complex pre-filtering of noisy data, accurately finds heterozygous chromosomal regions, corrects erroneous data, imputes missing data and precisely locates the recombination breakpoints (i.e., the meiotic crossovers). We test its accuracy using simulated data and we compare its performance with FSFHap, LB-Impute using three datasets: (1) a rice F₂ population sequenced by WGS, (2) a maize F₂ population sequenced by GBS and (3) 84 simulated F₂ populations with controlled depth, error rate and marker density. The algorithm is implemented in NOISYmputer, a multiplatform Java command line program (see “Availability” section).

Design and implementation

Imputation method

In this section we describe the main imputation algorithm, which is applied separately to each chromosome. The imputation can be preceded or followed by different filtering options in NOISYmputer (details in next section) that can be applied to reduce or eliminate the noise in the data (Fig 1).

Fig 1 — It is composed of three major phases: pre-imputation, imputation and post-imputation. Some steps are optional (dashed borders) while others are required for the algorithm to complete.

By imputation, we mean here guessing, confirming or correcting the genotype at a SNP site in a sample. LC-NGS generates poor information in heterozygous regions (see explanation on the confounding effect in SNPs with one or few reads—issue 1 of the Introduction section). Conversely, homozygous regions are much less prone to these confounding effects. Yet, missing data (issue 2), noisiness (issue 3) and sequencing errors (issue 4) can lower the power to identify homozygous diplotypes (i.e., the combination of two gametic haplotypes). The general idea of the algorithm is, as in Hidden Markov Model (HMM), to use information of various SNPs around the imputed SNP, leaving the regions surrounding the recombination breakpoints unimputed lying between the two diplotypes. The locations of the recombination breakpoints are then inferred. Furthermore, instead of modeling error rates, we take an iterative approach to estimate them (Fig 1).

Imputation—Step 1: Genotype calling

Let’s consider a chromosome of an F₂ individual with one single recombination breakpoint that separates a homozygous diplotype (AA; BB) from a heterozygous diplotype (AB, or BA, equivalent thereafter). Let’s also consider a set of SNPs evenly dispersed on the physical genome, say, every 500 base pairs (bp). In the AA diplotype, and far from the breakpoint location, all SNPs should be genotyped as AA, except from the different kinds of errors cited above. To determine the genotype of a particular SNP, and due to these errors, one must consider not only its score in the VCF, but also its immediate “environment”, that is, the SNPs that are located just before and just after it along the chromosome. Those surrounding SNPs help identify a potential error in the SNP scoring. Different approaches can be taken to look at the SNP environment. In segregating populations, the vast majority of the genome is exempt from crossing overs. Indeed, when implementing a sliding window method like described hereby, the expected proportion of the genome with no recombination in the window is $P_{n o X O} ≃ 1 - (\frac{1}{100 N}) D (8 m - 2)$ , where m is the number of SNPs in the sliding window, N is the total number of SNPs, and D is the expected genome size in centimorgans (cM) (S2 File). Hence, in almost the entire genome except the breakpoint regions there are only two or three possible diplotypes, depending on the population type. Thus, instead of calculating all the likelihoods of possible paths (like in Hidden Markov Model methods), the problem is reduced to calculate the likelihoods of the data for the three possible diplotypes. Furthermore, there is no need to include transition (i.e., recombination) probabilities. The main advantage of this approach is its computation time, which increases linearly according to the diplotype size, in O(T), since the log-likelihood of a diplotype is simply the sum of individual values for each site, while the time complexity is O(T × S²) for the Viterbi algorithm applied to resolve fully connected Hidden Markov Model processes, with T being the length of the sequence of observations and S being the number of hidden states. We now describe the algorithm with the example of an F₂ population.

In practice, one defines starting values for error rates for reads A (e_A) and B (e_B), being respectively the probabilities of observing a B read (O_B) whereas the genotype is truly AA and observing an A read (O_A) whereas the genotype is truly a BB

e_{A} = p (O_{B} | A A) e_{B} = p (O_{A} | B B)

We allow different error rates for A and B reads since the A and B parents are generally not equally (i.e., genetically) distant from the reference genome. For example, one could set e_A = 0.005 and e_B = 0.003 if Parent B is closer genetically to the Reference genome than Parent A is. Those values will be automatically refined after one or several rounds of imputation.

Thus, at homozygous sites, the probability of observing an A read if the true genotype is AA is

p (O_{A} | A A) = 1 - e_{A}

and the probability of observing a B read if the true genotype is BB is

p (O_{B} | B B) = 1 - e_{B}

At heterozygous (AB) sites, and assuming that the A and B reads have the same chance to occur, the probabilities of observing A and B reads are

p (O_{A} | A B) = \frac{1}{2} p (O_{A} | A A) + \frac{1}{2} p (O_{A} | B B) = \frac{1}{2} (1 - e_{A}) + \frac{1}{2} e_{B}

p (O_{B} | A B) = \frac{1}{2} p (O_{B} | B B) + \frac{1}{2} p (O_{B} | A A) = \frac{1}{2} (1 - e_{B}) + \frac{1}{2} e_{A}

Let’s consider a chromosome with n SNPs. For each site SNP_j of the chromosome, we define a symmetrical window (W_j) containing the SNP_j at its center, m SNPs before it in the sequence and m SNPs after it (with read count >0). SNPs that are located in chromosome ends are omitted, since it is not possible to define symmetrical windows around them. This case is discussed later on.

For each site SNP_i of the W_j window three situations are possible: i) the genotype G_i of the SNP_i is AA (homozygous for parent A allele), ii) the genotype G_i is BB (homozygous for parent B allele) or iii) the genotype G_i is AB (heterozygous).

We then estimate the likelihood of observing a given combination of A reads (n_Ai) and B reads (n_Bi) at site SNP_i, given that the total number of reads (n_i) at this site is equal to n_Ai + n_Bi. To do so, we use the binomial distribution, with sample size at SNP_i equal to n_i, the number of successes equal to n_Ai, and thus of fails equal to n_Bi. Knowing already the probability of observing A reads under the three possible genotypes (AA, BB and AB) we obtain the following:

\begin{matrix} P [n_{A i} | p (O_{A} | A A)] & = (\binom{n_{i}}{n_{A i}}) p {(O_{A} | A A)}^{n_{A i}} {(1 - p (O_{A} | A A))}^{n_{i} - n_{A i}} \\ = (\binom{n_{i}}{n_{A i}}) p {(O_{A} | A A)}^{n_{A i}} p {(O_{B} | A A)}^{n_{B i}} \end{matrix}

\begin{matrix} P [n_{A i} | p (O_{A} | B B)] & = (\binom{n_{i}}{n_{A i}}) p {(O_{A} | B B)}^{n_{A i}} {(1 - p (O_{A} | B B))}^{n_{i} - n_{A i}} \\ = (\binom{n_{i}}{n_{A i}}) p {(O_{A} | B B)}^{n_{A i}} p {(O_{B} | B B)}^{n_{B i}} \end{matrix}

\begin{matrix} P [n_{A i} | p (O_{A} | A B)] & = (\binom{n_{i}}{n_{A i}}) p {(O_{A} | A B)}^{n_{A i}} {(1 - p (O_{A} | A B))}^{n_{i} - n_{A i}} \\ = (\binom{n_{i}}{n_{A i}}) p {(O_{A} | A B)}^{n_{A i}} p {(O_{B} | A B)}^{n_{B i}} \end{matrix}

Since the binomial factor is the same for the three possible genotypes, it can be omitted in the calculations. Then, individual relative probabilities that the genotype G_i of the SNP_i is AA, BB or AB are defined as:

p (G_{i} = X) = \frac{P [n_{A i} | p (O_{A} | X)]}{\sum_{X} P [n_{A i} | p (O_{A} | X)]}, with X = A A, B B, A B

The probabilities for the window’s diplotype around the SNP_j to be AA, BB or AB are obtained by multiplying the individual probabilities of all the SNPs in the window. As multiplication of probabilities can result in very small numbers, we add their logarithms instead to avoid reaching the precision limit of the computer:

ρ_{X} = \sum_{i = S N P_{j - m}}^{S N P_{j + m}} log [p (G_{i} = X)], with X = A A, B B, A B

Finally, the relative probabilities for the window’s W_j around the SNP_j to be AA, BB or AB are defined as:

P (W_{j} = A A) = \frac{exp (ρ_{A A})}{exp (ρ_{A A}) + exp (ρ_{B B}) + exp (ρ_{A B})}

P (W_{j} = B B) = \frac{exp (ρ_{B B})}{exp (ρ_{A A}) + exp (ρ_{B B}) + exp (ρ_{A B})}

P (W_{j} = A B) = \frac{exp (ρ_{A B})}{exp (ρ_{A A}) + exp (ρ_{B B}) + exp (ρ_{A B})}

A genotype is assigned to the SNP j if the relative probability of its surrounding window is superior to a given threshold α. To guarantee that no SNP is falsely genotyped, the threshold is set to a very stringent value (0.999 by default). SNPs with P(W_j) < α for all genotypes are assigned a missing data value. We repeat the process for each SNP_j of the chromosome. For chromosome ends, the procedure is similar except that the half-window on the end side is smaller due to the lack of sites available to the left or right of SNP_j. This leaves two types of chromosomal regions unimputed and filled with missing data: 1) regions between imputed chromosome segments with identical diplotypes and for which none of the criteria are matched to assign a genotype, and 2) regions near recombination breakpoints.

Imputation—Step 2: Gap filling and error rate estimation

Step 2 consists in i) filling the unimputed regions with the surrounding genotype, with the condition that they are surrounded (left and right) by identical imputed genotypes, then ii) re-estimating error rates. The filling procedure assumes that a double recombination event is very unlikely. The maximum region size that is allowed for data filling can be calculated using the local recombination rate, which is calculated from the data of the entire F₂ population, imputed from Step 1. So, regions larger than the maximum size are left unimputed. It is desirable to use an interference model to estimate the distances (in cM), for instance the one implemented in the Kosambi mapping function [11]. The method employed in NOISYmputer to estimate recombination fractions in F₂ populations is the standard Expectation-Maximization algorithm [12].

Let’s take the example of two SNPs A and C that define the bounds of such a region. They are separated by the genetic distance d (cM). The maximum probability of a double crossover can be calculated as follows. We first search for the SNP B that is the closest to the middle point between A and C (in cM). Then, we calculate the recombination fractions r_AB and r_BC from d_AB and d_BC using the inverse of the Kosambi mapping function

r = \frac{1}{2} tanh (\frac{2 d}{100})

Note that $r \approx \frac{d}{100}$ when d < 15 cM. In the case of highly saturated maps, this formula can be used in most intervals. Then the maximum probability of the missing data to be different to the surrounding genotype is

r_{A B C} = r_{A B} r_{B C} + {r_{A B}}^{2} {r_{B C}}^{2} \approx r_{A B} r_{B C} if SNPs A and C are homozygous

r_{A B C} = r_{A B} r_{B C} + {r_{A B}}^{2} {r_{B C}}^{2} \approx 2 r_{A B} r_{B C} if SNPs A and C are heterozygous

The regions for which r_ABC ≤ α are filled with the surrounding genotype; α is set to 0.001 by default. This step leaves the breakpoint regions unimputed.

We can then estimate new values for e_A and e_B by comparing the observed data with the newly imputed regions. This is done by simply counting the proportion of A reads in BB-imputed segments, and the proportion of B reads in AA-imputed segments.

Imputation—Step 3: Locating recombination breakpoints

Step 3 consists in imputing the SNP genotypes in the regions near the recombination breakpoints—i.e., between diplotypes of different states. The general idea is to determine an interval of high probability of presence (loose support interval) of the breakpoint, then to calculate the likelihood of the data under the hypothesis of a recombined segment. This procedure allows determining with high confidence a loose support interval where the recombination breakpoint is located. Here we take the example of a segment BB to the left of the breakpoint and a segment AB to the right. Since we already know from Step 1 which are the two genotypes at the left and the right of the breakpoint, we only need to consider the only two possible diplotypes, BB and AB. This saves one degree of freedom. If k defines the closest SNP position to the point where p(W_j = BB) = p(W_j = AB) in Step 1, we take k − 2m and k + 2m as starting points to guarantee that the breakpoint is covered by the interval. Then, for each SNP_j of the scanned area, we recalculate p(W_j = BB) and p(W_j = AB), but this time in asymmetric windows of size m, that is, for BB, we define a window from SNP_j to SNP_j−m and for AB a window from SNP_j+m to SNP_j. And then, following calculations similar to Step 1 but omitting the probabilities for the AA genotype:

p (W_{j} = B B) = \frac{exp (ρ_{B B})}{exp (ρ_{B B}) + exp (ρ_{A B})} in the B window

p (W_{j} = A B) = \frac{exp (ρ_{A B})}{exp (ρ_{B B}) + exp (ρ_{A B})} in the H window

Starting from k − 2m, and progressing to the right, we look for the first site SNP_j for which:

P_{S I} = (1 - p (W_{j} = B B)) ((1 - p (W_{j} = A B)) > α_{S I}

with SI = 0.05 by default.

The explanation for the calculation of PSI is provided in S3 File.

The breakpoint loose support interval is defined between the first position from the left (k_L) and from right (k_R) where P_SI > SI. The breakpoint support interval and position are then estimated within the loose support interval. To do so, for each SNP_j in the breakpoint interval k_L to k_R, a probability P_bkp that the diplotype’s window contains a breakpoint in its middle is estimated. We define a left window for p_bkp(W_j = BB) that includes the SNP_j and goes to the left until the window’s data count reaches m/2 SNPs with at least one read (the left boundary of this window is called m_L) and a right window for p_bkp(W_j = AB) that starts at SNP_j+1 and goes to the right until the window’s data count reaches m/2 SNPs with at least one read (the right boundary of this window is called m_R). Values of m_L and m_R are recalculated for each SNP_j.

The log-probabilities for the left and right segments are:

ρ_{b k p} (W_{j} = B B) = \sum_{i = m_{L}}^{j} log [p (G_{i} = B B)]

ρ_{b k p} (W_{j} = A B) = \sum_{i = j + 1}^{m_{R}} log [p (G_{i} = A B)]

Then, the probability that the SNP_j and SNP_j+1 are surrounding the breakpoint is:

p_{b k p} (B K_{j}) = exp (ρ_{b k p} (W_{j} = B B) + ρ_{b k p} (W_{j} = A B)

And after normalization:

P_{b k p} (B K_{j}) = \frac{p_{b k p} (B K_{j})}{m a x (p_{b k p} (B K_{z}) : z = k_{L}, \dots, k_{R}}

The breakpoint is estimated in the middle of the interval defined by the SNP having the maximal P_bkp(BK_j) and the next SNP to its right.

Finally, the unimputed genotypes in the breakpoint area are completed in assigning the BB genotype to the SNPs to the left of the SNP with the max P_bkp(BKj) (included) and AB to the right. Imputation of breakpoint positions for the other types of homozygous-heterozygous transitions (AB→BB, AA→AB, AB→AA) are easily derived from the example beforehand. The support interval for the breakpoint around its most likely position can be defined in searching for the SNPs (left and right starting from the SNP with the maximum P_bkp(BK_j) for which −log₁₀(P_bkp(BK_j) ≥ α_drop, where α_drop is the dropping value of P_bkp. α_drop is set to 1 by default, corresponding to ten-fold decrease of P_bkp compared with P_bkp(BK_j).

Filtering options—Before imputation

Genotypic frequencies, heterozygosity, missing data

The program can filter out SNPs for parental genotypes, and progeny heterozygosity, percentage of missing data and parental genotypic frequencies. Min and max filtering values can be manually entered (though usually not recommended), or the program can calculate them from the genotype matrix imported from the VCF. In this case, genotypic frequencies are calculated for each SNP, and the filter values are derived from the extreme percentiles of the frequency distribution. Correction factors can be applied to the percentiles, to avoid too small or too large values.

Read counts

SNPs with too few or too many reads can be eliminated. This can be useful to, for instance, remove SNPs in duplicated regions. By default, when reading the input VCF file, if the depth of a site for a sample is at least 10 times superior to the average depth of this same sample (across all sites), the genotype at this site is set to a missing data. Also by default, no filter is applied on the minimum number of reads required to consider a site in a sample.

Incoherent SNPs

In sequence-based genetic mapping, it is common to observe SNPs that do not segregate the same way as their immediate environment, indicating a probable mapping error due to, for instance, structural variation between the reference genome and the population parents, or between the parents, or both. As segregation distortion is a frequent phenomenon in many organisms, the Mendelian expected frequencies cannot be used to analyze the SNP segregation. Instead, the procedure defines a window of n SNPs around each tested locus. By default, n = 1% the number of SNPs in the largest chromosome. For each window/SNP couple, it calculates the genotypes AA, BB and AB frequencies and the reads A and B frequencies across the population from the genotypes called in the VCF and compares the SNP with the window segregation of genotypes and reads using a chi-square test, where expected counts are the observed frequencies in the window multiplied by the population size. It then filters out SNPs for which the chi-square statistic exceeds a defined threshold for genotypes or reads frequencies.

Filtering options—After imputation

Incoherent chromosome segments (single individual)

Even after imputation and the different filtering operations, some few, improbable chromosome short diplotypes can still remain in the imputed matrix—we call them “small chunks”. The procedure identifies each small chunk composed of identical alleles, embedded in a homogeneous genomic environment that has a different allele. The method resembles the one used in Imputation—Step2.

Consider two SNPs A and C that define the bounds of a region imputed as H and surrounded by regions imputed as A or B. Search for the SNP B that is the closest to the middle point between A and C (in cM). Also search for an SNP D before the SNP A so that d_DA ≈ d_AB, and an SNP E after the SNP C so that d_CE ≈ d_BC.

Then, calculate the recombination fractions r_DB and r_BE from d_DB and d_BE using the inverse of the Kosambi mapping function. Then the maximum probability of the “chunk” to be different to the surrounding genotype is

r_{A B C} = r_{D B} r_{B E}

The chunks for which r_ABC ≤ α are restored with the surrounding genotype; is set to 0.001 by default.

Incoherent chromosome segments (cross-population)

Entire chromosome segments can be misplaced due to different kinds of genomic structural variation such as translocation, or duplication in one of the two parents that is not present in the reference genome. Such segments are called “aliens” in the program. If their size is too large, the chi-square procedure that filters out the incoherent SNPs may fail to identify them since it is run before the imputation. Alien segments are easily detected, as they produce severe map expansion. The procedure searches for SNPs that mark rapid changes in the slope of the cumulated centimorgans of the genetic map calculated from the imputed matrix. If a SNP marker is detected, the procedure then searches for the next SNP that is closely linked (by default r < 0.01) to the SNP located just before the slope change. It then eliminates all the SNPs that are in-between.

Running the program

Algorithm implementation

The program is implemented in Java, as a Spring Boot (v2.6.7) project. Spring Boot is an open-source Java framework used to create standalone java applications. The executable .jar has been built with JDK 8 using Maven (v3.9.6), an open-source build tool.

Paths to datafiles and working folders paths, as well as parameters for imputation and filtering can be entered in a config file or directly in the command line. A “NOISYmputerResults” folder is automatically created, where the program writes all the output files.

Data specifications

In this current version, NOISYmputer is built and extensively tested to perform on F₂ intercross data, that is, the progeny from F₁ self-fertilization (F₂). NOISYmputer can also be used on recombinant inbred lines by single seed descent from the F₂ (SSD).

Input data for NOISYmputer are standard Variant Call Format (VCF) files, with chromosome coordinates. Genotypes (GT field) and allele depths (AD field) must be present in the VCFs. The data can be low coverage, that is, the sum of all sequences produced per sample is equivalent to 1–3 times (1–3 X) the size of the reference genome used. Ideally, the VCF should contain only bi-allelic single-nucleotide polymorphisms (SNPs), however NOISYmputer automatically filters out the other types of sites. Small indels are not handled. Parental lines need to be included in the VCF file with the prefix “Parent” in their name. Compressed “.gz” VCFs are accepted.

Results

NOISYmputer, FSFHap and LB-Impute were run on the IFB Core cluster (specs. available at https://ifb-elixirfr.gitlab.io/cluster/doc/cluster-desc/ and in details in S1 File) with one allocated node per job and 32GB to 64GB of RAM to make sure that the tested programs are fully efficient.

Details on parameters used for the three imputation methods are provided in S1 File.

Using simulations for calibration

To test NOISYmputer’s efficiency in precision and recall of breakpoints estimation, we used simulated F₂ datasets generated using PopSimul (https://forge.ird.fr/diade/recombination_landscape/popsimul). A set of 84 VCFs with n = 300 samples and varying values of marker density, mean depth and error rate were generated for a final expected map size of 180 cM (corresponding to an average of 3.6 breakpoints per sample) to mimic the chromosome 1 of rice (available at https://zenodo.org/records/13381283). Using five different imputation window sizes, we compared the outputs of NOISYmputer to the known positions of breakpoints in the simulated data. In total, a set of 420 combinations were analyzed. All combinations and tested parameters are listed in Table 1. The results of these analyses confirmed that NOISYmputer efficiently detects the recombination breakpoints and precisely estimates their positions.

Table 1. Parameter values used in PopSimul to generate simulated F₂ VCFs: Marker density, mean depth and error rate.

All possible combinations of these parameters were tested and imputed using a range of imputation windows in NOISYmputer.

Parameters	Marker density (in number of markers along the chromosome)	Mean depth (in X)	Error rate	NOISYmputer impute half window size
Tested values	220,000	0.5	0.05	15
	180,000	1	0.01	20
	100,000	1.5	0.005	30
	66,000	2		50
		2.5		100
		3
		4

Open in a new tab

Breakpoint precision-recall

We assessed NOISYmputer’s ability to correctly detect all breakpoints within samples by comparing positions of breakpoints found by NOISYmputer to those of simulated datasets. We considered a breakpoint correct when the simulated breakpoint position falls within NOISYmputer’s loose support interval, along with the correct transition type.

Across all 420 VCFs, representing an average of 455,000 breakpoints, NOISYmputer demonstrated robust recall score, correctly finding 99.5% of simulated breakpoints (median at 99.6%). NOISYmputer also displayed high precision as, on average, 98.9% of breakpoints identified correspond to actual breakpoints (with a median at 100%). Thus, NOISYmputer presents an overall excellent precision and recall in detecting breakpoints.

To better understand the impact of each parameter and their interaction on NOISYmputer performance, we performed a principal component analysis (PCA) on parameters and performance indicators. Precision was primarily influenced by error rates, but was also affected by the imputation window size when excessively large. Conversely, smaller window sizes enhanced recall. Also, higher marker density correlated with improved recall, as lower densities limit NOISYmputer’s ability to identify breakpoints in regions with high recombination rates.

Some specific combinations decreased NOISYmputer precision and/or recall but overall the lower performances were still good. For instance, the lowest precision was of 72.3% (with error rates at 0.05 and smaller imputation window size of 15), and the lowest recall was of 96.9% (with larger imputation window size of 100). This is expected as small windows with high levels of noise are prone to false positive breakpoints. Nevertheless, when setting the window size to a larger value—30 –, NOISYmputer achieves excellent results even with error rates as high as 5%. As such, the minimum precision is 96% and the recall is 99%, regardless of marker density or depth, even with reads with 5% error rate (S1 Table). On the other hand, large windows (especially if coupled with low depth or marker density) may miss double recombination events, leading to false negatives (Fig 2A and 2C). In more realistic conditions, error rates as high as 0.05 are not typically observed in Illumina sequencing and alignments. When removing runs with the 0.05 error rate, the average breakpoint precision reached 99.9%, with a median of 100%. Similarly, the average recall was 99.6%, with a median of 99.5% (Fig 2B, S1 Fig).

Fig 2 — A) Representation of NOISYmputer’s precision (proportion of NOISYmputer-identified breakpoints being actual breakpoints from simulated data) in function of NOISYmputer’s recall (proportion of simulated breakpoints correctly identified by NOISYmputer). NOISYmputer shows excellent recall and precision with at least 72.3% and 96.9% respectively. B) Zoom on the upper part of the A plot of precision and recall, ignoring the error rate of 0.05. C) PCA Biplot of NOISYmputer showing VCFs characteristics and imputation window size influencing precision and recall with simulated VCFs. The lowest recalls are observed when high error rates are coupled with a small imputation half-window size in NOISYmputer. The lowest precisions correspond to VCFs imputed with a large imputation half-window size in NOISYmputer and can be accentuated by very low depth (≤ 1X) and/or low marker density (< 66, 000 sites /44 Mb).

The data in the VCF files, such as sequencing depth or marker density or species model, depend on the model species or sequencing type and are generally not under the user’s control. We thus looked for the imputation window size producing the best results for both breakpoint precision and recall with the VCF that mimicked best the real F₂ rice data we had. In both cases, the optimal results were obtained by the imputation half window size of 30. Thus, we used this value of 30 later on when exposing NOISYmputer to real datasets.

Precision of breakpoint position

NOISYmputer’s precision of breakpoint position was estimated by computing the difference between the simulated breakpoints positions and the estimated ones by NOISYmputer. We considered the size of the support interval and its marker density to estimate discrepancy (in number of SNPs) with the actual breakpoint position.

Across all 420 VCFs, a difference of 1,427 bp on average (equivalent to a discrepancy of ∼ 2 SNPs) was observed. The median difference was even lower, with only 245 bp (< 1 SNP discrepancy). This disparity between the median and mean is mainly due to extreme combinations, particularly low depth combined with high error rate. Notably, variance is higher in 0.5X coverage VCFs, becoming more homogeneous at 1X coverage.

Regarding the imputation window, smaller half-windows resulted in lower average differences between NOISYmputer and simulated positions but increased the median difference. Consequently, smaller windows enhanced overall precision of breakpoint position while potentially increasing the occurrence of extreme discrepancies.

Error rate estimations

Error rates (e_A and e_B), are recalculated after a first iteration of imputation step 1. NOISYmputer correctly estimated the error rates in 100% of the cases, with an average difference between simulated and estimated error rates of 9.8 10^-7(standard deviation 4.8 10^-5) (S2 Table). This reflects the accuracy of the imputation, even with starting values for error rates far from the true values.

Confirmed efficiency on real data and comparison with other methods

We assessed the performance of NOISYmputer on two real datasets: i) a maize F₂ population in GBS with 91 samples, including the parents, and ii) a rice F₂ population with 3X coverage in whole-genome sequencing (WGS) comprising 222 samples, including the parents sequenced at ∼ 30X. Details of how the real dataset for rice was generated are summarized in the S1 File and the data are available at https://doi.org/10.23708/8FXUNC). The maize dataset is described in the LB-Impute publication [10].

In real data, direct estimation of imputation accuracy may be challenging due to the unknown true state at each locus. However, it is possible to assess the quality of the imputation indirectly by comparing the final genetic map to, for instance, existing high-quality maps. A correctly imputed dataset should yield a map size—in centimorgans (cM)—consistent with those derived from high-quality marker data. Conversely, datasets with a high rate of genotyping errors will exhibit map expansion, resulting in a longer genetic map due to falsely imputed recombination breakpoints.

Using map size estimates in centimorgans (cM) of chromosome 1 of these datasets, we compared the results of NOISYmputer to those of LB-Impute and FSFhap (Fig 3 and Table 2). Concerning the maize GBS dataset, LB-Impute and FSFhap strongly overestimated the map size expected from high-quality datasets (respectively 633 cM and 13,271 cM), whereas NOISYmputer’s map was in range with the expected map size (203 cM). Regarding the Rice WGS dataset, while both LB-Impute and FSFhap yielded maps much larger than expected (23,436 cM and 337,750 cM respectively), NOISYmputer estimated a map size close to the expected value (208 cM). The lack of prefiltering only partially explains the poor performance of LB-Impute and FSFhap: even when providing a VCF prefiltered with NOISYmputer, they still produce extremely long maps (11,571cM and 47,319 cM respectively) (S3 Table). Also, results from FSFhap on PopSimul data produced extremely large map size estimations for high error rates (5%) VCF (1,397,752 cM on average), whereas NOISYmputer estimates the map size to be 180 cM compared to the 180 cM expected. Results were qualitatively similar between NOISYmputer and FSFHap for lower (1% and .5%) error rates (S1 Table).

Fig 3 — Rice_WGS is an F₂ Rice WGS dataset with n = 222 samples including parents; PopSimul_84 values are averages across 84 VCFs generated with PopSimul, each VCF containing n = 300 samples including parents, simulated using ranges of depth, marker density and error rate to mimic different characteristics of F₂ VCFs; Maize_GBS is an F₂ Maize GBS dataset with n = 91 samples including parents (with a lower marker density than Rice_WGS). NOISYmputer is overly faster and more RAM-efficient in all conditions than FSFHap and LB-Impute, with the exception for RAM usage on simulated VCF files of PopSimul. No data is shown for PopSimul_84/LB-Impute, as LB-Impute was not benchmarked due to excessive CPU time.

Table 2. Comparison of estimated and expected map sizes for three different datasets using NOISYmputer, FSFHap and LB-Impute.

The 84 VCFs generated using PopSimul have varying numbers of markers (66,000, 100,000, 180,000 or 220,000), depending on the settings used to generate the VCFs. Overall, NOISYmputer is showing considerably higher accuracy in map size estimation compared to FSFHap and LB-Impute. Estimated map size is given as average in the case of the 84 simulated populations.

Dataset	Software	Estimated map size (cM)	Expected map size (cM)	Initial number of markers in VCF
F₂ Maize GBS n = 91 samples including parents	NOISYmputer	203
	FSFhap	13,271	200	17,945
	LB-Impute	633
F₂ Rice WGS n = 222 samples including parents	NOISYmputer	208
	FSFhap	337,750	183	254,095
	LB-Impute	23,436
84 PopSimul F₂ populations n = 300 samples each including parents	NOISYmputer	180
	FSFhap	479,399	180	Different numbers of markers depending on the simulation settings
	LB-Impute	N/A

Open in a new tab

To further estimate the performance of NOISYmputer on real datasets we also performed comparisons on precision, recall, and breakpoint position estimate in the F₂ Rice population (n = 222). This dataset includes 20 samples sequenced at ∼20X depth, and artificially subsetted to 3X (that we call pseudo-3X). These 20 samples allow for a more robust evaluation as their breakpoints are well estimated thanks to their better depth. We processed similarly to the simulated analyses and compared breakpoint precision, recall, and precision of breakpoint position for NOISYmputer against the accurately estimated breakpoints at 20X coverage. Unfortunately, we were not able to compare NOISYmputer results to those of FSFHap and LB-impute as, even if we managed to retrieve each breakpoint position estimate, we could not easily check which were actual breakpoints and which were false positives, as TASSEL FSFHap and LB-impute do not provide support intervals for breakpoints.

Overall, NOISYmputer demonstrated excellent results with, on average, 99% precision and 97% recall. Regarding precision, on average the difference in position was of 10,219 bp, while the median was of only 415 bp. The large difference between the average and the median is due to a few breakpoints estimated far from their true position. Indeed, 80% of the breakpoints were still estimated at less than 1,669 bp from their true position. In terms of number of SNPs, the discrepancy was of 2 SNPs on average (median: 1) (S4 Table).

Overestimation of map sizes was mostly due to misinterpretation of noisy data by FSFHap and LB-Impute. These discrepancies frequently arise in regions corresponding to structural variations between parental genomes. Such variations can occur, for instance, when attempting to map onto regions found exclusively in the Parent A genome, which serves as the reference. In such cases, reads from B regions might map to the most similar A regions available resulting in false recombination events according to imputation softwares. This phenomenon is accentuated in WGS data compared to GBS data as the complete genome is sequenced and mapped, thus increasing the number of markers. Including more sites, inducing sites belonging to peculiar genomic structures, can hinder the quality of imputation if the software does not take into account the coherence of a marker with its surrounding environment in the population. Though FSFHap and LB-impute might be precise in the estimated breakpoints positions, their lack of precision in breakpoints detection leads to results, on whole genome datasets, difficult to use without the help of complex filtering steps. NOISYmputer, on the contrary, is very efficient at correcting mapping issues or divergence between parental genome structures. Moreover, the filters applied by NOISYmputer allow keeping substantial amounts of data, as shown in S5 Table. For instance, in the case of the Rice WGS dataset, nearly 150,000 SNPs are kept for chromosome 1, allowing the generation of an extremely saturated map.

A resource-optimized software, CPU- and RAM-efficient

In our comparative analysis of the NOISYmputer with established counterparts, we conducted comprehensive benchmarks, focusing on execution time and RAM usage (Table 3 and Fig 3). To do so, we ran NOISYmputer, FSFHap and LB-Impute on simulated and real datasets. We then retrieved their CPU time, “wall clock” execution time and RAM usage using the seff command on the IFB computing cluster.

Table 3. CPU and RAM usage of NOISYmputer, FSFHap and LB-Impute for three datasets based on the output of the seff command on the IFB cluster.

NOISYmputer 1st and 2nd Runs are displayed as NOISYmputer shows better CPU time usage for the second run since the conversion of the raw VCF file has already been done. For LB-Impute, as imputation is processed in two steps, CPU time and execution time results are the sum of the two steps; RAM usage corresponds to the highest RAM usage of the two steps (offspring imputation). For the 84 PopSimul VCFs section, results correspond to the average of resource usage for each of the 84 PopSimul VCFs for an imputation half-window of 30 SNPs with NOISYmputer and the default window size (50) of FSFHap. All tests were conducted on the IFB Core cluster. *LB-Impute showed excessive time/RAM use so it could not be evaluated.

Dataset	Software	CPU time (h:m:s)	Total execution time (h:m:s)	RAM (GB)
F₂ Maize GBS n = 91 samples including parents	NOISYmputer—1st run	00:00:27	00:00:28	1.00
	NOISYmputer—2nd run	00:00:07	00:00:09	1.00
	FSFhap	00:06:49	00:04:35	3.42
	LB-Impute	600:20:59	00:21:04	21.61
F₂ Rice WGS n = 222 samples including parents	NOISYmputer—1st run	00:09:47	00:06:44	3.36
	NOISYmputer—2nd run	00:06:49	00:04:35	3.42
	FSFhap	01:19:00	01:19:06	4.02
	LB-Impute	16:18:52	16:19:21	31.48
84 PopSimul F₂ populations n = 300 samples each including parents	NOISYmputer—1st run	00:05:18	00:05:39	3.61
	FSFhap	00:09:20	00:09:24	1.93
	LB-Impute*	N/A	N/A	N/A

Open in a new tab

Concerning the F₂ Maize GBS dataset, NOISYmputer ran ∼10 and ∼45 times faster than FSFhap and LB-Impute, respectively. It also used less RAM (∼3.4 GB), ∼3 times less than FSFHap and ∼21 times less than LB-Impute.

Regarding the F₂ Rice WGS dataset, NOISYmputer used slightly less RAM than FSFHap and was ∼13 times faster (<6 min vs. 1h19m). LB-Impute showed poor CPU and RAM efficiency as NOISYmputer used ∼9 times less RAM and ran ∼145 times faster.

Due to the excessive computation time on this single smaller dataset, LB-Impute was excluded from the remaining comparisons with the 84 PopSimul VCFs with 300 samples. It is interesting to note that FSFHap resource efficiency is better on simulated than on real datasets even though they have more samples. Indeed, FSFHap used on average 1.93 GB of RAM, whereas NOISYmputer was stable at 3.61 GB. NOISYmputer was still faster than FSFHap on average, with ∼5 min, while FSFHap ran in ∼9 min. This underlies the difficulty that FSFHap has to impute noisy data, partly due to structural variants and calling errors. These results underscore NOISYmputer’s efficiency improvement in processing imputation tasks, especially compared to existing software for bi-parental population imputation.

Availability and future directions

Availability

NOISYmputer is available as a multiplatform (Linux, macOS, Windows) Java executable at the URL https://gitlab.cirad.fr/noisymputer/noisymputerstandalone/-/tree/1.0.0-RELEASE?ref_type=tags. The source code and the documentation are available at the same URL. A Quarto markdown companion (compatible with R markdown and Jupyter notebooks IDE) that allows to display graphics of statistics (e.g., genotypic frequencies on SNPs and samples) and graphical genotypes from NOISYmputer output files was developed and is also available.

NOISYmputer and its companion are distributed under the GNU Affero General Public License V3.0.

The 84 simulated VCFs that mimic the chromosome 1 of rice are available at https://zenodo.org/records/13381283.

Future directions

NOISYmputer’s strengths

Although previous methods have made significant advances in addressing the challenges listed above, the noisiness of imputed datasets are still producing expanded genetic maps, excess heterozygosity, and probabilistically unlikely recombination events contained within a short physical interval. Here, we introduce an algorithm which, in a series of steps, addresses each source of error to create higher-quality datasets for improved trait mapping and genomics-assisted breeding. Our algorithm represents a step to systematically address all sources of NGS genotyping error and even errors in the reference genome, and hopefully the corrections brought here will be integrated into future algorithm development. Indeed, key features of NOISYmputer are its pre- and post-filtering steps that other currently available software does not perform. In filtering SNPs and segments that are incoherent with their environment and with the population local recombination landscape, NOISYmputer efficiently eliminates errors of genotype calling, sequencing errors, or errors generated by structural variants. The pre-imputation and post-imputation stages of NOISYmputer, in particular, address artifacts of imputation caused by presence-absence variation misrepresented by the reference assembly and assembly errors from inaccurate or misordered contigs. These imputation artifacts, such as those caused by collapsed structural variants (incoherent sites or false heterozygosity) or misassembled “chunks”, are not systematically addressed by other imputation methods, such as LB-Impute [10], and otherwise must be parsed through manual filtering of the imputed dataset.

NOISYmputer is a resource-effective software developed in Java, allowing its integration in bioinformatics pipelines. NOISYmputer is parallelizing computation at the sample level in several steps of the algorithm, which increases its speed considerably. The use of a Java standalone executable also allows to simulate parallelization in running each chromosome on a separate core of a server/cluster. Moreover, NOISYmputer employs a maximum likelihood method, instead of hidden Markov models, which considerably reduces computational complexity, compared to FSFHap [9] and LB-impute [10], while enhancing result accuracy and flexibility across diverse datasets. Indeed, NOISYmputer is less sensitive to noisy regions (due to mapping artifacts for example) as it can handle large windows without being greedy in RAM and computation time to overpass complex regions.

Notably, NOISYmputer’s speed allows iterative refinement of parameter settings. For example, the size of the imputation window (in number of SNPs), like in other imputation programs (e.g., FSFhap, LB-Impute), is arbitrarily fixed by the user. The most appropriate value for m depends on several factors, including depth and SNP density. A convenient way to determine which value for m to use is to run the imputation several times with different values until reaching the expected distribution of the number of recombination breakpoints per sample across the population (if previously known). Often, saturated genetic maps generated with other types of markers are available in the literature, from which the expected distribution is easily derived. With our rice data, the imputation algorithm gave the best results with m = 30, so even a few runs should provide a satisfying window size.

Furthermore, NOISYmputer generates a .json file from the VCF during the initial run, that is used by the consecutive runs, eliminating the redundant tasks of converting the input VCF file, thus enhancing speed for subsequent launches on the same dataset.

Its robust performance extends to various VCF characteristics, accommodating differences in SNP quality, marker density, error rates, and sequencing depths. This is partly due to its low sensitivity to the SNP calling step used to generate the input VCF, as NOISYmputer is re-estimating the probabilities of genotypes using the allele depth at each site, along with information of the surrounding environment and of the whole population. This results in maintenance of overall excellent precision, recall and position precision on recombination breakpoints even with very low coverage datasets (≤1X). However, users should exercise caution in selecting an appropriate imputation window size to mitigate the risk of false positives and negatives.

In addition to its performance benefits, NOISYmputer provides users with several comprehensive breakpoint confidence information allowing to further filter the identified breakpoints. This is a feature that is innovative and useful and not available in other software, to our knowledge. NOISYmputer also outputs statistics on genotypic/allelic frequencies, samples and genetic map among others.

Suggestions for improvement

NOISYmputer could benefit from several improvements. The first one is including more population types. In the next version, we will implement F₂ backcross, or BC₁F₁, the progeny of the F₁ hybrid crossed with one of the parents (BC₁); doubled haploid of F₁ gametes (DH); F₂ intercross, that is, the progeny from F₁ self-fertilization (F₂); recombinant inbred lines by single seed descent from the the BC₁F₁ (BCSSD); the unconventional mating design (UMD) BC₁F₃, derived by two generations of self-fertilization of BC₁F₁ individuals. For now, it has been extensively tested and optimized for F₂ crosses between distant parents which might be one of the hardest designs to estimate breakpoints from. We thus are confident that the algorithm can be adapted to these other types of crosses.

Breakpoint precision and recall could benefit from a more complex modeling of the likelihood. Currently, we test for the existence of a single transition within the loose support interval in imputation Step 3. Testing for one, two or even three transitions in a single interval could increase the probability of finding close double recombination events if they happened to have a higher probability in the tested region. Breakpoint position estimation, on the other hand, might be improved by using a combination of NOISYmputer’s current algorithm with a hidden Markov model occurring in the Step 3 of imputation. This way, a smaller window size could be applied and the region to scan would be reduced to a very limited percentage of the genome only, resulting in a considerable gain of time.

NOISYmputer is robust on a broad range of samples and its computation time makes it very convenient. Part of the success of NOISYmputer lies in the fact that it performs pre- and post-imputation filtering steps that remove, among other things, incoherent SNPs, meaning SNPs that do not segregate the same way as its immediate environment, often indicating mapping errors. This filtering of incoherent SNPs step uses a Chi-square test to evaluate if the observed pattern is reasonable. Unfortunately, Chi-square test thresholds are dependent on sample sizes. Thus, when imputing many samples (e.g., m = 2000) with NOISYmputer, the user has to adapt the Chi-square threshold to the sample size, which is not convenient. A solution to this would be to use a “Cramér’s V” statistic instead [13], which would be independent of the sample number in the VCF.

NOISYmputer estimates global error rates for A and B reads for a dataset. However, mapping and sequencing errors are highly variant-specific. So, it might be desirable to estimate A and B reads error rates for each and every variant. This is what GBScleanR, for instance, implements for Genotyping-by-Sequencing data [14]. Nevertheless, in any case, NOISYmputer is already very robust over a wide range of sequencing error rates, while mapping errors are filtered before imputation.

Unlike FSFHap or LB-Impute, NOISYmputer does not impute the parental genotypes, which might result in the loss of SNPs, especially in datasets derived from very low-coverage sequencing. Although we recommend sequencing the parents at high coverage (≥ 20X), it is not always possible—for instance, when re-analyzing historical data. The next version of NOISYmputer will impute the parental genotypes when necessary.

Finally, as pointed in the Results section, the imputation half-window size can have an impact on the outputs of NOISYmputer. NOISYmputer could benefit from an iterative process that would check for different window sizes and analyze the convergence of the results to select the appropriate window size and thus to achieve the best compromise between precision and recall, along with precision of breakpoints positions.

Supporting information

S1 File. Options used for imputation with LB-impute, TASSEL-FSFHap and NOISYmputer; Generation of Rice F₂ VCF dataset; Cluster description.

(PDF)

pone.0314759.s001.pdf^{(76.2KB, pdf)}

S2 File. What is the proportion of recombined versus plain diplotypes in an F₂ individual?.

(XLSX)

pone.0314759.s002.xlsx^{(11.7KB, xlsx)}

S3 File. Calculation details of the loose support interval (PSI).

(XLSX)

pone.0314759.s003.xlsx^{(202.8KB, xlsx)}

S1 Fig. Precision and Recall without the parameter values window = 15 and error rate = 0.05.

(PDF)

pone.0314759.s004.pdf^{(55.2KB, pdf)}

S1 Table. Genetic map sizes of 84 bi-parental populations, varying in marker density, depth and error rate, after imputation with FSFHap and NOISYmputer with default parameters.

(XLSX)

pone.0314759.s005.xlsx^{(13.5KB, xlsx)}

S2 Table. Error rate estimation accuracy in 420 simulated VCFs with variable marker density, depth and error rate.

(XLSX)

pone.0314759.s006.xlsx^{(54.5KB, xlsx)}

S3 Table. Comparison of estimated with expected map sizes using NOISYmputer, FSFHap and LB-Impute using the VCF filtered by NOISYmputer as input.

(XLSX)

pone.0314759.s007.xlsx^{(9.2KB, xlsx)}

S4 Table. Evaluation of NOISYmputer outputs for 20 pseudo 3X samples F₂ samples compared to their 20X counterparts.

Pseudo 3X samples are derived from artificially subsetted 20X samples. These pseudo 3X samples were part of the 222 F₂ Rice WGS samples imputed using NOISYmputer. Number of SNPs differences are estimated based on the position difference compared to the marker density in the loose support interval.

(XLSX)

pone.0314759.s008.xlsx^{(28KB, xlsx)}

S5 Table. Comparison of numbers of remaining markers after the different filtering steps of NOISYmputer pre-imputation filtering.

(XLSX)

pone.0314759.s009.xlsx^{(9.6KB, xlsx)}

Acknowledgments

We thank Karine Labadie (CEA, Institut de Génomique, Genoscope, Evry, France) for sharing the WGS data for the sequencing of rice populations, Christine Tranchant-Dubreuil (IRD, Montpellier, France) for her help with retrieving the Rice_WGS data and François Sabot (IRD, Montpellier, France) for coordinating the IRIGIN project. We are grateful to the Institut Français de Bioinformatique (IFB) for providing computing resources. We also thank the Yale Center for Research Computing for guidance and use of the research computing infrastructure.

Data Availability

All VCF data files are available at: - https://doi.org/10.23708/8FXUNC (real data) - https://zenodo.org/records/13381283 (simulated data).

Funding Statement

- The French ANR project "LANDSREC" 385 (ANR-21-CE20-0012-03) - The French Government’s France Génomique program through its International RIce 485 Genome INitiative “IRIGIN” project - The CGIAR Research Program “RICE” We also declare that the funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1. Davey JW, Blaxter ML. RADSeq: next-generation population genetics. Briefings in Functional Genomics. 2010;9(5-6):416–423. doi: 10.1093/bfgp/elq031 [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Elshire RJ, Glaubitz JC, Sun Q, Poland JA, Kawamoto K, Buckler ES, et al. A Robust, Simple Genotyping-by-Sequencing (GBS) Approach for High Diversity Species. PLoS ONE. 2011;6(5):e19379. doi: 10.1371/journal.pone.0019379 [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Huang X, Feng Q, Qian Q, Zhao Q, Wang L, Wang A, et al. High-throughput genotyping by whole-genome resequencing. Genome Research. 2009;19(6):1068–1076. doi: 10.1101/gr.089516.108 [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Xu C, Ren Y, Jian Y, Guo Z, Zhang Y, Xie C, et al. Development of a maize 55 K SNP array with improved genome coverage for molecular breeding. Molecular Breeding. 2017;37(3):20. doi: 10.1007/s11032-017-0622-z [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Heffelfinger C, Fragoso CA, Moreno MA, Overton JD, Mottinger JP, Zhao H, et al. Flexible and scalable genotyping-by-sequencing strategies for population studies. BMC Genomics. 2014;15(1):979. doi: 10.1186/1471-2164-15-979 [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Browning SR, Browning BL. Rapid and Accurate Haplotype Phasing and Missing-Data Inference for Whole-Genome Association Studies By Use of Localized Haplotype Clustering. The American Journal of Human Genetics. 2007;81(5):1084–1097. doi: 10.1086/521987 [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Browning BL, Tian X, Zhou Y, Browning SR. Fast two-stage phasing of large-scale sequence data. The American Journal of Human Genetics. 2021;108(10):1880–1890. doi: 10.1016/j.ajhg.2021.08.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Howie B, Fuchsberger C, Stephens M, Marchini J, Abecasis GR. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nature Genetics. 2012;44(8):955–959. doi: 10.1038/ng.2354 [DOI] [PMC free article] [PubMed] [Google Scholar]
9. Swarts K, Li H, Romero Navarro JA, An D, Romay MC, Hearne S, et al. Novel Methods to Optimize Genotypic Imputation for Low-Coverage, Next-Generation Sequence Data in Crop Plants. The Plant Genome. 2014;7(3):plantgenome2014.05.0023. doi: 10.3835/plantgenome2014.05.0023 [DOI] [Google Scholar]
10. Fragoso CA, Heffelfinger C, Zhao H, Dellaporta SL. Imputing Genotypes in Biallelic Populations from Low-Coverage Sequence Data. Genetics. 2016;202(2):487–495. doi: 10.1534/genetics.115.182071 [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Kosambi DD. The Estimation of Map Distances from Recombination Values. In: Ramaswamy R, editor. D.D. Kosambi: Selected Works in Mathematics and Statistics. New Delhi: Springer India; 1944. p. 125–130. [Google Scholar]
12. Dempster AP, Laird NM, Rubin DB. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society Series B (Methodological). 1977;39(1):1–38. doi: 10.1111/j.2517-6161.1977.tb01600.x [DOI] [Google Scholar]
13. Cramér H. Mathematical Methods of Statistics. Princeton University Press; 1999. [Google Scholar]
14. Furuta T, Yamamoto T, Ashikari M. GBScleanR: robust genotyping error correction using a hidden Markov model with error pattern recognition. Genetics. 2023;224(2):iyad055. doi: 10.1093/genetics/iyad055 [DOI] [PMC free article] [PubMed] [Google Scholar]

PLoS One. doi: 10.1371/journal.pone.0314759.r001

Decision Letter 0

Andrea Tangherloni

23 Jul 2024

PONE-D-24-15414Fast and accurate imputation of genotypes from noisy low-coverage sequencing data in bi-parental populationsPLOS ONE

Dear Dr. Lorieux,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

The reviewers' comments pointed out that a revision is required to improve the current version of the manuscript. One of the most critical concerns regards the filtering step. It also needs to be clarified if the input of the three tested tools is the same. Please refer to the reviewers’ reports and the Reviewer’s Responses to Questions section for detailed comments, which could help you improve your manuscript. Please carefully address (and reply to) all the comments raised by all reviewers (this is mandatory).

Please submit your revised manuscript by Sep 06 2024 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Andrea Tangherloni

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please update your submission to use the PLOS LaTeX template. The template and more information on our requirements for LaTeX submissions can be found at http://journals.plos.org/plosone/s/latex.

3. Thank you for stating the following financial disclosure:

"The French ANR project "LANDSREC" 385 (ANR-21-CE20-0012-03), , the French Government France Génomique program through its International RIce 485 Genome INitiative “IRIGIN” project, and the CGIAR Research Program “RICE”"

Please state what role the funders took in the study. If the funders had no role, please state: ""The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.""

If this statement is not correct you must amend it as needed.

Please include this amended Role of Funder statement in your cover letter; we will change the online submission form on your behalf.

4. Thank you for stating the following in the Acknowledgments Section of your manuscript:

"We thank Karine Labadie (CEA, Institut de Génomique, Genoscope, Evry, France) for sharing the WGS data for

the sequencing of rice populations, Christine Tranchant-Dubreuil (IRD, Montpellier, France) for her help with

retrieving the Rice_WGS data and François Sabot (IRD, Montpellier, France) for coordinating the IRIGIN

project. We are grateful to the Institut Français de Bioinformatique (IFB) for providing computing resources.

We also thank the Yale Center for Research Computing for guidance and use of the research computing

infrastructure. The following programs supported parts of this initiative: the French ANR project "LANDSREC"

(ANR-21-CE20-0012-03), the French Government France Génomique program through its International RIce

Genome INitiative “IRIGIN” project, and the CGIAR Research Program “RICE”."

We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

5. When completing the data availability statement of the submission form, you indicated that you will make your data available on acceptance. We strongly recommend all authors decide on a data sharing plan before acceptance, as the process can be lengthy and hold up publication timelines. Please note that, though access restrictions are acceptable now, your entire data will need to be made freely accessible if your manuscript is accepted for publication. This policy applies to all data except where public deposition would breach compliance with the protocol approved by your research ethics board. If you are unable to adhere to our open data policy, please kindly revise your statement to explain your reasoning and we will seek the editor's input on an exemption. Please be assured that, once you have provided your new statement, the assessment of your exemption will not hold up the peer review process.

6. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information.

Additional Editor Comments:

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Partly

Reviewer #3: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: I Don't Know

Reviewer #2: Yes

Reviewer #3: I Don't Know

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

Reviewer #2: Yes

Reviewer #3: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The authors propose a method to genotype bi-parental populations with low-coverage NGS technology and compare their approach against FSFhap and LB-Impute. An MLE algorithm as compared to HMM makes sense with the assumptions of the bi-parental populations with inbred and exclusively homozygous parental lines and the low running time and memory consumption looks good. However, I have some comments/questions as given below:

Major Comments

-------------

- There is a heavy filtering of SNPs before being used by the tool. Filtering multiallelic SNPs and small indels is common in genomic tools (but newer versions of tools are updating their pipelines to incorporate them now) but further filtering of SNPs based on HWE parameters, read counts, etc might be a bit excessive. Do you have any explanation about this? How many SNPs are removed through these filtering criteria (in terms of percentage of all SNPs)?

- Do you compare NOISYmputer to FSFhap and LB-Impute with the input of these filtered variants to all three tools or are the filtering criteria for the three tools different? If variants are filtered differently for the three methods, is it possible to run the three tools on one VCF and compare the results? Is it possible that the high accuracy, better map size estimation, and breakpoint precision of the tool are the outcome of the heavily filtered set of SNPs?

- Line 259: How do you define too many or too few? Is it based on the average read depth or a constant value of X?

Also it is possible that the SNPs with too many reads might be duplicated regions but this information seems to be lost when they are removed. Any comments on this?

- What is the reason that windows in this work are based on the number of variant positions around a central SNP and not base pair distance-based? More clearly, why is an m-window around SNPj defined as SNPj-m to SNPj+m and not all the variants whose base pair distance is less than m from SNPj. Most references have gaps in their sequences and the existing window-system in the manuscript will include variants which can be large distances away from the central SNP if such a gap exists between the two variants.

Why was such a window-system used instead of a base pair distance-based?

- Why was the comparison done with FSFhap and also not with FILLIN (as the main publication of FSFhap does)? FILLIN has imputation capabilities that can also be compared with the imputation done by NOISYmputer.

I could not find a direct link to the simulated and real dataset used in the experiments. Each dataset used in the experiments should be stated clearly with a proper link. For simulations, an open source repository (e.g., Zenodo) might be used.

Minor Comments

------------

This is not a very big issue since the tool primarily deals with Illumina reads, which have very low error rates (as stated in lines 354-357). But future ideas related to extending these tools to error-prone long reads will be unsuccessful based on these results. Is it possible to generalize the model further to allow high error rates?

- I would expect to see some stats about the performance of the algorithm in the abstract.

- Line 48 states that “Reducing sequencing costs through minimized per-sample coverage has an important experimental downside: LC-NGS mechanically introduces a series of issues, the main ones being:”

However, I can't see any connection of issue 3 listed in line 60 with the data being low coverage. It's a problem with short-read sequencing.

- Line 117: Where does the equation for the proportion of the genome with no recombination come from? Reference?

- Line 123: “computation time, which increases linearly according to the diplotype size” - Is there any time complexity analysis given as part of the work to show this?

- Line 153-155: Are nAi and nBi numbers or is it a multiplication between n and Ai and Bi. If they are numbers, consider superscipting the A and B. It is not clear how they are being obtained. The binomial expansion has them as variables and their values can range from 0 to ni. But in Line 162 in the equation, they are no longer variables on the LHS of the equation while they exist on the RHS of the equation. Which value is chosen for nAi and nBi and how are they chosen?

- Line 210-225: If BB is the segment to the left of position j and AB is to the right, why is the BB window defined for j to j+m and vice versa for the AB window? Since BB is to the left, the window is for j-m to j. Further definition of PSI as a product of 1-probabilities for the two regions. There is no explanation on why this approach has been taken.

- Line 244-245: The formulation of homozygous to heterozygous transitions seems to be easily generalizable for double recombination events of transitioning from homozygous to homozygous. Why not allow such events as default and allow users to choose whether to consider the double recombination transitions?

- Line 320: Does PoPSimul also simulate gaps in the reference or is it a uniform distribution of SNPs?

- Line 338-342: It can also be considered to report the results using other metrics like precision-recall or Matthews Correlation Co-efficient etc. Maybe even with a confusion matrix.

- Line 316-317: Specs of the cluster are given in the link but which of these machines did you use for the experiments. This information can also be added as a single line regarding the specs of the machine used.

Typos

-----

- Line 102: “leaving unimputed the regions"

- Line 132: “once could set”

- Line 157-159: End bracket missing in the expression in the middle.

- Line 218: Should this be SNPj+m and SNPj-m respectively, where “m” is subscripted?

Reviewer #2: The authors propose a specific scheme for handling imputation of individuals where the original genetic stock come from two completely inbred lines. They propose to compute the total number of reads supporting the two generalized alleles (A and B) and make the primary imputation decisions based on these counts. They claim that this technique will be more tolerant to sequencing and mapping errors compared to other prevalent approaches.

The authors claim that this approach is more efficient than a Hidden Markov Model, but in my mind, what's proposed is similar to a HMM where the emitted symbols are the read counts in each window. Combining windowing with transition probabilities for long range haplotype consistency wouldn't be impossible.

I would also be somewhat weary against the argument that it makes sense to have a single global error rate per origin line, since mapping and sequencing errors are highly variant-specific. On the other hand, I can of course agree that two parameters would maybe be better than a single global epsilon.

The use of the reconstructed chromosome length as a metric for qaulity is ingenious, but it takes some time to wrap your head around it.

Overall, I find the contribution to be interesting, but there are several assumptions made on the type of dataset and I think the manuscript would benefit from more clearly spelling these out and also contrast to related scenarios where they would not be applicable.

Reviewer #3: The paper is written well, I understood the problem of imputation and authors' approach to the maximum likelihood estimation. The results provided are understandable. However, related work or literature review could be improved.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Carl Nettelblad

Reviewer #3: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2025 Jan 30;20(1):e0314759. doi: 10.1371/journal.pone.0314759.r002

Author response to Decision Letter 0

10 Sep 2024

PLOS ONE – PONE-D-24-15414

Manuscript: “Fast and accurate imputation of genotypes from noisy low-coverage sequencing data in bi-parental populations”

Response to Reviewers

Editor’s comments

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

> We carefully checked the styles requirements, and hope everything is fine.

2. Please update your submission to use the PLOS LaTeX template. The template and more information on our requirements for LaTeX submissions can be found at http://journals.plos.org/plosone/s/latex.

> We updated the manuscript using the PLOS LaTeX template.

3. Thank you for stating the following financial disclosure:

"The French ANR project "LANDSREC" 385 (ANR-21-CE20-0012-03), the French Government France Génomique program through its International RIce 485 Genome INitiative “IRIGIN” project, and the CGIAR Research Program “RICE”"

If this statement is not correct you must amend it as needed.

Please include this amended Role of Funder statement in your cover letter; we will change the online submission form on your behalf.

> We included the amended role of funders in the cover letter.