Abstract
In diploid species, many multiparental populations have been developed to increase genetic diversity and quantitative trait loci (QTL) mapping resolution. In these populations, haplotype reconstruction has been used as a standard practice to increase the power of QTL detection in comparison with the marker-based association analysis. However, such software tools for polyploid species are few and limited to a single biparental F1 population. In this study, a statistical framework for haplotype reconstruction has been developed and implemented in the software PolyOrigin for connected tetraploid F1 populations with shared parents, regardless of the number of parents or mating design. Given a genetic or physical map of markers, PolyOrigin first phases parental genotypes, then refines the input marker map, and finally reconstructs offspring haplotypes. PolyOrigin can utilize single nucleotide polymorphism (SNP) data coming from arrays or from sequence-based genotyping; in the latter case, bi-allelic read counts can be used (and are preferred) as input data to minimize the influence of genotype calling errors at low depth. With extensive simulation we show that PolyOrigin is robust to the errors in the input genotypic data and marker map. It works well for various population designs with offspring per parent and for sequences with read depth as low as 10x. PolyOrigin was further evaluated using an autotetraploid potato dataset with a 3 × 3 half-diallel mating design. In conclusion, PolyOrigin opens up exciting new possibilities for haplotype analysis in tetraploid breeding populations.
Keywords: hidden Markov model, multiparental population, tetraploid potato, double reduction, QTL mapping, MPP, Multiparent Advanced Generation Inter-Cross (MAGIC)
Introduction
Polyploid species have more than two sets of chromosomes and are especially common in flowering plants. As with diploid species, breeding progress in economically important tetraploid crops, such as alfalfa, potato, and blueberry, is advanced by studying the genetic architecture of complex traits. However, inbred lines do not exist for these crops, and therefore biparental F1 populations derived from outbred, highly heterozygous clones are typically used for QTL mapping. QTL discovered in a single biparental population can lose their predictive ability when applied more broadly in breeding populations. To overcome this limitation in diploid species, multiparental populations have been developed and used for QTL mapping (see review by Huang et al. 2015). Several software tools are available for haplotype reconstruction in diploid multiparental populations (Mott et al. 2000; Broman et al. 2003, 2019; Zheng et al. 2015), but there is no such tool for polyploid multiparental populations. The primary aim of this work is to develop a statistical framework, called PolyOrigin, for haplotype reconstruction in connected tetraploid F1 populations with shared parents.
Several methods have been developed for haplotype reconstruction in a single F1 polyploid population. Conditional on parental phases, Xie and Xu (2000) developed a hidden Markov model (HMM) for ancestral inference, although the model does not represent biological processes in a tetraploid F1 (Hackett 2001). Luo et al. (2001) developed a heuristic algorithm for parental phasing in a tetraploid F1, based on two-point linkage analyses. Hackett et al. (2003) modified the phasing algorithm (Luo et al. 2001) for analyzing SNP dosage data, and developed an HMM for ancestral inference by assuming only bivalent formation. Zheng et al. (2016) developed the integrated HMM framework TetraOrigin for parental phasing and ancestral inference, accounting for both bivalent and quadrivalent formations in meiosis. One phenomenon resulting from quadrivalent formation is double reduction, which is the inheritance of both sister chromatids (for a particular locus) in the diploid gamete. The MAPpoly software (Mollinari and Garcia 2019; Mollinari et al. 2020) uses two-point procedures and HMMs for parental phasing and ancestral inference in polyploids up to 8x, assuming only bivalent formation.
Compared to earlier software, the principal innovation of our new software PolyOrigin is the ability to analyze multiple, connected 4x F1 populations, regardless of the number of parents or mating design. In addition, PolyOrigin has the following desirable features (some of which can be found in earlier packages): (1) it can analyze SNPs from both arrays and sequencing data, particularly at low read depth; (2) it accounts for quadrivalent formation during haplotype reconstruction of the progeny; (3) it can use a physical map as input and estimate genetic distances; and (4) it has improved robustness to errors in the input map and marker data.
One source of error is the uncertainty in the dosage calling from intensity signals of a SNP array or allele counts of sequencing data. We account for parental dosage errors by a correction procedure during ancestral inference in PolyOrigin, whereas TetraOrigin introduced a parental error parameter (Zheng et al. 2016). In addition, PolyOrigin includes a procedure for marker deletion during parental phasing, and the markers with parental errors are thus likely to be removed. On the other hand, since it has been shown that read depths of 60–80 are required for accurately inferring dosage in autotetraploids (Uitdewilligen et al. 2013; Matias et al. 2019), PolyOrigin can account for the dosage uncertainties by using read count data directly.
Another source of error is the input marker map. The marker deletion procedure during parental phasing can also remove those markers that are misgrouped or long-range misordered, in addition to parental errors. The map construction packages such as MAPpoly (Mollinari and Garcia 2019; Mollinari et al. 2020) and polyMapR (Bourke et al. 2018) order markers by the multidimensional scaling algorithm (Preedy and Hackett 2016), based on two-point linkage analyses. Such input genetic maps can be improved by an extra step of map refinement using a multi-locus HMM approach. This map refinement, implemented in PolyOrigin, consists of local marker reordering and inferring inter-marker genetic distance—the latter becoming necessary when the input marker map is a physical map.
We evaluate PolyOrigin by extensive simulation studies and with a real tetraploid potato dataset. For the simulation studies, we compare PolyOrigin with TetraOrigin and MAPpoly and investigate the effect of mating design, such as the number of parents. We also investigate the robustness to low depth sequencing and errors in the input dosage data and marker linkage map. We demonstrate that PolyOrigin is a robust new software for haplotype reconstruction in connected tetraploid F1 populations.
Methods
Overview of PolyOrigin
An overview of PolyOrigin is shown in Figure 1. Each F1 population can be either a cross between two parents or a self-fertilized population from a single parent. The set of populations can be represented by an un-directed graph (e.g., Figure 1A), with nodes representing parents and edges representing the crosses or selfings. This is called connected F1 since two populations can be connected by sharing one of their parents. PolyOrigin requires two inputs: (1) a mating design describing the parents of each F1 offspring, and (2) a genotypic data matrix for all parents and offspring for a set of SNP markers. Genotypic data include a genetic map or physical map of the SNP markers. We assumed that all markers are bi-allelic, and denote the two alleles by 1 and 2, with genotype dosage (0–K) defined as the count of allele 2. Here, ploidy K = 4 for tetraploids. We model the marker data independently across linkage groups, and thus describe the model for only one linkage group.
Figure 1.
Overview of PolyOrigin. (A) Mating design of the three F1 populations derived from three parents: P1, P2, and P3, where population 3 was derived by self-pollinating P3. (B) The directed acyclic graph of the PolyOrigin model for the connected F1 populations in (A). The symbol Offspring denotes an offspring j of population i. The squares denote the input marker data, the circles denote random variables to be inferred, and the arrows denote probabilistic relationships to be modeled. This panel is adapted from Figure 1 of Zheng et al. (2016). (C) Workflow consists of three steps. The second step is necessary only if the input map is a physical map.
Notations for the PolyOrigin model will be introduced in the following description and are summarized in Table 1. We use t to index a marker, p for a parent, i for an F1 population, and j for an individual in a given F1 population. Let L denote the number of parents, and M for the number of markers. Denote by the observed genotypic data for parent at marker , and the genotypic data for individual j of F1 population i at marker t. The PolyOrigin model has three kinds of hidden variables (Figure 1B): denotes phased genotype for parent p at locus t, denotes phased origin-genotype for offspring (i, j) at marker t, and denotes valent (bivalent or quadrivalent) formation for producing offspring (i, j) from their parents. Here, the term origin-genotype denotes a combination of K parental origins, specifying parental homologs that the marker alleles in offspring are derived from; each parental homolog is regarded as a distinct origin, and there are in total KL possible parental origins in the population. The genotypic data is modeled by applying a genotyping error model on the true phased genotype, which can be obtained by replacing the parental homolog origins in the phased origin-genotype by the corresponding marker alleles in the phased genotypes .
Table 1.
List of symbols used in PolyOrigin and their brief descriptions
| Symbol | Description |
|---|---|
| t | Subscript for a marker |
| p | Superscript for a parent |
| (i, j) | Superscript for an offspring, individual j of F1 population i |
| L | Number of parents |
| N | Total number of offsprings in connected F1 |
| M | Number of markers |
| D | Average number of sequence reads for an individual at a marker. |
| S | Number of selfing populations in a mating design |
| K | Poidy level, K = 4 for tetraploid |
| Probability of x, conditional probability of y given x | |
| Observed genotypic data of parent p at marker t, | |
| Observed genotypic data of offspring (i, j) at marker t, | |
| Hidden phased genotype of parent p at marker t, | |
| Hidden origin-genotype of offspring (i, j) at marker t, | |
| Hidden valent (bivalent or quadrivalent) formation of offspring (i, j) | |
| Genotyping error probability at marker t, | |
| ϵ | Sequencing read error probability |
| for the parents of offspring (i, j) | |
| True dosage of offspring (i, j) at marker t, | |
| d | Genetic distance in Morgans |
| rbi | Recombination fraction for bivalent formation, |
| rquad | Recombination fraction for quadrivalent formation, |
| likelihood of genotype depends implicitly on | |
| Individual marginal log-likelihood, | |
| logl | Marginal log-likelihood, |
| T | Temperature in the simulation annealing for refining local ordering |
| Posterior probability of phased origin-genotype | |
| Posterior probability of unphased origin-genotype | |
| fexch | Fraction of long-range disturbed markers |
| σlocal | Intensity of local disturbances in marker ordering |
The PolyOrigin algorithm consists of three steps (Figure 1C): parental phasing, map refinement, and ancestral inference. In the third step, HMM decoding and parental error correction are iterated until no errors can be detected. The parental phasing corresponds to the maximum likelihood estimation of . And the HMM decoding corresponds to the estimation of , averaging over all possible values. In the following, we will describe the HMM-based PolyOrigin model and the three steps of PolyOrigin algorithm.
PolyOrigin model
Conditional on phased parental genotypes, offspring are independent of each other. For a given offspring (i, j) and its valent formation , the genotypic data can be modeled by an HMM. This can be described by a genotype model specifying the probability of given a hidden origin-genotype , conditionally independent among markers, and a parental origin process specifying the joint prior probability of .
Genotype model:
At a locus t, the genotype likelihood depends implicitly on parental phases via the unknown true dosage , which is a deterministic function of hidden origin-genotype and phased parental genotypes for the parents of offspring (i, j). We consider three possible representations of genotypic data . First, is represented by a dosage with the likelihood given by
| (1) |
where ploidy level K = 4, indicator function I(s) equals 1 if statement s is true and 0 otherwise, and denotes the dose error probability at marker t. If a dosage error occurs, the resulting dosage is randomly drawn from the other K possible dosages.
Second, is represented by a pair of read counts. Let r1 and r2 be the counts of sequence reads for alleles 1 and 2, respectively, at marker t for offspring (i, j). Assume that the read counts r1 and r2 are generated by an unknown dosage that is different from with error probability , for example, because of the misalignment of reads to the reference genome. We sum out to obtain the read count likelihood
| (2) |
where can be obtained from Equation (1) by replacing with , and can be obtained from the following binomial model
| (3) |
| (4) |
where q denotes the probability of a sampled read being allele 1, denotes the probability of allele 2 being sampled, and ϵ denotes the sequencing error probability of observing the incorrect allele. By default, we set , and the dependence of likelihood on ϵ is not shown in Equation (3).
Third, is represented by the vector of probabilities , a generalization of the first and second representations, and the data likelihood is given by
| (5) |
according to Equations (1) and (2), where . The probability vector can be calculated from Equations (3) and (4) for the read counts.
For example, suppose that and for the two parents of offspring (i, j), and denotes that the offspring is descended from homologs 1 and 2 of parent P1 and homologs 6 and 8 of parent P2; we denote the four homologous chromosomes of the first parent P1 by , and for the second parent P2. Thus, the true phased genotype is 1112 and the true dosage is 1. If dosage . If read count , the probability vector is for , and thus . If probability vector . If is a missing value, .
Parental origin process:
Zheng et al. (2016) have described a discrete time Markov chain model for the parental origin process along the four homologs of an offspring in a F1 population. The same model can be used for an offspring resulting from selfing, except that the state space is different. A discrete time Markov chain model consists of two components: a discrete distribution of the states at the first marker t = 1, and a transition probability matrix describing how the states change from marker t to the next t + 1 for , so that the joint prior distribution is given by , because of the Markov approximation. Here, we summarize the two components.
The two gametes in an offspring are assumed to be produced independently. The initial distribution for a zygote can be obtained by the Kronecker product between the two initial distributions, one for each of the two gametes. Similarly, the transition probability matrix for a zygote can be obtained by the Kronecker product between the transition probability matrices for the two gametes. Denote by the valent formation v1 (v2) for the first (second) gamete in an offspring (i, j). We describe the parental origin process in a gamete, for example, the first gamete, conditional on a given value of v1.
Denoting the four homologs of the gamete parent by , v1 can take four possible values: , and , where the first three values denote bivalent formation, and the last value denotes quadrivalent formation. For example , the initial distribution is assumed to be discrete uniform among gamete states (1, 3), (1, 4), (2, 3), and (2, 4). The transition probability matrix is given by the Kronecker product , where
describes the transition between origins 1 and 2 along the homolog produced by the parental homolog pair , and it refers to the transition between origins 3 and 4 for the homolog pair . Along with the resulting homolog, one parental origin changes into the other with probability rbi, the inter-marker recombination fraction assuming bivalent formation. For quadrivalent formation , the initial distribution is assumed to be discrete uniform among the 16 possible pairs of origins , and the transition probability matrix is given by , where
describes the transition among origins 1–4 along each homolog produced by quadrivalent formation. This transition matrix is a statistical, rather than biological, model: along with each homology in the gamete, the parental origin changes randomly into one of the other three origins with equal probability , where rquad denotes the inter-marker recombination fraction. We assume there is no crossover interference and use Haldane’s map function (Haldane 1919; Luo et al. 2006),
where d is the inter-marker genetic distance in Morgans.
If an offspring is produced by crossing two different parents, the bivalent pairing v2 takes possible values: , and . If the offspring is self-fertilized, v2 takes the same set of values as those of v1. The HMM state space for a selfed offspring is thus different from that of a cross-fertilized offspring, but the size of the state space and the transition probability matrix are the same.
The prior distribution for the valents assumes the three possible bivalents and quadrivalent have equal probability 1/4. The preferential pairing and frequency of quadrivalent can be estimated by the posterior distribution of valent formation , conditional on all marker data.
PolyOrigin algorithm
Step 1 parental phasing:
We extend the phasing algorithm described in Zheng et al. (2016) from a single biparental F1 population to connected F1 populations. The phasing algorithm is to optimize the log-likelihood , where the individual log-likelihood . Here, for genotyping error probabilities at all markers, for the hidden phased genotypes of parent p at all markers, while the hidden origin-genotypes are summed out in . Note that logl depends implicitly on the marker ordering and inter-marker distances.
The phasing algorithm starts with the initialization of for each parent by randomly drawing from its prior distribution . For example, if dosage follows a prior uniform discrete distribution among the four possible phased genotypes: 1112, 1121, 1211, and 2111. If probability vector takes 1111 with probability 0.2, takes one of the four-phased genotypes: 1112, 1121, 1211, and 2111 with equal probability 0.125, and takes one of the six phased genotypes 1122, 1212, 1221, 2112, 2121, and 2211 with equal probability 0.05. If is a pair of read counts, it can be first transformed into a probability vector. If is a missing value, takes one of the phased genotypes with equal probability 1/16.
After initialization, each phasing iteration performs alternative maximization among valent formation and phased parental genotypes. First, independently for each offspring, the valent formation is given by maximizing the individual log-likelihood with respect to , conditional on the phased parental genotypes . We consider only bivalent formation because of computational efficiency, and it has been shown that quadrivalent formation has barely any improvement on parental phasing (Zheng et al. 2016). We calculate the individual log-likelihood for a given by the forward algorithm for HMM (Rabiner 1989). Second, sequentially for each parent , we obtain the maximum possible hp, conditional on valent formation for all offspring and phased genotypes for all the other parents. Specifically, we calculate a proposed phase hp that approximates the maximum possible phase, accept it if the target function logl is increased, and otherwise reject it and keep the current phase. We obtain proposed hp in a forward-backward procedure, which can be adapted from the detailed description for a single F1 population (Zheng et al. 2016).
When phasing iteration gets stuck such that the proposed parental phase for every parent is rejected, we delete markers that do not fit into the marker sequence. Because the number of markers deleted is negatively correlated with genotyping error probability, we estimate ε by maximizing the target function logl, prior to marker deletion, assuming that ε does not vary with markers. We perform the estimation of ε and marker deletion only once for the sake of computational efficiency. We delete markers using the Vuong’s closeness test, a likelihood-ratio-based test that can be used for comparing two nonnested models (Vuong 1989). We calculate the Vuong test statistic for all markers simultaneously and delete those markers with P-values significant at 0.05.
A single phasing run stops if the parental phases do not change for 5 consecutive iterations, or the number of iterations reaches 30. To find the global maximum, we perform multiple phasing runs independently and select the one with the largest logl. We repeat phasing runs until the so-far maximum phases have been obtained 3 times or the number of runs reaches 10. In comparison with the TetraOrigin algorithm (Zheng et al. 2016), we decrease some default values such as the maximum number of phasing runs, because the differences among phasing runs may be caused by the parental errors, and the PolyOrigin algorithm has additional error correction in the ancestral inference step.
Step 2 map refinement:
The optional step of map refinement refines local marker ordering and inter-maker genetic distances. If the input map is a physical map, it is necessary to estimate genetic distances. If the input map was obtained from two-point linkage analyses, it can be further refined by the multi-point HMM analysis in this step, although the refinement of local ordering would usually result in more than double computational time.
Prior to map refinement, ancestral inference with parental error correction is performed to correct parental genotype errors and exclude outlier offspring. Conditional on the phased parental genotypes, map refinement iteratively updates local marker ordering, inter-marker genetic distance, valent formation , and marker-specific error probability . The estimation of is the same as that in the parental phasing, except that quadrivalent formation is allowed. To decrease the effect of offspring genotyping errors, is estimated by maximizing logl using the local Brent method (Brent 1973), sequentially for marker , and markers with are deleted. Similarly, inter-marker distance is estimated by maximizing logl using the local Brent method (Brent 1973).
In each iteration, the local marker ordering is refined by sliding a window along chromosome, and the ordering refinement starts with window size 2 and increases until no proposed ordering at the given window size is accepted during a scan along chromosome. The ordering of markers within a sliding window is reversed with probability , where is the increase of logl due to the order reversion and T is temperature in the simulated annealing (Kirkpatrick et al. 1983). The temperature T is set to 4 in the first iteration, and decreases by half after each iteration.
The map refinement can be divided into three stages with decreasing number of updating variables. The first stage updates local ordering, inter-marker distance, , and , and it changes into the second stage when and the maximum sliding window size equals 2. The second stage consists of two iterations: it updates only inter-marker distance and strips markers at a chromosome end if there exists a distance jump greater than 20 cM and the fraction of markers deleted is less than 5%. The third stage estimates inter-marker distances for selected skeleton markers in five iterations. The chromosome is divided into 50 segments, and the marker with the smallest is selected in each segment. The inter-marker distances in the final map are re-scaled piece-wisely, based on the estimated skeleton marker map.
Step 3 ancestral inference:
Conditional on phased parental genotypes and the refined genetic map, each offspring is analyzed independently with an HMM. The step of ancestral inference performs iteratively the estimation of marker specific , HMM decoding, and parental error correction, until there are no error corrections. The estimation of is conditional on valent formation for all offspring, and the estimations of ε and are the same as those in map refinement.
In the HMM decoding, the posterior probability and the individual marginal likelihood are calculated by the forward-backward algorithm for HMM (Rabiner 1989), conditional on each of the 16 possible values of , allowing for quadrivalent formation. Assuming a discrete uniform prior distribution of , we can obtain the posterior distribution from the individual marginal likelihood according to the Bayesian theorem (Gelman et al. 2013). Finally, we can obtain phased origin-genotype probability
| (6) |
where the summation is over the 16 possible values of , and the dependencies on phased genotypes for the parents of offspring (i, j) are not shown.
In the parental error correction, we first perform dosage calling based on the HMM decoding. Specifically, we calculate the dosage posterior probability by summing the condition probability in Equation (6) over such that . The dosage is called to be the maximum possible one if its posterior probability is larger than 0.5, and otherwise, it is set to missing. Second, we detect suspicious markers at which the fraction of mismatches between called genotypes and observed offspring genotypes is larger than 0.15. Here, mismatch refers to the input dosage being different from the called dosage, or the input probability of the called dosage being less than 0.01. Finally, at each suspicious marker t and for each parent p, we replace the current value of by the one with minimum mismatches in offspring dosages, among all the 16 possible values of , if the number of mismatches is decreased by at least 3.
The final output of ancestral inference is given by unphased origin-genotype probability for all offspring at all markers by summarizing the corresponding phased origin-genotype probabilities , where is given by the sorted value of . For example, unphased origin-genotype for cross-fertilized offspring (i, j) corresponds to four-phased origin-genotypes , and .
In addition, we detect outlier offspring according to the estimated distribution of the number of recombination breakpoints. Specifically, for each offspring at each marker, the unphased origin-genotype is called to be the maximum possible one if its posterior probability is larger than 0.6, and otherwise it is set to missing. For an offspring, we count the number of changes in origin-genotype along the four homologs of a linkage group after skipping the missing genotype calls, and obtain the number of recombination breakpoints by summing the number of changes over all linkage groups. An offspring is labeled as outlier if , where the Anscombe transform with b being the number of breakpoints in the offspring (Anscombe 1948), the Tukey’s fence is set to 3 (Tukey 1977), and Q1 and Q3 are the lower and upper quartiles of the transformed values.
Algorithm evaluation
We evaluated the performance and robustness of PolyOrigin by extensive simulation using PedigreeSim (Voorrips and Maliepaard 2012) and updog (Gerard et al. 2018) with a custom-made R package wrapper called PedigreeSimR available at https://github.com/rramadeu/PedigreeSimR. We quantified parental phasing error as the fraction of estimated parental phases different from the true phases, and ancestral inference error was defined as 1 minus the posterior probability of the true unphased origin-genotype, averaged over offspring and markers.
Baseline setup:
We first set up default parameter values as a baseline. We simulated only one linkage group and first specified the true parental haplotypes. In the scenarios with fixed number of markers, the true parental haplotypes were given by the 32 real potato haplotypes described in Real Potato datasets. The genetic length is 149 cM, with the number of polymorphic markers varying from M = 201 in the first two parents (L = 2) to M = 258 for L = 8. In the scenarios with varying number of markers, we first simulated a genetic map and then true parental haplotypes at each marker. The inter-marker distances were first simulated from a Poisson distribution and then re-scaled to obtain the total genetic length of 100 cM, and the four homologous haplotypes of a parent were simulated by first randomly sampling a dosage and then randomly sampling a phased genotype compatible with the sampled dosage, independently at each marker.
We simulated two kinds of polysomic inheritance: (1) both preferential bivalent and quadrivalent formations were allowed, prefPairing = 0.5 and quadrivalents = 0.5, so that double reduction is possible; (2) only random bivalent formation was allowed, prefPairing = 0 and quadrivalents = 0, so that double reduction is not possible.
The true offspring genotypes were obtained by combining true parental haplotypes and simulated inheritance, from which observed genotypic data were obtained by applying an error model and a missing pattern. For SNP array dosage data, an error occurred in each parental or offspring dosage with probability , and the resulting dosage was set to one of the other dosages with equal probability. Each parental or offspring dosage was missing with probability 0.1. Sequencing data were simulated with average depth D = 5, 10,…, 80, sequencing error rate 0.005, allelic bias 0.7, and over-dispersion 0.005 (Gerard et al. 2018). A read depth equaled zero (i.e., missing data) with probability 0.1 and otherwise followed a Poisson distribution with mean .
The default mating design was a half-diallel design with L = 5 parents, where all 10 possible combinations of parents were crossed, and each cross produced an equal number of offspring.
Simulation scenarios:
We simulated four scenarios, according to their study purposes: (1) comparisons with previous methods, (2) effect of population design, (3) effect of genotyping design, and (4) robustness to errors in the marker map. In each scenario, a few parameters were varied while keeping the others at the baseline. For each parameter combination in the first scenario, we simulated 50 replicates to obtain confidence intervals of the estimation errors for the comparisons. For the other three scenarios, we simulated only 5 replicates because there are too many combinations, and 5 was sufficient to generate smooth trends.
To compare with MAPpoly (Mollinari and Garcia 2019; Mollinari et al. 2020) and TetraOrigin (Zheng et al. 2016), we simulated biparental F1 populations. Missing dosages in parents were not allowed, which is required by MAPpoly. We simulated SNP array data with population size varying from N = 10 to N = 200 and two kinds of polysomic inheritance: one with quadrivalents and the other without quadrivalents.
To study the effect of population design, we simulated SNP array data for four mating designs: linear design where each parent was crossed with the next, circular design differing from the linear design by an extra cross between the first and the last parents, star design where the first parent is crossed with each of the other parents, and diallel design where all pairs of parents were crossed. The naming of mating design is based on the un-directed graph representation of the connected F1 populations. We varied three parameters: the number L of parents, the number S of selfing populations, and the total population size N, one at a time, while keeping all other parameter values at the baseline. When increasing S from 1 to 5, the selfing population was created in order from parents 1 to 5.
To study the effect of genotyping design, we simulated genotyping by SNP array and sequencing in the diallel designs with no selfings (S = 0), using simulated true parental haplotypes with various marker densities. The SNP array design aimed to study the robustness to genotyping error probability ε for two population sizes N = 50 and 200, with L = 5 parents. The sequencing design aimed to study the effect of read depth D and the number M of markers for three diallel designs with L = 2, 5, and 10 parents, the number of offspring per parent being fixed to 90 so that N = 180, 450, and 900, respectively.
To study the robustness to errors in the input marker map, we first simulated SNP array data in the diallel design with no selfings (S = 0) and L = 5 parents for two population sizes N = 50 and 200, using the true parental haplotypes with M = 242 markers. To study the effect of markers that are wrongly positioned in long range, we disturbed marker ordering by randomly selecting markers on one chromosome arm and markers on the other arm, and then exchanging them between two arms. To study the effect of erroneous local marker ordering, we obtained a disturbed genetic map by ordering markers according to the sum of true marker index t and a normal distributed random variable with mean 0 and standard deviation σlocal, while keeping the original marker locations.
Real potato datasets:
A set of 32 chromosome-length SNP haplotypes from potato were used as the true parental haplotypes to simulate populations and evaluate algorithm performance; see Supplementary Table S1. The 32 haplotypes correspond to chromosome group 4 of 8 tetraploid potato clones, genotyped with version 2 of the potato SNP array, which had 12 K markers (Hamilton et al. 2011; Felcher et al. 2012). The eight clones were mated in pairs to create four F1 populations (Endelman et al. 2018), and the software MAPpoly (Mollinari and Garcia 2019; Mollinari et al. 2020) was used for parental phasing.
In addition, a 3 × 3 half-diallel population in potato was used for evaluation; see Supplementary Table S2 for the dosage data with physical map, and Supplementary Table S3 for the mating design. Three parents (W6511-1R, W9914-1R, and Villetta Rose) were mated in all three pairwise combinations to create a total population of 434 clones (individual F1 population sizes of 162, 155, and 117). Clones were genotyped with version 3 of the potato SNP array, which had an additional 9 K markers from Vos et al. (2015) compared to version 2. Allele dosage was assigned using R package fitPoly (Voorrips et al. 2011; Zych et al. 2019) and 5078 markers distributed across all 12 chromosome groups remained after curation. Physical positions for the input map were based on the potato DMv4.03 reference genome (Potato Genome Sequencing Consortium 2011; Sharma et al. 2013).
Parameter setup:
For simulated data, local ordering and inter-marker distances were refined only when studying the robustness to errors in the input genetic map. For real potato data, PolyOrigin estimated the inter-marker distances, conditional on the input marker ordering. We set up TetraOrigin to have the same option values as those of PolyOrigin. The parameter setup of MAPpoly was provided by the first author of MAPpoly (M Mollinari, personal communication). See the Supplementary materials for the detailed description of the parameter setup for running PolyOrigin, TetraOrigin, and MAPpoly. All analyses were performed using a desktop computer with the processor of Intel(R) Xeon(R) W-2133 CPU @ 3.60 GHz and 32 GB RAM memory.
Data availability
PolyOrigin has been implemented in Julia 1.5.3, and is freely available under the GNU General Public License 3.0 from the website: https://github.com/chaozhi/PolyOrigin.jl. See https://github.com/chaozhi/PolyOrigin_Evaluate for the scripts and the real and simulated data for comparing PolyOrigin with the previous methods; the real potato data are also available in Supplementary Tables S1–S3. Supplementary material is available at figshare: https://doi.org/10.25386/genetics.14745417.
Results
Comparisons with previous methods
PolyOrigin was compared with the existing software TetraOrigin and MAPpoly for a single F1 population considering quadrivalent formation, under which double reduction is possible (Figure 2). Both PolyOrigin and MAPpoly have very small phasing error when population size (Figure 2A), whereas the phasing error for TetraOrigin was around 0.02 because of the parental dosage errors in the simulated data (). PolyOrigin deleted those markers with parental dosage errors, while TetraOrigin has no function for removing markers and MAPpoly falsely removed more markers with decreasing population size (Figure 2B). Note that TetraOrigin may account for parental dosage errors by assuming a nonzero parental genotyping error probability, but this leads to much longer computation time.
Figure 2.
Comparisons of PolyOrigin with TetraOrigin and MAPpoly in a single F1 population considering quadrivalent formation (double reduction is possible). The error bars denote the 95% confidence intervals obtained from 50 replicates. (A, C) Estimation errors in parental phasing and ancestral inference, respectively. (B) Fraction of markers deleted. TetraOrigin has no marker deletion. The dashed lines denote the fraction of markers that are deleted and have no parental dosage errors. (D) Computational time in minutes.
TetraOrigin has slightly worse performance in ancestral inference than PolyOrigin (Figure 2C), resulting from its higher parental phasing error (Figure 2A). On the other hand, the poor performance of MAPpoly when compared to TetraOrigin and PolyOrigin is mainly because MAPpoly cannot correctly reconstruct progeny in regions with double reduction. When double reduction is not considered, MAPpoly has smaller ancestral inference error but still larger than PolyOrigin, mainly resulting from its larger phasing error (Supplementary Figure S2).
The computational time of TetraOrigin is around 4–6 times as long as that of PolyOrigin (Figure 2D). In comparison, MAPpoly is around 4 times as long as that of PolyOrigin for N = 200, and it takes even longer time with decreasing population size (Figure 2D).
Effect of population design
The effect of population design on parental phasing, considering the four design parameters—mating design, population size N, number of selfings S, and number of parents L, is summarized through the number of gametes contributed by each parent (Figure 3). Note that the number of gametes is the same as the number N/L of offspring produced by each parent in the case of no selfings (S = 0). The parental phasing error becomes very small when the number of gametes from each parent exceeds 30 (Figure 3). No noticeable differences between a single F1 population of size N and the collection of two independent selfing populations of size were observed for (Figure 3D).
Figure 3.
Effect of population design on parental phasing. Each dot denotes the result for each parent in each of five replicates, given each combination of the design parameter values. (A–C) Effect for the populations with varying number L of parents, number S of selfings, and population size N, respectively, for each of the four mating designs. Panels A and B include the results for two sizes of 50 and 100. (D) Effect for a biparental F1 and two independent selfing populations.
The effect of each design parameter on parental phasing is noticeable only for the small population size (Supplementary Figure S3). It is not unexpected that the parental phasing error increases with the number L of parents (Supplementary Figure S3A) and decreases with the total population size N (Supplementary Figure S3E). The star mating design performs much worse than the linear, circular and diallel designs, particularly at the medium number S of selfings (Supplementary Figure S3C), where the numbers of gametes contributed by parents are more unequal that at the two extreme values of S. The effect of population design on ancestral inference mainly results from its effect on parental phasing (Supplementary Figure S4).
Effect of genotyping design
SNP array design:
The effect of dosage error probability ε is shown in Figure 4, A, C, and E for the diallel populations with sizes N = 50 and 200. PolyOrigin is robust to ε, except for small N = 50 and large (Figure 4, A and C). The fraction of markers deleted increases gradually with ε but it is always smaller than ε (Figure 4E), indicating that both marker deletion and parental error correction contribute to the robustness.
Figure 4.
Effect of dosage error probability ε and the number M of markers for SNP array dosage data in the diallel populations with no selfing (S = 0) and L = 5 parents. (A, C, and E) Effect of ε on parental phasing, ancestral inference, and marker deletion, respectively, with M = 200. (B, D, and F) Effect of M on parental phasing, ancestral inference, and marker deletion, respectively, with . The dashed lines in (E, F) denote the fraction of markers that are deleted and have no parental dosage errors.
The effect of the number M of markers is shown in Figure 4, B, D, and F. The parental phasing is robust to M except for small N = 50 and (Figure 4B), and the ancestral inference error decreases with M (Figure 4D). The fraction of markers deleted is independent of M and is always smaller than ε (Figure 4F).
Sequencing design:
The effect of read depth D (number of reads per marker per individual) and number M of markers is shown in Figure 5 for sequencing data in the diallel populations with L = 2, 5, and 10 parents, the total population size N being adjusted so that the number N/L of offspring per parent is fixed. The parental phasing is robust to read depth D and number M of markers, except for low D < 10 and small M < 250 (Figure 5, A and C). The ancestral inference error decreases with M up to 2000 and with D up to 20, and it levels off when D > 20 (Figure 5, B and D). The number L of parents has little effect on parental phasing and ancestral inference (Figure 5, C–F), if the population size N is increased proportionally, although the parental phasing error for L = 2 is slightly greater than that for L = 5 and 10 (Figure 5, C and E).
Figure 5.
Effect of read depth D and the number M of markers for sequencing data in the diallel populations with no selfing (S = 0) and L = 2, 5, and 10 parents. (A) Contour plot of the parental phasing error as a function of the number M of markers and read depth D. (B) Contour plot of the ancestral B inference error as a function of M and D. (C, D) Effect of read depth D on parental phasing and ancestral inference, respectively, with M = 500. (E, F) Effect of read depth D on parental phasing and ancestral inference, respectively, with .
The effect of D and M, under the constraint that , is shown in Figure 5, E and F, where the product D × M denotes the total number of reads, or the sequencing cost per individual. The optimal strategy for decreasing ancestral inference error is to increase M instead of D under the cost constraint (Figure 5F), although parental phasing error increases with M but it is still very small at M = 2000 or D = 5 (Figure 5E).
Robustness to errors in input map
The parental phasing is robust to long-range or local disturbances, especially for the larger populations (N = 200) (Figure 6, A and B). Although the parental phasing error gradually increases with the strength of long-range disturbance (dashed lines in Figure 6A), it can be greatly reduced (Figure 6A) by the default marker deletion procedure in the parental phasing step, which removed most long-range disturbed markers (Figure 6E). On the other hand, the robustness of parental phasing to local disturbances is not because of many disturbed markers being removed (Figure 6, B and F).
Figure 6.
Effect of disturbances in the input genetic maps in the diallel populations with no selfings (S = 0) and L = 5 parents. The left and right panels denote the effect of long-range and local disturbances, respectively. (A, B) Effect on parental phasing. The solid (or dashed) lines denote the results with (or without) marker deletion in the parental phasing step. (C, D) Effect on ancestral inference. The solid (or dashed) lines denote the results with (or without) refining marker ordering and spacing in the map refinement step, conditional on the default marker deletion in the parental phasing step. (E, F) Fraction of markers deleted in the parental phasing step. The dotted line denotes y = x.
In the presence of long-range or local disturbances, the effect of map refinement on ancestral inference was conditional on the parental phasing with marker deletion. In the presence of only long-range disturbances, the map refinement has essentially no effect on ancestral inference in the large populations, and even negative effect in the small populations (N = 50) (Figure 6C). Nevertheless, in the presence of local disturbances, the map refinement can greatly improve ancestral inference, especially in the large populations (Figure 6D). These are consistent with the changes of marker ordering accuracy due to the map refinement (Supplementary Figure S5, A–D), the length of refined map being close to the true value except for strong disturbances (Supplementary Figure S5, E and F).
Evaluation with real data
PolyOrigin was applied to a 3 × 3 half-diallel population of autotetraploid potato. The inferred frequency of quadrivalents is 20% on average, ranging from 9 to 41% across the 12 chromosomes, and the frequencies of the three possible bivalent pairings for each parent were very similar, as expected for a true autopolyploid (Supplementary Figure S6). Of the 5078 markers, 18 were discarded due to poor fit, and 15 genotype errors were detected in the parents, 14 of which involved an allelic dosage difference of magnitude 1. Even though all 434 progeny had passed sensitive quality control tests for parentage based on the genome-wide markers (Endelman et al. 2017), PolyOrigin flagged 19 outlier offspring due to an excessive number of haplotype switches (Supplementary Figure S7).
PolyOrigin is capable of predicting a double reduction in offspring since it accounts for quadrivalent formation in meiosis. Figure 7A shows one such example offspring, and the double reduction events are visible as dark blue segments in linkage groups 2, 5, and 6. The predicted haplotypes from MAPpoly (Figure 7B) are similar to PolyOrigin except in regions of double reduction, where the MAPpoly solution tends to show a large number of haplotype switches (Supplementary Figure S7). The fraction of gametes with double reduction obtained by PolyOrigin increases from almost 0 at centromeres to the maximum 0.08 at telomeres (Figure 7C).
Figure 7.
Comparison of PolyOrigin with MAPpoly for the 3x3 potato diallel population. Dashed vertical lines denote chromosome boundaries. (A) Posterior probabilities obtained by PolyOrigin for the example offspring (W15268-27R). The darker the color, the higher the probability. (B) Posterior probabilities obtained by MAPpoly for the same example offspring. (C) Variation of double reduction along chromosome obtained by PolyOrigin. The y-axis denotes the fraction of gametes having two copies of the same parental haplotype, based on the maximum possible origin-genotypes of offspring at a given marker. (D) Comparison of the estimated genetic maps with the physical map. On the y-axis of (A, B), h1-h4 denote the homologs from the first parent (W6511-1R) of the offspring, and h5-h8 for the second parent (VillettaRose).
Another notable difference between the PolyOrigin and MAPpoly solutions is the length of the genetic map (Figure 7D). The MAPpoly map was 18.1 Morgans compared to 12.2 Morgans for PolyOrigin, which is more similar to the estimates of 10–11 Morgans published in biparental linkage mapping studies (Massa et al. 2015; Bourke et al. 2016; Da Silva et al. 2017). One source of map inflation with MAPpoly appears to be elevated estimates of recombination frequency in the pericentromeric regions (Figure 7B). Even when the three F1 populations were analyzed separately with PolyOrigin, more accurate map lengths were obtained (Supplementary Figure S8).
PolyOrigin was faster than MAPpoly in analyzing the real potato data (N = 434 and M = 5078). The computational times were 7.5 hours for MAPpoly, 6.4 hours for PolyOrigin analyzing the three F1 populations jointly, and 3.1 hours for PolyOrigin analyzing the data separately. We did not use parallel computation in the analysis, although both PolyOrigin and MAPpoly can perform parallel computation at the chromosome level.
Discussion
We have developed a new method, implemented in PolyOrigin, for haplotype reconstruction in connected tetraploid F1 populations. PolyOrigin extends the previous HMM framework TetraOrigin (Zheng et al. 2016) from a single F1 to multiple F1 populations, including the possibility of selfed populations. Both PolyOrigin and TetraOrigin use a forward-backward procedure for parental phasing, whereas MAPpoly (Mollinari and Garcia 2019; Mollinari et al. 2020) uses two-point linkage analyses and a forward procedure for parental phasing in a F1 cross. This algorithmic difference may explain why MAPpoly did not work well for small population sizes.
In comparison to the basic steps of parental phasing and ancestral inference in TetraOrigin, PolyOrigin has added a procedure of marker deletion in the step of parental phasing. The marker deletion is based on the Vuong’s closeness test (Vuong 1989) with the likelihood calculated from the multi-locus HMM approach, which has been shown to be very effective to remove long-range misplaced markers and some markers with parental errors. In comparison, MAPpoly performs parental phasing by sequentially adding markers and removing markers that cannot fitted into the chain, according to some limit parameters such as the maximum number of phase configurations to be tested. In general, the markers would be increasingly difficult to build into the chain, resulting in a nonuniform distribution of deleted makers.
PolyOrigin has also added a procedure of parental error correction in the step of ancestral inference. The procedure corrects parental dosages and phases by minimizing the number of mismatches between the observed and estimated genotypes in offspring, conditional on phased parent genotypes, which is computationally more efficient than TetraOrigin introducing a parental dosage error parameter. Not surprisingly, the error correction procedure is not effective in small populations, particularly, at low-sequencing depth.
Another quality-control feature implemented in PolyOrigin is the automated outlier detection of progeny with an excessive number of haplotype switches. In the simulated datasets, very few outliers were ever detected, which suggests a very small false discovery rate. However, we are unable to explain why 19 of the 434 potato progeny were outliers. The potato SNP array has been shown to be a powerful tool for detecting pedigree errors (Endelman et al. 2017), and all 434 progeny passed these quality control measures. Perhaps some of the complex chromosomal behavior possible in meiosis I is poorly captured by the genetic model in PolyOrigin. The average frequency of 20% quadrivalents in the potato population, with some variation between parents and chromosomes, is consistent with previous studies based on marker data (Bourke et al. 2015) and cytological techniques (Choudhary et al. 2020).
To increase the robustness to dosage uncertainties at low sequencing depth, PolyOrigin has integrated a dosage calling procedure by analyzing read counts directly, where the probabilities of read counts givens all possible dosages are calculated. These probabilities can also be provided by posterior dosage probabilities exported by the softwares such as polyRAD (Clark et al. 2019) for sequencing data and fitPoly (Voorrips et al. 2011; Zych et al. 2019) for SNP array data. In comparison, TetraOrigin can analyze only dosage data, and MAPpoly cannot analyze read counts directly, relying instead on an input file with genotype probabilities.
PolyOrigin allows flexibility in the mating and genotyping designs for linkage mapping projects. Our results show that the parental phasing error is less than 0.02 when the number of offspring per parent is . This implies that incomplete diallel designs, such as linear or star, can be used with similar performance to a complete diallel, which can be difficult to create due to reproductive limitations of the parents. We also show that because PolyOrigin effectively pools data across the entire chromosome, reliable genotype calls can be made in autotetraploids with much less read depth per marker, such as 10 or 20x, compared with values of 60–80x when genotype calls are made independently for each marker (Uitdewilligen et al. 2013; Matias et al. 2019). For the design of sequence-based genotyping platforms with a fixed number of markers (e.g., baits or amplicons) and reads per sample, we have shown that increasing the number of markers leads to more accurate results even though the number of reads per marker decreases.
Computationally, PolyOrigin scales linearly with the number of parents, population size, and the number of markers (Supplementary Figure S9). PolyOrigin is faster than TetraOrigin, mainly because TetraOrigin is implemented in Mathematica (Wolfram Research 2016) while PolyOrigin is implemented in Julia (Bezanson et al. 2017). MAPpoly was increasingly slower than PolyOrigin with decreasing population size, although MAPpoly is implemented in R (R Core Team 2019) and C/C++. This is probably because the computational time of MAPpoly critically depends on some parameter values such as “extend.tail,” the length of the chain’s tail that is used for likelihood calculation. Note that the parameter values of MAPpoly have been adjusted by the first author of MAPpoly, according to the information from two-point linkage analyses, such that both “extend.tail” and “phase.number.limit” (the maximum number of phase configurations) increase with decreasing population size (M. Mollinari, personal communication).
PolyOrigin has been implemented and tested for tetraploid, and most parts of the algorithm can be extended easily to higher ploidy levels. However, a stochastic algorithm would be needed to infer valent formation for hexaploids or higher, because the number of possible valents increases rapidly with ploidy level and the current implementation of PolyOrigin considers all possible configurations. For example, there are 105 possible bivalents in octoploid and thus 1052 combinations for biparental populations, not to mention the demanding modeling and computational requirements for multivalent formation.
In conclusion, we have developed a novel method PolyOrigin for haplotype reconstruction in connected tetraploid F1 populations, which opens up exciting new possibilities for haplotype-based QTL mapping in such populations. It is recommended to design populations with offspring per parent, and thus unbalanced crosses such as the start design are not recommended in the case of small population size. Extensive evaluations have shown that PolyOrigin is robust to various sources of errors in input genetic data and that it works well for sequencing data with read depth as low as 10x.
Acknowledgments
The authors thank Marcelo Mollinari for helping using MAPpoly.
C.Z. developed the PolyOrigin model and algorithm and wrote the first draft of the manuscript. R.R.A. wrote the wrapper PedigreeSimR and helped using MAPpoly. J.B.E. provided potato data and helped reshape the manuscript. All authors participated in study design and provided critical feedback. All authors read and approved the final manuscript.
Funding
Financial support provided by USDA NIFA Award No. 2019-67013-29166.
Conflicts of interest
None declared.
Literature cited
- Anscombe FJ. 1948. The transformation of poisson, binomial and negative-binomial data. Biometrika. 35:246–254. [Google Scholar]
- Bezanson J, Edelman A, Karpinski S, Shah VB.. 2017. Julia: a fresh approach to numerical computing. Siam Rev. 59:65–98. [Google Scholar]
- Bourke PM, van Geest G, Voorrips RE, Jansen J, Kranenburg T, et al. 2018. polymapRd-linkage analysis and genetic map construction from F-1 populations of outcrossing polyploids. Bioinformatics. 34:3496–3502. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bourke PM, Voorrips RE, Kranenburg T, Jansen J, Visser RGF, et al. 2016. Integrating haplotype-specific linkage maps in tetraploid species using SNP markers. Theor Appl Genet. 129:2211–2226. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bourke PM, Voorrips RE, Visser RGF, Maliepaard C.. 2015. The double-reduction landscape in tetraploid potato as revealed by a high-density linkage map. Genetics. 201:853–863. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brent RP. 1973. Algorithms for Minimization without Derivatives. Courier Corporation, Chelmsford, MA. [Google Scholar]
- Broman K, Wu H, Sen S, Churchill G.. 2003. R/qtl: QTL mapping in experimental crosses. Bioinformatics. 19:889–890. [DOI] [PubMed] [Google Scholar]
- Broman KW, Gatti DM, Simecek P, Furlotte NA, Prins P, et al. 2019. R/qtl2: software for mapping quantitative trait loci with high-dimensional data and multiparent populations. Genetics. 211:495–502. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Choudhary A, Wright L, Ponce O, Chen J, Prashar A, et al. 2020. Varietal variation and chromosome behaviour during meiosis in Solanum tuberosum. Heredity (Edinb). 125:212–215. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Clark LV, Lipka AE, Sacks EJ.. 2019. polyRAD: Genotype calling with uncertainty from sequencing data in polyploids and diploids. G3 (Bethesda). 9:663–673. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Da Silva WL, Ingram J, Hackett CA, Coombs JJ, Douches D, et al. 2017. Mapping loci that control tuber and foliar symptoms caused by pvy in autotetraploid potato (Solanum tuberosum l.). G3 (Bethesda). 7:3587–3595. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Endelman JB, Carley CAS, Bethke PC, Coombs JJ, Clough ME, et al. 2018. Genetic variance partitioning and genome-wide prediction with allele dosage information in autotetraploid potato. Genetics. 209:77–87. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Endelman JB, Carley CAS, Douches DS, Coombs JJ, Bizimungu B, et al. 2017. Pedigree reconstruction with genome-wide markers in potato. Am J Potato Res. 94:184–190. [Google Scholar]
- Felcher KJ, Coombs JJ, Massa AN, Hansey CN, Hamilton JP, et al. 2012. Integration of two diploid potato linkage maps with the potato genome sequence. PLoS One. 7:e36347. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gelman A, Stern HS, Carlin JB, Dunson DB, Vehtari A, et al. 2013. Bayesian Data Analysis. Chapman and Hall/CRC, Boca Raton, FL. [Google Scholar]
- Gerard D, Ferrao LFV, Garcia AAF, Stephens M.. 2018. Genotyping polyploids from messy sequencing data. Genetics. 210:789–807. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hackett CA. 2001. A comment on Xie and Xu: ’mapping quantitative trait loci in tetraploid species’. Genet Res. 78:187–189. [DOI] [PubMed] [Google Scholar]
- Hackett CA, Pande B, Bryan GJ.. 2003. Constructing linkage maps in autotetraploid species using simulated annealing. Theor Appl Genet. 106:1107–1115. [DOI] [PubMed] [Google Scholar]
- Haldane J. 1919. The combination of linkage values and the calculation of distances between the loci of linked factors. J Genet. 8:299–309. [Google Scholar]
- Hamilton JP, Hansey CN, Whitty BR, Stoffel K, Massa AN, et al. 2011. Single nucleotide polymorphism discovery in elite North American potato germplasm. BMC Genomics. 12:302. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huang BE, Verbyla KL, Verbyla AP, Raghavan C, Singh VK, et al. 2015. MAGIC populations in crops: current status and future prospects. Theor Appl Genet. 128:999–1017. [DOI] [PubMed] [Google Scholar]
- Kirkpatrick S, Gelatt CD, Vecchi MP.. 1983. Optimization by simulated annealing. Science. 220:671–680. [DOI] [PubMed] [Google Scholar]
- Luo Z, Hackett C, Bradshaw J, McNicol J, Milbourne D.. 2001. Construction of a genetic linkage map in tetraploid species using molecular markers. Genetics. 157:1369–1385. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Luo ZW, Zhang Z, Leach L, Zhang RM, Bradshaw JE, et al. 2006. Constructing genetic linkage maps under a tetrasomic model. Genetics. 172:2635–2645. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Massa AN, Manrique-Carpintero NC, Coombs JJ, Zarka DG, Boone AE, et al. 2015. Genetic linkage mapping of economically important traits in cultivated tetraploid potato (Solanum tuberosum l.). G3-genes. G3 (Bethesda). 5:2357–2364. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Matias FI, Xavier Meireles KG, Nagamatsu ST, Lima Barrios SC, Borges do Valle C, et al. 2019. Expected genotype quality and diploidized marker data from genotyping-by-sequencing of Urochloa spp. tetraploids. Plant Genome. 12:1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mollinari M, Garcia AAF.. 2019. Linkage analysis and haplotype phasing in experimental autopolyploid populations with high ploidy level using hidden Markov models. G3 (Bethesda). 9:3297–3314. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mollinari M, Olukolu BA, Pereira GD, Khan A, Gemenet D, et al. 2020. Unraveling the hexaploid sweetpotato inheritance using ultra-dense multilocus mapping. G3 (Bethesda). 10:281–292. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mott R, Talbot C, Turri M, Collins A, Flint J.. 2000. A method for fine mapping quantitative trait loci in outbred animal stocks. Proc Natl Acad Sci S A. 97:12649–12654. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Potato Genome Sequencing Consortium 2011. Genome sequence and analysis of the tuber crop potato. Nature. 475:189–195. [DOI] [PubMed] [Google Scholar]
- Preedy KF, Hackett CA.. 2016. A rapid marker ordering approach for high-density genetic linkage maps in experimental autotetraploid populations using multidimensional scaling. Theor Appl Genet. 129:2117–2132. [DOI] [PubMed] [Google Scholar]
- R Core Team 2019. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. [Google Scholar]
- Rabiner L. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE. 77:257–286. [Google Scholar]
- Sharma SK, Bolser D, de Boer J, Sønderkær M, Amoros W, et al. 2013. Construction of reference chromosome-scale pseudomolecules for potato: integrating the potato genome with genetic and physical maps. G3 (Bethesda). 3:2031–2047. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tukey JW. 1977. Exploratory Data Analysis, Vol. II. Addison-Wesley, Reading, MA. [Google Scholar]
- Uitdewilligen JGAML, Wolters A-MA, D'hoop BB, Borm TJA, Visser RGF, et al. 2013. A next-generation sequencing method for genotyping-by-sequencing of highly heterozygous autotetraploid potato. PLoS One. 8:e62355. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Voorrips RE, Gort G, Vosman B.. 2011. Genotype calling in tetraploid species from bi-allelic marker data using mixture models. BMC Bioinformatics. 12:172. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Voorrips RE, Maliepaard CA.. 2012. The simulation of meiosis in diploid and tetraploid organisms using various genetic models. BMC Bioinformatics. 13:248. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vos PG, Uitdewilligen JG, Voorrips RE, Visser RG, van Eck HJ.. 2015. Development and analysis of a 20k SNP array for potato (Solanum tuberosum): an insight into the breeding history. Theor Appl Genet. 128:2387–2401. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vuong QH. 1989. Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica. 57:307–333. [Google Scholar]
- Wolfram Research I. 2016. Mathematica. Champaign, IL: Wolfram Research, Inc. version 11.0 edition. [Google Scholar]
- Xie CG, Xu SH.. 2000. Mapping quantitative trait loci in tetraploid populations. Genet Res. 76:105–115. [DOI] [PubMed] [Google Scholar]
- Zheng C, Boer MP, van Eeuwijk FA.. 2015. Reconstruction of genome ancestry blocks in multiparental populations. Genetics. 200:1073–1087. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zheng C, Voorrips RE, Jansen J, Hackett CA, Ho J, et al. 2016. Probabilistic multilocus haplotype reconstruction in outcrossing tetraploids. Genetics. 203:119–131. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zych K, Gort G, Maliepaard CA, Jansen RC, Voorrips RE.. 2019. FitTetra 2.0-improved genotype calling for tetraploids with multiple population and parental data support. BMC Bioinformatics. 20:148. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
PolyOrigin has been implemented in Julia 1.5.3, and is freely available under the GNU General Public License 3.0 from the website: https://github.com/chaozhi/PolyOrigin.jl. See https://github.com/chaozhi/PolyOrigin_Evaluate for the scripts and the real and simulated data for comparing PolyOrigin with the previous methods; the real potato data are also available in Supplementary Tables S1–S3. Supplementary material is available at figshare: https://doi.org/10.25386/genetics.14745417.







