Skip to main content
BMC Plant Biology logoLink to BMC Plant Biology
. 2025 Aug 25;25:1125. doi: 10.1186/s12870-025-07242-x

Benchmarking of low coverage sequencing workflows for precision genotyping in eggplant

Virginia Baraja-Fonseca 1,, Andrea Arrones 1, Santiago Vilanova 1, Mariola Plazas 1, Jaime Prohens 1, Aureliano Bombarely 2, Pietro Gramazio 1,
PMCID: PMC12379343  PMID: 40855264

Abstract

Background

Low-coverage whole-genome sequencing (lcWGS) presents a cost-effective solution for genotyping, particularly in applications requiring high marker density and reduced costs. In this study, we evaluated lcWGS for eggplant genotyping using eight founder accessions from the first eggplant MAGIC population (MEGGIC). We tested various sequencing coverages and minimum depth of coverage thresholds with two SNP callers, Freebayes and GATK. Reference SNP panels were used to estimate the percentage of common biallelic SNPs (i.e., true positives) relative to the low coverage datasets (accuracy) and the SNP panels themselves (sensitivity). Furthermore, the percentage of true positives with the same genotype across both datasets was calculated to assess genotypic concordance.

Results

Sequencing coverages as low as 1X and 2X achieved high accuracy but lacked sufficient sensitivity and genotypic concordance. However, 3X sequencing reached approximately 10% less sensitivity than 5X while maintaining genotypic concordance above 90% at any depth of coverage threshold. Freebayes outperformed GATK in terms of sensitivity and genotypic concordance. Therefore, we used this software to conduct a pilot test with some MEGGIC lines from the fifth generation of selfing, comparing their datasets with a gold standard. Sequencing coverages as low as 1X identified a substantial number of true positives, with 3X significantly increasing the yield, particularly at moderate depth of coverage thresholds. Additionally, at least 30% of the true positives were consistently genotyped in all lines when using coverages greater than 2X, regardless of the depth of coverage threshold applied.

Conclusions

This study highlights the importance of using a gold standard to reduce false positives and demonstrates that lcWGS, with proper filtering, is a valuable alternative to high-coverage sequencing for eggplant genotyping, with potential applications to other crops.

Supplementary Information

The online version contains supplementary material available at 10.1186/s12870-025-07242-x.

Keywords: Eggplant (Solanum melongena), Genotyping, Low-coverage whole-genome sequencing (lcWGS), Bioinformatic pipeline, Benchmarking analysis, Gold standard (GS)

Background

Plant genomic characterization is a critical step in modern breeding programs, and essential for studying diversity, understanding domestication and recombination events, and identifying candidate regions linked to important agronomic traits [1]. Current methods for high-throughput genotyping in plants primarily involve two approaches: reduced representation sequencing (RRS) and whole-genome sequencing (WGS) [2]. However, low-coverage whole-genome sequencing (lcWGS) is gaining popularity as a genotyping strategy that combines the broad genomic coverage of WGS with the cost-efficiency of RRS [3, 4]. lcWGS allows for the identification of a dense marker set with a comprehensive representation of the entire genome, using very low sequencing coverage (< 10X) [5, 6]. This strategy has opened new possibilities for genotyping studies in various crops, including chickpea [7], rice [8], canola [5], soybean [6], tomato [9], radish [10], wheat [11] and potato [12], among others.

The primary strength of lcWGS lies in its cost-effectiveness, with costs scaling down in tandem with reduced sequencing coverage [4, 13]. Another valuable feature of this strategy is the concurrent reduction in data volume as sequencing coverage diminishes, expediting and streamlining bioinformatic analysis [5, 14]. Nevertheless, low sequencing coverages can lead to erroneous conclusions due to the limited information provided by the low number of reads. Some potential challenges encompass: (1) genotype misclassification, (2) loss of genuine polymorphism, and (3) sequencing errors being erroneously classified as genetic variants [15]. To address these drawbacks, rigorous SNP filtering steps are critical [13, 16]. Furthermore, employing additional procedures is advised for the elimination of false positive calls, such as using more than one SNP calling software [17, 18] and validating polymorphisms against a set of truly-assumed genetic variants (gold standard; GS), supported by a higher number of reads or validated in various independent studies [1921].

Eggplant (Solanum melongena L.) is an economically significant crop ranking as the third most important solanaceous crop after potato and tomato in global production and the fifth among all vegetable crops [22]. Despite its economic importance, available genetic and genomic resources of eggplant have traditionally lagged behind those of other important vegetable crops [23]. However, noteworthy progress includes the development of the first and only multiparent advanced generation inter-cross (MAGIC) population in eggplant, known as MEGGIC [24]. To fully leverage MAGIC populations as valuable next-generation genomic resources, genotypic characterization is essential. The 5k Single Primer Enrichment Technology (SPET) genotyping platform [25], developed from the resequencing at 20X of its eight founders [26], was used to genotype 420 individuals from the third generation of selfing (S3MEGGIC), resulting in 7724 high-confidence SNPs [24]. Even though the SPET genotyping allowed the dissection of key genes for eggplant genetics and breeding [24, 27], the genetic characterization of the segregating individuals through variant identification and haplotype resolution was not fully comprehensive. Two principal limitations of this genotyping approach by amplicon sequencing are the limited number of variants that can be interrogated and their distribution, which are preferentially selected in the gene-rich chromosome arms to assess gene allelic diversity [25]. These issues can be addressed by lcWGS, as demonstrated in recent specific eggplant studies that created high-density recombination bin-based genetic maps and improved QTL mapping resolution [28, 29]. However, there remains a limited understanding of the impact of several parameters during data processing on its accuracy and sensitivity.

Thus, the principal aim of this study is to establish an optimized workflow for analysing low-coverage genomic data in eggplant. We used the MEGGIC founders to benchmark different combinations of sequencing coverages, minimum depth of coverage (DP) thresholds and SNP callers. A proof-of-concept validation of these findings was conducted on lines from the fifth generation of selfing of the eggplant MAGIC population (S5MEGGIC), which will facilitate the optimization of genomic diversity analyses in eggplant collections and populations. Furthermore, this study provides guidelines for selecting appropriate parameters in eggplant genomics analysis and presents a protocol that can be broadly applied across various crops and research objectives.

Methods

Plant materials, library preparation and resequencing

The plant materials used for the low-coverage sequencing benchmarking were the eight founders of the MEGGIC population, consisting of seven Solanum melongena and one wild relative S. incanum accessions (Additional file 1) [24]. All the accessions are maintained at the Universitat Politècnica de València (UPV) germplasm bank. On the other hand, to increase the genomic characterization precision of the MEGGIC population and to validate the lcWGS benchmarking results of this study, we used four random recombinant lines from the S5 generation (labelled for this study as S5-1, S5-2, S5-3, S5-4). The S5 lines were obtained following a funnel scheme as described by Mangino et al. [24] (Additional file 1).

Seeds from the twelve samples (the eight founders and the four S5MEGGIC lines) were germinated in Petri dishes, following the protocol developed by Ranil et al. [30]. We then transferred them to seedling trays in a climatic chamber under a photoperiod and temperature regime of 16 h light (25 °C, 100–112 µmol m− 2s− 1) and 8 h dark (18 °C). Total genomic DNA was extracted from approximately 100 mg of young leaves following the SILEX protocol described by Vilanova et al. [31]. We assessed DNA integrity and quality of the extracted DNA through agarose electrophoresis and NanoDrop ND-1000 spectrophotometer (NanoDrop Technologies, Wilmington, Delaware, USA). DNA concentration was determined with a Qubit® 2.0 fluorometer (Thermo Fisher Scientific, Waltham, MA, USA). High-quality DNA samples (260/280 and 260/230 ratios > 1.8) were then shipped to the Beijing Genomics Institute (BGI Genomics, Hong Kong, China) for the construction of 150 bp paired-end libraries and subsequent sequencing using the DNBseq platform. The lcWGS was pursued to generate around 6.3 Gb high-quality sequence reads per sample, approximating 5X genome coverage. Raw reads underwent filtration using SOAPnuke software [32] to remove adapters and low-quality reads (-n 0.001 -l 10 -q 0.4 --adaMR 0.25 --ada_trim). After receiving trimmed reads, we performed quality control using FastQC (version 0.11.9) [33] to assess the effectiveness of the quality filtering (Fig. 1.A and Fig. 1.B).

Fig. 1.

Fig. 1

Bioinformatics pipeline for low-coverage genomic data analysis. Squares represent files and circles indicate procedures during the bioinformatic analysis. A Low-coverage whole genome sequencing benchmark workflow. MEGGIC founders’ 5X cleaned data were down-sampled to 1X, 2X, 3X and 4X. Five replicates were generated for each down-sampled level (R1-R5). After the mapping step, SNP calling was performed for each low-coverage dataset and founder using Freebayes (F) and GATK (G), followed by variant filtration based on the minimum depth of coverage (DP 1 to DP 10). The SNPs datasets were validated using the corresponding reference SNP panel. B Proof-of-concept validation of the benchmark study. S5MEGGIC lines’ 5X cleaned data were downsampled to 1X-4X, and five replicates were generated for each level (R1-R5). After the mapping step, Freebayes was used to perform population-level SNP calling on the combined BAM files from all lines for each sequencing coverage and replicate. Variant filtration was based on the minimum depth of coverage (DP 1 to DP 10). The SNP datasets were validated using the gold standard (GS). Output datasets were labelled according to the sequencing coverage and the applied filter (e.g., 1X DP 1 indicates a sequencing coverage of 1X and a minimum depth of coverage threshold set at 1). C Reference SNP panels and GS preparation from MEGGIC founders’ 20X data. One reference SNP panel per genotype-SNP caller (Freebayes, F; GATK, G) combination was obtained. Additionally, a unique GS was obtained by performing a population-level SNP calling using Freebayes followed by variant filtration

Downsampling and mapping

The original fastq files obtained from the 5X sequencing were utilized to simulate different lower-depth samples using seqtk tool (version 1.3-r106, https://github.com/lh3/seqtk). Each sample was computationally subsetted based on the number of reads. Average depths of 1X, 2X, 3X and 4X were produced (Fig. 1A and B). The same random seed (-s) was employed to preserve read pairing. For each simulated dataset, five replicates (R) were randomly generated (Fig. 1A and B).

We mapped clean reads from the five gradient sequencing coverages for each sample against the v3.0 “67/3” high-quality eggplant reference genome [34] using BWA with its Minimal Exact Match algorithm (BWA-MEM) (version v.0.7.17–r1188; Fig. 1A and B) [35]. The resulting alignment data were subsequently transformed into the BAM format using the SAMtools package (version 1.13) [36]. Mapping statistics, including the number of mapped and unmapped reads, and the average depth of coverage, were recorded in the output file generated by the QualiMap application (version 2.2.1) [37]. Moreover, we used the ‘coverage’ function within the SAMtools package (version 1.13) [38] to calculate genome coverage. In-depth analysis of the spatial distribution of mapping depth across the genome, with a window size of 10 Kbp, was performed with the bamCoverage tool (version 3.5.1) [39]. Output data was visualized and graphically represented using the ‘plot’ function (version 3.6.2) in R (version 4.3.2) [40]. Finally, we marked PCR duplicates using the MarkDuplicates tool from Picard software (version 1.119; https://broadinstitute.github.io/picard/; Fig. 1A and B).

Polymorphism detection and data filtering

Variant calling in low-coverage founders’ data was carried out at the sample level (i.e., each sample independently for each sequencing coverage) using two different Bayesian-based software: Freebayes (version 1.3.6) [41] and GATK HaplotypeCaller (version 4.3.0.0) (Fig. 1A) [42]. BAM files from the mapping step were provided as input to both tools individually, using the -b and -I options, respectively. We ran both callers with default configuration settings, except for the minimum quality requirements for mapping and base, for which we set the threshold at 20. Biallelic SNPs, excluding monomorphic ones, were kept using BCFtools (version 1.13; https://samtools.github.io/bcftools/bcftools.html). To assess the impact of the minimum depth of coverage thresholds on polymorphism detection, we tested a range of thresholds from DP 1 to DP 10. This range was selected to capture sufficient genomic variation while balancing the need for accuracy in identifying true positives and minimizing false positives. This process generated 400 list-based SNP sets per low-coverage level-SNP caller combination. Specifically, this includes eight samples, five replicates, and 10 different minimum depth of coverage thresholds. For the 5X coverage level, 80 SNP sets were generated per SNP caller, as no replicates were included for this coverage (Fig. 1A).

Variants of the four S5MEGGIC lines were identified at the population level (i.e., the four samples together) using Freebayes (Fig. 1B). BAM files from the mapping step were provided as a list of inputs, using the -L option. This process created a single output file containing information for all the samples. While we maintained the default configuration settings, we specifically set the thresholds for mapping and base quality to a minimum of 20. In the same way, as did with founders’ data, biallelic SNPs were filtered by the minimum depth of coverage ranging from DP 1 to DP 10. Finally, only polymorphic variants among the four lines were retained (Fig. 1B).

MEGGIC founders derived reference SNP panels and gold standard

To benchmark each MEGGIC founder’s combinations of lc-DP (low-coverage dataset and minimum depth of coverage threshold), two reference SNP panels were established for each MEGGIC founder from its 20X resequencing dataset (SRA BioProject PRJNA392603) (Fig. 1C) [26]. The cleaned reads underwent a quality control assessment using FastQC (version 0.11.9) [33]. Then, we conducted the mapping and polymorphism detection at the sample level using Freebayes (Freebayes SNP panel) and GATK (GATK SNP panel) as previously detailed for the low-coverage founders’ data. Biallelic SNPs supported by at least 20 reads were retained to ensure the most informative markers, while monomorphic ones were excluded (Fig. 1C).

In addition, we established a unique gold standard to select common biallelic SNPs identified in the S5MEGGIC lines. This process validated the benchmark results and determined the optimal combination of parameters for the genomic characterization of the S5MEGGIC population (Fig. 1C). The GS was derived by performing SNP calling at the population level, as previously described for the low-coverage S5MEGGIC lines’ data, using the 20X resequencing datasets from all eight founders with Freebayes (-q 20 -m 20 –limit-coverage 800). The resulting VCF file included both common and unique variants across all founder genomes. Biallelic SNPs supported by at least 20 reads were retained, and monomorphic sites among samples were removed (Fig. 1C).

Genotyping accuracy, sensitivity and concordance evaluation

To determine the optimal combination of tools and parameters in terms of variant discovery, we compared the low-coverage datasets (lc-datasets) from each founder to their respective Freebayes and GATK SNP panels, with the latter serving as the reference due to their superior read support (Fig. 1A). These comparisons were performed with the isec tool from the BCFtools package (version 1.13; https://samtools.github.io/bcftools/bcftools.html) with the default configuration, yielding only records with identical alleles between the reference SNP panel and the lc-dataset (bcftools isec -c none). The output files allowed the identification of polymorphic sites: (I) common between the reference SNP panel and the lc-dataset (true positives; TP), (II) private to the lc-dataset (PlcD), and (III) private to the reference SNP panel (PP). The isec tool was also used to accurately determine the common variants (TP) between the S5MEGGIC lines and the GS, assuming the genotypes in the full 20X data were correct due to the high number of reads supporting each variant (Fig. 1B).

The first metric evaluated was accuracy, defined as the capability to filter out correctly potential false polymorphisms or identification errors. It was calculated as the ratio of true positives to the total number of polymorphisms identified in the sample (1).

graphic file with name d33e513.gif 1

On the other hand, sensitivity was assessed as the capability to identify correctly genuine polymorphisms within the genome. It was calculated as the ratio of true positives to the total number of polymorphisms identified in the reference SNP panels (2).

graphic file with name d33e524.gif 2

Finally, genotypic concordance was calculated as the percentage of true positives in the lc-dataset that exhibited complete allele matches (homozygous-homozygous or heterozygous-heterozygous) with the genotype at the same site sequenced at 20X coverage. This measure reflected the ability to accurately assign genotypes even at low coverages.

Statistical analysis

Multifactorial analysis of variance (ANOVA) was used to assess the effects of SNP caller, coverage level, and minimum depth of coverage threshold on accuracy, sensitivity and genotypic concordance. Then, we conducted post-hoc pairwise comparisons using Tukey’s Honest Significant Difference (HSD) test to identify specific differences between the levels of each factor. The significance level was set at p < 0.05. All analyses were carried out using Statgraphics Centurion 19 software (Statgraphics Technologies, Inc., The Plains, VA, USA).

Results

LcWGS and mapping

The 5X lcWGS of the seven S. melongena and one S. incanum founders of the MEGGIC population, along with four recombinant S5MEGGIC lines, yielded an average of 19.70 M reads per sample (Additional file 2). Of these, 96.38% were high-quality nucleotide bases, with a Q score greater than 20 (> Q20) (Additional file 2). To benchmark the impact of varying sequencing coverages, subsets ranging from 1X to 4X were simulated (Fig. 1A and B). The original 5X and low-coverage datasets were aligned against the “67/3” eggplant reference genome using BWA-MEM software. The average percentage of mapped reads was 96.40%, with no significant differences observed between the relative data of the 5X and low-coverage sets (Additional file 3). The BAM files derived from the original 5X sequences exhibited an average depth of coverage of 4.79, and the 1X, 2X, 3X, and 4X subsets of 0.96, 1.92, 2.87, and 3.83, respectively (Additional file 3).

As the average sequencing coverage increased, the proportion of the reference genome covered also expanded, although the increments became progressively smaller at higher coverages (Fig. 2A and Additional file 3). At 1X, 39.06% of the reference genome was covered, increasing to 55.20% at 2X and 63.05% at 3X. However, the gains diminished significantly at higher coverages, with only 66.54% and 68.52% covered at 4X and 5X, respectively (Fig. 2A). Surprisingly, a marginal increment of 4.58% was observed from 5X to 20X coverage (Fig. 2A).

Fig. 2.

Fig. 2

A Percentage of reference genome coverage at different sequencing coverages (1X to 5X, and 20X) and the increments (yellow bars) from one coverage to the next. Genome coverage refers to the percentage of the genome covered by at least one read. The data shown are averages calculated from five replicates of each genotype for 1-4X sequencing coverage and from the original datasets for each genotype for 5X and 20X coverage. Standard deviation is indicated by error bars (n = 60 for 1-4X, n = 12 for 5X and 20X). B Example of the distribution of mapped read across the parental A chromosome 1 by sequencing coverage. The maximum cutoff is set at DP 20 for low coverages and DP 60 for 20X data. Peaks represent regions of high sequencing within 10 Kbp windows

The distribution of mapped reads across the sequencing coverages displayed a consistent pattern, characterized by regions of the genome lacking sequencing data (underserved regions) alongside areas subjected to excessive sequencing (over-sequenced regions) (Fig. 2B and Additional file 4).

In silico evaluation of LcWGR under varying parameters

To assess the effectiveness of lcWGS for polymorphism identification, we systematically tested several combinations of sequencing coverages (1X to 5X), depth thresholds (DP 1 to DP 10), and two widely-used SNP callers, Freebayes and GATK (Additional file 5). These comparisons aimed to determine which SNP caller and parameters combination provided the best balance between polymorphism yield and resource efficiency.

Regarding SNP caller performance, five key trends emerged from our analysis: (I) Freebayes identified more polymorphisms than GATK, particularly at higher sequencing coverages and lower DP thresholds (Additional file 6 and Additional file 7); (II) while Freebayes showed no significant differences between filtering at DP 1 and DP 2, GATK exhibited a slight reduction in biallelic SNPs, particularly at lower sequencing coverages (Fig. 3A and Additional file 7); (III) the reduction in the number of SNPs identified from DP 1 to DP 10 was more pronounced in GATK (88.92%) compared to Freebayes (81.63%), indicating a more aggressive filtering effect in GATK at higher DP thresholds (Fig. 3A); (IV) the depth of coverage plateau was reached earlier with GATK than with Freebayes (e.g., for 1X, the difference between filtering at DP 6 and DP 10 was 5.60% for GATK and 10.91% for Freebayes) (Fig. 3A), which aligns with the lower number of reads used by GATK (Additional file 8); and (V) GATK exhibited a higher scaling factor between sequencing coverages, meaning it identified a larger proportion of biallelic SNPs at each coverage level relative to the previous one (e.g., 2X:1X). Specifically, GATK identified a larger proportion of new SNPs as coverage increased from 2X to 3 × (2.16) and 4X to 5 × (1.45), compared to Freebayes (1.97 and 1.34, respectively) (Fig. 3B). However, Freebayes slightly surpassed GATK when moving from 1X to 2 × (3.31 vs. 3.29) (Fig. 3B).

Fig. 3.

Fig. 3

A Variation in the percentage of total biallelic SNPs when using different minimum depth of coverage thresholds (from DP 2 to DP 10) compared to DP 1 (set at 100%) for different sequencing coverages (1-5X) using Freebayes and GATK. Percentages represent the average across the eight founder accessions, with SD indicated by error bars (n = 40 for 1-4X and n = 8 for 5X). B Variation in the scaling factor, indicating the change in the number of total biallelic SNPs when increasing the sequencing coverage from one coverage to the next, using different minimum depth of coverage thresholds (from DP 1 to DP 10) with Freebayes and GATK. Values represent the average across the eight founder accessions, with SD indicated by error bars (n = 40)

Both sequencing coverage and depth of coverage threshold had a significant influence on the number of polymorphisms identified, with higher coverage and lower thresholds consistently yielding more biallelic SNPs (Additional file 7). However, the impact of the depth of coverage threshold was dependent on the sequencing coverage. Specifically, the differences could be attributed to how SNPs were distributed across different depth levels (Additional file 8). At lower coverage, the SNP reduction was more pronounced at lower DP thresholds than at higher ones (Fig. 3A). For example, at 1X with Freebayes (median = DP 3, and 3rd quartile ≤ DP 4), the SNP reduction from DP 2 to DP 3 was 38.54%, whereas the difference between DP 9 and DP 10 was only 1.15% (Fig. 3A). Conversely, at higher sequencing coverages, the impact of DP thresholds on SNP reduction was less significant. For instance, at 5X with Freebayes (median = DP 8, and 3rd quartile ≤ DP 11), filtering at DP 3 resulted in a 5.71% reduction compared to DP 2, which was closer to the 7.48% difference observed between filtering at DP 9 and DP 10 (Fig. 3A).

This trend was also observed in the scaling factor when transitioning from one sequencing coverage level to the next. When comparing the number of SNPs identified at 1X and 2 × (2X:1X), the scaling factor was around 2 at DP 1 and DP 2, indicating that at 2X, the number of SNPs identified was double that at 1X (Fig. 3B). At higher depth of coverage thresholds, the scaling factor increased to 3 or 4 with both Freebayes and GATK (Fig. 3B). In contrast, the scaling factor was more consistent across DP thresholds at higher sequencing coverages. For example, when moving from 4X to 5X, the scaling factor ranged from 1.16 at DP 1 to 1.59 at DP 10 using Freebayes, and from 1.18 to 1.80 with GATK (Fig. 3B). Additionally, at lower sequencing coverages, a plateau was achieved for the scaling factor, but this was not observed at higher coverages (5X:4X), where the scaling factor increased steadily across depth of coverage thresholds.

Prediction of the optimal combination of factors to perform LcWGS

Comparisons between the reference SNP panels and each SNP dataset across lc-DP combinations allowed for the determination of true positives, missing data, and genotype assignment errors, as well as the estimation of accuracy and sensitivity. These panels were generated from the 20X resequencing dataset of the MEGGIC founders [26], assuming that the genotypes were more accurately characterized due to the higher coverage. The total cohorts of biallelic SNPs constituting the Freebayes SNP panels ranged from 3.68 M to 7.59 M (Additional file 9). On average, these were 1.58 times greater than those identified by GATK, which ranged from 1.86 M to 4.93 M (Additional file 9).

Per sample-SNP caller combination, we evaluated 210 SNP datasets generated from 5X downsampling, observing a consistent increase in true positives with higher sequencing coverage and a reduction with stricter depth of coverage thresholds (Additional file 7). Freebayes consistently identified more true positives than GATK across all coverage levels, which aligned with results observed at 20X (Additional file 9). For instance, at DP 1, Freebayes outperformed GATK, identifying an additional 231.20 k TP at 1X, 574.71 k at 3X and 3.26 M at 5X. Similarly, at the more stringent DP 10 threshold, Freebayes detected more TP than GATK across coverages, with differences ranging from 11.20 k at 1X to 425.03 k at 5X (Additional file 7).

Accuracy trend differed between the two SNP callers (Additional file 10). At low depth of coverage thresholds (DP 1 to DP 4), Freebayes achieved higher average accuracy than GATK. While Freebayes exhibited values that ranged from 44.82% at 5X DP 1 to 49.87% at 1X DP 1, the average accuracy obtained with GATK at the same depth of coverage threshold showed less variability among sequencing coverages, ranged from 39.07% at 5X to 39.41% at 1X (Fig. 4.A). However, at higher depth of coverage thresholds, GATK not only surpassed Freebayes in accuracy but also exhibited increased accuracy variability among sequencing coverages (Additional file 10). This trend was evident from DP 6 at 1X, continuing to increase up to DP 10 for 5X coverage (Fig. 4). Notably, Freebayes achieved its highest values (> 59%) at specific thresholds: DP 5–6 at 1X, DP 6–9 at 2X and DP 9–10 at 3X. In contrast, GATK did not reach a plateau, as its accuracy continued to increase at higher thresholds, although the rate of improvement diminished as threshold rose (Fig. 4.A and Additional file 10). Overall, the peak accuracy for Freebayes was observed at 3X DP 10, while GATK’s best performance was at 1X DP 10.

Fig. 4.

Fig. 4

Accuracy, sensitivity, and genotypic concordance and discordance achieved for each combination of sequencing coverage (1-5X) and minimum depth of coverage threshold (from DP 1 to DP 10) using Freebayes and GATK. Percentages represent the average across the eight founder accessions, with SD indicated by error bars (n = 40 for 1-4X and n = 8 for 5X). A Accuracy refers to the ability to correctly filter out potential false positives or identification errors, calculated as the ratio of true positives to the total number of identified polymorphisms. B Sensitivity represents the ability to detect genuine polymorphisms within the genome, calculated as the ratio of true positives to the total number of polymorphisms identified in reference SNP panels. C Percentage of true positives with identical genotypes between the lc-datasets and the reference SNP panels. (D) Percentage of heterozygous true positives misclassified as homozygous in the lc-datasets. E Percentage of homozygous true positives misclassified as heterozygous in the lc-datasets

On the contrary, the observed trend in average sensitivity was consistent for both SNP callers, showing a decrease at higher depth of coverage thresholds and lower sequencing coverage (Fig. 4.B and Additional file 10). In terms of performance, Freebayes outperformed GATK in sensitivity (Additional file 10). At 1X, Freebayes average sensitivity varied from 9.88% at DP 1 to 2.85% at DP 5 and 0.54% at DP 10, whereas GATK sensitivity ranged from 8.50% at DP 1 to 1.83% at DP 5 and 0.51% at DP 10. At 5X coverage, Freebayes achieved an even better performance than GATK with 2.30% higher sensitivity values at DP 1 (35.27% vs. 32.97% with GATK), 4.30% at DP 5 (30.45% vs. 26.16%) and 4.22% at DP 10 (15.36% vs. 11.14%) (Fig. 4.B).

Among the common variants identified between the lcWGS datasets and the corresponding reference SNP panels, genotypic concordance was influenced by both SNP caller and sequencing parameters (Additional file 10). Freebayes consistently showed higher genotypic concordance under low coverage conditions compared to GATK (Fig. 4C). For instance, at 1X coverage, Freebayes peaked at over 93.86% at DP 7, while GATK lagged behind, reaching a maximum concordance of around 85.90% at DP 6 for the same coverage. At higher coverages (5X), Freebayes maintained superior genotyping concordance, achieving a maximum of 98.72% at DP 10, while GATK’s concordance reached 1.45% less than Freebayes (Fig. 4C).

Discrepancies between true positives and the reference SNP panels genotypes were primarily due to heterozygous loci being misclassified as homozygous in the lc-dataset (Fig. 4D). This error decreased with increasing minimum depth of coverage thresholds for both callers. With Freebayes, the percentage of misclassified heterozygous loci fell below 5% starting at DP 5 across all coverages, whereas GATK required higher coverage (3X and above) to achieve similar levels of concordance (Fig. 4.D). Homozygous misclassifications were minimal for both callers (Fig. 4E).

Validation of LcWGS results with MAGIC lines

As a proof-of-concept validation of the benchmark performed with the MEGGIC founders, the assessment was extended to four S5MEGGIC lines (Fig. 1B), characterized by an intricate mosaic genome background from the MEGGIC founders (Additional file 1). Similarly, the four S5 lines were sequenced at 5X and downsampled at 4X to 1X with five replicates (Fig. 1B). For the SNP calling, Freebayes was used, guided by insights obtained from the benchmarking analysis of the founders’ data. To assess the extent of potential false positives identified at each lc-DP combination, a comparative analysis was conducted using a gold standard as a reference (Fig. 1B). The GS comprised a combination of shared and private founder variants from the 20X resequencing dataset, obtained by performing a Freebayes SNP calling at the population level (Fig. 1C). This process yielded a total of 17,069,371 biallelic SNPs, considered reliable polymorphisms due to high read support, with their distribution provided in Additional file 11.

Shared polymorphisms between S5MEGGIC lc-datasets and the gold standard (i.e., true positives) are the most informative in determining the most convenient lc-DP combination to maximize characterization accuracy and resource optimization. As expected, the number of true positives identified increased with the sequencing coverage and decreased at higher depth of coverage thresholds (Fig. 5 and Additional file 12). Similarly to the founders’ benchmark (Fig. 3), true positives decrements decelerated at higher sequencing coverage. Fixing DP 1 as 100% of TP for each sequencing coverage, 5X DP 10 still retained 21.67% of the TP versus 8.26% at 3X DP 10 and only 1.59% at 1X DP 10 (Additional file 12). Nevertheless, the proportion of true positives relative to the total biallelic SNPs identified for each lc-DP combination (%TP) did not follow the same trend and was different for each sequencing coverage (Additional file 13). The %TP slightly increased at higher sequencing coverage and the plateau shifted at higher depth of coverage thresholds when adding coverage. So that, at 1X, the highest %TP was 51.76% observed with DP 3, while it was 61.70% at 5X DP 10 (Additional file 13). Thus, considerations should be given to whether applying higher DP thresholds is advantageous. While this approach may reduce false positives and increase confidence in calling heterozygous loci, it could also result in a lower proportion of true positives relative to the total biallelic SNPs identified.

Fig. 5.

Fig. 5

Variation in the total number of polymorphisms when using different minimum depth of coverage thresholds (from DP 1 to DP 10) across different sequencing coverages (1-5X) using Freebayes. The total number of polymorphisms is broken down into potential false positives and true positives, with true positives being those shared between the samples and the gold standard. True positives are categorized based on genotyping completeness in the four S5MEGGIC lines: fully genotyped (0% missing data), partially genotyped in three lines (25% missing data), and in two lines (50% missing data). Values represent the average across the four S5MEGGIC lines, with SD indicated by error bars (n = 20 for 1-4X and n = 4 for 5X)

Additionally, missing data for each lc-DP combination was assessed, as it is a variable that highly impacts downstream analysis (Fig. 5 and Additional file 14). SNPs with 100% missing data after the filtering step were excluded from the total SNP count used to calculate the percentage of TP with 0%, 25% and 50% missing data. At low depth of coverage thresholds (DP ≤ 7), sequencing at 5X achieved the highest percentage of true positives without any missing data. Beyond this threshold, the highest rate was obtained with 1X sequencing coverage, even though it was the coverage at which the fewest number of true positives was identified (Fig. 5). At this coverage, the %TP genotyped in all individuals varied from 39.52% at DP 1 to 45.96% at DP 10, with a minimum value of 19.40% at DP 3, where sites genotyped in two out of four individuals peaked at 51.17% (Additional file 15). As sequencing coverage increased, the minimum percentage of sites genotyped in all individuals was achieved with higher depth of coverage thresholds, with a similar trend for the peak of sites genotyped in two individuals. For example, at 3X coverage, the %TP without missing data varied from 83.94% at DP 1 to 37.20% at DP 10, with a minimum value of 32.17% achieved at DP 6. At 5X coverage, it ranged from 95.84% at DP 1 to 37.59% at DP 10 (Additional file 15). Percentage of sites genotyped in three individuals remained more constant across all thresholds, except for high coverages and low depth of coverage thresholds (Additional file 15).

Discussion

Accurately and comprehensively identifying genetic variation is crucial for advancing studies in genetic diversity, trait mapping, and breeding within plant genomics. Experimental populations, such as MAGIC lines, are particularly valuable tools for generating genetic diversity and identifying genomic regions linked to traits of interest [24, 43, 44]. To fully exploit these populations, it is essential to achieve an unbiased and accurate genetic representation of the mosaic genomes present in each line, a goal often constrained by the limitations of the genotyping technology employed [24, 45]. In this study, we benchmark the application of lcWGS for high-throughput eggplant genotyping, focusing on both characterizing experimental populations and applying this method to broader genotyping efforts. As lcWGS is relatively novel in plant genomics, we evaluated key parameters—such as sequencing coverage, depth thresholds, and SNP callers—to assess their influence on polymorphism identification. This analysis benefited from the data generated from the WGS of the eight MEGGIC founders, providing a robust foundation for comparison and optimization [24, 26]. The results were further validated using S5MEGGIC recombinant fixed lines, offering valuable insights into the broader utility of lcWGS for high-resolution genotyping.

Our results support lcWGS as a promising strategy to address the technical and economic limitations inherent in the main massive genotyping methods used in eggplant, such as SPET [25], GBS [46], and WGS [47], offering a synthesis of their respective advantages [13]. Despite its use of low sequencing coverage, lcWGS provides comprehensive genome representation and identifies a significant number of variants, largely due to the absence of a genome complexity reduction step [48, 49]. We observed that an increase in the sequencing coverage above 5X resulted in only small increments in genome coverage. Specifically, the difference in reference genome coverage between 5X and 20X was only 4.58%. This implies the presence of regions in the genome that were not sequenced, even at high coverage, or could not be aligned against the reference genome. This may be due to unassembled regions in the reference genome or the presence of repetitive sequences, which correspond to 12% and 73% of the v3.0 eggplant reference genome “67/3”, respectively [34]. These findings aligned with the presence of undeserved and over-sequenced regions, which were consistent across the lc-datasets and the 20X data, despite being generated in different experiments.

Several studies have demonstrated that genome coverages as low as 1X in lcWGS are sufficient for association analysis, identifying a significant number of high-quality polymorphisms [12, 50, 51]. Even lower sequencing coverages, such as 0.20X and 0.03X, have been successfully used in wheat for precision mapping of key traits [52], and as low as 0.02X was applied in rice for population characterization and QTL analysis [53]. In our study, sequencing coverages from 1X to 5X also enabled the identification of a substantial number of SNPs, making it a viable approach for diverse biological analysis, despite the significant percentage of missing loci compared to higher sequencing coverage. As noted in previous work [5, 11, 14, 54], the number of detected SNPs increased with coverage, although the total number varied depending on the applied minimum depth of coverage threshold. Raising the DP threshold reduced the number of SNPs, as low-read support variants were filtered out, improving reliability [13]. However, the effect of the applied threshold on the percentage of SNPs removed depends on the sequencing coverage. For instance, applying a DP 3 filter has a more pronounced effect at low coverage (e.g., 1X) than at high coverage (e.g., 5X), because, at lower coverage, a larger proportion of variants will be supported by 3 or fewer reads compared to higher coverage levels. This highlights the importance of understanding the distribution of SNPs across varying depths of coverage to determine optimal DP thresholds that balance coverage depth and the number of variants identified.

Sequencing coverage represents a compromise between cost and the number of polymorphisms detected [13, 55]. The additional cost must be weighed against the gain in polymorphisms, which is influenced by the chosen depth of coverage threshold. Along with a fixed library preparation cost, the price of sequencing a genome size of ~ 3 Gb at 1X coverage is at present around $18, based on [56]. For the 1.21 GB eggplant genome [57], this translates to around $7 at 1X coverage. Doubling the coverage to 2X doubles the cost, but our results show that this also approximately at least doubled the number of identified polymorphisms across all depth of coverage thresholds. Further increases in coverage—such as from 3X to 4X or from 4X to 5X—yielded proportional increases in polymorphism detection starting from a certain DP threshold, but did not double the number of SNPs identified. This aligns with findings by Liu et al. [54], where the rate of SNP detection slowed as coverage increased. Although increasing coverage does provide more data, this does not necessarily lead to increased accuracy. In fact, we observed similar accuracy across different coverage levels, with a plateau occurring at specific DP thresholds depending on the coverage. This observation is consistent with the work of Song et al. [58], who assessed genotype accuracy under depth of coverage thresholds ranging from DP 5 to DP 15 at low coverage levels of 5X and 10X in Crassostrea gigas. They found that accuracy exceeded 95% with DP 5 and reached up to 97% with DP 10 for both coverage levels. However, they also observed that accuracy at 5X coverage started to decrease beyond DP 12.

One of the primary challenges in lcWGS is the discrimination between homozygous and heterozygous sites. As noted by Bayer et al. [7], the correlation between the number of aligned reads and the number of heterozygous SNPs for an individual increased with sequencing coverage. We evaluated genotypic concordance for each lc-DP combination with both SNP callers and observed an improvement as sequencing coverage and minimum depth of coverage threshold increased, up to a certain point. This is consistent with findings in animal genomics, observed in species such as Canis lupus and C. familiaris [48, 59]. With Freebayes, concordance exceeded 90.00% at 3X, 4X and 5X coverage when filtered by any DP threshold. Using GATK, genotypic concordance exceeded 80.00% at these coverages. We found that non-concordant sites were typically misclassified as homozygous when their true genotype was heterozygous, due to the failure to detect both alleles in a small number of sequence reads. Kardos and Waples [59] estimated that, at 3X, 4X and 5X coverages, the likelihood of failing to detect one of the two alleles at a heterozygous locus is approximately 0.25, 0.12 and 0.06 in C. lupus. Thus, the percentage of heterozygous genotypes at 1X was lower than at 5X, which suggests that heterozygosity was underestimated at low coverages.

When it comes to bioinformatic tools used in lcWGS, the choice of read aligner, while less influential on the accuracy of variant discovery than variant callers, still plays a crucial role in overall data quality [60]. Among the available options, BWA-MEM has stood out in different benchmarking studies as one of the best read mapper [18, 61, 62]. However, the ongoing debate regarding the optimal variant caller in terms of performance remains unresolved [63]. In this study, we compared two widely used SNP callers, Freebayes and GATK, to determine the best one in terms of the number of TP identified and genotypic concordance when working with lc-data. Freebayes operates as a haplotype-based variant caller detecting polymorphisms based on the sequence content of reads aligned to particular genomic targets [41]. On the other hand, GATK functions as an alignment-based variant caller that detects polymorphisms by locally assembling haplotypes in active regions, relying on precise read alignment to a reference genome [42]. Both tools use probabilistic models, incorporating Bayesian inference to assess the likelihood of genotypes at each position. In comparison, SAMtools, another widely used variant calling software, is a heuristic tool that relies on predefined rules to call variants based on read alignments and other basic parameters (e.g. read depth) [38]. And this approach does not capture the complexity of variant calling as effectively as probabilistic methods. In our set of materials, Freebayes identified more TP than GATK at both high and low sequencing coverages, which is in agreement with other studies [18, 64]. However, this trend is not always consistent. For example, Ni et al. [65] and Liu et al. [54] identified 1.31 M and 142.45 k more SNPs using GATK than Freebayes within 8X and 10X whole-genome sequencing data of chicken, respectively. Additionally, the impact of increasing DP threshold on the reduction of the percentage of SNPs was less pronounced for Freebayes, likely due to its use of a higher number of reads compared to GATK [64].

Regarding the accuracy, sensitivity, and genotypic concordance achieved by each software with various combinations of lc-DP, we found significant differences between the two software. Our finding indicates that Freebayes is more inclusive in detecting true positives, thus increasing the sensitivity of variant calling, consistent with findings by Yao et al. [18]. The trade-off, however, is that Freebayes also reported more potential false positives, which could reduce its overall accuracy when compared to GATK. This higher number of variants detected by Freebayes, while beneficial for sensitivity, highlights the need for filtering potential false positives—an aspect that was addressed in our study using a gold standard for variant validation. Our results align with Stegemiller et al. [64], showing that Freebayes assigned more correct genotypes compared to the 20X data than GATK. However, all these parameters evaluated are also dependent on the specific data used. For example, Ni et al. [65] found that the set of SNPs obtained from whole-genome sequence data in chicken achieved higher accuracy and genotypic concordance using GATK compared to Freebayes. Similarly, comparisons of variant calling tools for the analysis of Arabidopsis thaliana NGS data [61] and microbial genomes [66] showed that GATK demonstrated greater accuracy and sensitivity than the other programs evaluated. This is consistent with the study carried out by Liu et al. [54], which found GATK to be superior at both low and high coverages. In summary, the selection of SNP calling programs should be assessed individually for each specific case. Based on our results, we chose Freebayes for SNP calling the S5MEGGIC lines. We prioritized higher sensitivity and genotypic concordance, as well as time efficiency [67], due to the nature of the analysis we conducted.

We also evaluated the use of lcWGS to genomic characterize the S5MEGGIC lines. Due to the additional challenges associated with lc-data, following the same pipeline as with high-coverage data is not advisable. Although the DP threshold allowed us to balance retaining TP with complete genotypic concordance and discarding potential false positives, it may not be sufficient. In the clinical sector, the use of a GS to filter lc-data is widely employed and benefits from a plethora of benchmark datasets [68, 69]. Conversely, its adoption in the breeding sector has been gradual, with reference standards mainly accessible for model species like rice [70], Arabidopsis [71], corn [72], soybean [73], and wheat [74]. Given the absence of an established gold standard for eggplant, we developed our own, taking advantage of the available 20X data from the MEGGIC founders [26], which represents the genetic diversity of the population. However, the genetic diversity among the founders was not uniform, which led us to exclude multiallelic SNPs, ensuring that only the most consistent polymorphisms, shared across the majority of the founders, were included in the GS. Unlike SNP calling for the development of reference SNP panels, variant identification within the founders for the GS preparation was conducted in a single step, utilizing all the information collectively. This approach benefited our data as Freebayes leverages information from multiple samples to confidently call variants and address gaps where data from a single sample may be insufficient or ambiguous. More importantly, this method allowed us to obtain complete genotype information for all the samples evaluated from any polymorphic site [41]. The final dataset used to identify TP in the four S5MEGGIC lines contained 17.07 M SNPs, providing a substantial amount of high-quality data to serve as the gold standard.

The comparison between the GS and the four S5MEGGIC lines revealed a significant presence of potential false positives at low-coverage levels, highlighting the necessity of employing a GS to retain only TP for subsequent analysis [11, 50]. While the accuracy values, or %TP, were similar between the lc-data from the founders and the lines, the reduction at higher depth of coverage thresholds was more pronounced in the lines’ data. This highlights the benefits of considering all S5MEGGIC lines collectively during SNP calling for the accurate identification of both true and false positives. On the other hand, the application of a depth of coverage threshold helped to remove potential false positives, but, beyond a certain threshold, true positives were also lost. For 1X coverage, a threshold of DP 3 is recommended to capture 69,741 TP with significant genotypic concordance (80–90%) and no missing genotypes. For 3X coverage, minimum depth of coverage thresholds between DP 3 and DP 7 were efficient, capturing between 902,465 and 169,157 TP with genotypic concordance greater than 90% and no missing genotypes. Although these thresholds may appear low, the use of a gold standard for filtering allowed us to effectively eliminate potential false positives. If the GS-based filtering does not adequately remove these variants, a higher depth of coverage threshold may be necessary to ensure more reliable results. However, applying a high threshold increased the likelihood of loci lacking genotype information across all lines, especially at low coverage, where each locus is supported by only a few reads [5, 59]. On the other hand, at 1X coverage, the percentage of true positives without missing data increased beyond DP 3, as the total number of true positives began to decline sharply. In contrast, at 5X coverage, a higher threshold, untested in this study, would be required compared to 1X coverage to increase the proportion of true positives without missing data.

Our results have direct implications for key biological applications in eggplant genetics and breeding, and are broadly applicable to other crops. The demonstrated accuracy, sensitivity, and genotypic concordance at depths as low as 1X suggest that high-density genotype data can be generated reliably and affordably. This is particularly relevant for genome-wide association studies (GWAS) and QTL mapping, which require dense marker coverage across diverse individuals. Notably, Arrones et al. [75] successfully genotyped the S5MEGGIC population at 3X coverage, yielding 293,783 SNPs filtered with a minimum depth of coverage threshold of DP 3. These data enabled the identification of candidate genes associated with root architectural traits — key targets for improving abiotic stress resilience. Similarly, studies aiming to implement genomic selection can benefit from our approach, especially in early-generation selection stages where high-throughput, low-cost genotyping is critical to train the predictive models [76, 77].

Conclusions

Genotyping by lcWGS presents a promising alternative to overcome the limitations of low marker density in GBS and the high costs associated with WGS. While lcWGS is challenged by the identification of numerous potential false positives, our study highlights the efficacy of using a gold standard to mitigate this drawback. Additionally, the choice of bioinformatic tools can significantly influence the results. Future research should focus on optimizing these methodologies and developing tailored bioinformatic tools to fully leverage the potential of lcWGS in various genomic studies. Our findings indicate that Freebayes outperforms GATK in terms of sensitivity and genotypic concordance. Furthermore, selecting an appropriate sequencing coverage requires careful consideration of multiple variables, including economic constraints, study goals, and species characteristics. Coverages as low as 1X, combined with a gold standard and a depth threshold of DP 3, can yield a substantial number of true positives, making them suitable for large-scale, cost-sensitive genomic applications. However, for highly heterozygous species, higher coverage and more stringent depth thresholds are advised to avoid underrepresentation of heterozygosity. As a broad recommendation, we propose that sequencing at 3X or 4X, with a gold standard and depth thresholds of DP 3 and DP 4, respectively, provides an optimal balance between cost, sensitivity, missing data, and genotypic concordance. These settings achieve genotypic concordance comparable to 5X while minimizing the loss of TP and the proportion of missing genotypes. In conclusion, with appropriate optimization, lcWGS presents a cost-effective and scalable tool for genotyping. Successful implementation of lcWGS has the potential to revolutionize genotyping practices not only in eggplant breeding programs but also across other crops and genomic studies, enabling the efficient identification of valuable genetic variants at a reduced cost and accelerating genetic improvement endeavours.

Supplementary Information

Below is the link to the electronic supplementary material.

12870_2025_7242_MOESM1_ESM.pdf (86.1KB, pdf)

Additional file 1. A Founders of the MEGGIC population including their country of origin and code assessed in this study. B Funnel breeding design for developing the S5MEGGIC population. The process began with eight parent lines (G0), labelled A to H. These parents were crossed in pairs (A x B, C x D, E x F, G x H) to produce four simple hybrids (G1). Next, these simple hybrids were crossed in pairs (AB x CD, EF x GH) to create two double hybrids (G2). The double hybrids were then crossed (ABCD x EFGH) to form the quadruple hybrids (G3). Following this, the quadruple hybrids underwent intercrossing and were subjected to five rounds of single seed descent (SSD). Adapted from Mangino et al. [1]. (PDF)

12870_2025_7242_MOESM2_ESM.xlsx (18.9KB, xlsx)

Additional file 2. Statistics of the lcWGS of the seven Solanum melongena and one S. incanum founders of the MEGGIC population and for the four S5MEGGIC lines. (XLSX)

12870_2025_7242_MOESM3_ESM.xlsx (30.8KB, xlsx)

Additional file 3. Mapping statistics for the individuals assessed in this study. Mean values across five replicates for each level of skim coverage (1-4X) ± SD (n = 5). (XLSX)

12870_2025_7242_MOESM4_ESM.pdf (6.6MB, pdf)

Additional file 4. Distribution of mapped read across the founder accession A by sequencing coverage. The maximum cutoff is set at DP 20 for low coverages and DP 60 for 20X data. Peaks represent regions of high sequencing within 10 Kbp windows. (PDF)

12870_2025_7242_MOESM5_ESM.xlsx (18.3KB, xlsx)

Additional file 5. Widely used variant calling software tools, their publication year, the number of citations according to Google Scholar, and the number of articles referencing these tools when searching on Google Scholar using the terms: “software name”, “low coverage sequencing”, “plant”, and “calling”. (XLSX)

12870_2025_7242_MOESM6_ESM.pdf (52.2KB, pdf)

Additional file 6. Comparative heatmaps depicting the average number of biallelic SNPs identified among the MEGGIC founders across varying sequencing coverages (1X to 5X) and minimum depth of coverage thresholds (from DP 1 to DP 10) using Freebayes and GATK. The colour of the squares reflects the number of polymorphisms identified. (PDF)

12870_2025_7242_MOESM7_ESM.xlsx (72.9KB, xlsx)

Additional file 7. Total number of polymorphic biallelic SNPs for each founder downsampling subsets (1-5X) and minimum depth of coverage thresholds (from DP 1 to DP 10) and shared polymorphisms (TP) with the reference SNP panels from the same founder using Freebayes and GATK. Data from 1X to 4X is the mean value of the five replicates with the standard deviation (SD) reported in the columns to the right. (XLSX)

12870_2025_7242_MOESM8_ESM.pdf (352.9KB, pdf)

Additional file 8. Distribution of SNPs supported by different coverage depths before (left) and after (right) gold standard filtering using Freebayes (blue) and GATK (orange) at different sequencing coverages (1-5X and 20X). The data represent one SNP dataset from each coverage level of the parental A. For each sequencing coverage-SNP caller combination, the lower and hither values, as well as the first, second (median), and third quartiles, along with the mean, are reported. (PDF)

12870_2025_7242_MOESM9_ESM.xlsx (19.8KB, xlsx)

Additional file 9. Reference SNP panels generated by Freebayes and GATK from the 20X resequencing data [26]. “Unfiltered” refers to the biallelic SNPs identified by each SNP caller without applying any filter. Polymorphic biallelic SNPs supported by at least 20 reads constituted the reference SNP panels. The ratio was calculated to compare Freebayes results relative to those of GATK-HC in both cases. Percentages reflect filtered versus total polymorphisms. (XLSX)

12870_2025_7242_MOESM10_ESM.xlsx (23.3KB, xlsx)

Additional file 10. Multifactorial ANOVA comparing the effect of SNP caller, coverage level, and minimum depth of coverage threshold on accuracy, sensitivity, and genotypic concordance. Post-hoc Tukey’s HSD tests were performed for pairwise comparisons between levels of each factor, evaluating their effects on accuracy, sensitivity, and genotypic concordance. The level of significance was set at p < 0.05. (XLSX)

12870_2025_7242_MOESM11_ESM.xlsx (18.9KB, xlsx)

Additional file 11. Number of missing, reference and alternative genotypes comprising the gold standard for each founder accession. Percentages are calculated above the total 17,069,371 biallelic SNPs. (XLSX)

12870_2025_7242_MOESM12_ESM.xlsx (21.4KB, xlsx)

Additional file 12. Total TP identified in the four S5MEGGIC lines and variation in the percentage of total TP (set at 100% at DP 1) across different low sequencing coverage levels (1-5X) and minimum depth of coverage thresholds (from DP 1 to DP 10) using Freebayes. Data for sequencing coverages from 1X to 4X represents the mean value of five replicates. Standard deviations (SD) are shown in the second column. (XLSX)

12870_2025_7242_MOESM13_ESM.pdf (120.6KB, pdf)

Additional file 13. Percentage of true positives variants when using different minimum depth of coverage thresholds (from DP 1 to DP 10) across different sequencing coverages (1-5X) using Freebayes. True positive polymorphisms were those shared between the samples and the gold standard. Percentages represent the average across the four S5MEGGIC lines, with SD indicated by error bars (n = 20 for 1-4X and n = 4 for 5X). (PDF)

12870_2025_7242_MOESM14_ESM.xlsx (22.2KB, xlsx)

Additional file 14. Total TP identified in the four S5MEGGIC lines across different percentages of allowed missing data (0%, 25% and 50%), skim sequencing coverage levels (1-5X) and minimum depth of coverage thresholds (from DP 1 to DP 10) using Freebayes. Data for sequencing coverages from 1X to 4X represents the mean value of five replicates. Standard deviations (SD) are shown in the second column. (XLSX)

12870_2025_7242_MOESM15_ESM.pdf (117.6KB, pdf)

Additional file 15. Percentage of true positives variants identified in the four S5MEGGIC lines (0% missing data), in three lines (25% missing data), and in two lines (50% missing data). This was evaluated for each combination of minimum depth of coverage thresholds (from DP 1 to DP 10) and sequencing coverages (1-5X). (PDF)

Acknowledgements

Not applicable.

Abbreviations

lcWGS

Low-coverage whole-genome sequencing

DP

Depth of coverage

TP

True positives

GS

Gold standard

RRS

Reduced representation sequencing

WGS

Whole-genome sequencing

GBS

Genotyping-by-sequencing

QTLs

Quantitative trait loci

ILs

Introgression lines

MAGIC

Multiparent advanced generation intercross

SPET

Single primer enrichment technology

Author contributions

SV, JP and PG conceived the study. VBF, AA, MP and AB contributed to the data curation. VBF, AB and PG contributed to the formal analysis. JP and PG acquired funding. PG supervised the study. AA and MP validated the pipeline. VBF wrote the original draft. VBF, AA, SV, MP, JP, AB and PG reviewed the manuscript. All authors read and approved the final version of the manuscript.

Funding

Open access funding supported by Universitat Politècnica de València. This work was supported by grant PID2021-128148OB-I00 funded by MICIU/AEI/10.13039/501100011033/ and by ERDF/EU, grant CIPROM/2021/020 from Conselleria d’Educació, Cultura, Universitats i Ocupació (Generalitat Valenciana), and by the Horizon Europe programme, project number 101094738 (“Promoting a Plant Genetic Resource Community for Europe; PRO-GRACE). Pietro Gramazio is grateful for the post-doctoral grant RYC2021-031999-I funded by MICIU/AEI/10.13039/ 501100011033 and the European Union through NextGenerationEU/PRTR.

Data availability

The raw data have been submitted to the NCBI Short Read Archive under the Bioproject identifier PRJNA1174391. Accessions are indexed with BioSample IDs from SAMN44339997 to SAMN44340008. VCF files with the corresponding variants identified are available upon request to the corresponding author.

Declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Virginia Baraja-Fonseca, Email: vbarfon@posgrado.upv.es.

Pietro Gramazio, Email: piegra@upv.es.

References

  • 1.Song B, Ning W, Wei D, Jiang M, Zhu K, Wang X, et al. Plant genome resequencing and population genomics: current status and future prospects. Mol Plant. 2023;16:1252–68. [DOI] [PubMed] [Google Scholar]
  • 2.Scheben A, Batley J, Edwards D. Genotyping-by-sequencing approaches to characterize crop genomes: choosing the right tool for the right application. Plant Biotechnol J. 2017;15:149–61. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Golicz AA, Bayer PE, Edwards D. Skim-based genotyping by sequencing. Methods Mol Biol. 2015;1245:257–70. [DOI] [PubMed] [Google Scholar]
  • 4.Kumar P, Choudhary M, Jat BS, Kumar B, Singh V, Kumar V, et al. Skim sequencing: an advanced NGS technology for crop improvement. J Genet. 2021;100:38. [PubMed] [Google Scholar]
  • 5.Malmberg MM, Barbulescu DM, Drayton MC, Shinozuka M, Thakur P, Ogaji YO, et al. Evaluation and recommendations for routine genotyping using skim whole genome re-sequencing in Canola. Front Plant Sci. 2018;9:1809. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Happ MM, Wang H, Graef GL, Hyten DL. Generating high density, low cost genotype data in soybean [Glycine max (L.) Merr.]. G3 Genes. Genomes Genet. 2019;9:2153–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Bayer PE, Ruperao P, Mason AS, Stiller J, Chan CKK, Hayashi S, et al. High-resolution skim genotyping by sequencing reveals the distribution of crossovers and gene conversions in Cicer arietinum and Brassica Napus. Theor Appl Genet. 2015;128:1039–47. [DOI] [PubMed] [Google Scholar]
  • 8.Wang H, Xu X, Vieira FG, Xiao Y, Li Z, Wang J, et al. The power of inbreeding: NGS-based GWAS of rice reveals convergent evolution during rice domestication. Mol Plant. 2016;9:975–85. [DOI] [PubMed] [Google Scholar]
  • 9.Gonda I, Ashrafi H, Lyon DA, Strickler SR, Hulse-Kemp AM, Ma Q, et al. Sequencing‐based bin map construction of a tomato mapping population, facilitating high‐resolution quantitative trait loci detection. Plant Genome. 2019;12:180010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Luo X, Xu L, Wang Y, Dong J, Chen Y, Tang M, et al. An ultra-high-density genetic map provides insights into genome synteny, recombination landscape and taproot skin colour in radish (Raphanus sativus L). Plant Biotechnol J. 2020;18:274–86. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Adhikari L, Shrestha S, Wu S, Crain J, Gao L, Evers B, et al. A high-throughput skim-sequencing approach for genotyping, dosage estimation and identifying translocations. Sci Rep. 2022;12:17583. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Clot CR, Wang X, Koopman J, Navarro AT, Bucher J, Visser RG, et al. High-density linkage map constructed from a skim sequenced diploid potato population reveals transmission distortion and QTLs for tuber and pollen production. Potato Res. 2024;67:139–63. [Google Scholar]
  • 13.Lou RN, Jacobs A, Wilder AP, Therkildsen NO. A beginner’s guide to low-coverage whole genome sequencing for population genomics. Mol Ecol. 2021;30:5966–93. [DOI] [PubMed] [Google Scholar]
  • 14.Deng XL, Frandsen PB, Dikow RB, Favre A, Shah DN, Shah RDT, et al. The impact of sequencing depth and relatedness of the reference genome in population genomic studies: a case study with two caddisfly species (Trichoptera, rhyacophilidae, Himalopsyche). Ecol Evol. 2022;12:e9583. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Meisner J, Albrechtsen A. Inferring population structure and admixture proportions in low-depth NGS data. Genetics. 2018;210:719–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.O’Leary SJ, Puritz JB, Willis SC, Hollenbeck CM, Portnoy DS. These aren’t the loci you’e looking for: principles of effective SNP filtering for molecular ecologists. Mol Ecol. 2018;27:3193–206. [DOI] [PubMed] [Google Scholar]
  • 17.Wickland DP, Battu G, Hudson KA, Diers BW, Hudson ME. A comparison of genotyping-by-sequencing analysis methods on low-coverage crop datasets shows advantages of a new workflow, GB-eaSy. BMC Bioinformatics. 2017;18:586. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Yao Z, You FM, N’Diaye A, Knox RE, McCartney C, Hiebert CW, et al. Evaluation of variant calling tools for large plant genome re-sequencing. BMC Bioinform. 2020;21:360. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Hardwick SA, Deveson IW, Mercer TR. Reference standards for next-generation sequencing. Nat Rev Genet. 2017;18:473–84. [DOI] [PubMed] [Google Scholar]
  • 20.Poplin R, Ruano-Rubio V, DePristo MA, Fennell TJ, Carneiro MO, Van der Auwera GA et al. Scaling accurate genetic variant discovery to tens of thousands of samples. BioRxiv. 2017:201178.
  • 21.Zook JM, Hansen NF, Olson ND, Chapman L, Mullikin JC, Xiao C, et al. A robust benchmark for detection of germline large deletions and insertions. Nat Biotechnol. 2020;38:1347–55. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.FAOSTAT. FAOSTAT. 2024. http://www.fao.org/faostat/en/#data/QCL. Accessed 9 Jan 2024.
  • 23.Gramazio P, Alonso D, Arrones A, Villanueva G, Plazas M, Toppino L, et al. Conventional and new genetic resources for an eggplant breeding revolution. J Exp Bot. 2023;74:6285–305. [DOI] [PubMed] [Google Scholar]
  • 24.Mangino G, Arrones A, Plazas M, Pook T, Prohens J, Gramazio P, et al. Newly developed MAGIC population allows identification of strong associations and candidate genes for anthocyanin pigmentation in eggplant. Front Plant Sci. 2022;13:847789. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Barchi L, Acquadro A, Alonso D, Aprea G, Bassolino L, Demurtas O, et al. Single primer enrichment technology (SPET) for high-throughput genotyping in tomato and eggplant germplasm. Front Plant Sci. 2019;10:1005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Gramazio P, Yan H, Hasing T, Vilanova S, Prohens J, Bombarely A. Whole-genome resequencing of seven eggplant (Solanum melongena) and one wild relative (S. incanum) accessions provides new insights and breeding tools for eggplant enhancement. Front Plant Sci. 2019;10:1220. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Arrones A, Mangino G, Alonso D, Plazas M, Prohens J, Portis E, et al. Mutations in the SmAPRR2 transcription factor suppressing chlorophyll pigmentation in the eggplant fruit Peel are key drivers of a diversified colour palette. Front Plant Sci. 2022;13:1025951. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Qian Z, Zhang B, Chen H, Lu L, Duan M, Zhou J, et al. Identification of quantitative trait loci controlling the development of prickles in eggplant by genome re-sequencing analysis. Front Plant Sci. 2021;12:731079. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Guan W, Ke C, Tang W, Jiang J, Xia J, Xie X, et al. Construction of a high-density recombination bin-based genetic map facilitates high-resolution mapping of a major QTL underlying anthocyanin pigmentation in eggplant. Int J Mol Sci. 2022;23:10258. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Ranil RHG, Niran HML, Plazas M, Fonseka RM, Fonseka HH, Vilanova S, et al. Improving seed germination of the eggplant rootstock Solanum torvum by testing multiple factors using an orthogonal array design. Sci Hortic. 2015;193:174–81. [Google Scholar]
  • 31.Vilanova S, Alonso D, Gramazio P, Plazas M, García-Fortea E, Ferrante P, et al. SILEX: a fast and inexpensive high-quality DNA extraction method suitable for multiple sequencing platforms and recalcitrant plant species. Plant Methods. 2020;16:110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Chen Y, Chen Y, Shi C, Huang Z, Zhang Y, Li S, et al. SOAPnuke: a mapreduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data. Gigascience. 2018;7:gix120. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Andrews S. FastQC: a quality control tool for high throughput sequence data. 2010. http://www.bioinformatics.babraham.ac.uk/projects/fastqc. Accessed 28 Jan 2023.
  • 34.Barchi L, Pietrella M, Venturini L, Minio A, Toppino L, Acquadro A, et al. A chromosome-anchored eggplant genome sequence reveals key events in Solanaceae evolution. Sci Rep. 2019;9:11769. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. ArXiv. 2013:1303.3997 [q-bio.GN].
  • 36.Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The sequence alignment/map format and samtools. Bioinformatics. 2009;25:2078–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.García-Alcalde F, Okonechnikov K, Carbonell J, Cruz LM, Götz S, Tarazona S, et al. Qualimap: evaluating next-generation sequencing alignment data. Bioinformatics. 2012;28:2678–9. [DOI] [PubMed] [Google Scholar]
  • 38.Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter Estimation from sequencing data. Bioinformatics. 2011;27:2987–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Ramírez F, Dündar F, Diehl S, Grüning BA, Manke T. DeepTools: a flexible platform for exploring deep-sequencing data. Nucleic Acids Res. 2014;42:W187–91. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.R Core Team. R: A Language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2021. [Google Scholar]
  • 41.Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. ArXiv. 2012:1207.3907v2 [q-bio.GN].
  • 42.McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, et al. The genome analysis toolkit: a mapreduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20:1297–303. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Kumar N, Boatwright JL, Brenton ZW, Sapkota S, Ballén-Taborda C, Myers MT, et al. Development and characterization of a sorghum multiparent advanced generation intercross (MAGIC) population for capturing diversity among seed parent gene pool. G3 Genes Genomes Genet. 2023;13:jkad037. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Thudi M, Samineni S, Li W, Boer MP, Roorkiwal M, Yang Z, et al. Whole genome resequencing and phenotyping of MAGIC population for high resolution mapping of drought tolerance in Chickpea. Plant Genome. 2024;17:e20333. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Dell’Acqua M, Gatti DM, Pea G, Cattonaro F, Coppens F, Magris G, et al. Genetic properties of the MAGIC maize population: a new platform for high definition QTL mapping in Zea Mays. Genome Biol. 2015;16:1–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Peterson GW, Dong Y, Horbach C, Fu YB. Genotyping-by-sequencing for plant genetic diversity analysis: a lab guide for SNP genotyping. Diversity. 2014;6:665–80. [Google Scholar]
  • 47.Kumawat S, Raturi G, Dhiman P, Sudhakarn S, Rajora N, Thakral V, et al. Opportunity and challenges for whole-genome resequencing-based genotyping in plants. In: Sonah H, Goyal V, Shivaraj SM, Deshmukh RK, editors. Genotyping by sequencing for crop improvement. John Wiley & Sons, Ltd.; 2022. pp. 38–51.
  • 48.Wragg D, Zhang W, Peterson S, Yerramilli M, Mellanby R, Schoenebeck JJ, et al. A cautionary Tale of low-pass sequencing and imputation with respect to haplotype accuracy. Genet Sel Evol. 2024;56:1–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Jiang Y, Jiang Y, Wang S, Zhang Q, Ding X. Optimal sequencing depth design for whole genome re-sequencing in pigs. BMC Bioinform. 2019;20:556. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Bhattarai G, Shi A, Mou B, Correll JC. Skim resequencing finely maps the downy mildew resistance loci RPF2 and RPF3 in spinach cultivars Whale and Lazio. Hortic Res. 2023;10:uhad076. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Sapkota S, Zou C, Ledbetter C, Underhill A, Sun Q, Gadoury D, et al. Discovery and genome-guided mapping of REN12 from Vitis amurensis, conferring strong, rapid resistance to grapevine powdery mildew. Hortic Res. 2023;10:uhad052. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Saripalli G, Adhikari L, Amos C, Kibriya A, Ahmed HI, Heuberger M, et al. Integration of genetic and genomics resources in Einkorn wheat enables precision mapping of important traits. Commun Biol. 2023;6:1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Huang X, Feng Q, Qian Q, Zhao Q, Wang L, Wang A, et al. High-throughput genotyping by whole-genome resequencing. Genome Res. 2009;19:1068–76. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Liu J, Shen Q, Bao H. Comparison of seven SNP calling pipelines for the next-generation sequencing data of chickens. PLoS One. 2022;17:e0262574. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Martin AR, Atkinson EG, Chapman SB, Stevenson A, Stroud RE, Abebe T, et al. Low-coverage sequencing cost-effectively detects known and novel variation in underrepresented populations. Am J Hum Genet. 2021;108:656–68. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Watowich MM, Chiou KL, Graves B, Montague MJ, Brent LJN, Higham JP, et al. Best practices for genotype imputation from low-coverage sequencing data in natural populations. Mol Ecol Resour. 2023;00:1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Barchi L, Rabanus-Wallace MT, Prohens J, Toppino L, Padmarasu S, Portis E, et al. Improved genome assembly and pan-genome provide key insights into eggplant domestication and breeding. Plant J. 2021;107:579–96. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Song K, Li L, Zhang G. Coverage recommendation for genotyping analysis of highly heterologous species using next-generation sequencing technology. Sci Rep. 2016;6:35736. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Kardos M, Waples RS. Low-coverage sequencing and Wahlund effect severely bias estimates of inbreeding, heterozygosity and effective population size in North American wolves. Mol Ecol. 2024;00:e17415. [DOI] [PubMed] [Google Scholar]
  • 60.Musich R, Cadle-Davidson L, Osier MV. Comparison of short-read sequence aligners indicates strengths and weaknesses for biologists to consider. Front Plant Sci. 2021;12:657240. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Schilbert HM, Rempel A, Pucker B. Comparison of read mapping and variant calling tools for the analysis of plant NGS data. Plants. 2020;9:439. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Wu X, Heffelfinger C, Zhao H, Dellaporta SL. Benchmarking variant identification tools for plant diversity discovery. BMC Genomics. 2019;20:701. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Barbitoff YA, Abasov R, Tvorogova VE, Glotov AS, Predeus AV. Systematic benchmark of state-of-the-art variant calling pipelines identifies major factors affecting accuracy of coding sequence variant discovery. BMC Genomics. 2022;23:1–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Stegemiller MR, Redden RR, Notter DR, Taylor T, Taylor JB, Cockett NE, et al. Using whole genome sequence to compare variant callers and breed differences of US sheep. Front Genet. 2023;13:1060882. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Ni G, Strom TM, Pausch H, Reimer C, Preisinger R, Simianer H, et al. Comparison among three variant callers and assessment of the accuracy of imputation from SNP array data to whole-genome sequence level in chicken. BMC Genomics. 2015;16:824. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Bhadhadhara K, Balamurugan M, Bharti N, Banerjee R, Kasibhatla SM, Joshi R. Performance Evaluation of Variant Calling Tools for Human and Microbial Genomes. In: 2023 International Conference on Emerging Trends in Networks and Computer Communications (ETNCC). Namibia: Institute of Electrical and Electronics Engineers Inc. 2023. pp. 235–42.
  • 67.Bu M, Xu M, Tao S, Cui P, He B. Evaluation of different SNP analysis software and optimal mining process in tree species. Life. 2023;13:1069. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.The 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–73. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Espejo Valle-Inclan J, Besselink NJM, de Bruijn E, Cameron DL, Ebler J, Kutzera J, et al. A multi-platform reference for somatic structural variation detection. Cell Genomics. 2022;2:100139. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.The 3.000 rice genomes project. The 3,000 rice genomes project. Gigascience. 2014;3:7. [DOI] [PMC free article] [PubMed]
  • 71.1001 Genomes Consortium. 1,135 genomes reveal the global pattern of polymorphism in Arabidopsis Thaliana. Cell. 2016;166:481–91. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Bukowski R, Guo X, Lu Y, Zou C, He B, Rong Z, et al. Construction of the third-generation Zea Mays haplotype map. Gigascience. 2018;7:1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Torkamaneh D, Laroche J, Valliyodan B, O’Donoughue L, Cober E, Rajcan I, et al. Soybean (Glycine max) haplotype map (GmHapMap): a universal resource for soybean translational and functional genomics. Plant Biotechnol J. 2021;19:324–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74.Jordan KW, Bradbury PJ, Miller ZR, Nyine M, He F, Fraser M, et al. Development of the wheat practical haplotype graph database as a resource for genotyping data storage and genotype imputation. G3 genes, genomes. Genet. 2022;12:jkab390. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75.Arrones A, Baraja-Fonseca V, Solana A, Plazas M, Soler S, Prohens J, et al. Resequencing and phenotyping of the first highly inbred eggplant multiparent population reveal SmLBD13 as a key gene associated with root morphology. Hortic Res. 2025;6:uhaf157. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76.Zhang W, Li W, Liu G, Gu L, Ye K, Zhang Y, et al. Evaluation for the effect of low-coverage sequencing on genomic selection in large yellow croaker. Aquaculture. 2021;534:736323. [Google Scholar]
  • 77.Ye H, Ji C, Liu X, Bello SF, Guo L, Fang X, et al. Improvement of the accuracy of breeding value prediction for egg production traits in muscovy Duck using low-coverage whole-genome sequence data. Poult Sci. 2025;104:104812. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

12870_2025_7242_MOESM1_ESM.pdf (86.1KB, pdf)

Additional file 1. A Founders of the MEGGIC population including their country of origin and code assessed in this study. B Funnel breeding design for developing the S5MEGGIC population. The process began with eight parent lines (G0), labelled A to H. These parents were crossed in pairs (A x B, C x D, E x F, G x H) to produce four simple hybrids (G1). Next, these simple hybrids were crossed in pairs (AB x CD, EF x GH) to create two double hybrids (G2). The double hybrids were then crossed (ABCD x EFGH) to form the quadruple hybrids (G3). Following this, the quadruple hybrids underwent intercrossing and were subjected to five rounds of single seed descent (SSD). Adapted from Mangino et al. [1]. (PDF)

12870_2025_7242_MOESM2_ESM.xlsx (18.9KB, xlsx)

Additional file 2. Statistics of the lcWGS of the seven Solanum melongena and one S. incanum founders of the MEGGIC population and for the four S5MEGGIC lines. (XLSX)

12870_2025_7242_MOESM3_ESM.xlsx (30.8KB, xlsx)

Additional file 3. Mapping statistics for the individuals assessed in this study. Mean values across five replicates for each level of skim coverage (1-4X) ± SD (n = 5). (XLSX)

12870_2025_7242_MOESM4_ESM.pdf (6.6MB, pdf)

Additional file 4. Distribution of mapped read across the founder accession A by sequencing coverage. The maximum cutoff is set at DP 20 for low coverages and DP 60 for 20X data. Peaks represent regions of high sequencing within 10 Kbp windows. (PDF)

12870_2025_7242_MOESM5_ESM.xlsx (18.3KB, xlsx)

Additional file 5. Widely used variant calling software tools, their publication year, the number of citations according to Google Scholar, and the number of articles referencing these tools when searching on Google Scholar using the terms: “software name”, “low coverage sequencing”, “plant”, and “calling”. (XLSX)

12870_2025_7242_MOESM6_ESM.pdf (52.2KB, pdf)

Additional file 6. Comparative heatmaps depicting the average number of biallelic SNPs identified among the MEGGIC founders across varying sequencing coverages (1X to 5X) and minimum depth of coverage thresholds (from DP 1 to DP 10) using Freebayes and GATK. The colour of the squares reflects the number of polymorphisms identified. (PDF)

12870_2025_7242_MOESM7_ESM.xlsx (72.9KB, xlsx)

Additional file 7. Total number of polymorphic biallelic SNPs for each founder downsampling subsets (1-5X) and minimum depth of coverage thresholds (from DP 1 to DP 10) and shared polymorphisms (TP) with the reference SNP panels from the same founder using Freebayes and GATK. Data from 1X to 4X is the mean value of the five replicates with the standard deviation (SD) reported in the columns to the right. (XLSX)

12870_2025_7242_MOESM8_ESM.pdf (352.9KB, pdf)

Additional file 8. Distribution of SNPs supported by different coverage depths before (left) and after (right) gold standard filtering using Freebayes (blue) and GATK (orange) at different sequencing coverages (1-5X and 20X). The data represent one SNP dataset from each coverage level of the parental A. For each sequencing coverage-SNP caller combination, the lower and hither values, as well as the first, second (median), and third quartiles, along with the mean, are reported. (PDF)

12870_2025_7242_MOESM9_ESM.xlsx (19.8KB, xlsx)

Additional file 9. Reference SNP panels generated by Freebayes and GATK from the 20X resequencing data [26]. “Unfiltered” refers to the biallelic SNPs identified by each SNP caller without applying any filter. Polymorphic biallelic SNPs supported by at least 20 reads constituted the reference SNP panels. The ratio was calculated to compare Freebayes results relative to those of GATK-HC in both cases. Percentages reflect filtered versus total polymorphisms. (XLSX)

12870_2025_7242_MOESM10_ESM.xlsx (23.3KB, xlsx)

Additional file 10. Multifactorial ANOVA comparing the effect of SNP caller, coverage level, and minimum depth of coverage threshold on accuracy, sensitivity, and genotypic concordance. Post-hoc Tukey’s HSD tests were performed for pairwise comparisons between levels of each factor, evaluating their effects on accuracy, sensitivity, and genotypic concordance. The level of significance was set at p < 0.05. (XLSX)

12870_2025_7242_MOESM11_ESM.xlsx (18.9KB, xlsx)

Additional file 11. Number of missing, reference and alternative genotypes comprising the gold standard for each founder accession. Percentages are calculated above the total 17,069,371 biallelic SNPs. (XLSX)

12870_2025_7242_MOESM12_ESM.xlsx (21.4KB, xlsx)

Additional file 12. Total TP identified in the four S5MEGGIC lines and variation in the percentage of total TP (set at 100% at DP 1) across different low sequencing coverage levels (1-5X) and minimum depth of coverage thresholds (from DP 1 to DP 10) using Freebayes. Data for sequencing coverages from 1X to 4X represents the mean value of five replicates. Standard deviations (SD) are shown in the second column. (XLSX)

12870_2025_7242_MOESM13_ESM.pdf (120.6KB, pdf)

Additional file 13. Percentage of true positives variants when using different minimum depth of coverage thresholds (from DP 1 to DP 10) across different sequencing coverages (1-5X) using Freebayes. True positive polymorphisms were those shared between the samples and the gold standard. Percentages represent the average across the four S5MEGGIC lines, with SD indicated by error bars (n = 20 for 1-4X and n = 4 for 5X). (PDF)

12870_2025_7242_MOESM14_ESM.xlsx (22.2KB, xlsx)

Additional file 14. Total TP identified in the four S5MEGGIC lines across different percentages of allowed missing data (0%, 25% and 50%), skim sequencing coverage levels (1-5X) and minimum depth of coverage thresholds (from DP 1 to DP 10) using Freebayes. Data for sequencing coverages from 1X to 4X represents the mean value of five replicates. Standard deviations (SD) are shown in the second column. (XLSX)

12870_2025_7242_MOESM15_ESM.pdf (117.6KB, pdf)

Additional file 15. Percentage of true positives variants identified in the four S5MEGGIC lines (0% missing data), in three lines (25% missing data), and in two lines (50% missing data). This was evaluated for each combination of minimum depth of coverage thresholds (from DP 1 to DP 10) and sequencing coverages (1-5X). (PDF)

Data Availability Statement

The raw data have been submitted to the NCBI Short Read Archive under the Bioproject identifier PRJNA1174391. Accessions are indexed with BioSample IDs from SAMN44339997 to SAMN44340008. VCF files with the corresponding variants identified are available upon request to the corresponding author.


Articles from BMC Plant Biology are provided here courtesy of BMC

RESOURCES