Abstract
Whole exome sequencing (WES) technology has become a prevalent methodology in the field of human genetics research, providing an effective and affordable alternative to identify causative genetic mutations in genomic exon regions. This study focuses on the comparative assessment of four commercially available WES platforms on the DNBSEQ-Series sequencer, a platform that has not been extensively evaluated in the literature. The study provides a comprehensive comparison of data quality, capture specificity, coverage uniformity and variants detection accuracy across these platforms. The results indicate that these platforms exhibit comparable reproducibility and superior technical stability and detection accuracy on the DNBSEQ-T7 sequencer. Furthermore, the study establishes a robust workflow for probe hybridization capture that is compatible with the four commercial exome kits and the DNBSEQ-Series sequencer, offering uniform and outstanding performance, thus enhancing broader compatibility regardless of probe brand. This study contributes to filling a significant gap in the literature regarding the performance evaluation of WES platforms on the DNBSEQ-Series sequencer and provides valuable insights for researchers in human genetics research.
Supplementary Information
The online version contains supplementary material available at 10.1186/s12864-025-12104-9.
Keywords: Whole exome sequencing, Probe hybridization capture, Uniformity, Efficiency, Variants detection
Introduction
Massively parallel sequencing (MPS) technology has emerged as an indispensable and pivotal tool for genomics, transcriptomics and epigenomics research. Over the past decade, MPS has undergone significant technological advancements becoming a cornerstone of modern biological research [1]. Notably, the capacity of the MPS has increased exponentially, and there are now numerous commercially available high-throughput sequencers, including HiSeq X Ten System, NovaSeq 6000, BGISEQ-500, DNBSEQ-G400, DNBSEQ-T7 and so on [2–5]. In recent years, MGI MPS sequencers have been demonstrated to offer an unparalleled combination of cost-effectiveness, superior data quality, and flexibility of throughput. MGI’s products have become globally popularized and are extensively utilized across different applications and research fields [6–8].
To date, whole exome sequencing (WES) technology has become a prevalent methodology in the field of human genetics research, providing an effective and affordable alternative to identify causative genetic mutations in genomic protein-coding exon regions. Several WES platforms designed for massively parallel sequencers have been developed and widely used in numerous studies, showing superior comprehensive performance on Illumina platforms [2, 9–12] and BGISEQ-500 [13], DNBSEQ-G400 [14, 15]. Nevertheless, a thorough scientific evaluation of WES platforms performance on DNBSEQ-T7 sequencer has not been comprehensively assessed.
In this study, four commercially available WES platforms are selected for comparative assessment on the DNBSEQ-T7 sequencer. The design and workflow of the comparative study are shown in Table 1. The four exome capture platforms evaluated are the TargetCap Core Exome Panel v3.0 from BOKE Bioscience (hereafter referred to as BOKE), the IDT’s xGen Exome Hyb Panel v2 from Integrated DNA Technologies (hereafter referred to as IDT), the EXome Core Panel from Nanodigmbio Biotechnology (hereafter referred to as Nad) and the Twist Exome 2.0 from Twist Bioscience (hereafter referred to as Twist). This study presents a comprehensive comparison of the data quality, capture specificity, uniformity of coverage, GC content bias, the efficiency, accuracy and concordance of variants detection across the four probe platforms. Our findings indicate that these platforms have comparable reproducibility and better technical stability and detection accuracy on DNBSEQ-T7 sequencer. Moreover, we have established a robust and comprehensive workflow for probe hybridization capture that exhibits broad compatibility with four distinct commercial exome probe sets and the DNBSEQ-Series sequencers. In contrast to the data derived from these probe hybridization kits following the manufacturers’ protocols, our methodology offers uniform and outstanding performance across various probe capture kits on the DNBSEQ sequencers, thereby enhancing a wider compatibility regardless of the probe brands.
Table 1.
The design and workflow of the study are showed
| gDNA | NA12878 | ||
|---|---|---|---|
| Library preparation | A total of 72 libraries by MGIEasy UDB Universal Library Prep Set | ||
| Pre-capture pooled libraries | Duplicates of 1-plex hybridization for each probe | 8-multiplex hybridization of one pool for each probe | 8-multiplex hybridization of one pool for each probe |
| Enrichment protocol | BOKE | BOKE | MGI |
| IDT | IDT | ||
| Nad | Nad | ||
| Twist | Twist | ||
| Probes | TargetCap Core Exome Panel v3.0 | ||
| IDT’s xGen Exome Hyb Panel v2 | |||
| EXome Core Panel | |||
| Twist Exome 2.0 | |||
| Sequence | pooling total 72 samples on one lane of DNBSEQ-T7, PE150 | ||
| Bioinformatics | Exome data processing&analysis | ||
Materials and methods
Samples
DNA samples of HapMap-CEPH NA12878 were purchased from Coriell Institute. It is the most thoroughly studied Human genome since it has been repeatedly sequenced by different platforms of massively parallel sequencing and whole-genome screening in the International HapMap project. PancancerLight 800 gDNA Reference Standard (IB-GW-OGTM800, short as G800) was purchased from Genewell, containing more than 720 variants across 330 key cancer genes which is well-characterized. Library construction reagents were obtained and prepared from MGI, while the four probe platforms of exome capture were purchased from their respective manufacturers.
gDNA fragmentation
Genomic DNA samples were physically fragmented into small fragments primarily ranging from 100 to 700 bp using a Covaris E210 ultrasonicator following the manufacture’s recommended procedure. Then, the DNA fragments were subjected to size selection using the MGIEasy DNA Clean Beads to obtain 220 to 280 bp fragments prior to pre-capture PCR.
Library construction
As part of the comparative analysis, we generated a total of 72 DNA libraries by NA12878. After the fragmented gDNA samples were normalized, 50 ng DNA was added to each of 72 wells on a 96-well PCR plate. Subsequently, all samples were processed using the MGIEasy UDB Universal Library Prep Set (MGI) reagents and library construction protocol. The procedure included end repair, adapter ligation, purification, and pre-PCR amplification steps, which were performed under identical conditions using the MGISP-960 High-throughput Automated Sample Preparation System. To facilitate subsequent library pooling for enrichment and sequencing, each sample was uniquely dual-indexed during PCR amplification using 72 UDB primers from the MGIEasy UDB Primers Adapter Kit Set A. After eight amplification cycles, pre-PCR products with a predominant size distribution of 350 to 450 base pairs (bp) were obtained. The concentrations of the pre-PCR libraries were quantified using the Qubit dsDNA HS Assay (ThermoFisher Scientific), and the pre-PCR library yields were calculated based on these quantification results (Table S1). The average yield of 72 libraries exceeded 1500 ng, and the coefficient of variation (CV) was less than 10%, indicating great uniformity across all samples. G800 gDNA was underwent similar library construction workflow but performed manually.
Pre-capture library pooling and Enrichment method
Exome capture were processed using four different enrichment probes, they were TargetCap Core Exome Panel v3.0 from BOKE, xGen Exome Hyb Panel v2 from IDT, EXome Core Panel from Nanodigmbio, and Twist Exome 2.0 from Twist. We followed different enrichment procedures from the step of libraries concentration by a Speedvac evaporator.
The hybridization sample arrangement was shown in Table S2. In the first column of the 96-well plate, 8 pre-capture libraries were individually target enriched by 4 different probes (1-plex hybridization), with input of 1000 ng per sample. Each probe had two replicate samples for 1-plex hybridization. Then, samples in columns 2 to 9 of the PCR plate underwent 8-plex hybridization with 8 libraries pools. The input amount of each library was 250 ng, so that the total mass per pool of one 8-plex hybridization reaction was 2000 ng. For 4 of the library pools (columns 2 to 5), we performed exome capture using the corresponding reagents and protocol provided by each probe manufacturer. Owing to the requirement of referencing enrichment reagents and protocols from different manufacturers, this segment of the experiment spanned four days. In contrast, the other 4 of the library pools (columns 6 to 9) were captured with 4 probes using the consistent MGI enrichment reagents (MGIEasy Fast Hybridization and Wash Kit) and workflow. Because of the uniform workflow for 4 probes, this segment of experiments was completed within one day (Fig. 1,Table S2). In addition, 1000 ng of the G800 library was used to perform one 1-plex capture using BOKE TargetCap Core Exome Panel v3.0, following an hour-long hybridization reaction of MGI’s protocol. The whole workflow of MGI enrichment protocol mainly consisted of four steps (Figure S1). It was a solution-based whole exome capture method utilized a workflow high similarity to most commercially available exome capture protocols [9]. Although there are minor differences in enrichment workflows, the probe hybridization step across all capture methods was standardized to a 1-h incubation in this study.
Fig. 1.
The timeline of experiments.Under the MGI hybridization protocol, sample enrichment by all probes are completed within a single day.The original manufacturers' protocols for each probe individually required four days in total, with one probe hybridized per day
After that, a total of 16 captured DNA libraries were amplified using reagents and protocol from MGIEasy Dual Barcode Exome Capture Accessory Kit with 12 cycles of PCR. The yields of post-capture were shown in Table S3.
Finally, the 16 target-enriched DNA libraries with 72 samples were converted into single-stranded DNA circles, and 40 fmol of all libraries were processed for one pool of the DNA Nanoball (DNB) generation. The DNB pool was loaded into one patterned flowcell and sequenced on DNBSEQ-T7, yielding paired-end 150 base pair reads (PE150) in a single sequencing run. Each sample was sequenced to a depth providing over 100 × mapped coverage on targeted regions. The final enriched G800 library was converted to DNB as well and sequenced on DNBSEQ-G400 instrument using PE150.
Bioinformatics analysis
Paired-end reads were processed in accordance with the best practices recommended by the Genome Analysis Toolkit (GATK) using MegaBOLT v2.3.0.0, which integrates and accelerates algorithms such as BWA, GATK HaplotypeCaller and others (https://en.mgi-tech.com/products/software_info/6). MegaBOLT provided fast, accurate, cost-effective analysis for whole genome and whole exome sequencing data. All quality control, alignment, and variant calling of all 72 WES data (approximately 1.2 T bases) were collected using in-house scripts. To enhance the accuracy of variant calling, public variant datasets for hg19 and dbSNP build 151 were applied to all samples for Base Quality Score Recalibration (BQSR).
For uniformity of coverage analysis, uniformity refers to the proportion of bases with a sequencing depth exceeding 20% of the average depth among the total number of bases in the target region. FOLD_80_BASE_PENALTY is a metric used by Picard to assess the homogeneity of genome coverage. It measures the fold over coverage required to bring 80% of the bases to an average coverage level. They are used to measure the uniformity of coverage.
![]() |
For variant concordance analysis, Jaccard similarity was employed to measure the concordance between variants datasets. This metric is defined as the size of the intersection of two sets divided by the size of their union (the number of genotype-concordant variants detected in both datasets divided by the number of variants occurring in at least one of the two datasets):
![]() |
Where
and
are both sample sets.
To identify target cancer mutations in the G800 library, we began with quality control and alignment using MegaBOLT. Due to more rare mutations, we subsequently utilized Verdict software (Version 1.8.2) for mutation detection from BAM files. Prior to this mutation analysis, we performed Base Quality Score Recalibration (BQSR) and duplicate read removal using the following parameters: '-f 0.01 –dedup -u -c 1 -S 2 -E 3 -g 4'. These stringent criteria ensured the accuracy and reliability of our mutation detection analysis.
Results
Comparison of four human exome capture platforms
Although BOKE, IDT, Nad and Twist exome capture platforms did not have identical target regions, they all employed biotinylated DNA baits that were complementary to the target exome. These baits were then employed in the process of hybridization to genomic fragment libraries. A distinctive intrinsic feature of exome capture was its ability to simultaneously interrogate multiple targets that were directly dependent on coding sequences (CDS) targeted by the capture probes. In order to assess the coverage of the CDS among four platforms, the CDS was annotated using the CCDS database (release of 7 September 2011), NCBI RefSeq (release of 17 May 2021) and GENECODE databases (release of 1 March 2023). It should be noted that, although the four exome capture target regions exhibited different coverage, the majority of the coding regions were present on all platforms. The IDT targeted regions showed superior coverage on CCDS (with 99.73% CDS) than other three platforms. The Nad platform demonstrated relatively higher coverage on NCBI RefSeq CDS (with 98.64%). As anticipated, the coverage on GENECODE CDS was relatively similar (up to 96%) between the four platforms (Table 2).
Table 2.
Comparison of the target region size and CDS coverage rate of four exome capture platforms for CCDS, GENECODE and NCBI RefSeq annotations
| Kits | RegionSize(Mbp) | CCDS | GENECODE | NCBI ResSeq |
|---|---|---|---|---|
| BOKE | 42.0 | 99.21% | 96.21% | 97.85% |
| IDT | 34.0 | 99.73% | 96.61% | 98.41% |
| Nad | 42.0 | 99.03% | 96.80% | 98.64% |
| Twist | 36.0 | 97.89% | 96.66% | 97.11% |
Of the target regions, these four platforms shared approximately 32.8 million base (Mb), which accounting for 78.49%, 95.57%, 79.20%, and 90.40% of the total base of BOKE, IDT, Nad, and Twist, respectively. The Twist platform exhibited the most extensive target regions specific to its platform in comparison to the other platforms (BOKE-specific 2.13 Mb, Nad-specific 1.59 Mb, IDT-specific 0.3 Mb (Figure S2).
Data summary of four whole exome capture platforms sequencing
We aimed to thoroughly evaluate the performance of the four exome capture platforms by utilizing the specific protocols and experimental design tailored for each platform. We incorporated duplicates for each platform to guarantee reliability and evaluate the reproducibility of data production. The single-plex sample was replicated in two captures and the eight multiplexed samples in one capture, resulting in a total of ten data sets for each platform. Consequently, we constructed a total of 40 libraries for the four platforms, with each library averaging about 116 million(M) reads. These reads were then normalized to an average sequencing depth of 100 × and analyzed by MegaBOLT.
Considering that the low-quality reads would impact the subsequent variant calling results, the raw FASTQ files of each dataset were filtered and trimmed by SOAPnuke with default parameter settings. For comparison, each dataset was downsampled to 100 × sequencing depth. Across all datasets, the clean read proportion (the percentages of the filtered reads to raw reads) reached up to 99% (Fig. 2a), with the percentage of bases at Q30 was consistently exceeding 95% (Fig. 2b). An average 99.7% of the filtered reads were mapped in paired-end mode to the human reference genome (hg19) (Fig. 2c). Although all samples demonstrated high alignment rates for filtered reads to the human genome, the paired mapping rates varied slightly across platforms: BOKE (99.69%), Nad (99.67%), IDT (99.76%) and Twist (99.77%). Significant variations in unique mapping rates were observed across the four platforms. The mean unique mapping rates were 89.16% (BOKE), 80.29% (Nad), 78.90% (IDT), and 94.93% (Twist), with Nad and IDT platforms showing substantially lower unique mapping rates than BOKE and Twist (Fig. 2d, details in Table S4). Notably, 1-plex samples consistently demonstrated higher unique mapping rates than 8-plex samples on all platforms: BOKE (91.68% vs 89.16%), Nad (84.50% vs 79.72%), IDT (87.22% vs 76.82%), and Twist (95.87% vs 94.70%).
Fig. 2.
Data production, mapping rate and exome capture efficiency for samples from four platforms exome sequencing on DNBSEQ-T7 about four platforms. Exome capture efficiency indicates the precision of reads aligning with the targeted region. A higher percentage of reads captured indicates higher capture precision. (a)The percentages of the filtered reads to raw reads. (b) The percentage of bases at Q30. (c) The percentages of filtered reads were mapped in paired-end mode to the human reference genome. (d) The percentage of reads uniquely mapped to the human reference genome. (e) The proportions of reads uniquely mapped to target regions.
Exome capture specificity
To investigate the performance of exome enrichment across different platforms, we made a comparative analysis of their capture efficiency, which was described as the percentage of reads mapped to target regions (Capture Rate on Reads). The proportions of reads uniquely mapped to target regions were more comparable, ranging from 65.24% to 83.38%, among the four platforms for both 1-plex and 8-plex configurations (Fig. 2e, Table S4). The Twist platform showed a relatively lower and different target enrichment efficiency when comparing 1-plex and 8-plex setups to the other platforms. In particular, we found that 8-plex configuration showed a higher capture rate than the 1-plex, except for the Nad platform.
Uniformity of coverage
The uniformity of sequence coverage across targeted regions is crucial for determining the genotype detection sensitivity at fixed sequencing depths in exome capture experiments. A more consistent sequencing depth across a platform increases the likelihood of achieving a desired genotype sensitivity. At average downsampling 100 × sequencing depth, we observed that, all four platforms displayed extensive coverage of their target regions at different sequencing depths (1x,10x,20x,30x). At 20 × sequencing depth with 1-plex configuration, NAD, Twist and BOKE platforms showed comparable coverage proportions (98.84 ± 0.00%, 98.70 ± 0.15%, and 98.68 ± 0.01%, respectively), while IDT exhibited slightly lower performance (96.44 ± 0.41%) (Fig. 3e, Table S5). Although NAD, Twist and BOKE demonstrated comparable mean coverage and uniformity at 1x, 10 × and 20 × sequencing depths, Twist outperformed the other platforms at 30 × depth, achieving both the highest coverage proportion (98.00 ± 0.25%) and the lowest FOLD_80_BASE_PENALTY score (1.31 ± 0.06) (Fig. 3b, f). Furthermore, Twist reached the maximum fraction of target bases covered at about 100 × depth (Figure S3). These showed that Twist performed more uniform coverage.
Fig. 3.
The uniformity of sequence coverage across targeted regions of four human exome capture platforms. (a)The uniformity for the four platforms. (b) The FOLD_80_BASE_PENALTY score for four platforms. (c), (d), (e), and (f )show the average coverage proportion of the target region for the four platforms at sequencing depths of 1x, 10x, 20x, and 30x, respectively.
Meanwhile, we also conducted a comparative analysis of four platforms with 8-plex. The mean coverage proportion at 20 × depth were: BOKE at 98.66 ± 0.02%, IDT at 96.81 ± 0.2%, Nad at 98.86 ± 0.02% and Twist at 98.47 ± 0.04%. It showed similar uniformity of sequencing depth with 1-plex (Fig. 3, Figure S3), Twist has lowest FOLD_80_BASE_PENALTY (1.48 ± 0.01%). Thus, Twist had best overall uniformity of sequencing depth on DNBSEQ-T7 compared to other platforms, and the 8-plex coverage uniformity of the four platforms was slightly lower than that of 1-plex.
GC content bias among four platforms
GC-content has been recognized as a significant source of coverage bias in WES and has been shown to affect the efficiency of hybridization, therefore influence the uniformity of coverage in target regions [2, 16–18]. The uniform coverage of high GC and low GC areas indicates that targeted areas will be more comprehensively covered and will exhibit reduced dropout rates.
Therefore, we assessed the performance between mean sequencing depth in target regions and GC content for each platform. In this study, for each 100 bases window along each region, we determined its GC-content, average read depth and assessed the extent of GC bias in ten data sets from four platforms under 1-plex and 8-plex. The results showed a broad spectrum of GC bias across the data sets. As expected, all platforms performed rather similarly in the extreme regions of low GC content (< 20%) and high GC content (> 80%) (Fig. 4). In addition, the average sequencing depth of the Twist platform was more evenly distributed across the target region compared to other platforms, especially between 40 and 60% GC content.
Fig. 4.
The density plot shows the correlation between the mean read depth across target regions and the GC content in each platform. Mean depth of coverage in different GC content contexts. Mean depth obtained by summing the depths by dividing each GC content window number
Furthermore, we also observed that Twist had a higher coverage of sequences with low GC content, and all platforms showed sharp drop in GC-rich regions from 60 to 80% (Fig. 4), but for all platforms the performance looked rather similar between 1-plex and 8-plex (Figure S4). It is evident that extreme GC content continues to present a significant challenge in the context of exome capture.
Variants detection
Identifying variants in the human exome accurately and comprehensively is a primary objective of exome sequencing, especially single-nucleotide polymorphism (SNPs) [16, 19]. Thus, the ability to detect variants is more important. Although achieving complete target coverage (> 99%) at minimal sequencing depth (≥ 1x) for each replicate, high-quality genotype calls were obtained for a limited proportion of genomic loci.
To evaluate variants detection performance, we had compared the number of variants identified for each sample in our datasets with 100 × depth datasets. We performed variants detection across 40 datasets using our variant caller for four platforms using two library methods (1-plex and 8-plex) at least 10 × depth. The mean variants counts of (1-plex,8-plex) were: (29,771, 29,784) for BOKE, (29,231, 29,325) for Nad, (22,567, 22,645) for IDT, (26,844, 26,798) for Twist for SNPs; (2208, 2198) for BOKE, (2176, 2196) for Nad, (719, 722) for IDT, (1358, 1347) for Twist for INDELs. Although we found that the total number of discovered variants are different between four platforms due to the different target regions, the number of variants is approximately the same among replicates and between two hybridization pools (1-plex and 8-plex) for all platforms (Figure S5, Table S6).
We also compared SNPs and INDELs with known dbSNP151 [20, 21]. Approximately dbSNP rate 99.5% of detected SNVs were found in dbSNP151 (Fig. 5a, Table S6), except Twist had a relatively lower dbSNP rate than the other three platforms that due to its designed target regions. The corresponding transition/transversion ratio (Ti/Tv) on the whole target region were estimated from 2.66 to 2.97 among these platforms (Table S6), respectively. The dbSNP rate of indels showed relatively lower than SNPs (Fig. 5b, Table S6), because their detection is much harder from short reads using current methods [22]. Taken together, these findings showed that the four platforms demonstrated a high degree of technical reproducibility that was generally comparable and could interrogate a similarly high level of SNPs and INDELs in their target region.
Fig. 5.
Variation estimation by comparison with dbSNP 151 and variation accuracy with GIAB datasets. (a) The dbSNP rate of detected SNVs were found in dbSNP151. (b) The dbINDEL rate of detected indels were found in dbSNP151. The precision of SNPs (c) and INDELS (d) calls for four platforms against the the GIAB dataset. The sensitivity of SNPs (e) and INDELs (f) calls for four platforms against the GIAB dataset.
Variants accuracy
To assess the accuracy of the identified variants, the sample NA12878 was well characterized by genome in the bottle project [23]. We analyzed the genotypes from each replicate of the four platforms against GIAB dataset [24], the genotype result from GIAB was typically used as the standard dataset to benchmark the variants accuracy. We created a high-confidence BED file for each platform. Using BEDTools intersect, we identified the overlapping regions between each platform's target regions and the NA12878 WGS high-confidence regions. Variants called from each platform were then intersected with the NA12878 WGS variants within these high-confidence regions. Finally, we calculated precision, sensitivity, and F-measure metrics for each platform's variant calls using RTG Tools vcfeval.
We found that these four platforms showed comparable precision for SNP calling. Except the IDT platform with lower SNP sensitivity and F-measure, other three platforms also showed similar performance of SNP sensitivity and F-measure (Table 3,Fig. 5c, and e, Table S6). Compared to SNPs, the precision and sensitivity for INDELs showed relatively lower of four platforms that because their detection is much harder from short reads using current methods [22], the IDT platform showed distinct result with other three platforms, with higher precision and sensitivity. As expected, the precision and sensitivity of both SNPs and INDELs are concordant across replicates between two methods (1-plex and 8-plex) for each platform. Thus, the accuracy of the variants of the four platforms was credible on DNBSEQ-T7.
Table 3.
Comparative analysis of variants detection accuracy for NA12878 DNA standard sample with GIAB. True positive (TP)—SNPs/INDELs present in both VCF files (standard WGS NA12878 and experimental WES NA12878). False positive (FP)—SNPs/INDELs found only in the experimental NA12878 VCF file.False negative (FN)—SNPs/INDELs found only in the standard WGS NA12878 VCF file and not found in experimental NA12878 VCF
| Kit | Hybrid Method | SNP | INDEL | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| TP | FP | FN | Precision(%) | Sensitivity(%) | F-measure(%) | TP | FP | FN | Precision(%) | Sensitivity(%) | F-measure(%) | ||
| BOKE | 1-plex | 24,710 | 148 | 213 | 99.40 | 99.15 | 99.27 | 1358 | 122 | 221 | 91.76 | 86.00 | 88.79 |
| 8-plex | 24,713 | 164 | 209 | 99.34 | 99.16 | 99.25 | 1360 | 138 | 218 | 90.79 | 86.19 | 88.43 | |
| Nad | 1-plex | 24,350 | 109 | 192 | 99.55 | 99.22 | 99.39 | 1319 | 133 | 220 | 90.84 | 85.71 | 88.20 |
| 8-plex | 24,355 | 116 | 186 | 99.53 | 99.24 | 99.38 | 1328 | 140 | 212 | 90.46 | 86.23 | 88.30 | |
| IDT | 1-plex | 18,515 | 117 | 310 | 99.37 | 98.35 | 98.86 | 408 | 21 | 55 | 95.10 | 88.12 | 91.48 |
| 8-plex | 18,561 | 127 | 264 | 99.32 | 98.60 | 98.96 | 414 | 22 | 49 | 94.95 | 89.42 | 92.10 | |
| Twist | 1-plex | 22,362 | 168 | 168 | 99.25 | 99.25 | 99.25 | 789 | 65 | 111 | 92.39 | 87.67 | 89.97 |
| 8-plex | 22,350 | 174 | 180 | 99.23 | 99.20 | 99.21 | 775 | 68 | 124 | 91.93 | 86.21 | 88.98 | |
Variants concordance
We further investigated the platform-specific concordance of SNPs and indels separately for each of four platforms using two methods among replicates, and Jaccard similarity was employed to assess the concordance among datasets [13]. It is shown that SNPs have at least 97% intra-platforms concordance (Fig. 6). The strong intra-platform concordance demonstrated the reliable reproducibility of each platform. But for indel, the intra-platforms concordance is > 70% for four kits. INDELs with identical positions and genotype match were regarded as concordant. This concordance level is lower than SNPs, probably because that the concordance metric of INDELs is subject to variation, contingent upon the definition of concordance employed. This encompasses the consideration of variant location and genotypic concordance. However, the concordance of IDT is up to more than 83% (Fig. 6b), it is possible that IDT target regions had the fewest INDELs (Table S6). These results suggest that different datasets and platforms have excellent intra-platform concordance of variants detection on DNBSEQ-T7.
Fig. 6.
The Concordance of variation detection. The Jaccard similarity for SNPs (bottom-right triangle) and INDELs (upper-left triangle) for each platform among replicates was calculated separately,(a) BOKE, (b) IDT, (c) Nad, (d) Twist. SNP detection showed excellent intra-platform concordance, whereas INDEL detection showed inferior concordance
Features comparison of MGI and other four commercial hybridization processes
The probe designs of different vendors vary significantly, leading to the need for separate hybridization reagents and processes. Additionally, different exome probe or sequencer vendors bring bias during assessing the reproducibility of capture performance [17]. However, in an effort to streamline and expedite operations, we have implemented a standardized hybridization method for all the four DNA exome probes. This approach involves developing a standardized set of reagents and protocols that can be used across multiple commercial probe brands, reducing the need for specialized equipment and training.
By streamlining the hybridization process, we aim to improve efficiency and consistency in our experiments while also reducing time and costs associated with purchasing and maintaining multiple sets of reagents.
Here we investigated the performance of MGI’s in-house hybridization protocol for different probes we selected. We pooled 8 libraries each containing 250 ng of DNA library for each probe in one enrichment, and then enriched the same library pools under one hour hybridization using two different protocols including reagents (MGI protocol and commercial protocol from each probe manufacturer, Table S1) and found that the MGI protocol is compatible with all four probes. All these libraries of 8-plex enrichment data were compared each other (Figure S6, Table S8).
We also compared the concordance in SNP/INDEL detection between the two hybridization methods for each sample in the four platforms. All four platforms show stable SNP concordance at least 97% in each sample of two 8-plex pools by commercial and MGI methods (Fig. 7,Figure S7). BOKE and Nad give > 70% INDEL concordance either in each sample of one 8-plex hybridization within one method or in each sample between two methods. Twist and IDT give higher INDEL concordance, which is > 75% for Twist and > 80% for IDT (Fig. 7, Figure S7), because IDT target regions had the fewer INDELs. Also, concordance between samples from the MGI methods has no more differences with samples of the commercial methods, even when the two methods are cross-compared. These results suggest that MGI hybridization has notable concordance of variants detection for all the four platforms.
Fig. 7.
Variation accuracy estimation by comparison with dbSNP151. Two hybridization methods both represent excellent SNP/INDEL calling results and have no significant differences between them
While the four platforms demonstrated comparable SNP detection precision between commercial and MGI hybridization methods, distinct results were observed for INDELs, with IDT showing higher precision and BOKE exhibiting lower precision with MGI hybridization methods (Table 4, Fig. 7, Table S9, Table S10). For both SNPs and INDELs, sensitivity and F-measure exhibited variations between commercial and MGI hybridization methods, with BOKE and Nad were lower in MGI hybridization methods.
Table 4.
Comparative analysis of variants detection accuracy for NA12878 DNA standard sample between MGI hybridization and commercial hybridization on 8-hybridization with GIAB
| Kit | Hybrid Method | SNP | INDEL | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| TP | FP | FN | Precision(%) | Sensitivity(%) | F-measure(%) | TP | FP | FN | Precision(%) | Sensitivity(%) | F-measure(%) | ||
| BOKE | MGI | 24,654 | 164 | 266 | 99.34 | 98.93 | 99.14 | 1344 | 141 | 234 | 90.51 | 85.17 | 87.76 |
| Commercial | 24,713 | 164 | 209 | 99.34 | 99.16 | 99.25 | 1360 | 138 | 218 | 90.79 | 86.19 | 88.43 | |
| Nad | MGI | 24,301 | 126 | 240 | 99.48 | 99.02 | 99.25 | 1298 | 154 | 241 | 89.39 | 84.34 | 86.79 |
| Commercial | 24,355 | 116 | 186 | 99.53 | 99.24 | 99.38 | 1328 | 140 | 212 | 90.46 | 86.23 | 88.3 | |
| IDT | MGI | 18,643 | 110 | 182 | 99.41 | 99.03 | 99.22 | 441 | 20 | 22 | 95.66 | 95.25 | 95.45 |
| Commercial | 18,561 | 127 | 264 | 99.32 | 98.6 | 98.96 | 414 | 22 | 49 | 94.95 | 89.42 | 92.1 | |
| Twist | MGI | 22,363 | 131 | 167 | 99.42 | 99.26 | 99.34 | 775 | 61 | 124 | 92.7 | 86.21 | 89.33 |
| Commercial | 22,350 | 174 | 180 | 99.23 | 99.2 | 99.21 | 775 | 68 | 124 | 91.93 | 86.21 | 88.98 | |
In addition, considering library preparation time, cost, and DNA input volume, MGI's method has a certain degree of plasticity. The yields from MGI enrichment method are all over 1000 ng for four probes, and three of them obtained higher yields than those using manufacture’s methods (BOKE, IDT and NAD) (Table S3), which showed uniformity and stability under MGI method. As Fig. 1 demonstrated, following the original manufacturers' protocols for each probe individually required four days in total, with one probe hybridized per day. In contrast, under the MGI hybridization protocol, sample enrichment by all probes are completed within a single day. Here, MGI’s method lies in standardizing the hybridization steps for these four probes, enabling simultaneous hybridization with DNA probes from different brands, and facilitating the automation of large-scale hybridization with multiple probes.
Performance with detection of cancer interesting mutations
For WES, the cancer exome offers important insights into the coding mutations that drive tumor progression. Then, the sensitivity and accuracy of cancer-causing mutations detection are particular importance. We thus undertook an investigation into the consistency of WES Process and the performance of cancer-relevant mutations and present at a frequency of between 1 and 100% allelic frequency (AF) with MGI's in-house hybridization protocol for BOKE in PancancerLight 800 gDNA Reference Standard (short as G800) [25]. We thus constructed a library and sequenced with DNBSEQ-G400 about 942 million(M) reads, which was then normalized to 500 × average sequencing depth and analyzed with MegaBOLT to get BAM. Due to more rare mutations, we used Verdict software [26] to detect mutations. As expected, all of the associated 793 mutation sites were detected and their frequencies were consistent with the standards (Fig. 8, Table S11). In particularly, 12 of the 16 ddPCR variants were detected with the same frequency, two fusion sites and one CNV were not detected because they were not included by BOKE (Table S12). Nevertheless, it is important to recognize that this outcome highlights the strong capability of DNBSEQ sequencer for the coverage of cancer interesting mutations.
Fig. 8.

Consistency of variations of standard sample (G800)
Discussion
In recent years, with the development of which have been widely applied across various fields [27–29]. WES has widely become the primary method for detecting variants in both public and personal genomics [30–32].
In this study, we show a comprehensive comparison of four human WES platforms from four manufactures currently available with DNBSEQ-T7. The performance will assist researchers in choosing the appropriate platforms for their studies. For the current versions of the four platforms, the amount of coverage of protein coding regions is important for human genetic research. These four platforms provide a high coverage of CCDS (at least 98%), NCBI RefSeq CDS (at least 96%) and GENECODE CDS (at least 97.1%), respectively. These reflect the intrinsic power of exome capture.
The yields of post-capture PCR of all 16 enrichment samples are shown in Table S3. From all four probes under their own manufacture’s capture method, the yield of 8-plex hybridization was consistently higher than those from 1-plex hybridization, which was expected due to the larger input amount used in the 8-plex reactions.
After obtaining one individual (NA12878) from the 1000 genome project in parallel with four platforms with DNBSEQ-T7. Therefore, we measure the base quality (Q30), mapping rate and capture rate for each dataset. Basically, all platforms showed a high performance.
Sequencing depth is an important consideration for variant detection which affect the coverage of regions. For each dataset generated from both single hybridization and eight hybridizations, there was a high coverage rate, over 98.5% of the bases in the target region are covered by at least one read, and more than 96.5% of the bases in the target region are covered by 20 or more reads, > 98% the uniformity for each platform, suggesting that the entire target region is thoroughly and consistently captured on each platform. Furthermore, we found that Twist exhibits better coverage uniformity than others.
The efficiency of capture is a crucial factor for exome capture platforms. In our hands, we observed that have difference capture rate between four platforms. Although, Twist showed a lower and more variable capture rate in both single hybridization and eight hybridization, it showed better coverage uniformity.
High GC content and lower of the target regions correlated with low sequencing coverage in all exome capture methods. The GC content affected the depth target regions. The GC bias was not difference between 1-plex and 8-plex with DNSBEQ-T7. However, the Twist platform is poised to emerge as a more optimal solution for the targeted capture of genomic regions characterized by lower GC content, which also means that it achieves consistent, comprehensive coverage breadth along with adequate coverage depth essential for reliable variant calling.
For the variant detection, on one hand, we found IDT detected fewer SNPs and INDELs, which it has smaller target regions than others. On the other hand, we also find Twist detected relatively fewer SNVs, which maybe because of its lower capture efficiency. At last, as we anticipated, each platform demonstrated higher variant detection sensitivity and accuracy.
We also evaluated the consistency of SNPs and INDELs identification for each platform among 1-plex and 8-plex duplicate samples. Each platform shows widely agreement the identification of SNPs. However, the concordance rate of INDELS was always lower than SNPs, which can be attributed to the technical challenges in accurately detecting INDELs from short-read sequencing data using current methodologies. There may be varying concordance metrics for INDELs, based on the definitions of position and genotype concordance.
Although current MGI methods cannot achieve the same quality standards as established commercial platforms, we are committed to further optimization and improvement in future research. Sequencing based on the MGI platform streamlines operations through its own integrated workflow of fast library preparation and hybridization. This is particularly beneficial for conducting large-scale hybridization, especially when using multiple probes for hybridization experiments on different samples, as it reduces both time and financial expenditures for laboratories.
On the other hand, many exome sequencing applications are now being used in tumor detection, particularly for the detection of low-frequency variants, which are crucial for tumor prediction and treatment. Therefore, the ability to detect low-frequency variant sites in exons is crucial. In our study, we proposed a new hybridization method workflow (MGI’s hybridization). This method combined with commercial probes was able to detect low-frequency variants even quickly and effectively at lower 200 × depths, while being largely consistent with the vendors’ own hybridization methods. For now, while our current validations are limited to standard SNVs on the BOKE chip, we plan to perform extensive compatibility testing to explore the potential application areas of this new hybridization method in the future.
Supplementary Information
Acknowledgements
The authors are grateful to the people in their research group for support and valuable suggestions.
Authors’ contributions
M.L. and X.Y. devised the study and drafted the manuscript. M.L., X.L. and Z.J. analyzed the data and interpreted the results. X.Y. and S.Z. conducted the experiments. Y.Z., P.L., J.H. and F.C. supervised the study and reviewed the manuscript. All authors read and approved the final manuscript.
Funding
National Key Research and Development Program of China (No. 2022YFF1202203).
Data availability
The raw sequence data generated in this study have been deposited in the Genome Sequence Archive in National Genomics Data Center, China National Center for Bioinformation/Beijing Institute of Genomics, Chinese Academy of Sciences (accession number GSA-Human: HRA010877) that are publicly accessible at https://ngdc.cncb.ac.cn/gsa-human.
Declarations
Ethics approval and consent to participate
No ethics approval or consent to participate was required for this study.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Meiyan Li, Xinshi Yang, Xinming Liang and Zheng Jia contributed equally to this work and shared the co-first authorship.
Contributor Information
Jie Huang, Email: jhuang5522@nifdc.org.cn.
Ping Liu, Email: liuping3@mgi-tech.com.
Fang Chen, Email: fangchen@mgi-tech.com.
References
- 1.Zhang X, et al. Double-layer focal plane microscopy for high throughput DNA sequencing. Opt Express. 2022;30(11):18496–504. [DOI] [PubMed] [Google Scholar]
- 2.Zhou J, et al. Performance comparison of four types of target enrichment baits for exome DNA sequencing. Hereditas. 2021;158(1):10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Huang J, et al. A reference human genome dataset of the BGISEQ-500 sequencer. Gigascience. 2017;6(5):1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.NovaSeqTM 6000, Illumina Inc., https://www.illumina.com/systems/sequencing-platforms/novaseq.html.
- 5.Genetic sequencer DNBSEQ TM-T7 , MGI Tech Co., Ltd., https://en.mgi-tech.com/products/instruments_info/5.
- 6.Bai RQ, et al. A novel FAM83H variant causes familial amelogenesis imperfecta with incomplete penetrance. Mol Genet Genomic Med. 2022;10(4):e1902. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Sehar S, et al. Pan-transcriptomic profiling demarcates Serendipitaindica-phosphorus mediated tolerance mechanisms in rice exposed to arsenic toxicity. Rice. 2023;16(1):28. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Wei X, et al. Comprehensive analysis of transcriptomic profiling of 5-methylcytosin modification in placentas from preeclampsia and normotensive pregnancies. FASEB J. 2023;37(2):e22751. [DOI] [PubMed] [Google Scholar]
- 9.Sulonen AM, et al. Comparison of solution-based exome capture methods for next generation sequencing. Genome Biol. 2011;12(9):R94. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Shigemizu D, et al. Performance comparison of four commercial human whole-exome capture platforms. Sci Rep. 2015;5:12742. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Yaldiz B, et al. Twist exome capture allows for lower average sequence coverage in clinical exome sequencing. Hum Genomics. 2023;17(1):39. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Iadarola B, et al. Whole-exome sequencing of the mummified remains of Cangrande della Scala (1291–1329 CE) indicates the first known case of late-onset Pompe disease. Sci Rep. 2021;11(1):21070. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Xu Y, et al. A new massively parallel nanoball sequencing platform for whole exome research. BMC Bioinformatics. 2019;20(1):153. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Belova V, et al. System analysis of the sequencing quality of human whole exome samples on BGI NGS platform. Sci Rep. 2022;12(1):609. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Belova V, et al. Comparative evaluation of four exome enrichment solutions in 2024: Agilent, Roche, Vazyme and Nanodigmbio. BMC Genomics. 2025;26(1):76. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Clark MJ, et al. Performance comparison of exome DNA sequencing technologies. Nat Biotechnol. 2011;29(10):908–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Meienberg J, et al. New insights into the performance of human whole-exome capture platforms. Nucleic Acids Res. 2015;43(11):e76. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Barbitoff YA, et al. Systematic dissection of biases in whole-exome and whole-genome sequencing reveals major determinants of coding sequence coverage. Sci Rep. 2020;10(1):2057. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Abecasis GR, et al. A map of human genome variation from population-scale sequencing. Nature. 2010;467(7319):1061–73. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Sherry ST. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001;29(1):308–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Phan L, et al. The evolution of dbSNP: 25 years of impact in genomic research. Nucleic Acids Res. 2025;53(D1):D925–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Jiang Y, Turinsky AL, Brudno M. The missing indels: an estimate of indel variation in a human genome and analysis of factors that impede detection. Nucleic Acids Res. 2015;43(15):7217–28. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Zook JM, et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat Biotechnol. 2014;32(3):246–51. [DOI] [PubMed] [Google Scholar]
- 24.Zook JM, et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci Data. 2016;3:160025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.G800, https://us.gene-well.com/Product/Oncology/tumor/800/IB-GW-OGTM800.
- 26.Lai Z, et al. Vardict: a novel and versatile variant caller for next-generation sequencing in cancer research. Nucleic Acids Res. 2016;44(11):e108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Lu R, et al. Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding. Lancet. 2020;395(10224):565–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Lu M, et al. Changes in ploidy drive reproduction transition and genomic diversity in a polyploid fish complex. Mol Biol Evol. 2022. 10.1093/molbev/msac188. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Tang D, et al. Genome evolution and diversity of wild and cultivated potatoes. Nature. 2022;606(7914):535–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Qi Q, et al. Whole-genome sequencing analysis in fetal structural anomalies: novel phenotype-genotype discoveries. Ultrasound Obstet Gynecol. 2024;63(5):664–71. [DOI] [PubMed] [Google Scholar]
- 31.Hu FY, et al. ABCA4 gene screening in a Chinese cohort with Stargardt disease: identification of 37 novel variants. Front Genet. 2019;10:773. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Liang C, et al. Identification of novel EXT mutations in patients with hereditary multiple exostoses using whole-exome sequencing. Orthop Surg. 2020;12(3):990–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The raw sequence data generated in this study have been deposited in the Genome Sequence Archive in National Genomics Data Center, China National Center for Bioinformation/Beijing Institute of Genomics, Chinese Academy of Sciences (accession number GSA-Human: HRA010877) that are publicly accessible at https://ngdc.cncb.ac.cn/gsa-human.









