Skip to main content
Briefings in Bioinformatics logoLink to Briefings in Bioinformatics
. 2024 Jun 5;25(4):bbae266. doi: 10.1093/bib/bbae266

NIPT-PG: empowering non-invasive prenatal testing to learn from population genomics through an incremental pan-genomic approach

Zhengfa Xue 1,2,#, Aifen Zhou 3,4,#, Xiaoyan Zhu 5,6, Linxuan Li 7,8, Huanhuan Zhu 9, Xin Jin 10,11,, Jiayin Wang 12,13,
PMCID: PMC11151788  PMID: 38836702

Abstract

Non-invasive prenatal testing (NIPT) is a quite popular approach for detecting fetal genomic aneuploidies. However, due to the limitations on sequencing read length and coverage, NIPT suffers a bottleneck on further improving performance and conducting earlier detection. The errors mainly come from reference biases and population polymorphism. To break this bottleneck, we proposed NIPT-PG, which enables the NIPT algorithm to learn from population data. A pan-genome model is introduced to incorporate variant and polymorphic loci information from tested population. Subsequently, we proposed a sequence-to-graph alignment method, which considers the read mis-match rates during the mapping process, and an indexing method using hash indexing and adjacency lists to accelerate the read alignment process. Finally, by integrating multi-source aligned read and polymorphic sites across the pan-genome, NIPT-PG obtains a more accurate z-score, thereby improving the accuracy of chromosomal aneuploidy detection. We tested NIPT-PG on two simulated datasets and 745 real-world cell-free DNA sequencing data sets from pregnant women. Results demonstrate that NIPT-PG outperforms the standard z-score test. Furthermore, combining experimental and theoretical analyses, we demonstrate the probably approximately correct learnability of NIPT-PG. In summary, NIPT-PG provides a new perspective for fetal chromosomal aneuploidies detection. NIPT-PG may have broad applications in clinical testing, and its detection results can serve as a reference for false positive samples approaching the critical threshold.

Keywords: non-invasive prenatal testing, chromosomal aneuploidies, pan-genomic graph, sequence-to-graph alignment

Introduction

Non-invasive prenatal testing (NIPT) technology, based on high-throughput sequencing, enables the detection of fetal chromosomal aneuploidies [1–4]. The NIPT technology compares the test sample with a normal (euploid) control group. Detection and quantification of fetal aneuploidy are determined by observing whether the z-score fall within the normal range [5]. Numerous studies indicate that NIPT exhibits higher accuracy compared with traditional serological screening technology, with excellent sensitivity and specificity for Trisomy 21, 18, and 13 [6–8]. However, due to limitations in materials and methods, some studies have reported erroneous results, primarily false-positive outcomes [9, 10].

The core of chromosomal aneuploidy detection lies in accurately calculating the read counts for each chromosome [11]. This process can be subdivided into two crucial components: accurately aligning reads to a suitable reference genome and selecting a suitable z-score threshold. The existing researches primarily focused on the latter one, e.g. GC content correction, z-score threshold optimization [12–14], while overlooked the potential benefits from controlling alignment errors.

Reducing alignment errors is a quite challenged problem. Constrained by the proportion of fetal DNA in maternal plasma and sequencing costs, NIPT sequencing data typically exhibit extremely low coverage (generally 0.1–1×) and very short read length (~35 bp) [15]. The extremely low coverage of sequencing data means that most genomic regions in the sample lack sufficient reads to support alignment, thereby reducing the reliability and stability of subsequent detection. Additionally, shorter sequencing reads are more likely to align to other positions in the reference genome, leading to alignment errors. Therefore, NIPT requires high accuracy in read alignment. However, current methods for aneuploidy detection have not taken into consideration the impact of alignment error on the results. From the perspective of alignment methods, whether in the field of NIPT or mutation detection, the existing bioinformatics alignment methods, represented by software such as BWA, align reads to the reference genome based on rules, which are determined by the maps between the base arrangement of sequencing data and the reference genome. Rule-based alignment software produces deterministic results but lacks the capacity of learning. Even though some alignment software provides alternative potential matching positions for some reads, there is currently no related mutation detection software considering the impact of these multi-source aligned reads (reads aligned to multiple positions) on detection outcomes. Furthermore, from the perspective of the reference genome, the current human reference genomes are predominantly constructed based on few samples from Caucasians, making it challenging to represent the genomic diversity of Non-Caucasian populations [16]. Therefore, NIPT sequencing data are highly susceptible to the influence of reference genome polymorphism preferences during the alignment process, resulting in the emergence of reference bias. Reference bias is one of the reasons leading to false positives or false negatives in fetal aneuploidy detection.

Recent research in pan-genomes has provided new insights into addressing reference bias issues [17]. In contrast to the reference genome, the pan-genome models the entire set of genomic elements within a given species or clade. The additional information it provides allows for outstanding performance in various bioinformatics tasks, including read alignment, variant calling, and gene typing [18–20]. Estimates based on short-read sequencing data suggest that the human pan-genome may be 1% [21] to 10% [22] larger than the human reference genome (GRCh38). Sequences totaling up to several mega-base pairs per individual are not included in the reference genome [23, 24]. Therefore, a pan-genome constructed based on the reference genome may be better at identifying the true alignment positions of short-reads than the reference genome alone. Moreover, pan-genome is endowed with learning capabilities, allowing it to incorporate a greater number of samples to enhance the stability of alignments. Currently, the increasing availability of pan-genomic references for humans [25] and other organisms is making the use of a single reference genome increasingly less optimal. However, utilizing pan-genomes effectively necessitates the development of new bioinformatic methods capable of rapidly constructing, querying, and operating on the pan-genome, and tailoring the pan-genome to specific biological questions.

In this study, we explored a novel framework termed NIPT-PG, which involves the integration of pan-genomic graph and calibrated z-scores to identify high-risk fetal chromosomal aneuploidies. Within this framework, we have made improvements to the reference genome and alignment methods involved in the detection of aneuploidies. To enhance detection efficiency and accuracy, we concurrently designed a sequence-to-graph reads alignment method that considers misalignment. Results indicate that NIPT-PG exhibits higher specificity compared with traditional NIPT technology that based on standard z-scores approach, providing a more accurate basis for clinical genetic counseling. Importantly, NIPT-PG exhibits the probably approximately correct (PAC) learnable property, serving as a reference for other variant detection software.

Materials and methods

The standard process for NIPT is illustrated in Fig. 1A. NIPT-PG introduces three additional steps to the standard NIPT process: (i) the design of the pan-genome. (ii) Sequence-to-graph alignment. And (iii) z-score calculation model based on multi-source aligned read (Fig. 1B). NIPT-PG eliminates reference bias by realigning sequencing samples to the pan-genome, with its core components detailed in the following text. The usage methodology is illustrated in Supplement 1.

Figure 1.

Figure 1

(A) The standard process for NIPT; (B) NIPT-PG introduces three additional steps to the standard NIPT process.

The design of the pan-genome

The first step involves building a pan-genome based on a substantial cohort of individuals undergoing testing. The pan-genome model (Fig. 2) is a data structure that represents the genomic sequences of a population, a species, an evolutionary branch, or even a metagenome [26]. Pan-genome model serves as a central coordinating entity to describe the collection of sequences and genomes within the pan-genome. The pan-genome model can take various forms, including a collection of unaligned sequences or a learned sequence model. Here, we employ a graph-based model to establish direct relationships among all genomes, referring to this type of graph as a sequence graph. The main chain in pan-genome is considered to be the shared portion among all samples. For the parts of the sequence with differences, they exist as branches in the pan-genome. Therefore, sequence graphs are employed to compress numerous redundant input sequences into a smaller data structure that still represents the complete collection [27].

Figure 2.

Figure 2

The design of the pan-genome; (A) in genomic analysis based on a reference genome, all genomes (G1, G2, G3, G4) are compared with the reference genome R (I), and in the pan-genome context, we aim to simulate direct relationships among all genomes involved in the analysis (II), and as the analysis extends to a new genome Δ, we augment the genome model by comparing it to the reference genome R (III); simultaneously, incorporating the new genome into the pan-genomic analysis involves direct comparisons with all other genomes in the model (IV); (B) the visualization of the pan-genome structure. The graphical model of genomes allows for direct all-to-all comparisons, capturing all sequence relationships among them.

We define a sequence graph as a tuple G = (V, E, σ), where V = {Inline graphic,  Inline graphic, …,  Inline graphic} represents nodes, EV × V is a set of directed edges, and σ: V → ∑ assigns one character from the alphabet ∑ (A, G, C, T) to each node. Initially, the reference genome is transformed into a unidirectional sequence graph, serving as the starting pan-genome. It is a long-chain sequence graph without any branching paths. Subsequently, we incorporate the individual-specific genomes into the pan-genomic graph model, comparing it with all other genomes in the model (Fig. 2A). For genomic segments in the individual genome that overlap with the initial pan-genome, no modifications are made. For segments containing variant information, they exist in the pan-genomic graph as branching paths (Fig. 2B). The region where multiple paths connect a common head and tail node in a graph is commonly referred to as a bubble [28], indicative of variation. For fetal chromosomal aneuploidies detection, the pan-genome represents the collective genome of all individuals in the current batch of testing, encompassing all potential genetic variations and genome rearrangements. Therefore, the pan-genome can capture more comprehensive genetic variation information, effectively enhancing the accuracy of whole-genome data analysis.

Sequence-to-graph alignment

To identify multi-source aligned read, it is need to realign the sequencing data to the pan-genome. Here, each read is considered as a path sequence. We provide the following definition for path sequence: Let p = (p1, …, pk) be a path in the sequence graph G = (V, E, σ); that is, piV for i ∈ {1, …, k} and (pi, pi + 1) ∈ E for i ∈ {1, …, k−1}. Then, the path sequence of p is given by σ(p1), σ(p2), … σ(pk). Therefore, the read alignment issue can be transformed into the problem of finding sub-paths in the graph. Building upon depth first search, we use a hash indexing to expedite the entire alignment process. Initially, we search the entire pan-genome, recording all starting nodes for each base segment with length k-mer in the pan-genome, and construct a hash table (Fig. 3A and B). The storage bucket position for key-value pairs in the hash table is determined by the index function

Figure 3.

Figure 3

Sequence-to-graph alignment, and (A) pan-genomic model local graph, and firstly, contiguous DNA sequence fragments of length k are generated based on the pan-genome, and an index table (B) is constructed accordingly; the index table documents the positions of DNA sequence fragments with a length of k within the pan-genome, and these DNA sequence fragments are rapidly retrieved through a hash function-based mapping to indices; (C) the adjacency table documents the out-degree nodes of each node within the pan-genome; (D) the query sequences are placed into a stack, with the starting node at the top node at the stack; using information from the index table and adjacency table, the nodes (current nodes) within the pan-genome are sequentially aligned with the top node at the stack (current top node), and if the bases of the current node match those of the top node at the stack, the top node is popped from the stack until the stack is empty or no node can be matched with the top node at the stack.

graphic file with name DmEquation1.gif (1)

where k is the key, representing the k-mer base sequence (such as “AGCAT”), and n is the size of the hash bucket array.

Locating the position of the keyword in the storage bucket is achieved by processing the keyword k using a hash function, converting it into a scalar value. Subsequently, scaling this value is done to obtain an effective array index; accomplished using modulo operation denoted by the function ƒ. ƒ can be expressed as

graphic file with name DmEquation2.gif (2)

Here, h represents the hash function. The hash function maps the original input to an integer, and its expression is given by

graphic file with name DmEquation3.gif (3)

where K represents the full set of keywords, and S = {0, 1, …, m-1} represents the set of all possible values for the hash function. Subsequently, we use an adjacency list to record the outdegree nodes v' for each node v (Fig. 3C).

We utilize the information from the hash table and adjacency list to sequentially align the nodes in the pan-genome (starting nodes recorded in the hash table) with the top node of the stack (starting node of the path sequence). Subsequently, we query the adjacency list to find the next aligned node. If the current node's base matches the top node's base in the stack, then we pop the top node at the stack until the stack is empty or a matching node cannot be found in the adjacency list (Fig. 3D). Research indicates that the average and worst-case time complexity of a hash table are O (1) and O (n), respectively [29], which significantly reduces the time required for sequence-to-graph alignment. Consequently, DNA sequence segments can be rapidly located on the pan-genomic graph based on the mapping provided by the hash function. Finally, we recorded all alignment positions of each read on the pan-genome chromosomes for each sample and generated a new alignment file.

Z-score test based on multi-source aligned read

The standard z-score test for chromosomal aneuploidy in NIPT is defined as

graphic file with name DmEquation4.gif (4)
graphic file with name DmEquation5.gif (5)

where Inline graphic represents the proportion of reads for the current chromosome in the test sample relative to all chromosomes. Inline graphic signifies the average proportion of reads for this chromosome in the control group samples relative to all chromosomes, and Inline graphic denotes the standard deviation of the proportion of reads for this chromosome in the control group samples. In typical circumstances, results with z-score greater than 3 or less than −3 are considered as potential NIPT positive results. In such cases, further invasive testing is required to confirm the results.

Next, we have made modifications to the standard z-score test. Firstly, we define multi-source aligned read as those reads that have multiple alignment positions across the pan-genome. In the previous step, we tallied all multi-source aligned reads and generated the new alignment file. Each multi-source aligned read falls into one of the following two scenarios.

  • (I) The first scenario occurs when reads misalign to other positions on the same chromosome. For example, Readchr1_100 →  Readchr1_500, which indicates that the read has two alignment positions on chromosome 1. Readchr1_100 represents the original alignment position at position 100 on chromosome 1, and Readchr1_500 indicates that, after pan-genomic correction, the read may also align to position 500 on chromosome 1 (Fig. 4A). It is noteworthy that, as the calculation principle of the z-score is based on the proportion of reads across the entire chromosome, the first scenario does not impact the detection results.

  • (II) The second scenario arises when the reads misalign to other chromosomes. For instance, Readchr1_350 → Readchr2_300, which implies that the read simultaneously aligns to both chromosome 1 and chromosome 2 in the pan-genome (Fig. 4A). The second scenario introduces reference bias, as in the context of considering population polymorphism, we cannot determine which position is the true matching position for the read.

Figure 4.

Figure 4

Fetal aneuploidy detection; (A) two potential misalignment scenarios of reads during pan-genome alignment: misalignment to the same chromosome (Read 1) and misalignment to other chromosomes (Read 3); (B) schematic for calculating the maximum and minimum read count; (C) after calculating the z-score based on the maximum and minimum read count, the z-score is then compared to the threshold.

Therefore, we devised a novel z-score calculation method that accounts for reference bias. As illustrated in Fig. 4B, considering the second scenario, we computed the potential maximum and minimum count of reads on a chromosome. For a given chromosome, if any of the alignment positions for a multi-source aligned read is on that chromosome, it is considered a valid count for that chromosome, enabling the calculation of the potential maximum count of reads. Conversely, if any of the alignment positions for a multi-source aligned read is not on the same chromosome, it is deemed an invalid count for that chromosome, facilitating the calculation of the potential minimum count of reads. Finally, by averaging the potential maximum and minimum count, we obtain the average count of reads. The modified formula for z-score is

graphic file with name DmEquation6.gif (6)
graphic file with name DmEquation7.gif (7)

where Inline graphic represents the average reads count on a particular chromosome. The newly calculated z-score exhibit slight discrepancies compared with the standard method, but this does not impact the identification of negative samples (Fig. 4C). However, for aneuploid samples near the threshold of the z-score, correction for reference bias may correctly identify false positives and false negatives samples.

Why does the pan-genome work in NIPT?

The DNA genotypes of newborns are determined by the haplotypes of their parents. This process and the evolution of genotype frequencies across multiple generations align with the assumptions of the Wright–Fisher model [30]. The Wright–Fisher model is a classical framework for describing the stochastic reproduction and evolutionary changes in genotype frequencies, applicable to population genetics research. According to the theoretical framework of the Wright–Fisher model, in the process of population development, gene mutation is a random process primarily driven by stochastic sampling due to the finite population size. Assuming a diploid population consisting of N individuals, consider a given locus with two allelic genes, Inline graphicand Inline graphic. The population reproduces in discrete time steps, and individuals in each generation inherit one allelic gene from their parents. At time k, the quantity of allelic gene Inline graphic in the population is denoted as Inline graphic, with its possible values ranging from 0 to 2N. The transition probability can be expressed using the following formula:

graphic file with name DmEquation8.gif (8)

where N is the population size, and i and j represent the quantities of allelic genes Inline graphic and Inline graphic, respectively. Furthermore, in each subsequent generation, the genotype frequencies will be redistributed according to a binomial distribution. The evolutionary equation is given by

graphic file with name DmEquation9.gif (9)

where Inline graphic is the frequency of genotypes in the next generation, Inline graphic is the fitness of the current genotype, and Inline graphic is the average fitness of the population. The frequency of genotypes in the next generation is influenced by both Inline graphic and Inline graphic, so genotypes with higher fitness are more likely to be selected in the corresponding environment, while genotypes with lower fitness may gradually decrease. What is more, genotypes may fixate or disappear within a short time when N is small. We computed that under ideal conditions, when N = 10 and the initial genotype frequency is P = 0.5 (i = j = 10), it only takes 100 generations for the frequency of this genotype to reach 0.4958, eventually either fixating or disappearing. In a population with N = 100 (i = j = 100), the frequency of fixation or disappearance of this genotype after 100 generations is only 0.0704. Therefore, in the process of a small population gradually developing into a large population, the impact of randomness on gene frequency can be significant. In the absence of selection, gene drift is an irreversible process. Once an allelic gene is lost in the population, it is challenging for it to become dominant again. Consequently, the genomes of different populations gradually become differentiated.

NIPT data exhibit typical characteristics of population data and are influenced by factors such as geography and population migration. In other words, variations exist in NIPT data across different regions and ethnicities. The traditional linear reference genome is highly effective for some typical and highly conserved gene sequences. However, it has limitations in capturing genetic diversity within populations and individual differences. In contrast, the pan-genome encompasses differences among individuals and genetic variations between different populations, providing a more comprehensive reflection of the overall structure of the genome. Therefore, the pan-genome work for NIPT.

Results

We developed NIPT-PG, a novel NIPT framework based on a tested population for detecting fetal aneuploidy. NIPT-PG initially constructs a pan-genomic graph. For simulated data, 1000 euploid samples were utilized to build the pan-genomic graph. Real-world data involved 684 euploid samples for pan-genomic graph construction. Subsequently, individual sample sequencing data are realigned to the pan-genomic graph to generate new alignment files. The new alignment files document all matching positions of reads in the sample. Finally, based on the method described earlier, we calculate the average reads counts for each chromosome in the samples using the newly generated alignment files and compute new z-score to quantify and identify aneuploidies. By correcting reference biases through the polymorphic positions in the pan-genome, NIPT-PG further enhances the accuracy of aneuploidy detection. Additionally, we optimized the alignment process of sequencing data to improve detection efficiency. Using both simulated and real-world data, we compared the performance of NIPT-PG and the standard z-score test. The mean and standard deviation of the control samples remained consistent for both methods.

Study participants and data production

We validated NIPT-PG on both simulated data and real-world sequencing data from pregnant women (Table 1). It is noteworthy that the NIPT cohort typically originates from a specific geographic region. Shared variant sites within the population are recognized by NIPT-PG as polymorphic sites rather than mismatched sites, representing a major advantage of NIPT-PG. Hence, we generated two types of simulated data, namely population data (MS + Seq-Gen + ART) and random mutation data (ART-Random). The generation steps and details of simulated data can be found in Supplement 2.

Table 1.

Overview of BGI-NIPT and simulated data

Groups a NIPT+ Non-trisomy Total
Trisomy 13 Trisomy 18 Trisomy 21
Population 156 131 148 1424 1859
Random mutation 151 140 148 1408 1847
BGI-NIPT 15 14 32 684 745
a

NIPT+, NIPT positive results; NIPT, Non-invasive Prenatal Testing.

For population data, we utilized the MS [31], SEQ-GEN [32], and ART [33] software to simulate generation (Supplement 3, Supplemental Table 1). The program MS can be employed to generate numerous independent replicate samples under various assumptions regarding migration, recombination rates, and population size, aiding in the exploration of polymorphism studies. SQE-GEN is capable of simulating the evolution of sequences under various mutation models. ART constitutes a suite of simulation tools designed to generate next-generation sequencing data from a given reference genome. Initially, using the MS software, we extracted samples from a population evolving under the Wright–Fisher neutral model through a Monte Carlo simulation, obtaining evolutionary trees for individual samples. Subsequently, SEQ-GEN was utilized to generate reference genomes for individual samples based on the evolutionary trees. Finally, the ART software was employed to generate sequencing files for individual samples based on their reference genomes. Additionally, coverage was increased randomly on chromosomes 13, 18, or 21 for some samples to simulate trisomy samples.

For random mutation data, we directly employed the ART software for simulation (Supplement 4, Supplemental Table 2). The read length for both types of simulated data was consistently set at 35 bp, with the sequencing coverages ranging from 0.1× to 0.5×. The insertion rate and deletion rate were both set at 0.005. Samples with an amplification of chromosomes 13, 18, or 21 exceeding 1.4 were considered trisomy samples. The reference genome length was uniformly set at 2.4 million bp. Consequently, we were able to precisely analyze the performance of NIPT-PG from various perspectives.

Furthermore, we collected 745 real sequencing samples from real NIPT data, sourced from the Beijing Genomics Institution (BGI) in Shenzhen, China (BGI-NIPT). The ages of the pregnant women ranged from 20 to 45 years old. Gestational ages varied from 10 to 34 weeks, covering early to late pregnancy stages. Based on the amniotic fluid karyotyping analysis results, 684 fetuses were found to be euploid, while 37 exhibited aneuploidies. Aneuploid cases included 15 with trisomy 13, 14 with trisomy 18, and 32 with trisomy 21 (Supplemental Table 3).

Evaluating NIPT-PG with simulation samples

We compared the performance of NIPT-PG and standard z-score test with two simulated datasets (Table 2, Supplemental Tables 4 and 5). Both methods exhibited 100% sensitivity on both types of datasets, as NIPT aims to detect every aneuploid sample whenever possible. For population data, the average specificity of the standard z-score test was 98.813% (with specificity for trisomy 13 at 98.808%, trisomy 18 at 99.074%, and trisomy 21 at 98.654%). In comparison, NIPT-PG showed an average specificity of 99.260% (with specificity for trisomy 13 at 99.060%, trisomy 18 at 99.537%, and trisomy 21 at 98.182%), representing an average improvement of ~0.447%. For random mutation data, the average specificity of the standard z-score test was 99.151% (with specificity for trisomy 13 at 99.351%, trisomy 18 at 99.121%, and trisomy 21 at 98.999%). In comparison, NIPT-PG demonstrated an average specificity of 99.373% (with specificity for trisomy 13 at 99.646%, trisomy 18 at 99.297%, and trisomy 21 at 99.176%), representing a marginal average improvement of only 0.222%.

Table 2.

Performance comparisons between NIPT-PG and standard z-score test on simulated data

Data Test Standard z-score test NIPT-PG
Population (Number of cases) Sensitivity Specificity a FP Count Sensitivity Specificity FP Count
T13 (156) 100% 98.708% (1681/1703) 22 100% 99.060% (1687/1703) 16
T18 (131) 100% 99.074% (1712/1728) 16 100% 99.537% (1720/1728) 8
T21 (148) 100% 98.656% (1688/1711) 23 100% 99.182% (1697/1711) 14
Total/ 435 100% 98.813% 61 100% 99.260% 38
Random mutation (Number of cases) Sensitivity Specificity a FP Count Sensitivity Specificity FP Count
T13 (151) 100% 99.351% (1685/1696) 11 100% 99.646% (1692/1696) 4
T18 (140) 100% 99.121% (1692/1707) 15 100% 99.297% (1696/1707) 11
T21 (148) 100% 98.999% (1682/1699) 17 100% 99.176% (1689/1699) 10
Total 439 100% 99.151% 43 100% 99.373% 25
a

FP, False Positive.

The results demonstrate that NIPT-PG outperforms the standard z-score test on both types of datasets, with superior performance in population data compared to random mutation data. The underlying cause of the differences lies in the distribution of common polymorphic loci in the genomes of the two datasets. In contrast to population data, the distribution of polymorphic loci in random mutation data is irregular, with some individual samples containing unique variant information that may act as noise affecting the overall genomic stability. However, it is noteworthy that NIPT-PG still outperformed the standard z-score test in random mutation data. Therefore, the novel historical testing populations based NIPT framework can to some extent reduce false positive rates, even across populations with different geographical locations and ethnic compositions.

Evaluating NIPT-PG with real-world NIPT data

We further validated NIPT-PG on 745 real NIPT samples (BGI-NIPT) from the testing center in Wuhan, China. Due to computational time constraints, we selected 61 trisomy cases and randomly sampled a subset of euploid cases from tens of thousands of real NIPT data provided by BGI, spanning the time period from 2017 to 2023. The advantage of NIPT-PG lies in its ability to detect samples near the z-score thresholds. Since the real BGI-NIPT data did not provide explicitly reported false positive or false negative samples, and most trisomy samples had z-score significantly distant from the threshold, we anticipate that the detection results of NIPT-PG on the realNIPT data should align with the testing outcomes from BGI. We calculated the average and standard deviation of the percentage of reads for each chromosome using 1000 euploid samples, and utilized these values as input for the z-score test. We compared the detection results of standard z-score test and NIPT-PG from BGI-NIPT and found consistent trisomy outcomes between the two methods. The only observed variance was in the z-score, showing slight differences (Supplemental Table 6). For testing populations from more diverse regions, such differences might further amplify.

NIPT-PG enables earlier and more accurate detection of aneuploidies

Currently, in NIPT, a fetal fraction exceeding 4% is defined as the benchmark concentration. When the cell-free fetal DNA (cffDNA) proportion exceeds 4%, downstream comprehensive screenings can generally yield reliable results. We collected clinical data from 246 pregnant women from BGI and fitted the relationship between gestational age (weeks) and fetal concentration using locally weighted regression. The results indicate a positive correlation between fetal concentration and gestational age overall (Fig. 5A, Supplemental Table 7). As gestational age increases, fetal concentration gradually rises, leading to more accurate detection results. However, with advancing gestational age, the risk of managing fetuses with abnormalities also increases. Therefore, under the premise of ensuring accuracy, early NIPT is helpful in reducing the risk of managing fetuses with abnormalities. We further simulated, under ideal conditions, the sensitivity differences in detecting aneuploidies between NIPT-PG and the standard z-score test at different fetal concentrations (details of simulated data are provided in Supplement 5 and Supplemental Table 8). The results show that the sensitivity of both methods decreases with a decrease in fetal concentration, and the sensitivity decline trend of NIPT-PG, based on pan-genomic analysis, is slightly better than that of the standard z-score test (Fig. 5B, Supplemental Table 9). The sensitivity of the standard z-score test is 0.948 at a fetal concentration of 16%, while NIPT-PG achieves the same sensitivity at a fetal concentration of 14%. Therefore, compared with the standard z-score test, NIPT-PG can perform non-invasive prenatal testing earlier and more accurately, providing robust support for early intervention and reducing the risk of managing fetuses with abnormalities.

Figure 5.

Figure 5

(A) The correlation between fetal concentration and gestational weeks; (B) the sensitivity variations of NIPT-PG and the standard z-score test under different fetal concentrations.

The impact of pan-genomic graph scale on NIPT-PG

We calculated the number of false positive samples for NIPT-PG under pan-genome scales of 200, 400, 600, …, 2000, respectively (Supplement 6). In population data, the false positive samples for NIPT-PG detecting aneuploidies tend to decrease with the growth of the pan-genome scale, but after a certain scale, the number of false positive samples gradually stabilizes (Fig. 6A). In random mutation data, initially, the number of false positive samples for NIPT-PG detecting aneuploidies decreases with the growth of the pan-genome scale (Fig. 6B). However, after reaching its peak, further increasing the pan-genome scale leads to an increase in the number of false positive samples.

Figure 6.

Figure 6

The impact of pan-genomic graph scale on NIPT-PG; constructing pan-genomic graph with varying numbers of euploid samples to assess the accuracy changes of NIPT-PG on both population data (A) and random mutation data (B).

Why such differences exist? The performance of NIPT-PG is closely tied to the number of shared polymorphic loci on the pan-genomic graph, which directly determines the quantity of multi-source aligned read that the NIPT-PG can identify. In population data, the number of polymorphic loci remains stable. Therefore, once the pan-genomic graph scale reaches a certain size, further scaling does not significantly enhance the performance of NIPT-PG. As the scale of the pan-genomic graph increases, the addition of irregular mutations slightly interferes with the stability of the pan-genomic graph, causing originally correctly mapped reads to match to incorrect positions. Therefore, to some extent, the performance of NIPT-PG depends on the scale of the pan-genomic graph. Additionally, NIPT-PG appears to exhibit PAC learnable property [34], and we provide relevant discussion (Supplemental Fig. 1, see online supplementary material for a color version of this figure) and evidence in Supplement 7.

Discussion and conclusion

In recent years, the significant reduction in the cost of high-throughput sequencing has markedly increased its utility in clinical practice [35, 36]. In this paper, we employed a novel framework that incorporates historical data into fetal aneuploidy detection and integrates a pan-genomic graph to enhance the accuracy of NIPT. By constructing a pan-genomic graph based on historical tested population information, NIPT-PG effectively addresses reference biases arising from population polymorphism in the fetal aneuploidy detection process. In this study, we observed that when aligning reads of length 35 to the pan-genome, there were ~5%–10% of multi-source aligned reads. In contrast to the “perfect” alignment positions on a linear reference genome, these multi-mapped positions on the pan-genome more realistically capture population polymorphism. It is this “perfection” of the linear reference genome that leads to the imperfections in the detection results. Furthermore, the accuracy of NIPT is highly dependent on the quality and concentration of cffDNA in maternal blood samples [37]. When the concentration of cffDNA drops below 3.5%, the required count of unique reads exhibits exponential growth [38]. Reference bias may contribute to a reduction in the accuracy of NIPT in low cffDNA content data, as the impact of reference bias on results becomes more pronounced with fewer total reads in the sample. Lastly, it is crucial to emphasize that NIPT is not a diagnostic test for fetal aneuploidy. Therefore, positive NIPT results necessitate further invasive testing to confirm the findings. However, invasive testing comes with increased risks, making it imperative to strive for more accurate detection results to minimize the risks associated with invasive testing as much as possible.

While NIPT-PG demonstrates superior capabilities in detecting aneuploidies compared with the standard z-score test, the generation and alignment processes of the pan-genome in this study pose a limiting factor in terms of time expenditure. Despite implementing corresponding data structure optimizations, the time required for these processes is still greater than that needed for standard NIPT detection. Fortunately, the pan-genome only needs to be generated once and can be applicable for subsequent population screenings. As shown in Fig. 6, the performance of NIPT-PG gradually stabilizes with the scale of the pan-genome. Therefore, when dealing with large datasets, once a pan-genome reaches a certain scale, the impact of computation time can be almost negligible. Additionally, based on experimental results, the improvement of NIPT-PG compared with the standard z-score test is not particularly pronounced (with increases in specificity of only 0.222% and 0.447% for the two datasets, respectively). However, since the inception of NIPT, its accuracy has been quite high and has hardly changed over the past decade. For countries with large newborn populations such as China, even a slight increase in specificity could have significant implications, e.g. could save millions to billions of costs on further diagnosis. We utilized pan-genomics to address alignment bias issues in NIPT, which will contribute to future improvements in NIPT technology. In addition, we also discussed the differences between NIPT-PG and the human reference genome, as well as other pan-genome construction tools, as detailed in Supplements 8 and 9. In the near future, we anticipate that complete and large-scale pan-genome assembly technologies will become easily accessible with low time and cost implications.

In NIPT-PG, we treated each chromosome as a whole entity, allowing us to focus on the detection of aneuploidies. In principle, the impact of misalignment of reads due to population polymorphism is more pronounced on sub-chromosomal structural variations. We anticipate that in the near future, the concept of pan-genome based on extensive samples will remain applicable for detecting other chromosomal abnormalities that can lead to serious consequences, such as copy number variations, microdeletions, and microduplications.

In summary, based on pan-genomics, we propose a novel architecture for integrating historical data into fetal aneuploidy detection. NIPT-PG leverages the variation and polymorphism information from extensive historical data to correct reference biases and exhibits a learnable property. Importantly, the pan-genomic structure can be extended to address other variation detection challenges, providing a versatile solution for a range of genomic issues.

Key Points

  • We develop a novel non-invasive prenatal testing method (NIPT-PG) that learns from population data to overcome limitations in traditional NIPT caused by reference biases and population polymorphism.

  • NIPT-PG considers read mis-match rates and employs hash indexing and adjacency lists to achieve faster and more precise read alignment.

  • NIPT-PG exhibits probably approximately correct learnability and surpasses the standard z-score test in detecting chromosomal aneuploidies, demonstrating its potential for clinical applications.

Supplementary Material

Supplementary_bbae266
Supplemental_Table1_bbae266
Supplemental_Table2_bbae266
Supplemental_Table3_bbae266
Supplemental_Table4_bbae266
Supplemental_Table5_bbae266
Supplemental_Table6_bbae266
Supplemental_Table7_bbae266
Supplemental_Table8_bbae266
Supplemental_Table9_bbae266

Author Biographies

Zhengfa Xue is a PhD candidate in the School of Computer Science and Technology, Xi’an Jiaotong University. His research interests include bioinformatics management and computational biology.

Aifen Zhou is the Director of Wuhan Children’s Hospital and a Chief Physician. She has a solid theoretical foundation and rich clinical experience in prenatal care, genetic counseling, prenatal screening and diagnosis, and mother-to-child transmission prevention of Hepatitis B, Syphilis, and AIDS.

Xiaoyan Zhu is working as an associate professor in the School of Computer Science and Technology, Xi’an Jiaotong University. Her current research interests include data mining and machine learning.

Linxuan Li is a PhD student at the College of Life Sciences, University of Chinese Academy of Sciences. He is currently engaged in research on bioinformatics and human genetics data analysis at BGI Research.

Huanhuan Zhu is a research assistant in BGI Research, Shenzhen, China. Her research interests include the large-scale genome data analysis and genetic basis of complex traits and diseases.

Xin Jin is the Chief Scientist in the field of Population Genomics at BGI Research, a Doctoral Supervisor at the University of Chinese Academy of Sciences, and a Professor at the South China University of Technology. His primary research directions are genomic big data, liquid biopsy, precision medicine, and bioinformatics.

Jiayin Wang is a faculty member of the School of Computer Science and Technology, Xi’an Jiaotong University. His research interest includes the management issues in bioinformatics, computational biology, and cancer genomics.

Contributor Information

Zhengfa Xue, School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an 710049, China; Shaanxi Engineering Research Center of Medical and Health Big Data, Xi’an Jiaotong University, Xi’an 710049, China.

Aifen Zhou, Institute of Maternal and Child Health, Wuhan Children’s Hospital (Wuhan Maternal and Child Health care Hospital), Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430015, China; Department of Obstetrics, Wuhan Children’s Hospital (Wuhan Maternal and Child Health care Hospital), Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430015, China.

Xiaoyan Zhu, School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an 710049, China; Shaanxi Engineering Research Center of Medical and Health Big Data, Xi’an Jiaotong University, Xi’an 710049, China.

Linxuan Li, BGI Research, Shenzhen 518083, China; College of Life Sciences, University of Chinese Academy of Sciences, Beijing 100049, China.

Huanhuan Zhu, BGI Research, Shenzhen 518083, China.

Xin Jin, BGI Research, Shenzhen 518083, China; School of Medicine, South China University of Technology, Guangzhou 510006, China.

Jiayin Wang, School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an 710049, China; Shaanxi Engineering Research Center of Medical and Health Big Data, Xi’an Jiaotong University, Xi’an 710049, China.

Author contributions

Z.X., J.W., X.J. conceived and designed this research; Z.X., J.W., and X.J. designed the model; Z.X. implemented the program and performed the experiments; A.Z., L.L., and H.Z. provided and analyzed the data. Z.X. and J.W. wrote the manuscript. Z.X., J.W., and X.J. conducted the revision. All authors have read and agreed to the latest version of the manuscript.

Funding

This work was funded by the National Natural Science Foundation of China, grant numbers 72293581, 72293580, 72274152.

Conflict of interest

AZ is employed by the Wuhan Children's Hospital. Author LL, HZ and XJ were employed by Beijing Genomics Institute (BGI) in Shenzhen. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Data availability

The code and data of NIPT-PG is available at https://github.com/Nevermore233/NIPT-PG.

References

  • 1. Brand  H, Whelan  CW, Duyzend  M. et al.  High-resolution and noninvasive fetal exome screening. N Engl J Med  2023;389:2014–6. 10.1056/NEJMc2216144. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Chen  Y, Wu  Z, Sutlive  J. et al.  Noninvasive prenatal diagnosis targeting fetal nucleated red blood cells. J Nanobiotechnology  2022;20:546. 10.1186/s12951-022-01749-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Rezaei  M, Winter  M, Zander-Fox  D. et al.  A reappraisal of circulating fetal cell noninvasive prenatal testing. Trends Biotechnol  2019;37:632–44. 10.1016/j.tibtech.2018.11.001. [DOI] [PubMed] [Google Scholar]
  • 4. Lo  YD, Corbetta  N, Chamberlain  PF. et al.  Presence of fetal DNA in maternal plasma and serum. Lancet  1997;350:485–7. 10.1016/S0140-6736(97)02174-0. [DOI] [PubMed] [Google Scholar]
  • 5. Chiu  RW, Chan  KA, Gao  Y. et al.  Noninvasive prenatal diagnosis of fetal chromosomal aneuploidy by massively parallel genomic sequencing of DNA in maternal plasma. Proc Natl Acad Sci  2008;105:20458–63. 10.1073/pnas.0810641105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Gil  MM, Galeva  S, Jani  J. et al.  Screening for trisomies by cfDNA testing of maternal blood in twin pregnancy: update of The Fetal Medicine Foundation results and meta-analysis. Ultrasound Obstet Gynecol  2019;53:734–42. 10.1002/uog.20284. [DOI] [PubMed] [Google Scholar]
  • 7. Gil  MM, Accurti  V, Santacruz  B. et al.  Analysis of cell-free DNA in maternal blood in screening for aneuploidies: updated meta-analysis. Ultrasound Obstet Gynecol  2017;50:302–14. 10.1002/uog.17484. [DOI] [PubMed] [Google Scholar]
  • 8. Iwarsson  E, Jacobsson  B, Dagerhamn  J. et al.  Analysis of cell-free fetal DNA in maternal blood for detection of trisomy 21, 18 and 13 in a general pregnant population and in a high risk population–a systematic review and meta-analysis. Acta Obstet Gynecol Scand  2017;96:7–18. 10.1111/aogs.13047. [DOI] [PubMed] [Google Scholar]
  • 9. Chen  L, Wang  L, Hu  Z. et al.  Combining Z-score and maternal copy number variation analysis increases the positive rate and accuracy in non-invasive prenatal testing. Front Genet  2022;13:887176. 10.3389/fgene.2022.887176. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Jiang  F, Ren  J, Chen  F. et al.  Noninvasive Fetal Trisomy (NIFTY) test: an advanced noninvasive prenatal diagnosis methodology for fetal autosomal and sex chromosomal aneuploidies. BMC Med Genomics  2012;5:1–1. 10.1186/1755-8794-5-57. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Tian  Y, Zhang  L, Tian  W. et al.  Analysis of the accuracy of Z-scores of non-invasive prenatal testing for fetal Trisomies 13, 18, and 21 that employs the ion proton semiconductor sequencing platform. Mol Cytogenet  2018;11:1–7. 10.1186/s13039-018-0397-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Junhui  W, Ru  L, Qiuxia  Y. et al.  Evaluation of the Z-score accuracy of noninvasive prenatal testing for fetal trisomies 13, 18 and 21 at a single center. Prenat Diagn  2021;41:690–6. 10.1002/pd.5908. [DOI] [PubMed] [Google Scholar]
  • 13. Fan  HC, Quake  SR. Sensitivity of noninvasive prenatal detection of fetal aneuploidy from maternal plasma using shotgun sequencing is limited only by counting statistics. PLoS One  2010;5(5):e10439. 10.1371/journal.pone.0010439. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Lau  TK, Chen  F, Pan  X. et al.  Noninvasive prenatal diagnosis of common fetal chromosomal aneuploidies by maternal plasma DNA sequencing. J Matern Fetal Neonatal Med  2011;25:1370–4. 10.3109/14767058.2011.635730. [DOI] [PubMed] [Google Scholar]
  • 15. Peng  XL, Jiang  P. Bioinformatics approaches for fetal DNA fraction estimation in noninvasive prenatal testing. Int J Mol Sci  2017;18:453. 10.3390/ijms18020453. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Hickey  G, Monlong  J, Ebler  J, et al.  Pangenome graph construction from genome alignments with Minigraph-Cactus. Nat Biotechnol  2023;10:1–1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Garrison  E, Guarracino  A. Unbiased pangenome graphs. Bioinformatics  2023;39:btac743. 10.1093/bioinformatics/btac743. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Laing  C, Buchanan  C, Taboada  EN. et al.  Pan-genome sequence analysis using Panseq: an online tool for the rapid analysis of core and accessory genomic regions. BMC Bioinformatics  2010;11:1–4. 10.1186/1471-2105-11-461. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Rautiainen  M, Mäkinen  V, Marschall  T. Bit-parallel sequence-to-graph alignment. Bioinformatics  2019;35:3599–607. 10.1093/bioinformatics/btz162. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Eizenga  JM, Novak  AM, Sibbesen  JA. et al.  Pangenome graphs. Annu Rev Genomics Hum Genet  2020;21:139–62. 10.1146/annurev-genom-120219-080406. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Li  R, Li  Y, Zheng  H. et al.  Building the sequence map of the human pan-genome. Nat Biotechnol  2010;28:57–63. 10.1038/nbt.1596. [DOI] [PubMed] [Google Scholar]
  • 22. Sherman  RM, Forman  J, Antonescu  V. et al.  Assembly of a pan-genome from deep sequencing of 910 humans of African descent. Nat Genet  2019;51:30–5. 10.1038/s41588-018-0273-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Hehir-Kwa  JY, Marschall  T, Kloosterman  WP. et al.  A highquality human reference panel reveals the complexity and distribution of genomic structural variants. Nat Commun  2016;7:12989. 10.1038/ncomms12989. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Audano  PA, Sulovari  A, Graves-Lindsay  TA. et al.  Characterizing the major structural variant alleles of the human genome. Cell  2019;176:663–675.e19. 10.1016/j.cell.2018.12.019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Church  DM, Schneider  VA, Steinberg  KM. et al.  Extending reference assembly models. Genome Biol  2015;16:13. 10.1186/s13059-015-0587-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Computational pan-genomics consortium . Computational pan-genomics: status, promises and challenges. Brief Bioinform  2016;19:118–35. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Hein  J. A new method that simultaneously aligns and reconstructs ancestral sequences for any number of homologous sequences, when the phylogeny is given. Mol Biol Evol  1989;6:649–68. [DOI] [PubMed] [Google Scholar]
  • 28. Paten  B, Eizenga  JM, Rosen  YM. et al.  Superbubbles, ultrabubbles, and cacti. J Comput Biol  2018;25:649–63. 10.1089/cmb.2017.0251. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Li  P, Shrivastava  A, Moore  J, et al.  Hashing algorithms for large-scale learning. Adv Neural Inf Process Syst  2011;24:1–9. [Google Scholar]
  • 30. Tran  TD, Hofrichter  J, Jost  J. An introduction to the mathematical structure of the Wright–Fisher model of population genetics. Theory Biosci  2013;132:73–82. 10.1007/s12064-012-0170-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Hudson  RR. Generating samples under a Wright–Fisher neutral model of genetic variation. Bioinformatics  2002;18:337–8. 10.1093/bioinformatics/18.2.337. [DOI] [PubMed] [Google Scholar]
  • 32. Rambaut  A, Grass  NC. Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Bioinformatics  1997;13:235–8. 10.1093/bioinformatics/13.3.235. [DOI] [PubMed] [Google Scholar]
  • 33. Huang  W, Li  L, Myers  JR. et al.  ART: a next-generation sequencing read simulator. Bioinformatics  2012;28:593–4. 10.1093/bioinformatics/btr708. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Valiant  LG. A theory of the learnable. Commun ACM  1984;27:1134–42. 10.1145/1968.1972. [DOI] [Google Scholar]
  • 35. Brownstein  Z, Friedman  LM, Shahin  H. et al.  Targeted genomic capture and massively parallel sequencing to identify genes for hereditary hearing loss in middle eastern families. Genome Biol  2011;12:R89. 10.1186/gb-2011-12-9-r89. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Bell  CJ, Dinwiddie  DL, Miller  NA, et al.  Carrier testing for severe childhood recessive diseases by next-generation sequencing. Sci Transl Med  2011;3:65ra64. 10.1126/scitranslmed.3001756. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Hartwig  TS, Ambye  L, Sørensen  S. et al.  Discordant non-invasive prenatal testing (NIPT) - a systematic review. Prenat Diagn  2017;37:527–39. 10.1002/pd.5049. [DOI] [PubMed] [Google Scholar]
  • 38. Chen  EZ, Chiu  RW, Sun  H. et al.  Noninvasive prenatal diagnosis of fetal trisomy 18 and trisomy 13 by maternal plasma DNA sequencing. PLoS One  2011;6:e21791. 10.1371/journal.pone.0021791. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary_bbae266
Supplemental_Table1_bbae266
Supplemental_Table2_bbae266
Supplemental_Table3_bbae266
Supplemental_Table4_bbae266
Supplemental_Table5_bbae266
Supplemental_Table6_bbae266
Supplemental_Table7_bbae266
Supplemental_Table8_bbae266
Supplemental_Table9_bbae266

Data Availability Statement

The code and data of NIPT-PG is available at https://github.com/Nevermore233/NIPT-PG.


Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES