Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2020 Nov 5;10:19142. doi: 10.1038/s41598-020-76245-5

Genetic profiling of Vietnamese population from large-scale genomic analysis of non-invasive prenatal testing data

Ngoc Hieu Tran 1,7, Thanh Binh Vo 1,2, Van Thong Nguyen 3, Nhat-Thang Tran 4, Thu-Huong Nhat Trinh 5, Hong-Anh Thi Pham 1,2, Thi Hong Thuy Dao 1,2, Ngoc Mai Nguyen 1,2, Yen-Linh Thi Van 1,2, Vu Uyen Tran 1,2, Hoang Giang Vu 1,2, Quynh-Tram Nguyen Bui 1,2, Phuong-Anh Ngoc Vo 1,2, Huu Nguyen Nguyen 1,2, Quynh-Tho Thi Nguyen 2, Thanh-Thuy Thi Do 2, Nien Vinh Lam 6, Phuong Cao Thi Ngoc 2,8, Dinh Kiet Truong 2, Hoai-Nghia Nguyen 6,, Hoa Giang 1,2,, Minh-Duy Phan 1,9,
PMCID: PMC7644705  PMID: 33154511

Abstract

The under-representation of several ethnic groups in existing genetic databases and studies have undermined our understanding of the genetic variations and associated traits or diseases in many populations. Cost and technology limitations remain the challenges in performing large-scale genome sequencing projects in many developing countries, including Vietnam. As one of the most rapidly adopted genetic tests, non-invasive prenatal testing (NIPT) data offers an alternative untapped resource for genetic studies. Here we performed a large-scale genomic analysis of 2683 pregnant Vietnamese women using their NIPT data and identified a comprehensive set of 8,054,515 single-nucleotide polymorphisms, among which 8.2% were new to the Vietnamese population. Our study also revealed 24,487 disease-associated genetic variants and their allele frequency distribution, especially 5 pathogenic variants for prevalent genetic disorders in Vietnam. We also observed major discrepancies in the allele frequency distribution of disease-associated genetic variants between the Vietnamese and other populations, thus highlighting a need for genome-wide association studies dedicated to the Vietnamese population. The resulted database of Vietnamese genetic variants, their allele frequency distribution, and their associated diseases presents a valuable resource for future genetic studies.

Subject terms: Computational biology and bioinformatics, Genetics, Molecular biology

Introduction

Following the successful initiative of the 1000 genomes project1, several large-scale genome and exome sequencing projects have been conducted, either as international collaboration efforts such as ExAC2, gnomAD3, or for a specific country or population48. Those projects have provided comprehensive profiles of human genetic variation in some populations, paving the way for unprecedented advance in treatment of common genetic diseases. However, the lack of diversity and the under-representation of several populations in genome sequencing projects and genome-wide association studies (GWAS) have increasingly become a critical problem9,10. For instance, Gurdasani et al. found that the representation of ethnic groups in GWAS was significantly biased, with nearly 78% of the participants having European ancestries, whereas the two major populations, Asian and African, only accounted for 11% and 2.4%, respectively10. Vietnam has a population of 96.5 million, the 15th highest in the world and the 9th highest in Asia. Yet there was merely one dataset of 99 Vietnamese individuals that had been studied as part of the 1000 genomes project (population code KHV, the Kinh ethnic group in Ho Chi Minh City, Vietnam). A recent study has sequenced genomes and exomes of another 305 individuals to further expand the Vietnamese genetic database11. However, costs and technologies to perform large-scale genome sequencing projects still remain a challenge for most developing countries, including Vietnam.

An alternative approach has been proposed recently to re-use the low-coverage genome sequencing data from non-invasive prenatal testing (NIPT) for large-scale population genetics studies12,13. NIPT is a method that sequences cell-free DNA from maternal plasma at an ultra-low depth of 0.1–0.2 × to detect fetal aneuploidy14. By combining a sufficiently large number of NIPT samples, one could obtain a good representation of the population genetic variation. The benefits of re-using NIPT data for population genetics are manifold. First, the data can be re-used at no extra cost given the approval and consent of the participants. As one of the most rapidly adopted genetic tests, NIPT has been successfully established and become a standard screening procedure with thousands to millions of tests performed world-wide, including many developing countries such as Vietnam14. Using NIPT data for population genetic studies may also reduce privacy concerns since the genetic variants can only be analyzed by aggregating a large number of samples and the results can only be interpreted at the population level. A single sample tells little about the genetic information of an individual due to low sequencing depth. Last but not least, previous studies have suggested that sequencing a large number of individuals at a low depth might provide more accurate inferences of the population genetic structure than the traditional approach of sequencing a limited number of individuals at a higher depth, especially when the budget is limited15,16.

In this paper, we presented the first study of Vietnamese genetic variations from non-invasive prenatal testing data, and to the best of our knowledge, the third of such kinds in the world12,13. We analyzed the NIPT data of 2683 pregnant Vietnamese women to identify genetic variants and their allele frequency distribution in the Vietnamese population. We also studied the relationships between the Vietnamese genetic profile and common genetic disorders, and discovered pathogenic variants related to prevalent diseases in Vietnam. Finally, we highlighted the differences in the distribution of disease-associated genetic variants between the Vietnamese and other populations, thus highlighting a need for genome-wide association studies dedicated to the Vietnamese population. The resulted database of Vietnamese genetic variants from NIPT data is made available to facilitate future research studies in population genetics and associated traits or diseases.

Results

Data collection

A total of 2683 pregnant Vietnamese women who performed non-invasive prenatal testing during the period from 2018 to 2019 at the Medical Genetics Institute, Vietnam, were recruited to the study. The participants have approved and given written informed consent to the anonymous re-use of their genomic data for the study. All information of the participants is confidential and not available to the authors, except the records that their NIPT and pregnancy results are normal. The study was approved by the institutional ethics committee of the University of Medicine and Pharmacy, Ho Chi Minh city, Vietnam. The whole genome of each participant was sequenced to an average of 3.6 million paired-end reads of 2 × 75 bp, which corresponds to a sequencing depth of 0.17 × per sample.

Genome coverage and sequencing depth of the NIPT dataset

Data pre-processing was first performed on each of 2683 samples and the results were stored in binary alignment map (BAM) format, one BAM file per sample. The data pre-processing steps include: quality control of raw data using FastQC17, trimmomatic18; alignment of paired-end reads to the human reference genome (build GRCh38) using bwa19, samtools20, MarkDuplicates21; and summary of alignment results using Qualimap22, bedtools23, IGV24. The quality of raw data and alignment results are presented in Supplementary Figs. S1 and S2 for 1,000 samples randomly selected from our dataset. Raw data showed high sequencing quality, no bias was observed. The mapping quality and insert size distributions followed closely what expected across the reference genome. The overall sequencing error rate was estimated to be about 0.3% by Qualimap. More details of the data pre-processing steps can be found in the “Methods” section.

The average genome coverage and depth were 14.59% and 0.17 × per sample, respectively, and aggregated to 95.09% and 462 × across 2683 samples (Fig. 1a). Although the sequencing depth per sample was low, there might be more than one read from the same sample overlapping at a genome position. This problem may affect the estimation of allele frequency because the estimation is based on the assumption that a sample may contribute only 0 or 1 allele (read) at any given genome position12,13. For instance, we found that the average percentage of genome positions with depth 2 × (i.e. covered by two overlapping reads) in a sample was 1.75% (Fig. 1a). These overlapping reads occurred randomly across the reference genome and the samples. At any genome position, there were on average 47 out of 2683 samples that each contributed two reads (Supplementary Fig. S3). To address this problem, we followed a filtering strategy from previous studies12,13 to keep only one read if there were overlapping reads in a sample. Thus, for every genome position, each sample could only contribute up to one read, and when the samples were aggregated, all reads at any position were obtained from different samples. In addition, we also removed alignments with low mapping quality scores (MAPQ < 30).

Figure 1.

Figure 1

Distributions of genome coverage and sequencing depth of the NIPT dataset. (a) Average genome coverage and sequencing depth per sample and from all samples combined. (b) Summary histogram of sequencing depth over all genome positions. (c) Distribution of sequencing depth per chromosome. (d) IGV tracks of sequencing depth, bwa MAPQ score, and Umap k50 mappability across the whole genome (the figure was produced using IGV, Integrative Genomics Viewer, version 2.8.924).

After the filtering step, the sequencing depth was reduced from 0.17 × to 0.12 × per sample. The aggregated sequencing depth of 2683 samples was 364 × and the genome coverage was 91.56% (Fig. 1a). The distributions of genome coverage and sequencing depth are presented in Fig. 1b–d. The sequencing depth was approximately uniform across the reference genome, except for low-mappability regions and chromosome Y. The distributions of sequencing depth and MAPQ score also closely followed the mappability of the human reference genome obtained from Umap25. The average sequencing depth of chromosome Y was about 2.1% that of the whole genome, consistent with the proportion of fetal DNA in NIPT samples (8–10%)14.

Variant calling and validation

We aggregated 2683 samples into one and used Mutect2 from GATK2628 for variant calling and allele frequency estimation. In addition to its main function of somatic calling, Mutect2 can also be used on data that represents a pool of individuals, such as our NIPT dataset, to call multiple variants at a genome site12,13,27. The called variants were further checked against strand bias, weak evidence, or contamination using FilterMutectCalls. The allele frequencies were estimated based on the numbers of reads aligned to the reference and the alternate alleles. For validation, we compared our NIPT call set to the KHV (Kinh in Ho Chi Minh City, Vietnam) and EAS (East Asian) populations from the 1000 genomes project1, as well as the dbSNP database (version 15129).

We identified a total of 8,054,515 SNPs from the NIPT dataset. The transition to transversion ratio was 2.0 over the whole genome and 2.8 over protein coding regions, which was similar to the observed ratios from previous genome or exome sequencing projects. As expected, a majority of these SNPs, 7,390,020 or 91.8%, had been reported earlier in the KHV call set (Fig. 2a). Since the KHV population only had 99 individuals, we further looked into its common SNPs that were shared by at least two individuals. We found that the NIPT call set recovered 90.5% of the KHV common SNPs (Supplementary Fig. S4; 6,889,016/7,609,526 = 90.5%). This sensitivity is in line with the genome coverage reported earlier in Fig. 1a.

Figure 2.

Figure 2

Summary of the NIPT call set. (a) Venn diagram comparison between the NIPT call set, the KHV and EAS call sets from the 1000 genomes project, and the dbSNP database. The percentages were calculated with respect to the NIPT call set. (b) Allele frequency distribution of the NIPT call set. (c) Distribution of locations and effects of variants in the NIPT call set.

Our NIPT call set included 664,495 (8.2%) SNPs that had not been reported in the KHV call set. Among them, 67,153 (0.8%) were found in the EAS call set, another 517,020 (6.4%) were found in the dbSNP database, and the remaining 80,322 (1.0%) were novel SNPs (Fig. 2a). Majority of those SNPs had allele frequencies less than 10%. The overall allele frequency distribution of our NIPT call set is presented in Fig. 2b.

We used VEP (Variant Effect Predictor30) to analyze the effects of 8,054,515 variants in the NIPT call set (Fig. 2c). About 1.5% of the SNPs were located in the coding and UTR regions, 12% in the upstream and downstream regions, and 4% in the TF regulatory regions. More than 80% of the SNPs were located in the intron and intergenic regions. We also noted that the new SNPs and those in KHV had similar proportions of coding, UTR, upstream and downstream, and TF regulatory regions (Fig. 2c).

An important advantage of NIPT data is the ability of sampling a large number of individuals to better represent a population and to accurately estimate the allele frequency. We found a strong Pearson correlation of 98.8% between the allele frequency of the NIPT call set and that of the KHV call set (Fig. 3a). Furthermore, thanks to its larger sample size, the NIPT allele frequency indeed showed better resolution than the KHV one, as evidenced by vertical trails in Fig. 3a or a zoomed-in view in Supplementary Fig. S5. We also performed principal component analysis (PCA) on our NIPT data together with the data of East Asia populations from the 1000 Genome Project by using PLINK (version 1.931). The PCA results in Fig. 3b show that the NIPT group is closely clustered with the KHV as both represent the Vietnamese population. The distribution of the populations in the PCA plot is also consistent with their geographic locations.

Figure 3.

Figure 3

Principal component analysis of the NIPT call set and other East Asia populations. (a) Scatter plot comparison of allele frequency estimated from the NIPT and the KHV call sets. (b) Principal component analysis. (JPT Japanese in Tokyo, Japan, CHB Han Chinese in Beijing, China, CHS Southern Han Chinese, CDX Chinese Dai in Xishuangbanna, China, KHV Kinh in Ho Chi Minh City, Vietnam, NIPT Non-Invasive Prenatal Testing data).

There are a few possible reasons as why the NIPT and KHV groups did not entirely overlap in the PCA plot. Firstly, they were obtained from two different types of data, ultra-low sequencing data of cell-free DNA from maternal plasma and whole-genome sequencing data. Secondly, in our variant calling approach for NIPT data, we aggregated all samples into one BAM file and then performed variant calling on that BAM file. Thus, we did not obtain individual genotypes, and instead, the allele frequencies were estimated based on the numbers of reads. Since PCA tools like PLINK require individual genotypes (e.g., 100 individuals for each population in the 1000 Genome Project), we used the NIPT allele frequency distribution to simulate 100 sets of genotypes. That could be another source of difference between the NIPT and the KHV groups in the PCA plot, as the former consisted of simulated genotypes from allele frequency distribution and the latter consisted of genotypes from real individuals.

Analysis of pathogenic SNPs and their allele frequencies in Vietnamese population

We searched the NIPT call set against the ClinVar database (version 2019110532) to explore the associations between Vietnamese genomic variants and common genetic diseases. We identified 24,487 SNPs with ClinVar annotations that have been reviewed by at least one research group (Table 1). Among them, five SNPs were classified as “Pathogenic” or “Likely pathogenic”, 117 SNPs were found to affect a “drug response”, and a majority of the remaining were classified as “Benign” or “Likely benign”. We also noted that 391 ClinVar-annotated SNPs (1.6%), including 1 pathogenic SNP, had not been reported in the KHV call set.

Table 1.

Summary of ClinVar annotations for the NIPT call set.

ClinVar clinical significance In KHV Not in KHV
Benign or likely benign 23,414 300
Uncertain significance 353 74
Pathogenic or likely pathogenic 4 1
Drug response 114 3
Others 211 13
Total number of annotations 24,096 391

Table 2 and Supplementary Table S1 present the details of five pathogenic SNPs identified in our NIPT call set. Their associated genetic diseases include: erythropoietic protoporphyria, non-syndromic genetic deafness, Joubert syndrome, hemochromatosis type 1, and 5-alpha reductase deficiency. The SNP rs9332964 C > T in SRD5A2, which is associated with 5-alpha reductase deficiency, had not been reported in the KHV call set. 5-alpha reductase deficiency is an autosomal recessive disorder that affects male sexual development. This SNP is rare in the world and East Asia populations, but was found to be more common in the Vietnamese population (allele frequency of 0.05%, 0.67%, and 2.90%, respectively). This SNP had also been reported in a recent study11 at a very low allele frequency of 1.36%.

Table 2.

Pathogenic variants identified from the NIPT call set.

Variant information ClinVar annotations Allele frequency
chr Position dbSNP Ref Alt ID Gene Conditions NIPT (%) gnomAD EAS (%) gnomAD (%)
chr18 57,571,588 rs2272783 A G 562 FECH Erythropoietic protoporphyria 28.10 32.57 11.23
chr13 20,189,473 rs72474224 C T 17023 GJB2 Nonsyndromic hearing loss and deafness 13.40 8.35 0.76
chr13 72,835,359 rs17089782 G A 217689 PIBF1 Joubert syndrome 6.80 5.13 1.36
chr6 26,090,951 rs1799945 C G 10 HFE Hemochromatosis type 1 5.10 3.41 10.82
chr2 31,529,325 rs9332964 C T 3351 SRD5A2 5-alpha reductase deficiency 2.90 0.67 0.05

We noticed that the allele frequencies of the five pathogenic SNPs varied considerably between the Vietnamese, the East Asia, and the world populations (Table 2). For instance, the SNP rs72474224 C > T in GJB2 is commonly liked to non-syndromic hearing loss and deafness, which is also the most prevalent genetic disorder in the Vietnamese population. We found that its allele frequency in the Vietnamese population was 60% higher than in the East Asia population, which in turn was an order of magnitude higher than in the world population (13.40%, 8.35%, and 0.76%, respectively). The allele frequency was consistent with the estimated carrier frequency of 1 in 5 in the Vietnamese population. Similarly, the allele frequency of rs2272783 A > G in FECH, which is associated with erythropoietic protoporphyria, was nearly three times higher in the Vietnamese and East Asia populations than in the world population (28.10%, 32.57%, and 11.23%, respectively). The prevalence of this pair of SNP and disease in East and Southeast Asia has been reported previously in33. On the other hand, the allele frequency of rs1799945 C > G in HFE, which is associated with hemochromatosis type 1, was about two and three times lower in the Vietnamese and East Asia populations than in the world population (5.10%, 3.41%, and 10.82%, respectively). Such discrepancies were also observed for “Benign” variants, e.g., those related to autosomal recessive non-syndromic hearing loss (Supplementary Table S2). The variations strongly suggest that population-specific genome-wide association studies are required to provide a more accurate understanding of the clinical significance of genetic variants and the true disease prevalence in the Vietnamese population.

Discussion

In this study, we analyzed the genomes of 2683 pregnant Vietnamese women from their non-invasive prenatal testing data. The genomes were originally sequenced at a low depth of approximately 0.17 × per sample for the purpose of fetal aneuploidy testing14. Here we combined the 2683 samples to a total sequencing depth of 364 × and performed variant calling and analysis for the Vietnamese population. We identified a comprehensive set of 8,054,515 SNPs at a high level of sensitivity and accuracy: 90.5% of Vietnamese common SNPs were recovered; 99% of identified SNPs were confirmed in existing databases; and a strong correlation of 98.8% to known allele frequencies. The results were exciting given that the total sequencing depth of our dataset, 364x, was merely equivalent to sequencing 20 individuals at a moderate depth of 20 ×. It also suggests that there is still plenty of room for improvement by increasing the number of NIPT samples. For instance, Liu et al. have demonstrated a large-scale population genetic analysis based on hundreds of thousands of NIPT samples for the Chinese population12.

Another benefit of using NIPT data is cost-effective. In our study, the dataset was re-used at no cost with written informed consent from the participants. The whole analysis pipeline was done within a week on a local computer with 32 CPUs, 128 GB memory, and 5 TB hard disk. Thus, the overall cost was negligible compared to that of a typical genome sequencing project. The cost advantage of this approach may play a major role in large-scale genome sequencing projects, especially in developing countries where technologies and resources are still limited.

However, conducting population genetic studies using NIPT data is not without challenges, the most important of which is how to properly design and collect sufficiently large NIPT data across a population. This requires a coordinated, nation-wide effort, and more importantly, the informed consents of thousands of participants for their data to be used for such studies to improve public healthcare in the population. To facilitate this nation-wide data collection, a legal framework, including privacy policy, code of ethics, and standards of practice, is needed to protect confidential information and data of the participants.

Our study revealed 24,487 disease-associated genetic variants, especially five pathogenic variants for prevalent genetic disorders in the Vietnamese population. We also found major discrepancies in the allele frequency distribution of genetic variants between the Vietnamese, the East Asia, and the world populations. Thus, a comprehensive genetic profile and genome-wide association studies dedicated to the Vietnamese population are highly desired. Knowing the distribution of genetic disorders in the population will be useful for public health policy and planning, preventive medicine, early genetic screening strategies, etc.

Some technical and design limitations in our study could be addressed in future research to improve the application of NIPT data in population genetics studies. First, currently there is no variant calling tool that is designed specifically for NIPT data. Here we used Mutect2 and previous studies also used similar somatic calling tools with the purpose of identifying all possible variants at a genome site12,13. Thus, we took a conservative approach to consider only SNPs but not indels to ensure a reliable call set. The novel variants reported in Fig. 2a may include false positives, especially those with high allele frequency (> 5%) as it is not likely that they would have been missed in previous studies1. The transition to transversion ratio is 2.09 for known variants and 1.56 for new variants. We were not able to experimentally validate those novel variants and they should be interpreted with caution. Another limitation was the exclusion of chromosome Y due to its low coverage as a result of limited amount of fetal DNA in NIPT samples. This problem could be addressed by increasing the number of samples to obtain enough sequencing coverage for reliable variant calling. NIPT data is also biased by sex, with only ~ 5% of the data coming from male population (assuming a 10% cell-free fetal DNA fraction with 50% male fetuses).

Conclusions

We showed that non-invasive prenatal testing data could be reliably used to reconstruct the genetic profile of the Vietnamese population. Our study identified pathogenic variants for prevalent genetic diseases in the Vietnamese population and called for a need for population-specific genome-wide association studies. The resulted database presents a valuable resource for future studies of genetic variations and associated traits or diseases, not only for the Vietnamese population but also for other Southeast-Asia and Asia populations. Our results also demonstrated that non-invasive prenatal testing data provides a valuable and cost-effective resource for large-scale population genetic studies.

Methods

Sample preparation

Cell-free DNA (cfDNA) in maternal plasma was extracted using MagMAX Cell-Free DNA Isolation Kit from Thermo Fisher Scientific (Waltham, MA, USA). Library preparation was done using NEBNext Ultra II DNA Library Prep Kit from New England BioLabs (Ipswich, MA, USA). The samples were sequenced on the NextSeq 550 platform using paired-end 2 × 75 bp Reagent Kit from Illumina (San Diego, CA, USA).

Bioinformatics analysis pipeline

Quality check of raw sequencing data was performed using FastQC (version 0.11.817). Paired-end reads were trimmed to 75 bp, adapters (“TruSeq3-PE-2.fa”) and low-quality bases were removed using trimmomatic (version 0.3918). We only kept pairs with both reads surviving the trimming (about 95.7% of the dataset). The reads were then aligned to the human reference genome, build GRCh38 (hg38), using bwa mem (version 0.7.17-r118819). Supplementary hits were marked as secondary for Picard compatibility. Alignment results were sorted and indexed using samtools (version 1.920). Potential PCR duplicates were marked using MarkDuplicates from GATK (version 4.1.1.026).

Alignments with mapping quality scores less than 30 were discarded. In-house Python scripts were developed to mark overlapping alignments and to keep only one of them. Qualimap (version 2.2.122), bedtools (version 2.25.023), and IGV (version 2.4.1924) were used to summarize the alignment results and to calculate genome coverage and sequencing depth.

Mutect2 from GATK (version 4.1.1.02628) was used in tumor-only mode for variant calling. All samples were assigned the same sample name to combine them before variant calling. FilterMutectCalls was used to exclude variants with weak evidence, strand bias, or contamination. bcftools (version 1.934) was used to filter, summarize, and compare VCF (Variant Call Format) files. VEP (version 9830) was used to predict the effects of variants and to annotate them against dbSNP (version 15129) and ClinVar (version 2019110532) databases. Principal component analysis was performed using PLINK (version 1.931).

The whole analysis pipeline and parameter settings can be found in the attached Python scripts.

Ethics approval and consent to participate

The study was approved by the institutional ethics committee of the University of Medicine and Pharmacy, Ho Chi Minh city, Vietnam. The study has followed the guidelines set by the University of Medicine and Pharmacy, Ho Chi Minh city, Vietnam, in handling human genetic data of the participants. The participants who performed NIPT triSure at Medical Genetics Institute, Vietnam, have approved and given written informed consent to the anonymous re-use of their genomic data for this study.

Consent for publication

All authors have read and approved the manuscript for publication.

Supplementary information

Acknowledgements

The authors thank Dr. Kwok Pui Choi for critical reading of the manuscript.

Author contributions

T.B.V., H.A.T.P, T.H.T.D., N.M.N., Y.L.T.V., V.U.T., H.G.V., Q.T.N.B. and P.A.N.V. performed experiments. V.T.N., N.T.T., T.H.N.T., T.T.T.D. and N.V.L. recruited patients and performed clinical analysis. H.N.N., Q.T.T.N., P.C.T.N. and D.K.T. designed experiments and analyzed data. H.N.N. supervised the project. N.H.T., H.G. and M.D.P. analyzed the data and wrote the manuscript, designed experiments and analyzed sequencing data.

Funding

This study was funded by Gene Solutions, Vietnam. The funder did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Data availability

The database of genetic variants identified from NIPT data and the Python scripts for bioinformatics analysis pipeline is available on GitHub: https://github.com/nh2tran/NIPT_WGS.

Competing interests

NHT, TBV, HATP, THTD, NMN, YLTV, VUT, HGV, QTNB, PANV, HNN, HG and MDP are current employees of Gene Solutions, Vietnam. The other authors declare no competing interests.

Footnotes

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Hoai-Nghia Nguyen, Email: nhnghia81@gmail.com.

Hoa Giang, Email: gianghoa@gmail.com.

Minh-Duy Phan, Email: m.phan1@uq.edu.au.

Supplementary information

is available for this paper at 10.1038/s41598-020-76245-5.

References

  • 1.The 1000 Genomes Project Consortium A global reference for human genetic variation. Nature. 2015;526:68–74. doi: 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Lek M, et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature. 2016;536:285–291. doi: 10.1038/nature19057. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Karczewski KJ, et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020;581:434–443. doi: 10.1038/s41586-020-2308-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.The UK10K Consortium The UK10K project identifies rare variants in health and disease. Nature. 2015;526:82–90. doi: 10.1038/nature14962. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Gudbjartsson DF, et al. Large-scale whole-genome sequencing of the Icelandic population. Nat. Genet. 2015;47:435–444. doi: 10.1038/ng.3247. [DOI] [PubMed] [Google Scholar]
  • 6.Maretty L, et al. Sequencing and de novo assembly of 150 genomes from Denmark as a population reference. Nature. 2017;548:87–91. doi: 10.1038/nature23264. [DOI] [PubMed] [Google Scholar]
  • 7.The Genome of the Netherlands Consortium Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nat. Genet. 2014;46:814–825. doi: 10.1038/ng.3021. [DOI] [PubMed] [Google Scholar]
  • 8.Wu D, et al. Large-scale whole-genome sequencing of three diverse Asian populations in Singapore. Cell. 2019;179:736–749. doi: 10.1016/j.cell.2019.09.019. [DOI] [PubMed] [Google Scholar]
  • 9.Editorial Diversity matters. Nat. Rev. Genet. 2019;20:495. doi: 10.1038/s41576-019-0162-y. [DOI] [PubMed] [Google Scholar]
  • 10.Gurdasani D, Barroso I, Zeggini E, Sandhu MS. Genomics of disease risk in globally diverse populations. Nat. Rev. Genet. 2019;20:520–535. doi: 10.1038/s41576-019-0144-0. [DOI] [PubMed] [Google Scholar]
  • 11.Le VS, et al. A Vietnamese human genetic variation database. Hum. Mutat. 2019;40:1664–1675. doi: 10.1002/humu.23835. [DOI] [PubMed] [Google Scholar]
  • 12.Liu S, et al. Genomic analyses from non-invasive prenatal testing reveal genetic associations, patterns of viral infections, and Chinese population history. Cell. 2018;175:347–359. doi: 10.1016/j.cell.2018.08.016. [DOI] [PubMed] [Google Scholar]
  • 13.Budis J, et al. Non-invasive prenatal testing as a valuable source of population specific allelic frequencies. J. Biotechnol. 2019;299:72–78. doi: 10.1016/j.jbiotec.2019.04.026. [DOI] [PubMed] [Google Scholar]
  • 14.Phan MD, et al. Establishing and validating noninvasive prenatal testing procedure for fetal aneuploidies in Vietnam. J. Matern. Fetal Neonatal Med. 2019;32:4009–4015. doi: 10.1080/14767058.2018.1481032. [DOI] [PubMed] [Google Scholar]
  • 15.Li Y, Sidore C, Kang HM, Boehnke M, Abecasis GR. Low-coverage sequencing: Implications for design of complex trait association studies. Genome Res. 2011;21:940–951. doi: 10.1101/gr.117259.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Fumagalli M. Assessing the effect of sequencing depth and sample size in population genetics inferences. PLoS ONE. 2013;8:e79667. doi: 10.1371/journal.pone.0079667. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.FastQC. https://www.bioinformatics.babraham.ac.uk/projects/fastqc/. Accessed 24 Apr 2020.
  • 18.Bolger AM, Lohse M, Usadel B. Trimmomatic: A flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30:2114–2120. doi: 10.1093/bioinformatics/btu170. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. https://arXiv.org/1303.3997v2 [q-bio.GN].
  • 20.Li H, et al. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Picard. https://broadinstitute.github.io/picard/. Accessed 24 Apr 2020.
  • 22.Okonechnikov K, Conesa A, Garcia-Alcalde F. Qualimap 2: Advanced multi-sample quality control for high-throughput sequencing data. Bioinformatics. 2016;32:292–294. doi: 10.1093/bioinformatics/btv566. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Quinlan AR, Hall IM. BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841–842. doi: 10.1093/bioinformatics/btq033. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Robinson JT, et al. Integrative genomics viewer. Nat. Biotechnol. 2011;29:24–26. doi: 10.1038/nbt.1754. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Karimzadeh M, Ernst C, Kundaje A, Hoffman MM. Umap and Bismap: Quantifying genome and methylome mappability. Nucleic Acids Res. 2018;46:e120. doi: 10.1093/nar/gkx951. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Van der Auwera GA, et al. From FastQ data to high confidence variant calls: The genome analysis toolkit best practices pipeline. Curr. Protoc. Bioinform. 2013;43:1–33. doi: 10.1002/0471250953.bi1110s43. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Cibulskis K, et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat. Biotechnol. 2013;31:213–219. doi: 10.1038/nbt.2514. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.DePristo M, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 2011;43:491–498. doi: 10.1038/ng.806. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Sherry ST, et al. dbSNP: The NCBI database of genetic variation. Nucleic Acids Res. 2001;29:308–311. doi: 10.1093/nar/29.1.308. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.McLaren W. The Ensembl variant effect predictor. Genome Biol. 2016;17:122. doi: 10.1186/s13059-016-0974-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Chang CC, et al. Second-generation PLINK: Rising to the challenge of larger and richer datasets. GigaScience. 2015 doi: 10.1186/s13742-015-0047-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Landrum MJ, et al. ClinVar: Public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res. 2014;42:D980–D985. doi: 10.1093/nar/gkt1113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Gouya L, et al. Contribution of a common single-nucleotide polymorphism to the genetic predisposition for erythropoietic protoporphyria. Am. J. Hum. Genet. 2006;78:2–14. doi: 10.1086/498620. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics. 2011;27:2987–2993. doi: 10.1093/bioinformatics/btr509. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data Availability Statement

The database of genetic variants identified from NIPT data and the Python scripts for bioinformatics analysis pipeline is available on GitHub: https://github.com/nh2tran/NIPT_WGS.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES