A comprehensive whole genome database of ethnic minority populations

Yan He; Changgui Lei; Chanjuan Wan; Shuang Zeng; Ting Zhang; Fei Luo; Ruichao Li; Xiaokun Li; Anshu Zhao; Defu Xiao; Yunyan Luo; Keren Shan; Xiaolan Qi; Xin Jin

doi:10.1038/s41598-024-63892-1

. 2024 Jun 17;14:13954. doi: 10.1038/s41598-024-63892-1

A comprehensive whole genome database of ethnic minority populations

Yan He ^1,^#, Changgui Lei ^2,^3,^4,^#, Chanjuan Wan ¹, Shuang Zeng ^1,^2,^3,⁴, Ting Zhang ¹, Fei Luo ^2,^3,⁴, Ruichao Li ¹, Xiaokun Li ^2,³, Anshu Zhao ¹, Defu Xiao ^2,³, Yunyan Luo ^1,³, Keren Shan ¹, Xiaolan Qi ^1,^✉, Xin Jin ^2,^3,^5,^6,^✉

PMCID: PMC11183174 PMID: 38886537

Abstract

China, is characterized by its remarkable ethnical diversity, which necessitates whole genome variation data from multiple populations as crucial tools for advancing population genetics and precision medical research. However, there has been a scarcity of research concentrating on the whole genome of ethnic minority groups. To fill this gap, we developed the Guizhou Multi-ethnic Genome Database (GMGD). It comprises whole genome sequencing data from 476 healthy unrelated individuals spanning 11 ethnic minorities groups in Guizhou Province, Southwest China, including Bouyei, Dong, Miao, Yi, Bai, Gelo, Zhuang, Tujia, Yao, Hui, and Sui. The GMGD database comprises more than 16.33 million variants in GRCh38 and 16.20 million variants in GRCh37. Among these, approximately 11.9% (1,956,322) of the variants in GRCh38 and 18.5% (3,009,431) of the variants in GRCh37 are entirely new and do not exist in the dbSNP database. These novel variants shed light on the genetic diversity landscape across these populations, providing valuable insights with an average coverage of 5.5 ×. This makes GMGD the largest genome-wide database encompassing the most diverse ethnic groups to date. The GMGD interactive interface facilitates researchers with multi-dimensional mutation search methods and displays population frequency differences among global populations. Furthermore, GMGD is equipped with a genotype-imputation function, enabling enhanced capabilities for low-depth genomic research or targeted region capture studies. GMGD offers unique insights into the genomic variation landscape of different ethnic groups, which are freely accessible at https://db.cngb.org/pop/gmgd/.

Keywords: GMGD database, Human whole genome sequencing, Guizhou Province in southwest China, Ethnic minority populations

Subject terms: Computational biology and bioinformatics, Genetics

Introduction

The advances in DNA sequencing technology and decreasing costs have allowed for unprecedented exploration of human genetic diversity. The analysis of genome sequencing data from diverse human populations can identify novel human variants, establish genotype frequency baselines, and elucidate population genetic structures and evolution. These investigations can provide precise and comprehensive insights into the genetic landscape of various ethnicities, laying a crucial foundation for our understanding of population differences, admixture, and guiding the development of new drugs^1,2. Despite the potential of these advances, the representation of certain populations and ethnic groups, particularly those in Southwest China, remains minimal in large-scale genetic databases.

Southwest China, housing 31 distinct ethnic minorities³, encompassing the provinces of Sichuan, Chongqing, Guizhou, Yunnan, and Tibet, it is characterized by its rich genetic diversity and complex genetic history. Due to the region’s unique geographical location and historical factors, these groups have been able to preserve their distinct and representative genetic characteristics⁴. Previous genotyping studies on Y chromosome short tandem repeat loci (STR)⁵, autosomal insertion/deletion polymorphisms (InDels)⁶, and whole genome single nucleotide polymorphisms (SNPs) in ethnic minority populations⁷ have underscored the significant research value of genetic information of these groups.

However, in spite of this, the representation of Southwest China’s population in large-scale gene databases remains minimal, with even scarcer information on specific ethnic groups^8–10. For instance, the recently published large population cohort, like NyuWa genomes and ChinaMap, includes less than 10% of the Southwest population without ethnic information. Moreover, the Human Genome Diversity Project (HGDP) database, as important reference data in population genetics research, provides a limited representation with only 10 samples from three ethnic groups: Miao, Yi, and Tujia.

These gaps underline a significant lack of representation and lack of ethnic-specific data, which hinders our full understanding of our species' genetic diversity. Therefore, there is an urgent need to establish a comprehensive genetic resource encompassing these underrepresented minority populations. By doing so, we not only facilitate data exchange and resource sharing but also advance frontier exploration in research and foster industrial transformations.

In this study, we performed whole genome sequencing (WGS) of 476 male individuals representing 11 ethnic minorities from Southwest China. Utilizing the genome analysis data, we constructed the Guizhou Multi-ethnic Genome Database (GMGD), aiming to fill the current gap in genetic data representation. This database offers a search method for the genomic region or point mutations and displays the frequency differences of mutations across various populations.

To further enhance the utility of GMGD, we employed Principal Component Analysis (PCA) and admixture analyses to illustrate population structure among ethnic groups. Furthermore, we provided a genotype-imputation function, which could serve as a valuable tool for other genomic studies. Thus, GMGD, with these features, constitutes a significant step toward understanding of the evolutionary and medical implications of different ethnic groups in Southwest China.

Materials and methods

Sample collection

GMGD database was constructed using the results of whole genome data derived from 476 unrelated, healthy male individuals belonging to various ethnic minorities from Guizhou Province, southwest China. From 1999 to 2002, the project team successively visited 10 counties and 24 townships in Qiandongnan Miao and Dong Autonomous Prefecture, Qiannan Buyi and Miao Autonomous Prefecture, Tongren City and Bijie City of Guizhou Province, conducted investigation and research on the local ethnic minorities, and collected thousands of DNA samples based on informed consent. A DNA sample bank of Guizhou ethnic minorities was established. In this study, suitable samples were selected from the sample bank. In our selection criteria, all individuals who lived in Guizhou, were ethnically homogeneous within three generations, and did not marry outside the population, had no blood with each other The relationship. The participants represented 11 ethnic minorities from three language branches: Hmong–Mien languages (Miao, n = 94; Yao, n = 30)), Tai-Kadai languages (Dong, n = 40; Bouyei, n = 53; Gelao, n = 39; Zhuang, n = 30; Sui, n = 40), and Sino-Tibetan languages (Yi, n = 40; Bai, n = 30; Tujia, n = 39); Hui, n = 40) (Table 1).

Table 1.

The information on samples collection.

Language branch	Ethnic	No. of samples	Location
Hmong–Mien languages	Miao	94	Bijie; Qiannan;Qiandongnan
Hmong–Mien languages	Yao	30	Qiannan
Tai–Kadai languages	Dong	40	Qiandongnan
	Bouyei	53	Bijie; Qiannan
	Gelao	39	Bijie
	Zhuang	30	Qiandongnan
	Sui	40	Qiannan
Sino-Tibetan languages	Yi	40	Bijie
	Bai	30	Bijie
	Tujia	39	Tongren
	Hui	40	Bijie

Open in a new tab

This study was undertaken according to the Measures for the Ethical Review of Life Science and Medical Research Involving Humans. This study was supported by the Research Ethics Committee of the Affiliated Hospital of Guizhou Medical University (IRB No. 2021201), and before participating in this study, all individuals were required to complete an informed consent form. An informed permission form was signed by each of the recruited volunteers.

Sequencing library construction and reads alignment

Upon receiving signed informed consent from the donors, we collected blood samples from each participant. Genomic DNA was extracted from these samples and sequenced on the DNBSEQ-T1 platform. Following are the steps taken to generate a whole-genome library: (1) Covaris was used to randomly fragment DNA, and then magnetic beads were used to screen genomic DNA fragments with an average size of 200–400 bp. (2) Terminal repair of DNA fragments was performed, followed by 3′ adenylation. Adaptors were attached to the ends of these 3′ adenylated fragments by T4 DNA ligase. (3) PCR was used to amplify genome products. (4) A qualified genome library was constructed and sample quality checked prior to sequencing with paired-end 100 bp reads on the DNBSEQ T1 platform with an average depth of 5.5 ×. To evaluate the quality of genotype calling, we increased the sequencing depth to 30 × for a subset of 34 individuals. For each sample, the cleaned reads were aligned to the GRCh37 and GRCh38 human genome reference using BWA, with the paired-end read alignment option¹¹. Following the initial data quality control, the samtools rmdup module to remove duplicates, and GATK (v3.8) to recalibrate the base quality score^12,13 We have identified duplicate samples using the—genome function of Plink version 1.9, and we have calculated the sequencing depth, GC content, Q20, and Q30 quality metrics for each sample, with the relevant details provided below. See Supplementary Fig. S1 for further information.

Variant calling and filtering

After removing duplicates and recalibrating the base quality scores (BQSR), variants were called using the Haplotyper algorithm in GVCF mode. Subsequently, the GVCFtyper algorithm was utilized to perform joint-calling and generate the cohort VCF. Variant Quality Score Recalibration (VQSR) was conducted using the Genome Analysis Toolkit (GATK version 4.1.2). The truth-sensitivity-filter-level was set to 99.1 for SNPs and 99.8 for Indels.

To ensure high quality, we applied several filters to exclude low-quality variants. First, variants that did not pass VQSR were removed. Then, we eliminated sites with genotype quality (GQ) < 10 in more than 50% of samples. Additionally, variants showing excessive heterozygosity or with ExcessHet > 54.69 in the INFO column, calculated by GATK, were filtered out. We also removed variants that were closer to an INDEL, using the bcftools filter command with the options—SnpGap 3 and—IndelGap 5.

Assessment of genotype calling accuracy

We utilized sequencing and genotype data from 34 individuals (30 ×) to assess the accuracy of GMGD low-depth WGS genotype calling. The genotype obtained from high coverage WGS was considered as the true genotype, while the called variants served as the test set. Variants with calling rates exceeding 95% per sample and a frequency above 1% were preserved. Ultimately, we estimated genotype concordance by considering only the variants in autosomes that were identified by both low WGS and high WGS, encompassing NR Sensitivity and NR genotype concordance. In addition to this, we also downsampled 50 30X high-coverage samples from the 1000 Genomes Project to 1 ×, 5 ×, and 15 × to conduct corresponding data validation and calculate genotype concordance.

Variant annotation and population structure analysis

The high-confidence variants were used for downstream analyses and statistics. We utilized the Ensembl Variant Effect Predictor (VEP) in conjunction with the relevant VEP-compiled annotation database to annotate the variants¹⁴. To display the genetic substructure between different ethnic branches, PCA and admixture were constructed by using PLINK (v1.9) and ADMIXTURE analyses^15,16.

Reference panel construction and Evaluation for imputation

We employed BEAGLE version 5.2 to conduct haplotype phasing of all 476 samples, utilizing the default settings. The chromosome was partitioned into overlapping chunks of 1 Mb in size, with a 0.1 Mb overlap between adjacent chunks. Then we evaluated the accuracy of genotype imputation for GMGD panels by using 50 Chinese populations in 1 KG.

Database implementation

GMGD is available at https://db.cngb.org/pop/gmgd/. The specific IT strategies are as follows: Considering the open source and popularity, flask (https://flask.palletsprojects.com/) and MongoDB (https://www.mongodb.com/) were chosen as the backend framework and the database engine, separately. The VUE framework (https://vuejs.org/) was used to develop the web frontend interface. To achieve web functions, we utilized JQuery, a versatile and comprehensive JavaScript library, along with Bootstrap, an open-source toolkit for web development, as plugins for our project. Furthermore, our website incorporated D3, a JavaScript library for data-driven document manipulation, and Echart, an open-source toolkit for data visualization, to effectively visualize genomics data.

Ethics approval and consent to participate

This study design and procedures were approved by the Institutional Review Board (IRB) of Guizhou Medical University of China (IRB No. 2021201), and before participating in this study, all individuals were required to complete an informed consent form. An informed consent form was signed by each of the recruited volunteers.

Results

Variants identification

By aligning to different reference genomes, a total of 16,336,982 variants (Ts/Tv ratio = 2.11) were filtered in GRCh38, of which 11.9% (1,956,322) are entirely new and do not exist in the dbSNP database. Similarly, a total of 16,205,829 variants (Ts/Tv ratio = 2.12) were filtered in GRCh37, with 18.5% (3,009,431) of these variants being entirely new and not present in the dbSNP database.

Assuming the high-depth WGS data as the gold standard, we achieved that the genotype concordance and non-reference (NR) genotype concordance rate extended to 0.951 and 0.9606 with increasing sequencing depth (Fig. 1A,B). The NR sensitivity even reaches 0.9907 (Fig. 1C). Additionally, downsampling was performed on 50 30X high-coverage samples from the1000 Genomes Project, and the results showed that genotype concordance is positively correlated with sequencing depth. Moreover, the genotype concordance rate is quite similar when the sequencing depths are comparable (Fig. 1D). These statistical data indicate the high quality of our mutated dataset.

Genotype Evaluation of the GMGD WGS data. (A) Genotype concordance rate versus sequencing depth for 33 GMGD samples in different Ethnic. (B) Non reference Sensitivity rate versus sequencing depth for 33 GMGD samples in different Ethnic. (C) Non reference Concordance rate versus sequencing depth for 33 GMGD samples in different Ethnic. (D) Genotype concordance rates for 34 GMGD samples and downsampling of 50 samples from the 1000 Genomes Project.

Search for variants (genomes)

GMGD provides users with five methods to search for variant information based on two reference genome versions of GRCh37 and GRCh38: Gene, Transcript, Variant, Multi-allelic variant, and genomic Region that spans not more than 100 kb (Fig. 2A).

Screenshots of the BRCA1 gene search and its corresponding search results are shown below. (A) The BRCA1 gene was used as an example for the search, where the corresponding transcript, variant, and region can all be used to obtain a similar results page. The following shows the query results. (B) The gene summary includes the number of variants and a UCSC browser link. (C) A coverage plot for a transcript is displayed. (D) Details of variants queried in the database are shown. (E) It is possible to specify a variant to view the allele frequency in other population studies.

For example, researcher can search for BRCA1, one of the most common breast cancer genes, on the home page and get the gene summary (Fig. 2B) including the coverage plot of the query transcripts or the genomic regions (Fig. 2C) and corresponding variants data comprising of chromosome positions, annotation, allele frequency (Fig. 2D), etc. We also provided relevant information on all variations including population frequency among different ethnic groups to discover ethnic-specific variations quickly (Fig. 2E). Search in any method, all the relevant elements will be displayed for understanding of the corresponding relationships between different elements and achieving interactive retrieval. users can click on the variations they pay close attention to enter a more detailed information page.

Overview page

The overview page displays genome statistics information, which includes variant density, allele frequency spectrum, base change, Ts/Tv ratio, variant quality, and variant type. The distribution of variation on all chromosomes is calculated based on two reference genome versions: GRCh37 and GRCh38. The user-friendly interfaces provide selection functionality for users to select reference genome versions by clicking on the dropdown box or chromosome regions of interest by clicking the color block of references or the horizontal axis of the variant density (Fig. 3A).

Principal Component Analysis (PCA) and Admixture analyses displayed on the overview page show that the clustering model of the Guizhou population is consistent with the linguistic classification of populations from different regions around the world. Interactive features assist users in understanding the structural relationships of various ethnic groups on a global scale. Clicking on a population in the PCA legend allows for hiding or accessing each population. Zooming in on a specific area can be done by using the scroll function on the mouse. Different k-values can be selected from the dropdown menu above the Admixture analysis. Additionally, the proportion of each ancestral component will be displayed when the mouse hovers over a sample (Fig. 3B).

Imputation development

In evaluating the imputation quality of the GMGD panel, we conducted a comparison of the squared correlation coefficient (R²) between the genotypes identified in the high coverage data of the 50 Chinese Southern Han in the 1000 Genomes Project and the imputed genotypes in the array data. We filtered for variants with an info score > 0.4 and a p value from a chi-square test of Hardy–Weinberg equilibrium larger than 10^–6. Ultimately, a total of 15,717,793 variants were effectively imputed. The mean imputation accuracy for these variants was 0.94, indicating the high quality of the GMGD panel imputation (Fig. 4).

The histogram of imputed variants and Pearson correlation coefficients for GMGD panel.

To further facilitate genotype imputation among the Chinese population and ethnic minorities, we developed a user-friendly imputation server (https://db.cngb.org/imputation/) for public use. This platform allows users to freely register, create imputation jobs, and upload their VCF-formatted array data in a bgzipped format, all within a secure data environment.

Discussion

Establishing a comprehensive multi-ethnic genome resource platform will help to understand the diversity of human genetic resources around the world. The large-scale genetic data such as The 1000 Genomes Project ¹⁷, China Metabolic Profiling Project (ChinaMAP)⁸, NyuWa genome¹⁸, Westlake BioBank for Chinese (WBBC)⁹ pilot project and Taiwan Biobank¹⁹ have remodeled fine-scale genetic profiles of the major populations in China as well as a detailed framework of the population evolutionary history, which provided new opportunities for Chinese genetic research in health and disease, there are relatively few samples in the entire southwestern region. The genome reference sequences of ethnic minorities are extremely scarce.

GMGD is currently the genome database for the largest number of ethnic groups. Aiming to fill the current gap in genetic data representation. Compared with commonly used reference genome database such as HGDP, GMGD have more genome samples and abundant population diversity. For example, there are 10 Miao ethnic groups in HGDP and 94 in GMGD²⁰. This is also the first complete genome resource of the Gelao ethnic group²¹. GMGD will greatly enrich the abundance and breadth of population genetic resources. Besides, we employed Principal Component Analysis (PCA) and admixture analysis to illustrate population structure among ethnic groups. and we developed an imputation server with a user-friendly website interface for public use (URL) it can To Efficient, Accurate facilitate genotype imputation for different ethnic populations will contribute to a more comprehensive and accurate understanding of human genetic diversity and population evolution history and research on related diseases. GMGD with these features, constitutes a significant step toward understanding of the evolutionary and medical implications of different ethnic groups in Southwest China. By doing so, we not only facilitate data exchange and resource sharing but also advance frontier explorations in research and foster industrial transformations.

Currently, our database includes 476 males, covers 11 ethnic groups, and provides genome-wide variation information. In the future, we plan to improve the database in the following areas. Firstly, we will expand the sample size to enroll more samples and ethnic groups to form a more comprehensive genetic resource of Chinese. Secondly, we will enrich the database content, such as CNV information, Y chromosome haplotype information, etc., to achieve a broader range of user interaction functions. Furthermore, integration and interaction with other databases will also become the direction of efforts.

In general, as the central repository of ethnic minority genomic data in China, GMGD will become a useful database and effective platform for the evolution of population genetics and medical research practices.

Conclusion

The GMGD interactive interface facilitates researchers with multi-dimensional mutation search methods and displays population frequency differences among global populations. Furthermore, GMGD is equipped with a genotype-imputation function, enabling enhanced capabilities for low-depth genomic research or targeted region capture studies. GMGD offers unique insights into the genomic variation landscape of different ethnic groups, which are freely accessible at https://db.cngb.org/pop/gmgd/.

Data availability.

Project name: GUIZHOU MULTI-ETHNIC GENOME DATABASE. Project home page: https://db.cngb.org/pop/gmgd/. The variation data reported in this paper have been deposited in the Genome Variation Map (GVM)²² in National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences and China National Center for Bioinformation²³, under accession number GVM000478: https://ngdc.cncb.ac.cn/gvm.

Supplementary Information

Supplementary Figure S1.^{(2.4MB, pdf)}

Acknowledgements

We are especially grateful to Xiaofeng Wei, Running Lin, and Ziyuan Ke from BGI for providing IT technical support to set up and upgrade GMGB. This work was supported by China National GeneBank (CNGB).

Author contributions

X.J. and X.Q. both contributed to the study conception and design. Material preparation, data collection, and analysis were performed by Y.L., K.S., Y.H., C.L., C.W., T.Z., R.L., X.L., A.Z., and D.X. The first draft of the manuscript was written by Y.H., C.L., S.Z., F.L. X.J. and X.Q. both reviewed, edited, and approved the final manuscript.

Funding

Funding was provided by the Natural Science Foundation of China (Grant Nos. 32360154, 31560306, 82360813), the Guizhou Science and Technology Support Project (Social Development Field [2019] 2807), the Science and Technology Fund of Guizhou Provincial Health Commission (gzwkj2024-056), the project of Key Laboratory of Endemic and Ethnic Diseases, Ministry of Education, Guizhou Medical University (No.FZSW-2022-001), and Guizhou Provincial Science and Technology Program Project Grant (QKH Platform Talents-GCC[2023]035, QKH Platform Talents-CXTD[2023]003, QKH-ZK-2022-401). Guangzhou Basic and Applied Basic Research Foundation (No.202201010189), Open Project of State Key Laboratory of Respiratory Disease (No.SKLRD-OP-202309), the Innovation Platform for Academicians of Hainan Province (No.YSPTZX202118), Key-Area Research and Development Program of Guangdong Province (No.2023B0303040001), the Shenzhen Key Laboratory of Genomics (No.CXB200903110066A), Guangdong Provincial Key Laboratory of Human Disease Genomics (No.2020B1212070028), Guangdong Provincial Key Laboratory of Genome Read and Write (No.2017B030301011), and the China National GeneBank.

Competing interests

The authors declare no competing interests.

Footnotes

The original online version of this Article was revised: The original version of this Article contained errors in the Funding section. Full information regarding the corrections made can be found in the correction for this Article.

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

These authors contributed equally: Yan He and Changgui Lei.

Change history

7/9/2024

A Correction to this paper has been published: 10.1038/s41598-024-66738-y

Contributor Information

Xiaolan Qi, Email: xlq@gmc.edu.cn.

Xin Jin, Email: jinxin@genomics.cn.

Supplementary Information

The online version contains supplementary material available at 10.1038/s41598-024-63892-1.

References

1.Chatterjee N, Shi J, Garcia-Closas M. Developing and evaluating polygenic risk prediction models for stratified disease prevention. Nat. Rev. Genet. 2016;17(7):392–406. doi: 10.1038/nrg.2016.27. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Taliun D, Harris DN, Kessler MD, Carlson J, Szpiech ZA, Torres R, Taliun SAG, Corvelo A, Gogarten SM, Kang HM, et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature. 2021;590(7845):290–299. doi: 10.1038/s41586-021-03205-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Liu Y, Xie J, Wang M, Liu C, Zhu J, Zou X, Li W, Wang L, Leng C, Xu Q, et al. Genomic insights into the population history and biological adaptation of southwestern Chinese Hmong-Mien people. Front. Genet. 2021;12:815160. doi: 10.3389/fgene.2021.815160. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Wang Q, Zhao J, Ren Z, Sun J, He G, Guo J, Zhang H, Ji J, Liu Y, Yang M, et al. Male-dominated migration and massive assimilation of indigenous East Asians in the formation of Muslim Hui people in southwest China. Front. Genet. 2020;11:618614. doi: 10.3389/fgene.2020.618614. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Liu C, Han X, Min Y, Liu H, Xu Q, Yang X, Huang S, Chen Z, Liu C. Genetic polymorphism analysis of 40 Y-chromosomal STR loci in seven populations from South China. Forensic Sci. Int. 2018;291:109–114. doi: 10.1016/j.forsciint.2018.08.003. [DOI] [PubMed] [Google Scholar]
6.Liu Y, Zhang H, He G, Ren Z, Zhang H, Wang Q, Ji J, Yang M, Guo J, Yang X, et al. Forensic features and population genetic structure of Dong, Yi, Han, and Chuanqing Human populations in Southwest China inferred from insertion/deletion markers. Front. Genet. 2020;11:360. doi: 10.3389/fgene.2020.00360. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Bin X, Wang R, Huang Y, Wei R, Zhu K, Yang X, Ma H, He G, Guo J, Zhao J, et al. Genomic insight into the population structure and admixture history of Tai-Kadai-speaking Sui people in southwest China. Front. Genet. 2021;12:735084. doi: 10.3389/fgene.2021.735084. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Cao Y, Li L, Xu M, Feng Z, Sun X, Lu J, Xu Y, Du P, Wang T, Hu R, et al. The ChinaMAP analytics of deep whole genome sequences in 10,588 individuals. Cell Res. 2020;30(9):717–731. doi: 10.1038/s41422-020-0322-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Cong PK, Bai WY, Li JC, Yang MY, Khederzadeh S, Gai SR, Li N, Liu YH, Yu SH, Zhao WW, et al. Genomic analyses of 10,376 individuals in the Westlake BioBank for Chinese (WBBC) pilot project. Nat. Commun. 2022;13(1):2939. doi: 10.1038/s41467-022-30526-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Chiang CWK, Mangul S, Robles C, Sankararaman S. A comprehensive map of genetic variation in the World’s largest ethnic group—Han Chinese. Mol. Biol. Evol. 2018;35(11):2736–2750. doi: 10.1093/molbev/msy170. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Li H, Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009;25(14):1754–1760. doi: 10.1093/bioinformatics/btp324. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R. Genome Project Data Processing S: The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25(16):2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 2011;43(5):491–498. doi: 10.1038/ng.806. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GR, Thormann A, Flicek P, Cunningham F. The ensembl variant effect predictor. Genome Biol. 2016;17(1):122. doi: 10.1186/s13059-016-0974-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, et al. PLINK: A tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 2007;81(3):559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Patterson N, Moorjani P, Luo Y, Mallick S, Rohland N, Zhan Y, Genschoreck T, Webster T, Reich D. Ancient admixture in human history. Genetics. 2012;192(3):1065–1093. doi: 10.1534/genetics.112.145037. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Jeffrey DW, Eric S, Aakrosh R, Hie Lim K, Changhoon K, Ravi G, Kushal S, Elena SG, Rikky WP, Tushar B, et al. The GenomeAsia 100K Project enables genetic discoveries across Asia. Nature. 2019;576:106–111. doi: 10.1038/s41586-019-1793-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Zhang P, Luo H, Li Y, Wang Y, Wang J, Zheng Y, Niu Y, Shi Y, Zhou H, Song T, et al. NyuWa Genome resource: A deep whole-genome sequencing-based variation profile and reference panel for the Chinese population. Cell Rep. 2021;37(7):110017. doi: 10.1016/j.celrep.2021.110017. [DOI] [PubMed] [Google Scholar]
19.Feng YA, Chen CY, Chen TT, Kuo PH, Hsu YH, Yang HI, Chen WJ, Su MW, Chu HW, Shen CY, et al. Taiwan Biobank: A rich biomedical research database of the Taiwanese population. Cell Genom. 2022;2(11):100197. doi: 10.1016/j.xgen.2022.100197. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Bergstrom A, McCarthy SA, Hui R, Almarri MA, Ayub Q, Danecek P, Chen Y, Felkel S, Hallast P, Kamm J, et al. Insights into human genetic variation and population history from 929 diverse genomes. Science. 2020;367(6484):eaay5012. doi: 10.1126/science.aay5012. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.He G, Wang Z, Zou X, Wang M, Liu J, Wang S, Ye Z, Chen P, Hou Y. Tai-Kadai-speaking Gelao population: Forensic features, genetic diversity and population structure. Forensic Sci. Int. Genet. 2019;40:e231–e239. doi: 10.1016/j.fsigen.2019.03.013. [DOI] [PubMed] [Google Scholar]
22.Li C, Tian D, Tang B, Liu X, Teng X, Zhao W, Zhang Z, Song S. Genome Variation Map: A worldwide collection of genome variations across multiple species. Nucleic Acids Res. 2021;49:D1186–D1191. doi: 10.1093/nar/gkaa1005. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Bai X. Database Resources of the National Genomics Data Center, China National Center for Bioinformation in 2022. Nucleic Acids Res. 2022;50:D27–D38. doi: 10.1093/nar/gkab951. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Figure S1.^{(2.4MB, pdf)}

Data Availability Statement

[CR1] 1.Chatterjee N, Shi J, Garcia-Closas M. Developing and evaluating polygenic risk prediction models for stratified disease prevention. Nat. Rev. Genet. 2016;17(7):392–406. doi: 10.1038/nrg.2016.27. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR2] 2.Taliun D, Harris DN, Kessler MD, Carlson J, Szpiech ZA, Torres R, Taliun SAG, Corvelo A, Gogarten SM, Kang HM, et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature. 2021;590(7845):290–299. doi: 10.1038/s41586-021-03205-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR3] 3.Liu Y, Xie J, Wang M, Liu C, Zhu J, Zou X, Li W, Wang L, Leng C, Xu Q, et al. Genomic insights into the population history and biological adaptation of southwestern Chinese Hmong-Mien people. Front. Genet. 2021;12:815160. doi: 10.3389/fgene.2021.815160. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Wang Q, Zhao J, Ren Z, Sun J, He G, Guo J, Zhang H, Ji J, Liu Y, Yang M, et al. Male-dominated migration and massive assimilation of indigenous East Asians in the formation of Muslim Hui people in southwest China. Front. Genet. 2020;11:618614. doi: 10.3389/fgene.2020.618614. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR5] 5.Liu C, Han X, Min Y, Liu H, Xu Q, Yang X, Huang S, Chen Z, Liu C. Genetic polymorphism analysis of 40 Y-chromosomal STR loci in seven populations from South China. Forensic Sci. Int. 2018;291:109–114. doi: 10.1016/j.forsciint.2018.08.003. [DOI] [PubMed] [Google Scholar]

[CR6] 6.Liu Y, Zhang H, He G, Ren Z, Zhang H, Wang Q, Ji J, Yang M, Guo J, Yang X, et al. Forensic features and population genetic structure of Dong, Yi, Han, and Chuanqing Human populations in Southwest China inferred from insertion/deletion markers. Front. Genet. 2020;11:360. doi: 10.3389/fgene.2020.00360. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR7] 7.Bin X, Wang R, Huang Y, Wei R, Zhu K, Yang X, Ma H, He G, Guo J, Zhao J, et al. Genomic insight into the population structure and admixture history of Tai-Kadai-speaking Sui people in southwest China. Front. Genet. 2021;12:735084. doi: 10.3389/fgene.2021.735084. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR8] 8.Cao Y, Li L, Xu M, Feng Z, Sun X, Lu J, Xu Y, Du P, Wang T, Hu R, et al. The ChinaMAP analytics of deep whole genome sequences in 10,588 individuals. Cell Res. 2020;30(9):717–731. doi: 10.1038/s41422-020-0322-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] 9.Cong PK, Bai WY, Li JC, Yang MY, Khederzadeh S, Gai SR, Li N, Liu YH, Yu SH, Zhao WW, et al. Genomic analyses of 10,376 individuals in the Westlake BioBank for Chinese (WBBC) pilot project. Nat. Commun. 2022;13(1):2939. doi: 10.1038/s41467-022-30526-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] 10.Chiang CWK, Mangul S, Robles C, Sankararaman S. A comprehensive map of genetic variation in the World’s largest ethnic group—Han Chinese. Mol. Biol. Evol. 2018;35(11):2736–2750. doi: 10.1093/molbev/msy170. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR11] 11.Li H, Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009;25(14):1754–1760. doi: 10.1093/bioinformatics/btp324. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR12] 12.Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R. Genome Project Data Processing S: The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25(16):2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] 13.DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 2011;43(5):491–498. doi: 10.1038/ng.806. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR14] 14.McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GR, Thormann A, Flicek P, Cunningham F. The ensembl variant effect predictor. Genome Biol. 2016;17(1):122. doi: 10.1186/s13059-016-0974-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR15] 15.Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, et al. PLINK: A tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 2007;81(3):559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR16] 16.Patterson N, Moorjani P, Luo Y, Mallick S, Rohland N, Zhan Y, Genschoreck T, Webster T, Reich D. Ancient admixture in human history. Genetics. 2012;192(3):1065–1093. doi: 10.1534/genetics.112.145037. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR17] 17.Jeffrey DW, Eric S, Aakrosh R, Hie Lim K, Changhoon K, Ravi G, Kushal S, Elena SG, Rikky WP, Tushar B, et al. The GenomeAsia 100K Project enables genetic discoveries across Asia. Nature. 2019;576:106–111. doi: 10.1038/s41586-019-1793-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR18] 18.Zhang P, Luo H, Li Y, Wang Y, Wang J, Zheng Y, Niu Y, Shi Y, Zhou H, Song T, et al. NyuWa Genome resource: A deep whole-genome sequencing-based variation profile and reference panel for the Chinese population. Cell Rep. 2021;37(7):110017. doi: 10.1016/j.celrep.2021.110017. [DOI] [PubMed] [Google Scholar]

[CR19] 19.Feng YA, Chen CY, Chen TT, Kuo PH, Hsu YH, Yang HI, Chen WJ, Su MW, Chu HW, Shen CY, et al. Taiwan Biobank: A rich biomedical research database of the Taiwanese population. Cell Genom. 2022;2(11):100197. doi: 10.1016/j.xgen.2022.100197. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR20] 20.Bergstrom A, McCarthy SA, Hui R, Almarri MA, Ayub Q, Danecek P, Chen Y, Felkel S, Hallast P, Kamm J, et al. Insights into human genetic variation and population history from 929 diverse genomes. Science. 2020;367(6484):eaay5012. doi: 10.1126/science.aay5012. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR21] 21.He G, Wang Z, Zou X, Wang M, Liu J, Wang S, Ye Z, Chen P, Hou Y. Tai-Kadai-speaking Gelao population: Forensic features, genetic diversity and population structure. Forensic Sci. Int. Genet. 2019;40:e231–e239. doi: 10.1016/j.fsigen.2019.03.013. [DOI] [PubMed] [Google Scholar]

[CR22] 22.Li C, Tian D, Tang B, Liu X, Teng X, Zhao W, Zhang Z, Song S. Genome Variation Map: A worldwide collection of genome variations across multiple species. Nucleic Acids Res. 2021;49:D1186–D1191. doi: 10.1093/nar/gkaa1005. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR23] 23.Bai X. Database Resources of the National Genomics Data Center, China National Center for Bioinformation in 2022. Nucleic Acids Res. 2022;50:D27–D38. doi: 10.1093/nar/gkab951. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

A comprehensive whole genome database of ethnic minority populations

Yan He

Changgui Lei

Chanjuan Wan

Shuang Zeng

Ting Zhang

Fei Luo

Ruichao Li

Xiaokun Li

Anshu Zhao

Defu Xiao

Yunyan Luo

Keren Shan

Xiaolan Qi

Xin Jin

Abstract

Introduction

Materials and methods

Sample collection

Table 1.

Sequencing library construction and reads alignment

Variant calling and filtering

Assessment of genotype calling accuracy

Variant annotation and population structure analysis

Reference panel construction and Evaluation for imputation

Database implementation

Ethics approval and consent to participate

Results

Variants identification

Figure 1.

Search for variants (genomes)

Figure 2.

Overview page

Figure 3.

Imputation development

Figure 4.

Discussion

Conclusion

Data availability.

Supplementary Information

Acknowledgements

Author contributions

Funding

Competing interests

Footnotes

Contributor Information

Supplementary Information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases