Skip to main content
Journal of Computational Biology logoLink to Journal of Computational Biology
. 2021 Mar 4;28(3):296–303. doi: 10.1089/cmb.2019.0434

PopInf: An Approach for Reproducibly Visualizing and Assigning Population Affiliation in Genomic Samples of Uncertain Origin

Angela M Taravella Oill 1, Anagha J Deshpande 1, Heini M Natri 1, Melissa A Wilson 1,
PMCID: PMC7994427  PMID: 33074720

Abstract

Germline genetic variation contributes to cancer etiology, but self-reported race is not always consistent with genetic ancestry, and samples may not have identifying ancestry information. In this study, we describe a flexible computational pipeline, PopInf, to visualize principal component analysis output and assign ancestry to samples with unknown genetic ancestry, given a reference population panel of known origins. PopInf is implemented as a reproducible workflow in Snakemake with a tutorial on GitHub. We provide a preprocessed reference population panel that can be quickly and efficiently implemented in cancer genetics studies. We ran PopInf on The Cancer Genome Atlas (TCGA) liver cancer data and identify discrepancies between reported race and inferred genetic ancestry. The PopInf pipeline facilitates visualization and identification of genetic ancestry across samples, so that this ancestry can be accounted for in studies of disease risk.

Keywords: cancer GWAS, computational pipeline, population ancestry, principal component analysis, visualization

1. Introduction

Cancer is a complex disease with genetic and environmental factors contributing to its risk and progression. The underlying genetic architecture of cancer, similar to other complex diseases, is influenced by common population-specific genetic variation (Hindorff et al., 2011; Timpson et al., 2018). Common genetic variation is shared within populations of shared genetic ancestry. Unaccounted population structure can confound the results of genetic analyses, similar to that in cancer GWAS, by causing spurious associations to disease phenotypes (Price et al., 2010). Thus, assessing genetic ancestry and population structure in studies on the effects of genetic loci and genetic background on cancer is crucial.

Cancer research has begun to recognize the importance of identifying genetic ancestry across patients in cancer genetic data sets (Yuan et al., 2018) and across cancer cell lines (Dutil et al., 2019). Yuan et al. (2018) characterized genetic ancestry across The Cancer Genome Atlas (TCGA) patient cohort to investigate the effect genetic ancestry has on genomic alterations across different cancers and to provide researchers with detailed ancestry information on each patient. Although this publicly accessible resource is of great research value for those using the TCGA data, researchers utilizing other data sets will have to independently infer the ancestry of their samples.

Methods and software are currently available to characterize population structure (Patterson et al., 2006; Alexander et al., 2009), estimate local and global ancestry proportions (Alexander et al., 2009; Maples et al., 2013), or predict ancestry using genomic data (Pedersen and Quinlan, 2017). Some of these rely on a predefined reference panel and may not report admixed samples. Having an easily reproducible and modifiable workflow to visualize principal component analysis (PCA) and identify ancestry in individuals of unknown ancestral origin would thus be a useful addition to the cancer genetics researchers tool kit.

In this study, we present PopInf v1.0, a pipeline to visualize PCA output and assign ancestry to individuals with unknown ancestry, given a flexible reference population panel of known origins. PopInf v1.0 takes, as input, variants from a sample with unknown or unverified genetic ancestry in variant call format (VCF), compares the variants in the unknown sample with a user-defined reference panel, and outputs an inferred ancestry origin report with accompanying PCA plots of the unknown samples and the reference panel. We ran PopInf on variants from 148 samples from the Genotype Tissue Expression (GTEx) Project (Lonsdale et al., 2013) and on 403 samples from germline tissue from the TCGA liver cancer data set (Ally et al., 2017; Grossman et al., 2016) and identify discrepancies between reported race and inferred genetic ancestry. Furthermore, we analyze each sample by chromosome and find cases of chromosome-specific admixture that is not reported in genome-wide analyses.

2. Materials and Methods

PopInf v1.0 uses a combination of publicly available software and custom scripts to generate PCA plots and a tab-delimited inferred ancestry report for samples of unknown ancestry or unverified self-reported population ancestry. PopInf v1.0 uses GATK v3.7 (McKenna et al., 2010), VCFtools v.0.1.14 (Danecek et al., 2011), bedtools v.2.27.1 (Quinlan and Hall, 2010), and Plink v.1.9 (Chang et al., 2015) to prepare the unknown ancestry data set and reference panel, smartpca—a program within EIGENSOFT v6.0.1 package (Patterson et al., 2006)—for PCA, and a custom R script (R Development Core Team, 2011) to infer individuals ancestry and plot the results of PCA of the study samples and reference panel. Our pipeline is incorporated into the reproducible workflow system, Snakemake v5.4.0 (Koster and Rahmann, 2012).

2.1. Input

Two sets of variant data are required to use PopInf v1.0: (1) variants from reference populations, and (2) variants from study samples—sample(s) of unknown or self-reported race or ancestry. These files need to be mapped to the same reference genome and in VCF file format. In addition, two sample information text files, one for the reference panel and one for the study samples, are needed for input, each with three tab-delimited columns. For the reference panel sample information text file, column one must contain sample names identical to the naming in the VCF file with one sample per row; column two must specify genetic sex information (“Male,” “Female,” or “N/A” if unknown, case insensitive); column three must contain population assignment. For the study sample information text file, columns one and two are similar to the reference panel file, but column three is a dummy variable with a single arbitrary value that is the same on every row. For example, column three of the sample information text file for the unknown set of samples could be set as “unknown.” Finally, the user must provide the FASTA file (.fa) of the reference genome used for read mapping along with a FASTA index file (.fai) and a sequence dictionary file (.dict).

2.2. Data processing

PopInf v1.0 implements filtering, merging, and file conversion before PCA. Single nucleotide polymorphisms (SNPs) are extracted from both the reference panel and study sample VCF files, using GATK v3.7 SelectVariants and merged using GATK v3.7 CombineVariants (McKenna et al., 2010). To ensure PopInf analyzes SNPs that overlap with both the reference and unknown variant sets, missing genotype data is removed using VCFtools v.0.1.14 (vcftools—max-missing flag) (Danecek et al., 2011). If analyzing the X chromosome, the pseudoautosomal regions and X-transposed region (Skaletsky et al., 2003; Ross et al., 2005) are masked using bedtools v.2.27.1 (Quinlan and Hall, 2010). Before running PCA, the merged VCF file is pruned for linkage disequilibrium and converted to plink format using Plink v1.9 (Chang et al., 2015). PCA on a user-defined set of chromosomes (e.g., whole genome, all autosomes, or a single chromosome) is carried out using smartpca (Patterson et al., 2006).

2.3. Output

PopInf v1.0 generates PCA plots for the first 10 PCs for the study samples and the reference panel, and an inferred ancestry report. Genetic ancestry of each study sample is inferred based on the distance between the study sample and the centroid coordinates of PCs 1 and 2 of each reference population. A study sample is inferred to originate from a particular population if it falls within N standard deviations (SDs) from the reference population centroid. To provide multiple levels of confidence, the ancestry is inferred using 1, 2, and 3 SDs. If the sample does not fall within three SDs of any population, the sample's ancestry will be assigned to the closest population or will be assigned as having admixed ancestry: PopInf calculates the midpoint coordinates between each pairwise combination of reference populations and then compares those distances with the study sample. For a sample to be assigned to two populations (admixed), it must be closer to the midpoint of two populations than to the centroid of any population. If the study sample is closer to the centroid of a population than any of the midpoints, it will be assigned as uncertain and PopInf will additionally specify in the output report the closest population to the study sample.

3. Results

3.1. Usage examples

We ran PopInf v1.0 using variants from two human genetic data sets: one from healthy individuals and one from cancer patients. The GTEx Project (Lonsdale et al., 2013) data set consisted of 148 individuals and germline data from the TCGA liver cancer data set (Ally et al., 2017) consisted of 403 individuals (Supplementary Tables S1 and S2; Fig. 1). Both data sets included self-reported race for most individuals. We inferred the genetic ancestries of these samples based on a reference panel consisting of variants from 986 unrelated individuals from populations across Africa, Europe, East Asia, and South Asia from 1000 Genomes Release 3 (The 1000 Genomes Project Consortium, 2015) (Supplementary Table S3).

FIG. 1.

FIG. 1.

PCA output from sample data sets plotted against the reference data set. Principal components 1 and 2 for all individuals for (A) autosomes merged and (B) X chromosome for the GTEx data set, and (C) autosomes merged and (D) X chromosome for the TCGA liver cancer data set. Gray points represent the reference samples from the 1000 Genomes reference panel. Red, blue, green, and purple points represent inferred ancestry as provided by PopInf for the study samples from GTEx and TCGA. Darker shades represent study samples that fall closer to the reference populations whereas lighter shades represent study samples that fall further from the reference populations. Black points represent study samples that were assigned to two populations (admixed). “+” represents the centroid of each reference population and dashed circles represent 1, 2, and 3 standard deviations from each reference population centroid. GTEx, Genotype Tissue Expression; PCA, principal component analysis; TCGA, The Cancer Genome Atlas.

We find that, using genome-wide genotypes, the genetic ancestry of most study samples does match that which is reported, with notable exceptions, and that we are able to infer ancestry of samples of unreported origin. The inferred ancestry matches closely with the self-reported race information in the GTEx data set (Supplementary Table S4). One of the GTEx individuals was missing self-reported race. Based on genetic ancestry, this individual was inferred as admixed East Asian and South Asian (Supplementary Table S4). In the TCGA liver cancer data set, we found 16 individuals with discrepancies between self-reported race and inferred ancestry; for these individuals, their self-reported race was white and inferred ancestry was either admixed, South Asian, or uncertain but falling closest to East Asia (Supplementary Table S5). However, 10 of the individuals had a reported ethnicity of Hispanic or Latino (Supplementary Table S5). We further inferred ancestry for the 10 individuals in the TCGA liver cancer data set with no self-reported race (Supplementary Table S5).

We additionally ran PopInf v1.0 on each autosome and the X chromosome separately, finding that chromosome-specific ancestry does not always match that inferred from the whole genome (Fig. 2A, C). We identify 22 individuals in the GTEx data set and 88 individuals in the TCGA data set (Fig. 2B, D) with variation in chromosome-specific ancestry. All of the admixed individuals had different inferred ancestry results among their chromosomes, as expected. However, there were also 15 individuals from GTEx and 74 individuals from TCGA) inferred as having only one ancestry or uncertain ancestry when analyzing all autosomes together that showed variation in chromosome-specific ancestry (Fig. 2B, D). These ancestry differences across the genome shows that assigning ancestry based only on genome-wide genotypes may result in missing clusters of ancestry across any single chromosome, which may lower our ability to identify risk alleles in data sets consisting of samples of diverse and admixed backgrounds.

FIG. 2.

FIG. 2.

Inferred ancestry for all autosomes combined and each chromosome separately. (A) All 148 GTEx individuals, (B) the subset of GTEx individuals with variation in inferred ancestry among their chromosomes, (C) all 403 TCGA individuals, and (D) the subset of TCGA individuals with variation in inferred ancestry among their chromosomes. Males and females were run together, and the autosomes merged, each autosome separately, and the X chromosome were analyzed. The x-axis represents the chromosome analyzed and the y-axis represents the individual from the data set. Colors represent inferred ancestry, whereas the intensity of the shade represents how close the study sample fell to a given reference population.

4. Discussion

The utility of PopInf is that it can assess population structure and explore genetic structure in a flexible user-defined data set. This information can be used for identifying samples with unexpected genetic components, or as covariates in downstream analyses. PopInf, as currently implemented, can assign one or two ancestry components to an individual with unverified ancestry information, and will report back unknown ancestry proportions. PopInf does not yet assign more than two ancestry components and thus may underrepresent the variation in ancestry for a multi-admixed individual. When running on the 1000 genomes population samples with known, but variable, admixture proportions, including Americans of African ancestry, PopInf was able to identify those with two major ancestry components, as well as flagged additional samples that looked like they had more than two ancestry components. In the usage example using the TCGA liver cancer data set, out of the 19 self-reported Hispanic individuals, PopInf called 10 as being admixed between two populations, whereas the rest were assigned to either Europe or South Asia. In many cases, individuals who self-report a certain racial or ethnic group may have ancestry from one or many global populations (Bryc et al., 2010, 2015).

Given the flexibility of the reference panel input, the user can alter the reference panel according to knowledge about the study samples. For example, the user may want to alter the reference panel if working with known admixed study samples that are not represented in 1000 genomes. If interested in fine scale population structure or to quantify specific admixture proportions, however, the user may wish to take a different approach—for example, inferring local ancestry or using a global ancestry assignment method such as ADMIXTURE (Alexander et al., 2009).

5. Conclusion

In this study, we provide a workflow that will set up and run PCA, summarize the PCA output, and provide the user with plots and an easily searchable inferred ancestry report for samples with unknown or unverified population information. Inferred ancestry results from the GTEx and TCGA data sets revealed heterogeneity in ancestry across the genome, and by chromosome. PopInf can be modified to work with any reference panel and may be applied to similarly infer chromosomal and genome-wide ancestry in diverse populations.

Supplementary Material

Supplemental data
Supp_Tables.zip (207KB, zip)

Acknowledgment

The authors thank Research Computing at Arizona State University for providing high-performance computing resources that have contributed to the research results reported within this article.

Disclaimer

The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Data Accessibility

PopInf v1.0 processed 1000 Genomes reference file used in this article, and an accompanying tutorial are available on Github (SexChrLab/PopInf, 2020).

Authors' Contributions

M.A.W. and A.M.T.O. conceived the ideas and designed methodology. A.M.T.O. and A.J.D. collected the data and analyzed the data. H.M.N. contributed to processing the TCGA data. A.M.T.O. and M.A.W. led the writing of the first draft of the article. All authors contributed critically to writing and editing the drafts and gave final approval for publication.

Funding Information

This publication was supported by the National Institute of General Medical Sciences of the National Institutes of Health under Award Number R35GM124827 to M.A.W. H.M.N. was supported by an ASU Center for Evolution and Medicine postdoctoral fellowship and the Marcia and Frank Carlucci Charitable Foundation postdoctoral award from the Prevent Cancer Foundation. A.M.T.O. was supported by The Graduate College at ASU and The Achievement Rewards for College Scientists (ARCS), Phoenix Chapter.

Supplementary Material

Supplementary Table S1

Supplementary Table S2

Supplementary Table S3

Supplementary Table S4

Supplementary Table S5

References

  1. SexChrLab/PopInf. 2020. Sex Chromosome Lab. https://github.com/SexChrLab/PopInf
  2. Alexander, D.H., Novembre, J., and Lange, K.. 2009. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19, 1655–1664 [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Ally, A., Balasundaram, M., Carlsen, R., et al. 2017. Comprehensive and integrative genomic characterization of hepatocellular carcinoma. Cell 169, 1327–1341.e23 [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Bryc, K., Velez, C., Karafet, T., et al. 2010. Genome-wide patterns of population structure and admixture among Hispanic/Latino Populations. Proc. Natl. Acad. Sci. 107, 8954–8961 [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Bryc, K., Durand, E.Y., Macpherson, M.J., et al. 2015. The genetic ancestry of African Americans, Latinos, and European Americans across the United States. Am. J. Hum. Genet. 96, 37–53 [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Chang, C.C., Chow, C.C., Tellier, L.C., et al. 2015. Second-generation PLINK: Rising to the challenge of larger and richer datasets. Gigascience 4, 7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Danecek, P., Auton, A., Abecasis, G., et al. 2011. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Dutil, J., Chen, Z., Monteiro, A.N., Teer, J.K., et al. 2019. An interactive resource to probe genetic diversity and estimated ancestry in cancer cell lines. Cancer Res. 79, 1263–1273 [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Grossman, R.L., Heath, A.P., Ferretti, V., et al. 2016. Toward a shared vision for cancer genomic data. N. Engl. J. Med. 375, 1109–1112 [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Hindorff, L.A., Gillanders, E.M., and Manolio, T.A.. 2011. Genetic architecture of cancer and other complex diseases: Lessons learned and future directions. Carcinogenesis 32, 945–954 [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Koster, J., and Rahmann, S.. 2012. Snakemake—A scalable bioinformatics workflow engine. Bioinformatics 28, 2520–2522 [DOI] [PubMed] [Google Scholar]
  12. Lonsdale, J., Thomas, J., Salvatore, M., et al. 2013. The Genotype-Tissue Expression (GTEx) project. Nat. Genet. 45, 580–585 [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Maples, B.K., Gravel, S., Kenny, E.E., et al. 2013. RFMix: A discriminative modeling approach for rapid and robust local-ancestry inference. Am. J. Hum. Genet. 93, 278–288 [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. McKenna, A., Hanna, M., Banks, E., et al. 2010. The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Patterson, N., Price, A.L., and Reich, D.. 2006. Population structure and Eigenanalysis. PLoS Genet. 2, e190. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Pedersen, B.S., and Quinlan, A.R.. 2017. Who's who? Detecting and resolving sample anomalies in human DNA sequencing studies with Peddy. Am. J. Hum. Genet. 100, 406–413 [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Price, A.L., Zaitlen, N.A., Reich, D., et al. 2010. New approaches to population stratification in genome-wide association studies. Nat. Rev. Genet. 11, 459–463 [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Quinlan, A.R., and Hall, I.M.. 2010. BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. R Development Core Team. 2011. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria [Google Scholar]
  20. Ross, M.T., Grafham, D.V., and Coffey, A.J., et al. 2005. The DNA sequence of the human X chromosome. Nature 434, 325–337 [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Skaletsky, H., Kuroda-Kawaguchi, T., Minx, P.J., et al. 2003. The male-specific region of the human Y chromosome is a mosaic of discrete sequence classes. Nature 423, 825–837 [DOI] [PubMed] [Google Scholar]
  22. The 1000 Genomes Project Consortium. 2015. A global reference for human genetic variation. Nature 526, 68–74 [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Timpson, N.J., Greenwood, C.M.T., Soranzo, N., et al. 2018. Genetic architecture: The shape of the genetic contribution to human traits and disease. Nat. Rev. Genet. 19, 110–124 [DOI] [PubMed] [Google Scholar]
  24. Yuan, J., Hu, Z., Mahal, B.A., et al. 2018. Integrated analysis of genetic ancestry and genomic alterations across cancers. Cancer Cell 34, 549–560.e9 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental data
Supp_Tables.zip (207KB, zip)

Data Availability Statement

PopInf v1.0 processed 1000 Genomes reference file used in this article, and an accompanying tutorial are available on Github (SexChrLab/PopInf, 2020).


Articles from Journal of Computational Biology are provided here courtesy of Mary Ann Liebert, Inc.

RESOURCES