KinVis: a visualization tool to detect cryptic relatedness in genetic datasets

Ehsan Ullah; Michaël Aupetit; Arun Das; Abhishek Patil; Noora Al Muftah; Reda Rawi; Mohamad Saad; Halima Bensmail

doi:10.1093/bioinformatics/bty1028

. 2018 Dec 24;35(15):2683–2685. doi: 10.1093/bioinformatics/bty1028

KinVis: a visualization tool to detect cryptic relatedness in genetic datasets

Ehsan Ullah ^{bty1028-aff1,}^✉, Michaël Aupetit ^bty1028-aff1, Arun Das ^bty1028-aff1, Abhishek Patil ^bty1028-aff1, Noora Al Muftah ^bty1028-aff2, Reda Rawi ^{bty1028-aff1,}^bty1028-aff3, Mohamad Saad ^bty1028-aff1, Halima Bensmail ^bty1028-aff1

Editor: Alfonso Valencia

PMCID: PMC6931347 PMID: 30590437

Abstract

Motivation

It is important to characterize individual relatedness in terms of familial relationships and underlying population structure in genome-wide association studies for correct downstream analysis. The characterization of individual relatedness becomes vital if the cohort is to be used as reference panel in other studies for association tests and for identifying ethnic diversities. In this paper, we propose a kinship visualization tool to detect cryptic relatedness between subjects. We utilize multi-dimensional scaling, bar charts, heat maps and node-link visualizations to enable analysis of relatedness information.

Availability and implementation

Available online as well as can be downloaded at http://shiny-vis.qcri.org/public/kinvis/.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Population structures have been investigated by directly using genetic similarity with model-based and non-parametric methods. Model-based approaches estimate individual ancestry by modeling probability of observed genotypes with ancestry proportions, population allele frequencies assuming Hardy–Weinberg equilibrium and linkage equilibrium among loci. One of the issues with model-based methods is that they only consider individual markers and not their joint variation patterns (Padhukasahasram, 2014). Non-parametric methods, on the other hand, use multi-variate-based techniques such as cluster analysis (Bouaziz et al., 2012; Lee et al., 2009) and principal component analysis (Galinsky et al., 2016; Limpiti et al., 2011; Price et al., 2006; Purcell et al., 2007). Most of these methods do not handle cryptic relatedness in the sample when estimating population structures, which may lead to inaccurate population structure inference, whereas Conomos et al. addressed these issues (Conomos et al., 2015).

In this paper, we present kinship visualization (KinVis), a kinship (relatedness) visualization tool. KinVis is a non-parametric, and model free alternative to existing methods for the statistical assessment and dissection of genetic background similarities among populations or individuals. KinVis supports users to interactively detect and identify groups of populations with similar structure, and groups of individuals with cryptic relatedness.

2 Material and methods

KinVis has been developed as an R-Shiny application. Users upload a set of input files to KinVis, each containing the similarities between individuals of a population. KinVis processes these data to compute the individual pairwise lineage similarity, and the similarities at population level (Technical details are given in Supplementary Material). KinVis can read standard relatedness data in .genome and .kinf formats from PLINK software (Purcell et al., 2007) or EMMAX (Kang et al., 2010). Populations and individuals can be analyzed in two independent tabs. We have applied KinVis on 1000 Genomes Project Phase 3 dataset (Auton et al., 2015) for illustration of its different features (Fig. 1).

Fig. 1. — Visualizations generated by KinVis for 1000 Genomes Project: (a) Distribution of populations based on population sub-structure (BN). (b) Detailed rounded lineage value of each pair of individuals, the left margin encodes the lowest lineage in each row and the top margin encodes the largest lineage in each column. (c) Distribution of lineage level of each population (populations are ordered by decreasing proportion of first five lineage levels). (d) Node-link diagram representation of relationships between individuals. We use a categorical color scale (bottom right) to code the lineage value, which is better suited to rapidly identify items or evaluate their number in the bar chart, heatmap and node-link views (Color version of this figure is available at *Bioinformatics* online.)

In the Population Overview tab, each population can be selected individually using a checkbox list coming with ‘un/select all’, and ‘revert’ selection buttons. The selected populations can be visualized with a MDS scatterplot and a bar chart in two sub-tabs when pushing on the ‘Visualize’ button. The MDS view shows a multi-dimensional scaling of the population similarities enabling to discover groups of populations with structural similarities based on lineage distribution if IBD (.genome file) is used, or based on Eigen decomposition if BN (.kinf file) is used (Fig. 1a) (Supplementary Material). Zooming and panning are available to ease exploration of cluttered areas, and populations there can be selected interactively using a lasso. The selected populations can be grouped-and-named. The groups can be downloaded as a csv file. The MDS can be applied to a single group to analyze local relatedness possibly distorted otherwise by the global MDS. The percentage of total variance explained indicates the MDS trustworthiness. The bar chart view (Fig. 1c) is only available for IBD. It shows for each population the proportion of pairs of individuals related by a specific color-coded lineage score. A slider allows ordering the population per increasing values up to the selected lineage. This enables to spot which population contains pairs of individuals with anomalously low lineage score.

In the Individual Overview tab, a population of interest can be chosen using a drop-down menu, then all individual data of that population can be selected independently in a checkbox list and displayed in sub-tabs with three different views. The MDS view shows the genetic distance between individuals based on IBD. The heatmap view (Fig. 1b) shows the full details of the pairwise individual color-coded lineage scores. Rows and columns of the matrix are ordered to better see clusters of individuals as blocks with the same color. At last, the node-link view (Fig. 1d) displays the relations (links color-coded by lineage value) between individuals (nodes) for which the lineage score is between the two values set by a double-slider. This node-link visualization can be downloaded as a csv file for further analysis and displayed with other network visualization software. Groups of selected individuals can be formed as for the populations, to focus the analysis on that group only or on the rest only, using ‘revert’ selection. All figures can be downloaded as pdf files for printing and integration in a report.

3 Conclusion

In large-scale genome-wide association studies (GWAS), accounting for population sub-structure and admixture represents a great challenge in the quest of controlling type 1 errors and avoid spurious association for correct downstream analysis. In this work, we proposed a visual analytic tool KinVis to analyze GWAS input data to identify relatedness. KinVis provides visualization of individuals and populations relatedness, supporting users to detect and identify cryptic relatedness. Our approach focuses on the relationships between pairs of individuals unlike the approach proposed by (Gazal et al., 2015), which focuses on the inbreeding coefficient of a single individual. Our goal is not to discover genetic markers specific to populations or modeling ethnic diversity, ancestry or admixture populations with different proportions of inbreeding. Our approach generates interactive visualizations that can help in the identification of population groups, related individuals and a maximal set of unrelated individuals.

Conflict of Interest: none declared.

Supplementary Material

bty1028_Supplementary_Material

Click here for additional data file.^{(1.4MB, pdf)}

References

Auton A. et al. (2015) A global reference for human genetic variation. Nature, 526, 68–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bouaziz M. et al. (2012) SHIPS: spectral hierarchical clustering for the inference of population structure in genetic studies. PloS One, 7, e45685.. [DOI] [PMC free article] [PubMed] [Google Scholar]
Conomos M.P. et al. (2015) Robust inference of population structure for ancestry prediction and correction of stratification in the presence of relatedness. Genet. Epidemiol., 39, 276–293. [DOI] [PMC free article] [PubMed] [Google Scholar]
Galinsky K.J. et al. (2016) Fast principal-component analysis reveals convergent evolution of ADH1B in Europe and East Asia. Am. J. Human Genet., 98, 456–472. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gazal S. et al. (2015) High level of inbreeding in final phase of 1000 genomes project. Sci. Rep., 5, 17453. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kang H.M. et al. (2010) Variance component model to account for sample structure in genome-wide association studies. Nat. Genet., 42, 348–354. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lee C. et al. (2009) PCA-based population structure inference with generic clustering algorithms. BMC Bioinformatics, 10, S73.. [DOI] [PMC free article] [PubMed] [Google Scholar]
Limpiti T. et al. (2011) Study of large and highly stratified population datasets by combining iterative pruning principal component analysis and structure. BMC Bioinformatics, 12, 255.. [DOI] [PMC free article] [PubMed] [Google Scholar]
Padhukasahasram B. (2014) Inferring ancestry from population genomic data and its applications. Front. Genet., 5, 204. [DOI] [PMC free article] [PubMed] [Google Scholar]
Price A.L. et al. (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet., 38, 904–909. [DOI] [PubMed] [Google Scholar]
Purcell S. et al. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Human Genet., 81, 559–575. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

bty1028_Supplementary_Material

Click here for additional data file.^{(1.4MB, pdf)}

[bty1028-B1] Auton A. et al. (2015) A global reference for human genetic variation. Nature, 526, 68–74. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bty1028-B2] Bouaziz M. et al. (2012) SHIPS: spectral hierarchical clustering for the inference of population structure in genetic studies. PloS One, 7, e45685.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bty1028-B3] Conomos M.P. et al. (2015) Robust inference of population structure for ancestry prediction and correction of stratification in the presence of relatedness. Genet. Epidemiol., 39, 276–293. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bty1028-B4] Galinsky K.J. et al. (2016) Fast principal-component analysis reveals convergent evolution of ADH1B in Europe and East Asia. Am. J. Human Genet., 98, 456–472. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bty1028-B5] Gazal S. et al. (2015) High level of inbreeding in final phase of 1000 genomes project. Sci. Rep., 5, 17453. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bty1028-B6] Kang H.M. et al. (2010) Variance component model to account for sample structure in genome-wide association studies. Nat. Genet., 42, 348–354. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bty1028-B7] Lee C. et al. (2009) PCA-based population structure inference with generic clustering algorithms. BMC Bioinformatics, 10, S73.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bty1028-B8] Limpiti T. et al. (2011) Study of large and highly stratified population datasets by combining iterative pruning principal component analysis and structure. BMC Bioinformatics, 12, 255.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bty1028-B9] Padhukasahasram B. (2014) Inferring ancestry from population genomic data and its applications. Front. Genet., 5, 204. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bty1028-B10] Price A.L. et al. (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet., 38, 904–909. [DOI] [PubMed] [Google Scholar]

[bty1028-B11] Purcell S. et al. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Human Genet., 81, 559–575. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

KinVis: a visualization tool to detect cryptic relatedness in genetic datasets

Ehsan Ullah

Michaël Aupetit

Arun Das

Abhishek Patil

Noora Al Muftah

Reda Rawi

Mohamad Saad

Halima Bensmail

Roles

Abstract

Motivation

Availability and implementation

Supplementary information

1 Introduction

2 Material and methods

Fig. 1.

3 Conclusion

Supplementary Material

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

KinVis: a visualization tool to detect cryptic relatedness in genetic datasets

Ehsan Ullah

Michaël Aupetit

Arun Das

Abhishek Patil

Noora Al Muftah

Reda Rawi

Mohamad Saad

Halima Bensmail

Roles

Abstract

Motivation

Availability and implementation

Supplementary information

1 Introduction

2 Material and methods

Fig. 1.

3 Conclusion

Supplementary Material

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases