Skip to main content
Advanced Genetics logoLink to Advanced Genetics
. 2022 Apr 5;3(3):2100066. doi: 10.1002/ggn2.202100066

Genotype Value Decomposition: Simple Methods for the Computation of Kernel Statistics

Kazuharu Misawa 1,
PMCID: PMC9744480  PMID: 36620199

Abstract

Recent advances in sequencing technologies enable genome‐wide analyses for thousands of individuals. The sequential kernel association test (SKAT) is a widely used method to test for associations between a phenotype and a set of rare variants. As the sample size of human genetics studies increases, the computational time required to calculate a kernel is becoming more and more problematic. In this study, a new method to obtain kernel statistics without calculating a kernel matrix is proposed. A simple method for the computation of two kernel statistics, namely, a kernel statistic based on a genetic relationship matrix (GRM) and one based on an identity by state (IBS) matrix, are proposed. By using this method, calculation of the kernel statistics can be conducted using vector calculation without matrix calculation. The proposed method enables one to conduct SKAT for large samples of human genetics.

Keywords: genetic relationship matrix, identity by state, rare variants, sequential kernel association tests


A simple method for the computation of two kernel statistics, one based on a genetic relationship matrix (GRM) and one based on an identity by state (IBS) matrix, is proposed. The proposed method can be used to conduct the sequence kernel association test (SKAT) for large human genetics datasets.

graphic file with name GGN2-3-2100066-g001.jpg

1. Introduction

A very large number of human genome sequences are now available for the study of human genetics, because of recent advances in genome sequencers. A recent study has shown that rare variants substantially contribute to phenotype variation.[ 1 ] Because each linkage disequilibrium block can be analyzed independently, increases in the number of sites can be tackled with parallel computation.[ 2 ] However, the statistical power of classical single‐marker association analysis for rare variants is quite limited.

To address this challenge, rare and low‐frequency variants are often grouped into gene or pathway levels, and the effects of multiple variants evaluated are based on collapsing methods.[ 3 , 4 ] The sequential kernel association test (SKAT)[ 5 , 6 ] is one such popular method. SKAT applies a test statistic S, which is defined by the quadratic form, S = y T Ky , where y is column vector of the phenotype defined by Equation (1).

y=y1y2···ynT (1)

where y(i) is the phenotype value of the i‐th individual and n is the sample size. In the following section, we assume the average of the elements of y is 0.

Evaluating the probability density of the null distribution of S is important for conducting SKAT, but it requires computing a matrix related to the genotype covariance between markers, which requires a very long computational time. When the length of y is n and the size of the matrix K is n 2. For example, when the number of people is 10 000, the size of the kernel matrix is 100 000 000. A genetic relationship matrix (GRM) among individuals is used in genome‐wide complex trait analysis[ 7 ] and in principal component analysis.[ 8 ] Identity by state (IBS) defines similarity between individuals as the number of shared alleles. The IBS kernel is used in linear regression[ 9 , 10 ] and SKAT.[ 5 ]

The aim of the present study is to develop simple methods for the computation of these two kernel statistics without calculating a GRM and an IBS matrix explicitly.

2. Theory

2.1. Genotype Value Vectors

In the present study, all sites are assumed to be biallelic, namely, each site has a reference allele and an alternative allele. Let us define genotype value vectors at site k. Let ak(i) be 1 when the individual i is a homozygote of a reference allele at site k, otherwise ak(i)=0. Let bk(i) be 1 when the individual i is the heterozygote at site k, otherwise bk(i)=0. Let ck(i) be 1 when the individual i is a homozygote of an alternative allele at site k, otherwise ck(i)=0. The vectors ak , bk , ck are defined by Equation (2).

ak=ak1ak2aknTbk=bk1bk2bknTck=ck1ck2cknT (2)

Let us denote ak , bk , and ck as the genotype value vectors. Because ak(i)+bk(i)+ck(i)=1, it is worth noting that

ak+bk+ck=1 (3)

where 1 is defined by 1 = (111)T The alternative allele frequency, p, is obtained by p=(bkT1+2ckT1)/(2n). In the following section, the Hardy–Weinberg equilibrium is assumed for this site. Namely, the frequencies of heterozygotes and homozygotes of the alternative alleles are 2p(1p) and p2, respectively.

2.2. The GRM Kernel

The allele values are 0 for the reference allele, and 1 for the alternative allele. The separator between the alleles is “/” as used in the variant call format.[ 11 ] Let gk(i) be the genotype value for individual i at site k. In the present study, gk(i) is the number of alternative alleles. The relationship between the genotype of the individual i at site k and gk(i) is shown in Table  1 . The vector gk is defined by Equation (4).

gk=gk1gk2···gknT (4)

Table 1.

Relationship between genotype values and the number of alternative alleles

Genotype
ak(i)
bk(i)
ck(i)
gk(i)
0/0 1 0 0 0
0/1 0 1 0 1
1/1 0 0 1 2

Table 1 displays the relationships among allelic states ak(i), bk(i),ck(i), and gk(i). Because gk(i)=bk(i)+2ck(i), gk is obtained by gk = bk + 2 ck .

Let us denote the GRM at site k as X k . We subtract the mean μk={i=1ngk(i)}/n to obtain a matrix with row sums equal to 0. The ij‐th element of X k at site k is obtained using Equation (5).

Xki,j=gkiμkgkjμk=gkigkjμkgkiμkgkj+μk2 (5)
Xk=gk1gk1gk1gk2gk2gk1gk2gk2gk1gkngk2gkngkngk1gkngk2gkngknμkgk1gk1gk2gk2gk1gk2gkngkngknμkgk1gk2gk1gk2gkngkngk1gk2gkn+μk2111111111 (6)

Subsequently, the matrix X k can be obtained using the genotype value vectors.

Let us define a new matrix G k . As shown in Table 1, the ij‐th element of G k at site k can be calculated using Equation (7).

Gki,j=gkigkj (7)

Subsequently, the matrix G k would be obtained using the genotype value vectors.

Gk=gk1gk1gk1gk2gk2gk1gk2gk2gk1gkngk2gkngkngk1gkngk2gkngkn=gk1gk2gkngk1gk2gkn=gkgkTgk1gk1gk2gk2gk1gk2gkngkngkn=1gkTgk1gk2gk1gk2gkngkngk1gk2gkn=gk1T111111111=11T (8)

Therefore, we obtain Equation (9).

Xk=Gkμ1gkTμgk1T+μ211T (9)

It is worth noting that

yTXky=yTGky+μyT1gkTy+μyTgk1Ty+μ2yT11Ty (10)

Because yT1=1Ty=0, we obtain

yTXky=yTGky (11)

By using the distributivity and associativity of matrix production, we obtain

Gk=bk+2cbk+2ckT (12)

Qk is a scalar value of site k defined by Equation (13):

Qk=yTGky=yTbk+2ckbk+2ckTy=yTbk+2yTck2 (13)

because the transpose of a product of matrices is the product, in the reverse order, of the transposes of the factors. Note yTbk+2yTck is a scalar that can be obtained as

yTbk+2yTck=i=1nyibki+2cki (14)

2.3. The IBS Kernel

IBS defines similarity between individuals as the number of shared alleles. The IBS kernel is used in linear regression[ 9 , 10 ] and the SKAT.[ 5 ] Let IBSk(i) be the ij‐th element of the IBS matrix, IBS k , at site k, which denotes the number of shared alleles by subjects i and j at site k.

Table  2 displays the relationships between genotype values and IBS. From the table, we can observe the following relationship among genotype value vectors and the IBS matrix.

IBSki,j=2akiakj+bki+bkj+2ckickj (15)

Table 2.

Relationship between genotype values and identities by state (IBS)

Individual i Individual j
Genotype
ak(i)
bk(i)
ck(i)
Genotype
ak(j)
bk(j)
ck(j)
IBS
0/0 1 0 0 0/0 1 0 0 2
0/0 1 0 0 0/1 0 1 0 1
0/0 1 0 0 1/1 0 0 1 0
0/1 0 1 0 0/0 1 0 0 1
0/1 0 1 0 0/1 0 1 0 2
0/1 0 1 0 1/1 0 0 1 1
1/1 0 0 1 0/0 1 0 0 0
1/1 0 0 1 0/1 0 1 0 1
1/1 0 0 1 1/1 0 0 1 2

Thus, the IBS matrix at site k is obtained by

IBSk=2ak1ak1ak1ak2ak2ak1ak2ak2ak1aknak2aknaknak1aknak2aknakn+bk1bk1bk2bk2bk1bk2bknbknbkn+bk1bk2bk1bk2bknbknbk1bk2bkn+2ck1ck1ck1ck2ck2ck1ck2ck2ck1cknck2ckncknck1cknck2cknckn=2akakT+1bkT+bk1T+2ckckT (16)

Rk is a scalar value of site k defined by Equation (17).

Rk=yTIBSky (17)

By using the distributivity and associativity of matrix production, we obtain

Rk=yTakakT+1bkT+bk1T+ckckTy=2yTakakTy+yT1bkTy+yTbk1Ty+2yTckckTy=2yTak2+2yTck2 (18)

yTak and yTck are scalars that can be obtained using the inner product of two vectors. By using Equation (3), we can obtain yTak=yT1yTbkyTck.

When multiple single‐nucleotide polymorphisms (SNPs) are investigated, the entire GRM and IBS matrices are obtained using S=k=1lwkQk and S=k=1lwkRk, respectively, where wk is weight of site k and l is the number of sites. SKAT allows the incorporation of flexible weight functions.[ 12 ] Weights can normalize each data column to have the same variance[ 13 ] and can increase the power of tests.[ 13 ]

2.4. Computer Simulations

To evaluate the new method, I performed computer simulations. The python scripts used in the computer simulation are shown in Material 1, Supporting Information. The usages are in Material 2, Supporting Information. This program is ready for data analysis.

2.4.1. Genotype Selection

SNPs on the SLC22A2 gene that are known to affect uric acid levels[ 1 , 14 , 15 , 16 ] were selected. Then, genetic variation of these SNPs of 2504 individuals were downloaded from the 1000 Genomes Project. Monomorphic sites were excluded. As a result, the sites in Table  3 were used in the computer simulation.

Table 3.

SNPs used in the computer simulation

Chromosome Position on hg19 rsID
11 64360996 rs552232030
11 64361124 rs201136391
11 64361219 rs121907892
11 64366298 rs150255373
11 64367290 rs563239942
11 64368212 rs200104135
11 64368968 rs528619562

2.4.2. Phenotype Generations

The heterozygous individuals and the homozygous individuals of alternative allele of the uric acid level were set to be 1.0 µg dL−1 lower than the homozygotes of the reference allele. A random variable that follows the normal distribution with mean 0.0 µg dL−1 and standard deviation 1.0 µg dL−1 was added to the uric acid level of each individual in the simulation as an environmental factor of uric acids level. These values are similar to the observed values.[ 1 ]

2.4.3. Calculation of Test Statistics and Permutation Tests

For each of these phenotypes, the test statistics of GRM and IBS were observed. Then the permutation tests were performed with 1 000 000 permutations to calculate the probability of exceeding the observed score. The significance level was set to be 5 × 10–6, because the number of tests of genome‐wide SKAT will be 104. Each permutation test was repeated ten times (n = 10).

3. Results

Table  4 shows that there is no significant difference between the GRM and IBS in the statistical power (the chi‐square test, n = 10, P > 5%). Table 4 also shows that permutation tests can be conducted in a short period of time by using the methods proposed in this study.

Table 4.

Power and Computational time of the GRM and IBS tests

Method The number of tests that reject the null hypothesis Time
GRM 2 out of 10 1 min 38 s
IBS 3 out of 10 1 min 47 s

4. Discussion

We demonstrate that necessary variant/phenotype association test statistics can be obtained without obtaining eigenvalues and eigenvectors of GRM and IBS matrices, in the present study. The method is referred to as genotype value decomposition. The new methods proposed in this study are conducted with computational time of O(n), where n is the sample size. Notably, these new methods are applicable for common variants as well as rare variants, even though the methods were developed for the association tests for rare variants. Sparse matrix computation can be used when all of variants are rare.

When the alternative allele frequency is very small, homozygotes of the alternative allele are very rare, so that ck is ignorable. In other words, Qk can be approximately obtained byQk(yTbk)2. Under the same condition, (yTak)2(yTbk)2 and (yTck)20, so that Rk is approximately equal to 2Qk.

On one hand, when all sites are independent, the necessary probability density functions can be calculated using convolution of the probability density functions of all sites. On the other hand, it is difficult to obtain convolution of the probability density functions when the sites are linked and dependent on each other. In such cases, a permutation test is used.[ 6 ]

Because the statistics calculated by the new method are not approximations but exact values, the null distributions of these statistics are exactly the same as the test statistics with calculating GRM and IBS matrices. Wu et al.[ 5 ] showed that the test statistics approximately follow the chi‐square distribution. Furthermore, because the distribution is derived from an asymptotic distribution of its statistics, the p‐values for datasets with an insufficient number of samples may be inaccurate, which could cause inflation or power loss.[ ] In a permutation test, the test statistic null distribution can be approximated by fully resampling the observed traits without replacement. The proposed method can be useful for reducing computational time to obtain p‐values using resampling methods.

5. Conclusion

In the present paper, a genotype value decomposition method is proposed for handling the kernel matrices. The method can be referred to as genotype value decomposition. By using this method, calculation of the null distribution of the kernel statistics can be conducted with time complexity O(n). The proposed method enables one to conduct SKAT for large samples of human genetics.

Conflict of Interest

The author declares no conflict of interest.

Peer Review

The peer review history for this article is available in the Supporting Information for this article.

Supporting information

Supporting Information

Supplementary Information: Record of Transparent Peer Review

Acknowledgements

The author thanks Dr. Naomichi Matsumoto for his suggestions and encouragement. This work was supported by JSPS KAKENHI Grant Numbers JP17K08682, JP19K22647, JP20K07316. The author also thanks Steven M. Thompson, from Edanz Group for editing a draft of this manuscript.

Misawa K., Genotype Value Decomposition: Simple Methods for the Computation of Kernel Statistics. Advanced Genetics 2022, 3, 2100066. 10.1002/ggn2.202100066

Data Availability Statement

The python code used in the study is available at https://github.com/kazumisawa/paraHaplo5 under the MIT license.

References

  • 1. Misawa K., Hasegawa T., Mishima E., Jutabha P., Ouchi M., Kojima K., Kawai Y., Matsuo M., Anzai N., Nagasaki M., Genetics 2020, 214, 1079. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Misawa K., Kamatani N., Source Code Biol. Med. 2009, 4, 7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Lin D. Y., Tang Z. Z., Am. J. Hum. Genet. 2011, 89, 354. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Larson N. B., Chen J., Schaid D. J., Genet. Epidemiol. 2019, 43, 122. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Wu M. C., Lee S., Cai T., Li Y., Boehnke M., Lin X., Am. J. Hum. Genet. 2011, 89, 82. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Hasegawa T., Kojima K., Kawai Y., Misawa K., Mimori T., Nagasaki M., BMC Genomics 2016;17, 745. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Yang J., Lee S. H., Goddard M. E., Visscher P. M., Am. J. Hum. Genet. 2011, 88, 76. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Price A. L., Patterson N. J., Plenge R. M., Weinblatt M. E., Shadick N. A., Reich D., Nat. Genet. 2006, 38, 904. [DOI] [PubMed] [Google Scholar]
  • 9. Kwee L. C., Liu D., Lin X., Ghosh D., Epstein M. P., Am. J. Hum. Genet. 2008, 82, 386. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Wessel J., Schork N. J., Am. J. Hum. Genet. 2006, 79, 792. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Danecek P., Auton A., Abecasis G., Albers C. A., Banks E., DePristo M. A., Handsaker R. E., Lunter G., Marth G. T., Sherry S. T., McVean G., Durbin R., Genomes Project Analysis Group , Bioinformatics 2011, 27, 2156.21653522 [Google Scholar]
  • 12. Patterson N., Price A. L., Reich D., PLoS Genet. 2006, 2, e190. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Zhang J., Wu B., Sha Q., Zhang S., Wang X., Genet. Epidemiol. 2019, 43, 966. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Enomoto A., Kimura H., Chairoungdua A., Shigeta Y., Jutabha P., Cha S. H., Hosoyamada M., Takeda M., Sekine T., Igarashi T., Matsuo H., Kikuchi Y., Oda T., Ichida K., Hosoya T., Shimokata K., Niwa T., Kanai Y., Endou H., Nature 2002, 417, 447. [DOI] [PubMed] [Google Scholar]
  • 15. Tin A., Li Y., Brody J. A., Nutile T., Chu A. Y., Huffman J. E., Yang Q., Chen M. H., Robinson‐Cohen C., Mace A., Liu J., Demirkan A., Sorice R., Sedaghat S., Swen M., Yu B., Ghasemi S., Teumer A., Vollenweider P., Ciullo M., Li M., Uitterlinden A. G., Kraaij R., Amin N., van Rooij J., Kutalik Z., Dehghan A., McKnight B., van Duijn C. M., Morrison A., et al., Nat. Commun. 2018, 9, 4228. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Claverie‐Martin F., Trujillo‐Suarez J., Gonzalez‐Acosta H., Aparicio C., Justa Roldan M. L., Stiburkova B., Ichida K., Martin‐Gomez M. A., Herrero Goni M., Carrasco Hidalgo‐Barquero M., Inigo V., Enriquez R., Cordoba‐Lanus E., Garcia‐Nieto V. M., RenalTube G., Clin. Chim. Acta 2018, 481, 83. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Information

Supplementary Information: Record of Transparent Peer Review

Data Availability Statement

The python code used in the study is available at https://github.com/kazumisawa/paraHaplo5 under the MIT license.


Articles from Advanced Genetics are provided here courtesy of Wiley

RESOURCES