Abstract
Recent advances in sequencing technologies enable genome‐wide analyses for thousands of individuals. The sequential kernel association test (SKAT) is a widely used method to test for associations between a phenotype and a set of rare variants. As the sample size of human genetics studies increases, the computational time required to calculate a kernel is becoming more and more problematic. In this study, a new method to obtain kernel statistics without calculating a kernel matrix is proposed. A simple method for the computation of two kernel statistics, namely, a kernel statistic based on a genetic relationship matrix (GRM) and one based on an identity by state (IBS) matrix, are proposed. By using this method, calculation of the kernel statistics can be conducted using vector calculation without matrix calculation. The proposed method enables one to conduct SKAT for large samples of human genetics.
Keywords: genetic relationship matrix, identity by state, rare variants, sequential kernel association tests
A simple method for the computation of two kernel statistics, one based on a genetic relationship matrix (GRM) and one based on an identity by state (IBS) matrix, is proposed. The proposed method can be used to conduct the sequence kernel association test (SKAT) for large human genetics datasets.

1. Introduction
A very large number of human genome sequences are now available for the study of human genetics, because of recent advances in genome sequencers. A recent study has shown that rare variants substantially contribute to phenotype variation.[ 1 ] Because each linkage disequilibrium block can be analyzed independently, increases in the number of sites can be tackled with parallel computation.[ 2 ] However, the statistical power of classical single‐marker association analysis for rare variants is quite limited.
To address this challenge, rare and low‐frequency variants are often grouped into gene or pathway levels, and the effects of multiple variants evaluated are based on collapsing methods.[ 3 , 4 ] The sequential kernel association test (SKAT)[ 5 , 6 ] is one such popular method. SKAT applies a test statistic S, which is defined by the quadratic form, S = y T Ky , where y is column vector of the phenotype defined by Equation (1).
| (1) |
where y(i) is the phenotype value of the i‐th individual and n is the sample size. In the following section, we assume the average of the elements of y is 0.
Evaluating the probability density of the null distribution of S is important for conducting SKAT, but it requires computing a matrix related to the genotype covariance between markers, which requires a very long computational time. When the length of y is n and the size of the matrix K is n 2. For example, when the number of people is 10 000, the size of the kernel matrix is 100 000 000. A genetic relationship matrix (GRM) among individuals is used in genome‐wide complex trait analysis[ 7 ] and in principal component analysis.[ 8 ] Identity by state (IBS) defines similarity between individuals as the number of shared alleles. The IBS kernel is used in linear regression[ 9 , 10 ] and SKAT.[ 5 ]
The aim of the present study is to develop simple methods for the computation of these two kernel statistics without calculating a GRM and an IBS matrix explicitly.
2. Theory
2.1. Genotype Value Vectors
In the present study, all sites are assumed to be biallelic, namely, each site has a reference allele and an alternative allele. Let us define genotype value vectors at site . Let be 1 when the individual is a homozygote of a reference allele at site , otherwise . Let be 1 when the individual is the heterozygote at site , otherwise . Let be 1 when the individual is a homozygote of an alternative allele at site , otherwise . The vectors ak , bk , ck are defined by Equation (2).
| (2) |
Let us denote ak , bk , and ck as the genotype value vectors. Because , it is worth noting that
| (3) |
where 1 is defined by 1 = The alternative allele frequency, , is obtained by . In the following section, the Hardy–Weinberg equilibrium is assumed for this site. Namely, the frequencies of heterozygotes and homozygotes of the alternative alleles are and , respectively.
2.2. The GRM Kernel
The allele values are 0 for the reference allele, and 1 for the alternative allele. The separator between the alleles is “/” as used in the variant call format.[ 11 ] Let be the genotype value for individual at site . In the present study, is the number of alternative alleles. The relationship between the genotype of the individual at site and is shown in Table 1 . The vector gk is defined by Equation (4).
| (4) |
Table 1.
Relationship between genotype values and the number of alternative alleles
| Genotype |
|
|
|
|
||||
|---|---|---|---|---|---|---|---|---|
| 0/0 | 1 | 0 | 0 | 0 | ||||
| 0/1 | 0 | 1 | 0 | 1 | ||||
| 1/1 | 0 | 0 | 1 | 2 |
Table 1 displays the relationships among allelic states , , and . Because , gk is obtained by gk = bk + 2 ck .
Let us denote the GRM at site k as X k . We subtract the mean to obtain a matrix with row sums equal to 0. The ij‐th element of X k at site k is obtained using Equation (5).
| (5) |
| (6) |
Subsequently, the matrix X k can be obtained using the genotype value vectors.
Let us define a new matrix G k . As shown in Table 1, the ij‐th element of G k at site k can be calculated using Equation (7).
| (7) |
Subsequently, the matrix G k would be obtained using the genotype value vectors.
| (8) |
Therefore, we obtain Equation (9).
| (9) |
It is worth noting that
| (10) |
Because , we obtain
| (11) |
By using the distributivity and associativity of matrix production, we obtain
| (12) |
Qk is a scalar value of site k defined by Equation (13):
| (13) |
because the transpose of a product of matrices is the product, in the reverse order, of the transposes of the factors. Note is a scalar that can be obtained as
| (14) |
2.3. The IBS Kernel
IBS defines similarity between individuals as the number of shared alleles. The IBS kernel is used in linear regression[ 9 , 10 ] and the SKAT.[ 5 ] Let be the ij‐th element of the IBS matrix, IBS k , at site , which denotes the number of shared alleles by subjects and at site .
Table 2 displays the relationships between genotype values and IBS. From the table, we can observe the following relationship among genotype value vectors and the IBS matrix.
| (15) |
Table 2.
Relationship between genotype values and identities by state (IBS)
| Individual | Individual | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Genotype |
|
|
|
Genotype |
|
|
|
IBS | ||||||
| 0/0 | 1 | 0 | 0 | 0/0 | 1 | 0 | 0 | 2 | ||||||
| 0/0 | 1 | 0 | 0 | 0/1 | 0 | 1 | 0 | 1 | ||||||
| 0/0 | 1 | 0 | 0 | 1/1 | 0 | 0 | 1 | 0 | ||||||
| 0/1 | 0 | 1 | 0 | 0/0 | 1 | 0 | 0 | 1 | ||||||
| 0/1 | 0 | 1 | 0 | 0/1 | 0 | 1 | 0 | 2 | ||||||
| 0/1 | 0 | 1 | 0 | 1/1 | 0 | 0 | 1 | 1 | ||||||
| 1/1 | 0 | 0 | 1 | 0/0 | 1 | 0 | 0 | 0 | ||||||
| 1/1 | 0 | 0 | 1 | 0/1 | 0 | 1 | 0 | 1 | ||||||
| 1/1 | 0 | 0 | 1 | 1/1 | 0 | 0 | 1 | 2 | ||||||
Thus, the IBS matrix at site is obtained by
| (16) |
is a scalar value of site k defined by Equation (17).
| (17) |
By using the distributivity and associativity of matrix production, we obtain
| (18) |
and are scalars that can be obtained using the inner product of two vectors. By using Equation (3), we can obtain .
When multiple single‐nucleotide polymorphisms (SNPs) are investigated, the entire GRM and IBS matrices are obtained using and , respectively, where is weight of site and is the number of sites. SKAT allows the incorporation of flexible weight functions.[ 12 ] Weights can normalize each data column to have the same variance[ 13 ] and can increase the power of tests.[ 13 ]
2.4. Computer Simulations
To evaluate the new method, I performed computer simulations. The python scripts used in the computer simulation are shown in Material 1, Supporting Information. The usages are in Material 2, Supporting Information. This program is ready for data analysis.
2.4.1. Genotype Selection
SNPs on the SLC22A2 gene that are known to affect uric acid levels[ 1 , 14 , 15 , 16 ] were selected. Then, genetic variation of these SNPs of 2504 individuals were downloaded from the 1000 Genomes Project. Monomorphic sites were excluded. As a result, the sites in Table 3 were used in the computer simulation.
Table 3.
SNPs used in the computer simulation
| Chromosome | Position on hg19 | rsID |
|---|---|---|
| 11 | 64360996 | rs552232030 |
| 11 | 64361124 | rs201136391 |
| 11 | 64361219 | rs121907892 |
| 11 | 64366298 | rs150255373 |
| 11 | 64367290 | rs563239942 |
| 11 | 64368212 | rs200104135 |
| 11 | 64368968 | rs528619562 |
2.4.2. Phenotype Generations
The heterozygous individuals and the homozygous individuals of alternative allele of the uric acid level were set to be 1.0 µg dL−1 lower than the homozygotes of the reference allele. A random variable that follows the normal distribution with mean 0.0 µg dL−1 and standard deviation 1.0 µg dL−1 was added to the uric acid level of each individual in the simulation as an environmental factor of uric acids level. These values are similar to the observed values.[ 1 ]
2.4.3. Calculation of Test Statistics and Permutation Tests
For each of these phenotypes, the test statistics of GRM and IBS were observed. Then the permutation tests were performed with 1 000 000 permutations to calculate the probability of exceeding the observed score. The significance level was set to be 5 × 10–6, because the number of tests of genome‐wide SKAT will be 104. Each permutation test was repeated ten times (n = 10).
3. Results
Table 4 shows that there is no significant difference between the GRM and IBS in the statistical power (the chi‐square test, n = 10, P > 5%). Table 4 also shows that permutation tests can be conducted in a short period of time by using the methods proposed in this study.
Table 4.
Power and Computational time of the GRM and IBS tests
| Method | The number of tests that reject the null hypothesis | Time |
|---|---|---|
| GRM | 2 out of 10 | 1 min 38 s |
| IBS | 3 out of 10 | 1 min 47 s |
4. Discussion
We demonstrate that necessary variant/phenotype association test statistics can be obtained without obtaining eigenvalues and eigenvectors of GRM and IBS matrices, in the present study. The method is referred to as genotype value decomposition. The new methods proposed in this study are conducted with computational time of , where is the sample size. Notably, these new methods are applicable for common variants as well as rare variants, even though the methods were developed for the association tests for rare variants. Sparse matrix computation can be used when all of variants are rare.
When the alternative allele frequency is very small, homozygotes of the alternative allele are very rare, so that ck is ignorable. In other words, can be approximately obtained by. Under the same condition, and , so that is approximately equal to .
On one hand, when all sites are independent, the necessary probability density functions can be calculated using convolution of the probability density functions of all sites. On the other hand, it is difficult to obtain convolution of the probability density functions when the sites are linked and dependent on each other. In such cases, a permutation test is used.[ 6 ]
Because the statistics calculated by the new method are not approximations but exact values, the null distributions of these statistics are exactly the same as the test statistics with calculating GRM and IBS matrices. Wu et al.[ 5 ] showed that the test statistics approximately follow the chi‐square distribution. Furthermore, because the distribution is derived from an asymptotic distribution of its statistics, the p‐values for datasets with an insufficient number of samples may be inaccurate, which could cause inflation or power loss.[ ] In a permutation test, the test statistic null distribution can be approximated by fully resampling the observed traits without replacement. The proposed method can be useful for reducing computational time to obtain p‐values using resampling methods.
5. Conclusion
In the present paper, a genotype value decomposition method is proposed for handling the kernel matrices. The method can be referred to as genotype value decomposition. By using this method, calculation of the null distribution of the kernel statistics can be conducted with time complexity O(n). The proposed method enables one to conduct SKAT for large samples of human genetics.
Conflict of Interest
The author declares no conflict of interest.
Peer Review
The peer review history for this article is available in the Supporting Information for this article.
Supporting information
Supporting Information
Supplementary Information: Record of Transparent Peer Review
Acknowledgements
The author thanks Dr. Naomichi Matsumoto for his suggestions and encouragement. This work was supported by JSPS KAKENHI Grant Numbers JP17K08682, JP19K22647, JP20K07316. The author also thanks Steven M. Thompson, from Edanz Group for editing a draft of this manuscript.
Misawa K., Genotype Value Decomposition: Simple Methods for the Computation of Kernel Statistics. Advanced Genetics 2022, 3, 2100066. 10.1002/ggn2.202100066
Data Availability Statement
The python code used in the study is available at https://github.com/kazumisawa/paraHaplo5 under the MIT license.
References
- 1. Misawa K., Hasegawa T., Mishima E., Jutabha P., Ouchi M., Kojima K., Kawai Y., Matsuo M., Anzai N., Nagasaki M., Genetics 2020, 214, 1079. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Misawa K., Kamatani N., Source Code Biol. Med. 2009, 4, 7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Lin D. Y., Tang Z. Z., Am. J. Hum. Genet. 2011, 89, 354. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Larson N. B., Chen J., Schaid D. J., Genet. Epidemiol. 2019, 43, 122. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Wu M. C., Lee S., Cai T., Li Y., Boehnke M., Lin X., Am. J. Hum. Genet. 2011, 89, 82. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Hasegawa T., Kojima K., Kawai Y., Misawa K., Mimori T., Nagasaki M., BMC Genomics 2016;17, 745. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Yang J., Lee S. H., Goddard M. E., Visscher P. M., Am. J. Hum. Genet. 2011, 88, 76. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Price A. L., Patterson N. J., Plenge R. M., Weinblatt M. E., Shadick N. A., Reich D., Nat. Genet. 2006, 38, 904. [DOI] [PubMed] [Google Scholar]
- 9. Kwee L. C., Liu D., Lin X., Ghosh D., Epstein M. P., Am. J. Hum. Genet. 2008, 82, 386. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Wessel J., Schork N. J., Am. J. Hum. Genet. 2006, 79, 792. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Danecek P., Auton A., Abecasis G., Albers C. A., Banks E., DePristo M. A., Handsaker R. E., Lunter G., Marth G. T., Sherry S. T., McVean G., Durbin R., Genomes Project Analysis Group , Bioinformatics 2011, 27, 2156.21653522 [Google Scholar]
- 12. Patterson N., Price A. L., Reich D., PLoS Genet. 2006, 2, e190. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Zhang J., Wu B., Sha Q., Zhang S., Wang X., Genet. Epidemiol. 2019, 43, 966. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Enomoto A., Kimura H., Chairoungdua A., Shigeta Y., Jutabha P., Cha S. H., Hosoyamada M., Takeda M., Sekine T., Igarashi T., Matsuo H., Kikuchi Y., Oda T., Ichida K., Hosoya T., Shimokata K., Niwa T., Kanai Y., Endou H., Nature 2002, 417, 447. [DOI] [PubMed] [Google Scholar]
- 15. Tin A., Li Y., Brody J. A., Nutile T., Chu A. Y., Huffman J. E., Yang Q., Chen M. H., Robinson‐Cohen C., Mace A., Liu J., Demirkan A., Sorice R., Sedaghat S., Swen M., Yu B., Ghasemi S., Teumer A., Vollenweider P., Ciullo M., Li M., Uitterlinden A. G., Kraaij R., Amin N., van Rooij J., Kutalik Z., Dehghan A., McKnight B., van Duijn C. M., Morrison A., et al., Nat. Commun. 2018, 9, 4228. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Claverie‐Martin F., Trujillo‐Suarez J., Gonzalez‐Acosta H., Aparicio C., Justa Roldan M. L., Stiburkova B., Ichida K., Martin‐Gomez M. A., Herrero Goni M., Carrasco Hidalgo‐Barquero M., Inigo V., Enriquez R., Cordoba‐Lanus E., Garcia‐Nieto V. M., RenalTube G., Clin. Chim. Acta 2018, 481, 83. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supporting Information
Supplementary Information: Record of Transparent Peer Review
Data Availability Statement
The python code used in the study is available at https://github.com/kazumisawa/paraHaplo5 under the MIT license.
