Figure - PMC

Skip to main content

View full-text article in PMC

. Author manuscript; available in PMC: 2021 Mar 24.

Published in final edited form as: Nat Genet. 2020 Apr 13;52(5):534–540. doi: 10.1038/s41588-020-0612-7

Extended Data Fig. 3 — (A) Creating the k-mer presence/absence table: Each accession’s genomic DNA sequencing reads are cut into k-mers⁴⁵, filtering k-mers appearing less than twice/thrice in a sequencing library. k-mers are further filtered to retain only those present in at least 5 accessions, and ones that are found in both forward and reverse-complement form in at least 20% of accessions they appeared in. All k-mer lists are combined into a k-mer presence/absence table.

(B) Genome-wide associations on the full k-mers table using SNP-based software: the k-mers table is converted into PLINK binary format, which is used as input for SNP-based association mapping software^14,42.

(C) GWA optimized for the k-mers: k-mers presence/absence patterns are first associated with the phenotype and its permutations using a LMM to account for population structure^16,17. This first step is done by calculating an approximated score of the exact model. Best k-mers from this first step (e.g. 100,000 k-mers) are passed to the second step, In which an exact p-value is calculated¹⁴ for both the phenotype and its permutations. A permutation-based threshold is calculated, and all k-mers passing this threshold are checked for their rank in the scoring from the first step. If not all k-mers hits are in the top 50% of the initial scoring, then the entire process is rerun from the beginning, passing more k-mers from the first to the second step. This last test is built to confirm that the approximation of the first step will not remove true associated k-mers.