Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2021 Aug 2;17(8):e1008904. doi: 10.1371/journal.pcbi.1008904

High-throughput Interpretation of Killer-cell Immunoglobulin-like Receptor Short-read Sequencing Data with PING

Wesley M Marin 1, Ravi Dandekar 1, Danillo G Augusto 1, Tasneem Yusufali 1, Bianca Heyn 2, Jan Hofmann 3, Vinzenz Lange 2, Jürgen Sauter 3, Paul J Norman 4, Jill A Hollenbach 1,*
Editor: Ferhat Ay5
PMCID: PMC8360517  PMID: 34339413

Abstract

The killer-cell immunoglobulin-like receptor (KIR) complex on chromosome 19 encodes receptors that modulate the activity of natural killer cells, and variation in these genes has been linked to infectious and autoimmune disease, as well as having bearing on pregnancy and transplant outcomes. The medical relevance and high variability of KIR genes makes short-read sequencing an attractive technology for interrogating the region, providing a high-throughput, high-fidelity sequencing method that is cost-effective. However, because this gene complex is characterized by extensive nucleotide polymorphism, structural variation including gene fusions and deletions, and a high level of homology between genes, its interrogation at high resolution has been thwarted by bioinformatic challenges, with most studies limited to examining presence or absence of specific genes. Here, we present the PING (Pushing Immunogenetics to the Next Generation) pipeline, which incorporates empirical data, novel alignment strategies and a custom alignment processing workflow to enable high-throughput KIR sequence analysis from short-read data. PING provides KIR gene copy number classification functionality for all KIR genes through use of a comprehensive alignment reference. The gene copy number determined per individual enables an innovative genotype determination workflow using genotype-matched references. Together, these methods address the challenges imposed by the structural complexity and overall homology of the KIR complex. To determine copy number and genotype determination accuracy, we applied PING to European and African validation cohorts and a synthetic dataset. PING demonstrated exceptional copy number determination performance across all datasets and robust genotype determination performance. Finally, an investigation into discordant genotypes for the synthetic dataset provides insight into misaligned reads, advancing our understanding in interpretation of short-read sequencing data in complex genomic regions. PING promises to support a new era of studies of KIR polymorphism, delivering high-resolution KIR genotypes that are highly accurate, enabling high-quality, high-throughput KIR genotyping for disease and population studies.

Author summary

Killer cell immunoglobulin-like receptors (KIR) serve a critical role in regulating natural killer cell function. They are encoded by highly polymorphic genes within a complex genomic region that has proven difficult to interrogate owing to structural variation and extensive sequence homology. While methods for sequencing KIR genes have matured, there is a lack of bioinformatic support to accurately interpret KIR short-read sequencing data. The extensive structural variation of KIR, both the small-scale nucleotide insertions and deletions and the large-scale gene duplications and deletions, coupled with the extensive sequence similarity among KIR genes presents considerable challenges to bioinformatic analyses. PING addressed these issues through a highly-dynamic alignment workflow, which constructs individualized references that reflect the determined copy number and genotype makeup of a sample. This alignment workflow is enabled by a custom alignment processing pipeline, which scaffolds reads aligned to all reference sequences from the same gene into an overall gene alignment, enabling processing of these alignments as if a single reference sequence was used regardless of the number of sequences or of any insertions or deletions present in the component sequences. Together, these methods provide a novel and robust workflow for the accurate interpretation of KIR short-read sequencing data.

Introduction

The killer cell immunoglobulin-like receptor (KIR) complex, located in human chromosomal region 19q13.42, encodes receptors expressed on the surface of natural killer (NK) cells [1] and a subtype of T-cells [2]. KIRs interact with their cognate HLA class I ligands to educate NK cells and modulate their cytotoxicity [35]. KIR genes exhibit presence and absence polymorphism and gene content variation that has been implicated in numerous immune-mediated and infectious diseases [611]. In addition, careful consideration of KIR gene content haplotypes for allogeneic transplantation has been shown to improve outcomes for acute myelogenous leukemia patients [1217]. Whereas evidence for the relevance of KIR variation in health and disease is mounting, analysis of the KIR family at allelic resolution has been thwarted by the complexity of the region.

The KIR complex evolved rapidly through recombination and gene duplication events, and in humans this has resulted in a gene-content variable cluster of 13 genes and 2 pseudogenes [1820]. Variation in KIR genes is characterized by extensive nucleotide polymorphisms, with 1110 alleles described to date [21]. The KIR complex is also characterized by large-scale structural variation, including gene fusions, duplications and deletions [22,23]. KIR haplotypes exhibit gene content variation at extraordinary levels, generating hundreds of observed haplotype structures [20,2426].

The high variability of KIR makes short-read sequencing an attractive technology for interrogating the region, providing a high-throughput, high-fidelity and cost-effective sequencing method [27]. Whereas the KIR region is relatively small, between 70-270Kbp [28], the overall sequence similarity among genes, structural variability of the region, and sequence polymorphism present major obstacles to bioinformatics workflows. The high potential for read misalignments significantly confounds interpretation of the region in modern large-scale sequencing studies.

Previously, we introduced a laboratory method for targeted sequencing of the KIR gene complex [27], but the associated prototype bioinformatic pipeline for sequence interpretation presented significant workload barriers for high-throughput studies. For example, the copy number determination workflow was unable to differentiate KIR2DL2 from KIR2DL3, which are sets of highly similar allelic groups of the KIR2DL23 gene [29], and the resolution of KIR2DS1 and KIR2DL1 was less precise than desired due to read misalignments caused by the similarity of these two genes. Additionally, a high frequency of unresolved genotypes (not matching any described allele sequence) necessitated subsequent interpretation by a user with domain expertise. In spite of these challenges, the prototype pipeline has provided insight into KIR genotyping methods development [30,31], the role of KIR sequence variants in immune dysfunction [17,32], and KIR evolutionary analyses [33]. To the best of our knowledge there are two other existing tools for interrogating KIR short-read sequencing data, KIR*IMP [34] and KPI [35,36]. KIR*IMP imputes KIR copy number from carefully selected SNPs while KPI interprets KIR gene content and predicts haplotype-pairs using in silico probes. Neither of these methods support allele level genotyping or direct copy number assessment.

Here, we present a comprehensive KIR sequence interpretation workflow, termed PING (Pushing Immunogenetics to the Next Generation), which builds on our early work by incorporating empirical optimizations derived from sequencing thousands of samples, in addition to novel alignment strategies to address issues with read misalignments, to provide a comprehensive KIR sequence analysis tool, offering allele-level genotypes, copy number, and novel sequence analysis. This work improves on the pipeline described in Norman et al. [27] in the following ways: the copy number determination workflow was adjusted from a single-sequence per gene alignment to a comprehensive multiple-sequence per gene alignment; virtual probes used for gene content determination were refined and expanded; the genotype determination workflow was adjusted from static single-gene filtration alignments to dynamic holistic alignments, which incorporate the so-established gene content and preliminary genotype determinations; a custom alignment processing workflow was developed to handle multiple-sequence per gene alignments; and finally, a KIR sequence imputation workflow was developed to enable alignment to any described KIR allele sequence. These innovations enable for the first time highly-automated, high-throughput KIR sequence analysis from short-read sequencing data, and importantly, largely obviate the need for user expertise in the KIR system.

Materials and methods

The major innovations in PING, detailed below, include the use of multiple-sequence per gene alignment references that incorporate the allelic diversity of KIR, and genotype-matched alignment references (Fig 1). The use of a diverse reference set in the copy number module substantially improves copy number determination for KIR2DL2, KIR2DL3, KIR2DS1 and KIR2DL1 compared to our prototype approach. The improved performance enables an innovative alignment workflow that dynamically constructs genotype-matched alignment references based on the so-established gene content and a preliminary genotype determination. The genotypes determined by this novel genotype-aware alignment workflow are highly accurate, with few unresolved calls.

Fig 1. Overview of the PING pipeline.

Fig 1

The PING pipeline processes KIR targeted sequencing data to determine KIR gene copy number and allele genotypes through a series of modules. First, KIR aligned reads are filtered through an alignment to a set of KIR haplotypes in PING extractor. Second, copy number of KIR genes are determined through an exhaustive alignment to a diverse set of KIR sequences in PING copy. Finally, PING allele performs a series of alignments to determine the most congruent KIR genotype, which informs a final round of alignment and genotype determination. Additionally, PING reports any identified novel SNPs and new alleles (SNP combinations not found in any described KIR allele sequence).

Preprocessing the database

Imputation of uncharacterized regions and extension of untranslated regions to generate comprehensive alignment reference sequence

KIR allele sequences used throughout this workflow are provided by the Immuno Polymorphism Database—KIR (IPD-KIR), release 2.7.1 [37]. However, many KIR allele sequences in IPD-KIR have only been characterized for exons, indeed, 65% of named alleles have less than 20% of their full-length sequence characterized (Fig 2A). Additionally, IPD-KIR allele sequences only include ~250bp of 5’ untranslated region (UTR) sequence and ~500bp of 3’ UTR sequence, reducing alignment depths across the first exon and potential regulatory regions. Thus, to maximize the utility of reference KIR sequences, we designed and implemented a protocol to impute the sequence of the intronic regions, followed by a protocol to extend UTRs to 1000bp each.

Fig 2. KIR sequence characterization before and after imputation.

Fig 2

(A) Histogram of IPD-KIR allele sequence lengths, shown as percentage of longest sequence for each gene and major allele group. (B) Histogram of IPD-KIR allele sequence lengths after imputation, shown as percentage of longest sequence for each gene and major allelic group.

As reference sequences, we used all KIR alleles described in the IPD—KIR [21], release 2.7.1. A subset of these sequences is not completely characterized through all exons and introns. We therefore used gene-specific alignments of known sequences, provided by IPD-KIR as multiple sequence format (MSF) files, and completed each allele sequence to comprise the invariant nucleotides together with each variable position represented by an ‘N’. Using this imputation method, we generated a new set of reference alleles in which ~90% of the 905 alleles were >98% complete (Fig 2B).

To extend UTR sequence, donor sequences, sourced from the full KIR haplotype sequences used in PING extractor, were appended to the ends of each reference sequence to generate 1000bp long UTRs. A single 3’UTR and 5’UTR donor sequence was used for each gene and major allelic group.

Designing a minimized reference allele set

While performing a comprehensive alignment to the full KIR allele set reduces misalignments caused by reference sequence bias, it demands substantial resource utilization, as well as a large alignment and processing time cost which can prove untenable for processing large datasets. For example, copy determination processing for 10 paired-end sequences using 36 threads took 4.95 hours with a maximum Binary Alignment Map (BAM) [38] file size of 338.2MB.

To address this issue, we constructed a minimized set of reference alleles to improve resource utilization and processing time while still reducing misalignments caused by reference sequence bias. The minimized reference set consists of five alleles for each KIR gene and major allelic group (S1 Table). The use of five alleles per gene was empirically determined to be sufficient for reducing reference sequence bias, while still considerably reducing the computational burden of multiple-sequence per gene alignments. Designing this reference set was guided by selecting alleles which had fully-characterized or nearly fully-characterized sequence, the secondary criteria was maximizing SNP diversity between the reference alleles of each gene, and third was selecting reference alleles to sequester reads susceptible to off-gene mapping. For example, reference sequences to represent KIR2DS1*002 as well as KIR2DL1*004 were selected to sequester reads that perfectly align to both. Notable characteristics of the reference set are the separation of KIR2DL2 from KIR2DL3, the separation of KIR3DL1 from KIR3DS1, and the merging of KIR2DL5A and KIR2DL5B.

PING workflow methods

Copy number determination–PING copy

The high sequence similarity between KIR genes coupled with extensive structural variation and nucleotide diversity makes copy number determination a non-trivial task. Our copy determination method is largely identical to that described in Norman et al. [27], in which copy number is determined by comparing the number of reads that align uniquely to each KIR gene across a batch of samples using KIR3DL3 as a normalizer. The improvement made by our method is the use of a comprehensive KIR reference composed of 905 distinct sequences from the imputed and extended allele set, instead of a single-sequence per gene reference. The use of a comprehensive reference provides a more accurate comparison of the number of reads that align uniquely to any KIR gene.

KIR virtual probes–PING allele

To determine the presence of target alleles or allelic groups that are prone to misidentification due to read misalignments, we have developed a set of virtual, or text-based, probes. The probe set includes those described in Norman et al. [27], as well as additional, custom probes (S2 Table). Probes are designed to match sequence that is unique to the target allele or allelic group, and sequence uniqueness is determined by a grep search over the imputed and extended IPD-KIR sequence set. Application of the probe set is performed using grep over the sequencing data, counting the number of unique reads that contain sequence perfectly matching the probe. A probe hit is determined using a threshold of 10 matching reads.

Genotype matched alignment workflow–PING allele

The overall alignment strategy of PING is to reduce reference sequence bias through using multiple-sequence per gene references, and the use of references that reflect the gene content makeup or genotype makeup of a sample (Fig 3). Additionally, PING utilizes multiple rounds of alignment and genotype determination with varied processing parameters to reduce bias introduced by assumptions made during the processing workflow. The intermediate and final alignment and genotyping rounds are referred to as ‘initial’ and ‘final’, respectively, when differentiating processing parameters.

Fig 3. Overview of the genotype aware alignment workflow.

Fig 3

Sequence gene content, determined by PING copy, informs the selection of reference sequence from a predefined set of diverse allele sequences. An exhaustive alignment is performed to the selected allele set, from which an initial genotype determination is made. The determined genotype informs selection of reference alleles for a genotype aware alignment, followed by another round of genotype determination. The genotype aware alignment and subsequent genotype determination is repeated, and the most congruent genotypings across all alignment rounds inform reference selection for a final round of alignment. A non-exhaustive alignment is performed to the selected allele set, from which all aligned reads are processed and used for the final genotype determination.

The first step is an alignment to a multiple-sequence per gene, gene content matched reference. This gene-content aware alignment workflow constructs individualized alignment references based on the presence of certain KIR genes: KIR3DP1, KIR2DS2, KIR2DL23, KIR2DL5A, KIR2DL5B, KIR2DS3, KIR2DS5, KIR2DP1, KIR2DL1, KIR2DL5, KIR3DL1S1, KIR2DS4, KIR3DL2, KIR2DS1 and KIR3DL3 (assumed always present [39]). Reference sequences are selected from the diverse, minimized reference sequence set described above. We have included an option to align to the full comprehensive allele set, but this is not default behavior as these alignments are time and resource intensive.

An exhaustive alignment, an alignment in which all qualified read mappings are recorded, is performed and aligned reads are processed and formatted according to the alignment processing workflow, detailed in S1 Text, selecting for reads that uniquely map to a gene or major allelic group. The formatted uniquely-mapped read set is processed according to the genotype determination workflow, detailed below, to obtain an initial full-resolution genotype.

Second is a series of two alignments to genotype-matched references with varied processing parameters to identify the most congruent KIR genotype. Genotype congruence is determined by the least number of SNP mismatches between the determined allele typing(s) of a gene, and the aligned SNPs. For each genotype-matched alignment in this series, the determined allele typing(s), including any ambiguity, are used as reference sequence for the following alignment. For each alignment, genotypes are first determined at seven-digit (non-coding mutation level), then five-digit resolution (synonymous mutation level). This approach reduces the impact of uncharacterized regions of IPD-KIR allele sequences on genotype determination, as most sequences are fully characterized across exons. Genotype determination can be biased towards or against IPD-KIR alleles with uncharacterized regions depending on whether uncharacterized SNPs count as mismatches or not. To reduce time spent on genotype determination, any unambiguous typing that is perfectly matched to the aligned SNPs is locked in across all subsequent intermediate rounds of genotype determination.

The reference for the final alignment is built from the locked genotypings and the closest matched genotypings for genes without a locked genotyping. A non-exhaustive alignment is performed to the built reference, from which all aligned reads are processed and formatted according to the alignment processing workflow. The formatted read alignments are passed to the genotype determination workflow to obtain a final exonic (five-digit) resolution genotype.

Additional methods utilized by the genotype matched alignment workflow–PING allele

Genotype matched alignments and subsequent genotype determinations can get stuck on a mistyped allele due to persistent reference sequence bias. In other words, a false SNP call originating from misaligned reads can perpetuate itself in the genotype matched alignments due to the same allele determination being made and the same alignment reference being used. To address this issue, we have included a method in the genotype matched alignments that will add the five allele sequences from the diverse, minimized reference set to the genotype-matched reference for any gene with an allele typing that does not perfectly match the aligned SNPs. The rationale behind this method is that mismatched allele typings are likely due to misaligned reads, and the use of the mismatched allele sequence as a reference will cause the read misalignments to be repeated in subsequent alignment and genotyping rounds. The addition of a diverse set of alignment sequences gives an avenue to break from this cycle by increasing the likelihood that a different allele typing will be made.

In building genotype-matched references PING allows any allele to be used as reference sequence, however, some allele sequences are only partially characterized even after imputation. The use of allele sequences containing uncharacterized sequence as alignment references can introduce reference sequence bias and drive read misalignments even if the reference alleles perfectly match the true genotype of the sample. To address this issue, we have included a method to add fully-characterized sequence to the alignment reference for any gene represented by only partially-characterized sequence(s). Fully-characterized alleles are pulled from the diverse, minimized reference set.

In the genotype-aware alignment workflow we found issues with false negative identifications of KIR2DL1*004/*007/*010 due to reads cross-mapping to other gene sequences. This issue was rectified using virtual sequence probes specific to each of these KIR2DL1 allele groups to identify *004/*007/*010 allele presence. If KIR2DL1*010 is present, then the KIR2DL1*010 allele sequence is added to the alignment reference. If KIR2DL1*004 is present, then the KIR2DL1*0040101 allele sequence is added to the alignment reference. If KIR2DL1*007 is present, then the KIR2DL1*007 allele sequence is added to the alignment reference. If multiple of these allele groups are present, then the KIR2DL1*0040101 allele sequence is added to the alignment reference.

We implemented additional probes to identify alleles and structural variants prone to misidentification across KIR2DL1, KIR2DL2, KIR2DL4, KIR2DS1, KIR3DP1 and KIR3DS1. For example, we implemented a probe to identify the KIR2DL4 poly-A stretch at the end of exon 7, as well as a probe to identify KIR3DP1 exon 2 deletion variants. The full list of probes used for reference refinement can be found in S2 Table.

Genotype determination workflow–PING allele

Indexed reads, detailed in S1 Text, are processed to generate a depth table spanning -1000bp 5’UTR to 1000bp 3’UTR for each KIR gene and major allelic group. Depths are marked independently for A, T, C, G, deletions and insertions. Depth tables are processed to generate SNP tables for positions passing a minimum depth threshold (default 8 for initial genotyping and 20 for final genotyping). To identify heterozygous positions the depth of each aligned variant is divided by the highest depth variant for that position, and up to three variants (A, T, C, G, deletions and insertions) passing the ratio threshold (default 0.25 for initial and final genotyping) are recorded.

Genotypes for each gene and major allelic group are determined from the aligned SNPs using a mismatch scoring approach. First, aligned homozygous SNPs are compared to each IPD-KIR allele, with SNP mismatches counting as a score of 1 and matches as 0. The lowest scoring alleles and alleles within a set scoring buffer of the lowest score (default of 4 for the initial genotyping workflow and 1 for the final genotyping workflow), are carried over into heterozygous position scoring.

For aligned heterozygous position scoring, all possible allele combinations are enumerated according to the determined copy of the gene under consideration, up to copy 3. For each aligned position, the variant(s) for each allele combination are compared to the aligned variants, with full matches counted as a score of 0 and mismatches scored according to the number of mismatched variants. For each allele combination, the homozygous score of each component allele is added to the heterozygous score, and the lowest scoring combinations are returned as the determined genotype. For the final genotyping workflow, only perfectly scoring combinations are accepted, with any mismatches resulting in an unresolved genotype.

The same workflow is applied to both initial and final genotyping with some important distinctions. In the initial genotyping workflow, the imputed and extended IPD-KIR allele sequences are used for SNP comparisons, uncharacterized variants within the comparison sequences are marked as full mismatches, and all aligned allele-differentiating SNP positions passing the depth threshold are compared and used for scoring. In the final genotyping workflow, the unimputed IPD-KIR allele sequences are used for SNP comparisons, uncharacterized variants within the comparison sequences are marked as matches, and only aligned exonic SNP positions passing the depth threshold are compared and used for scoring.

The final exonic resolution genotypes are processed to add null alleles to the genotype string for genes with copy 0 or 1 and combine component allele typings for the major allelic groups KIR2DL2 and KIR2DL3, KIR3DL1 and KIR3DS1, and KIR2DS3 and KIR2DS5.

Workflow validation

KIR synthetic sequence dataset

A KIR synthetic dataset consisting of 50 sequences was generated using the ART next-generation sequencing read simulator [40]. ART parameters were set to simulate 150-bp paired-end reads at 50x coverage, with a median DNA fragment length of 200 using quality score profiles from the HiSeq 2500 system. Eleven of the KIR haplotypes described in Jiang et al. [41] were used to simulate structural variation of the KIR region. Two of the eleven haplotypes were randomly selected with replacement to establish the copy number for each sample. Allele sequences were selected randomly without replacement from the imputed and extended set according to the copy number of each gene. Any uncharacterized regions in the selected allele sequences were replaced with sequence from a random fully-characterized sequence from the same gene. Reads were named according to the source allele, enabling tracing of misaligned reads to their source allele and gene. The full synthetic dataset is available at: https://github.com/wesleymarin/KIR_synthetic_data.

Discordant genotype results for the synthetic dataset were investigated by identifying the source gene for each read aligned to the incorrectly genotyped gene. The results were summarized to show the total number of reads from each source gene to each aligned gene, Table A in S3 Table, and a read sharing diagram was generated using the circlize [42] package in R [43].

Characterization of KIR reference cohorts for PING development

A significant barrier to the development of bioinformatic methods for high-resolution KIR sequence interpretation is the lack of a well-characterized reference cohort. Without such a resource it is extremely difficult to recognize and resolve issues with read misalignments, which can result in SNP calls that appear reasonable in many cases. To resolve this issue, we have characterized a KIR reference cohort of 379 healthy individuals of European ancestry that had been previously sequenced using our KIR target capture method [44], with the results meticulously curated by manual alignment and inspection of all sequences to provide a ground truth dataset to aid pipeline development (Table A in S4 Table). Furthermore, the European samples were independently sequenced and genotyped for KIR by our collaborators at the DKMS registry for volunteer bone marrow donors [31]. Any discordant typing or gene content results were resolved through direct examination of sequence alignments, and where necessary, confirmatory sequencing.

In order to validate our method on a second, divergent population, we also examined a previously characterized cohort of African Khoesan individuals [45], for which KIR alignments and genotypes were manually inspected (Table B in S4 Table).

Copy number and genotype concordance calculations

For the European cohort, any genotype containing an unresolved KIR3DL3 genotyping, a genotype for which the aligned SNPs do not perfectly match currently described alleles, in the truth dataset were excluded from copy number and genotype concordance comparisons. There were 16 full genotypes excluded by this criterion. Additionally, any individual gene with an unresolved genotype in the truth dataset were excluded from copy number and genotype concordance comparisons. For the synthetic dataset there were no simulated novel alleles, so the full dataset was used for copy number and genotype concordance comparisons. For the Khoesan dataset any individual gene with an unresolved genotype in the truth dataset were excluded from genotype concordance comparisons but were included for copy number comparisons.

Copy concordance was calculated by directly comparing the determined copy values to the validation copy values (S5 and S6 Tables). Genotype concordance was calculated on a per-gene basis by comparing each component allele of the determined typing to the truth genotype.

Code availability

The PING pipeline is available at https://github.com/wesleymarin/PING [46] with the following open source license: https://github.com/wesleymarin/PING/blob/master/LICENSE. Scripts and datasets used for data analysis are available at https://github.com/wesleymarin/ping_paper_scripts. PING was developed in the R programming language and tested on a Linux system. Additional requirements are: Samtools v1.7 or higher [38], Bcftools v1.7 or higher, and Bowtie2 v2.3.4.1 or higher [47]. The synthetic KIR sequence dataset is available at https://github.com/wesleymarin/KIR_synthetic_data.

Results

Extensive sequence identity among KIR genes is a major barrier for interpreting short-read sequencing data

Read misalignments due to sequence identity across the KIR region are a persistent challenge in KIR bioinformatics and often lead to spurious genotyping results. To quantify the extent of sequence identity and inform our investigation of SNPs suspected to be originating from misaligned reads, we performed a shared k-mer analysis using all 905 described KIR allele sequences in the Immuno Polymorphism Database (IPD)—KIR [21], release 2.7.1. Here, we transformed allele sequences into all distinct subsequences of sizes 50, 150, and 250 to compare sequence identity between genes. Shared k-mer proportions were calculated by dividing the number of shared k-mers by the total number of k-mers of that gene. K-mer sharing diagrams were generated using the circlize [42] package in R [43].

Our analysis showed that many genes share significant sequence identity at sequencing lengths commonly used in next generation sequencing (NGS) technology (Fig 4A). For example, KIR2DL5A shares 12,591 of its 15,359 distinct 150-mers (82%) with KIR2DL5B (Fig 4B and S8 Table), making it extremely difficult to distinguish short reads originating from these genes. Likewise, over 90% of the distinct 50-mers, and over 50% of distinct 250-mers of KIR2DS1 are shared with other genes, the vast majority with KIR2DL1. This analysis allowed us to identify specific “hotspots” for read misalignments, informing post-alignment modifications (detailed previously) to minimize their impact.

Fig 4. K-mer analysis of KIR gene sequence similarity.

Fig 4

(A) Ratio of distinct k-mers of size 50, 150 and 250 that are shared between the indicated KIR gene and others. The inverse of these bars (not shown) would indicate the proportion of k-mers that are distinct to that gene and not found in the alleles of other genes. (B) 150-mer connections between KIR genes, the size of the connecting line roughly indicates the total number of shared 150-mers.

Development of a comprehensive KIR alignment reference enables accurate copy number determination of KIR2DL1, KIR2DS1, KIR2DL2 and KIR2DL3

We compared single-sequence per gene vs. multiple-sequence per gene reference alignments using the synthetic dataset. The reads from this dataset are labeled according to their source gene, providing a straightforward approach to quantify off-target alignments. This comparison showed substantial reductions in the frequency of read misalignments (reads mapping to an off-target gene) across KIR2DL1, KIR2DL23, KIR2DL5A/B, KIR2DS1, KIR2DS35, and KIR3DL1S1, and small reductions for KIR3DL2, KIR3DL3, and KIR3DP1 for the multiple-sequence per gene reference (Fig 5A and S9 Table). Applying the comprehensive reference allele set to gene content and copy number determination, we achieved significant improvement over a single-sequence per gene reference for KIR2DL1, KIR2DS1 and the allelic groups KIR2DL2 and KIR2DL3 (S1S3 Figs). The copy number for KIR2DS1, per example, which is highly prone to read misalignments due to similarity to KIR2DL1 and KIR2DS4 (Fig 4B), was clearly determined (Fig 5B).

Fig 5. Use of comprehensive reference improves copy determinations.

Fig 5

(A) Frequencies of off-target read mappings using a comprehensive reference vs. a single-sequence per gene reference for the synthetic dataset. (B) Single-sequence reference vs. comprehensive reference copy number plot of KIR2DS1 for the European cohort. The copy plot of the single-sequence reference alignment shows no differentiation between copy groupings while the comprehensive reference alignment shows a clear distinction between the copy 0, 1 and 2 groups.

PING delivers accurate copy number and high-resolution allele calls

The overall performance of PING was assessed using our European KIR reference cohort, a synthetic KIR dataset, and a Khoesan KIR reference cohort. Results for PING copy number determination are summarized in Table 1, showing at least 97% concordance for the European cohort for all compared genes, with most genes exhibiting more than 99% concordance. Performance for the synthetic dataset showed 100% copy concordance for all compared genes except for KIR2DL1, at 98%, and KIR2DS3, at 92%. Finally, performance for the Khoesan cohort showed at least 95% concordance for all compared genes except for KIR2DL2, at 61%, KIR2DL5A/B, at 88%, KIR2DS5, at 89%, and KIR2DL1, at 94%. Across all datasets KIR3DL3 was not compared due to its use as a reference gene, and for the European and Khoesan cohorts the pseudogene KIR3DP1 was not compared due to an absence of validation data.

Table 1. Copy number determination performance.

Concordance table comparing copy numbers determined by PING for the European reference cohort, a synthetic KIR dataset, and a Khoesan reference cohort.

Gene European N Synthetic N Khoesan N
KIR3DL3 - - - - - -
KIR2DS2 0.988 343 1.00 50 0.97 100
KIR2DL2 0.994 331 1.00 50 0.61 100
KIR2DL3 0.994 331 1.00 50 1.00 100
KIR2DL5A/B 0.997 343 1.00 50 0.88 100
KIR2DS3 0.988 343 0.92 50 0.97 100
KIR2DS5 0.985 342 1.00 50 0.89 100
KIR2DP1 0.982 338 1.00 50 0.95 100
KIR2DL1 0.970 334 0.98 50 0.94 100
KIR3DP1 - - 1.00 50 - -
KIR2DL4 0.994 341 1.00 50 1.00 100
KIR3DL1 0.997 340 1.00 50 1.00 100
KIR3DS1 0.997 339 1.00 50 0.99 100
KIR2DS1 0.988 342 1.00 50 1.00 100
KIR2DS4 0.991 343 1.00 50 0.99 100
KIR3DL2 0.988 326 1.00 50 1.00 100

Performance of genotype determination was assessed at three-digit resolution (protein level) for the European and Khoesan cohorts, and at five-digit resolution (synonymous mutation level) for the synthetic dataset (Table 2). The results were categorized as genotype matches, mismatches or unresolved genotypes, which were cases where PING could not make a genotype determination.

Table 2. Genotype determination performance.

Genotype determination performance table comparing the genotypes determined by PING to the validation genotypes for each dataset. Possible outcomes are ‘Match’, where the determined component allele matches the validation allele, ‘Mismatch’, where the determined component allele does not match the validation allele, or ‘Unresolved’, where PING was unable to determine a genotype, but the validation allele was not marked as unresolved. The coloring signifies concordance level, where green is 0–10% discordant, yellow is 10–15% discordant, and red is over 15% discordant.

Gene Dataset Match Mismatch Unresolved N
KIR3DL3 European 0.959 0.009 0.032 686
Synthetic 0.960 0.000 0.040 100
Khoesan 0.887 0.062 0.050 80
KIR2DS2 European 0.975 0.012 0.013 686
Synthetic 0.940 0.040 0.020 100
Khoesan 0.835 0.005 0.160 188
KIR2DL23 European 0.965 0.003 0.032 656
Synthetic 0.850 0.020 0.130 100
Khoesan 0.810 0.042 0.149 168
KIR2DL5A/B European 0.927 0.044 0.029 687
Synthetic 0.856 0.106 0.038 104
Khoesan 0.857 0.071 0.071 126
KIR2DS35 European 0.982 0.006 0.012 683
Synthetic 0.923 0.019 0.058 104
Khoesan 0.871 0.052 0.078 116
KIR2DP1 European 0.890 0.007 0.103 672
Synthetic 0.971 0.000 0.029 103
Khoesan 0.721 0.012 0.267 86
KIR2DL1 European 0.961 0.009 0.030 666
Synthetic 0.883 0.019 0.097 103
Khoesan 0.803 0.045 0.152 132
KIR3DP1 European - - - -
Synthetic 0.773 0.055 0.173 110
Khoesan - - - -
KIR2DL4 European 0.980 0.007 0.013 685
Synthetic 1.000 0.000 0.000 110
Khoesan 0.961 0.006 0.032 154
KIR3DL1S1 European 0.962 0.006 0.032 686
Synthetic 0.955 0.000 0.045 110
Khoesan 0.873 0.028 0.099 142
KIR2DS1 European 0.985 0.003 0.012 682
Synthetic 0.900 0.000 0.100 100
Khoesan 0.995 0.000 0.005 198
KIR2DS4 European 0.943 0.044 0.013 685
Synthetic 0.940 0.020 0.040 100
Khoesan 0.793 0.051 0.157 198
KIR3DL2 European 0.965 0.008 0.028 648
Synthetic 0.990 0.010 0.000 100
Khoesan 0.819 0.011 0.170 188

PING genotype determination for the European cohort showed low percentages of unresolved genotypes with few mismatches for all compared genes except for KIR2DP1, with 10.3% unresolved. Notable results for the European cohort were the low frequencies of unresolved genotypes across most genes, except KIR2DP1, and the extremely low frequencies of mismatched genotypes, below 1%, for 9 out of the 12 genes compared.

Determined genotypes for the synthetic dataset showed over 95% concordance for KIR3DL3, KIR2DP1, KIR2DL4, KIR3DL1S1, and KIR3DL2. However, the synthetic dataset showed high percentages of unresolved genotypes for KIR2DL23, KIR3DP1 and KIR2DS1, each over 10% unresolved, and KIR2DS35 and KIR2DL1 showed 5.8% and 9.7% unresolved, respectively. KIR2DL5A/B and KIR3DP1 showed the highest mismatched genotype percentages, at 10.6% and 5.5%, respectively, while KIRDL3, KIR2DP1, KIR2DL4 and KIR2DS1 each showed 0.0% mismatched genotypes.

Determined genotypes for the Khoesan cohort showed highly concordant genotypes for KIR2DL4, at 96.1%, and KIR2DS1, at 99.5%. Additionally, results for this dataset showed low mismatch frequencies for KIR2DS2, KIR2DL23, KIR2DP1, KIR2DL1, KIR3DL1S1 and KIR3DL2, each below 5.0% mismatched. However, the Khoesan cohort showed moderate mismatch frequencies for KIR3DL3, at 6.2%, KIR2DL5A/B, at 7.1%, KIR2DS35, at 5.2%, and KIR2DS4, at 5.1%, and higher unresolved rates for KIR2DS2, KIR2DL23, KIR2DP1, KIR2DL1, KIR2DS4 and KIR3DL2, each over 10.0% unresolved.

For the European and Khoesan cohorts the pseudogene KIR3DP1 was not compared due to an absence of validation data.

Looking specifically at the concordance of resolved genotypes, the European cohort showed greater than 98.0% concordance across all compared genes except for KIR2DL5A/B, at 95.5%, and KIR2DS4, at 95.6% (Table 3). The synthetic dataset showed 100% concordance for KIR3DL3, KIR2DP1, KIR2DL4, KIR3DL1S1 and KIR2DS1, over 95% concordance for KIR2DS2, KIR2DL23, KIR2DS35, KIR2DL1, KIR2DS4 and KIR3DL2. The lowest performing genes in the synthetic dataset were KIR2DL5A/B, at 89%, and KIR3DP1, at 93%. The Khoesan cohort showed 100% concordance for KIR2DS1, over 95% concordance for KIR2DS2, KIR2DL23, KIR2DP1, KIR2DL4, KIR3DL1S1 and KIR3DL2, and over 90% concordance for KIR3DL3, KIR2DL5A/B, KIR2DS35, KIR2DL1 and KIR2DS4.

Table 3. Resolved genotype concordance.

PING genotype determination performance for the European reference cohort, a synthetic KIR dataset, and the Khoesan reference cohort for each considered KIR gene.

Gene European N Synthetic N Khoesan N
KIR3DL3 0.991 664 1.00 96 0.934 76
KIR2DS2 0.988 677 0.96 98 0.994 158
KIR2DL23 0.997 635 0.98 87 0.951 143
KIR2DL5A/B 0.955 667 0.89 100 0.923 117
KIR2DS35 0.994 675 0.98 98 0.944 107
KIR2DP1 0.992 603 1.00 100 0.984 63
KIR2DL1 0.991 646 0.98 93 0.946 112
KIR3DP1 - -k 0.93 91 - -
KIR2DL4 0.993 676 1.00 110 0.993 149
KIR3DL1S1 0.994 664 1.00 105 0.969 128
KIR2DS1 0.997 674 1.00 90 1.000 197
KIR2DS4 0.956 676 0.98 96 0.940 167
KIR3DL2 0.992 630 0.99 100 0.987 156

Together, these results demonstrate that PING accurately provides KIR genotyping across distinct populations.

Analysis of discordant determined copy number and genotype results

The discordant copy results for KIR2DS3 in the synthetic dataset were the result of poor differentiation between copy groups (S3 Fig). The highly discordant KIR2DL2 copy number result for the Khoesan cohort was due to non-differentiable copy number groupings (S2 Fig). Since the KIR2DL3 copy differentiation for this cohort was well defined, these results were used to set the KIR2DL2 copy number prior to genotype determination using the formula KIR2DL2_copy = 2 –KIR2DL3_copy.

An investigation into the discordant genotypes for the synthetic dataset showed discordant genotype determination results for KIR2DS35 were largely due to source reads from KIR2DS3 aligning to KIR2DS5 reference sequence, with a smaller number of reads from KIR2DS5 aligning to KIR2DS3 reference sequence (Fig 6 and Table A in S3 Table). This differential read flow between the two allelic groups is reflected in the component allele typings, with six discordant KIR2DS5 genotypings and two discordant KIR2DS3 genotypings (Table C in S7 Table). Intragenic misalignments are a product of how the PING workflow is structured, as major allelic groups, such as KIR2DS3 and KIR2DS5, are treated as independent genes during alignment and genotyping. Intragenic misalignments were also a large contributor to KIR3DL1S1 discordance, with reads supplied by KIR3DL1 mapping to KIR3DS1 reference sequence.

Fig 6. Misaligned read sources in the synthetic dataset.

Fig 6

Analysis of mismatched or unresolved genotype determination results for the synthetic sequence dataset where all misaligned reads are traced back to their source gene. The connections between genes represent the number of misaligned reads, and the color of the connection represents the source gene.

The analysis showed KIR3DP1 as a major hub for receiving misaligned reads, with reads being contributed by each other KIR gene. In fact, KIR3DP1 was largely the only receiver for misaligned reads originating from KIR3DL3, KIR3DL2, KIR2DP1 and KIR2DL4. While the only genes receiving reads sourced from KIR3DP1 were KIR2DL5A and KIR2DL5B.

The analysis also showed several gene pairings, where two genes largely sent and received reads from one another. Once such pairing was between KIR2DL1 and KIR2DS1, where each gene were the largest contributor and receiver of reads for each other. Another pairing was between KIR2DL2 and KIR2DS2, although both genes sent and received reads from several other genes.

This analysis illustrates the complex and highly interconnected nature of KIR and highlights the difficulty behind accurate interpretation of KIR short-read sequencing data.

Performance

The run time and resource utilization of the PING pipeline was measured on an Intel Xeon 2.20 GHz CPU using 36 threads. For ten sequences from the synthetic dataset, it took 1.92 mins for KIR read extraction, 34.7 mins for copy determination aligning to the minimized reference set, and 2.10 hours for genotype determination aligning to the minimized reference set. The output directory size was 1.4GB.

Discussion

Our shared k-mer analysis of all documented KIR variation shows the high degree of sequence identity between KIR genes and illustrates the challenges imposed by the homology of KIR on short-read interpretation workflows. It demonstrates that some genes are more likely to exhibit read misalignment problems than others. KIR2DP1, KIR3DL3 and KIR2DL4 have relatively unique sequence, while KIR2DS1, KIR2DL5A and KIR2DL5B have considerable shared sequence. This type of analysis provides an informative tool for investigating irregularities in the processing of KIR sequence data, revealing which genes are likely to be erroneously interpreted due to read misalignments for common sequencing read lengths. While paired-end sequencing with longer reads can improve read alignment fidelity, in our own experience 290bp paired-end reads with a median insert length of approximately 600bp still exhibited considerable read misalignment problems. It is important to note that this analysis does not account for unknown variation or intergenic sequence, two other sources of sequence variation that could potentially result in misaligned reads.

An initial determination of KIR gene content and copy number provides an informative scaffold for minimizing misalignments through the exclusion of reference sequence representing absent genes, as well as a system for identifying misalignments by searching for erroneously-called heterozygous SNP alleles in hemizygous genes. Thus, accurate copy number determination is a vital first step in interpreting KIR sequencing data. To achieve this goal, we developed a copy number determination method in PING that uses all described KIR alleles as an alignment resource, increasing the total number of reference sequences from 15 to 905 compared to single-sequence per gene alignments. While many of these alleles were only defined across exonic regions, rendering them ineffective for short-read alignment, we developed and implemented a protocol for intronic region imputation. The imputation method cannot resolve all uncharacterized nucleotide sequence, yet it accounts for the majority of missing sequence, greatly increasing the number of useful reference alleles. The exhaustive alignment provides a comprehensive map of the alleles to which a read may align, facilitating copy number resolution of important KIR allelic groups and genes that share extensive sequence similarity, such as KIR2DL2, KIR2DL3, KIR2DL1 and KIR2DS1, which were inaccessible to previous bioinformatic methods [27,48]. Additionally, the limited range of described UTR sequence, ~250bp 5’UTR and ~500bp 3’UTR, can reduce alignments over the first exon and potential regulatory regions [49,50].

The improved copy determination performance of PING, in addition to the expanded useful reference sequence repertoire, enables a smart, genotype-aware alignment workflow, designed to minimize read misalignments by closely matching reference sequences to the gene sequences present in the sequencing data. This alignment strategy addresses a major weakness of the filtration alignments utilized in the prototype workflow, which apply filters to retain gene-specific reads and eliminate cross-mapping reads regardless of the gene-content or sequence makeup, and thus often suffer from either inadequate or patchy aligned read depths after filtration. There is a valid concern about carrying forward alignment biases in the genotype-aware alignment workflow, and the KIR system in particular is sensitive to reference bias because the combination of highly polymorphic genes and high sequence similarity between genes means that small changes in the reference sequence can have large impacts on read alignments. We have implemented several methods to counteract potential alignment biases that could be carried forward by the genotype-aware alignments. The first is the use of virtual probes to identify alleles and structural variants prone to misidentification. The second is the addition of a curated sequence set to the alignment reference for any gene with a determined genotype that does not perfectly match the aligned SNPs, these sequences were selected to cover a large amount of the allelic diversity of the corresponding gene. Even with these countermeasures we still encounter some improper novel genotype determinations, likely due to reference sequence bias. Since no novel genotypes were simulated for the synthetic dataset the Synthetic data in the Unresolved column of Table 2 represent improper novel genotype determinations. Despite these limitations, the genotype-aware workflow achieves highly accurate genotype determinations for the European dataset (Table 2), and highly accurate resolved genotype determinations across all tested datasets (Table 3).

Both the synthetic dataset and Khoesan cohort showed higher levels of unresolved genotypes compared to the European cohort (Table 2). These datasets represent challenging data to correctly interpret, with the Khoesan being an extremely divergent population with many unresolved genotypes in the validation data, and the synthetic dataset consisting of random alleles, some of which used imputed sequence. An analysis into the discordant results for the synthetic dataset (Fig 6 and S3 Table) showed a complex web of cross-mapped reads. These cross-mapped reads can be extremely difficult to resolve because the high-degree of sequence shared among KIR genes (Fig 4) makes it almost impossible to determine correct mappings. Additionally, measures meant to prevent read misalignments, such as the use of virtual probes to refine reference sequence selection, can serve as a double-edged sword, where the issue at hand is addressed but the changes create new sources for read misalignments.

An analysis into the discordant copy results highlights a major outstanding problem with the PING workflow since accurate copy determination is a central component of effective genotype-aware alignments, and the need for manual thresholding between copy groups introduces the component of user error. Continued development of the pipeline will address methods for automating copy determination for targeted sequencing data that matches or surpasses the accuracy achieved by manual thresholding. To compare PING against an existing method, we benchmarked against KPI [35] for determining KIR gene content (S10 Table), achieving 100% concordance for KIR3DP1, KIR2DL3, KIR2DL4, KIR3DL3 and KIR3DL2, over 97% concordance for KIR2DS5, KIR2DP1, KIR2DS3, KIR2DS2, KIR3DL1, KIR2DL2, KIR2DS4, KIR2DL1, and KIR2DL5A/B, and over 95% concordance for KIR3DS1, and KIR2DS1.

We believe improved interpretation of KIR sequencing data will ultimately be achieved through longer-range sequencing technologies that can extend past the range of the shared sequence motifs, and through better imputation approaches that can more fully characterize currently described KIR alleles to provide a more robust alignment reference. While long-read data from a platform with cost-effective methods might be difficult to interpret due to the high error rates [51], the combination of long-reads and short-reads would cover the weaknesses of the respective technologies and should provide a highly accurate KIR interrogation method, indeed, long-read methods have provided valuable insights into KIR haplotypes [28]. We have not had the opportunity to test long-read technologies, but we anticipate a need for careful consideration of potential read misalignments when aligning the short and long reads together. Higher fidelity methods are currently under development, but currently the cost is prohibitive for the kind of high-throughput studies that PING was designed to address. Meanwhile, for samples for which genotypes are not easily resolvable, we recommend direct visualization of sequence alignments potentially coupled with alternative laboratory methods to more precisely determine genotypes.

While the PING workflow is specific to interpreting sequence originating from the KIR complex, the underlying strategies can be extended to other problematic genomic regions. For example, multiple-sequence per gene alignment strategies provide information for discriminating between reads derived from genes with high sequence identity and extensive nucleotide polymorphisms. Additionally, genotype-aware alignment strategies reduce bias introduced by the reference sequence for reads derived from genomic regions with high structural variation.

In conclusion, PING incorporates these innovations to provide accurate, high-throughput interpretation of the KIR region from short-read sequencing data. Together, these modifications provide a consistent KIR genotyping pipeline, creating a highly automated, robust workflow for interpreting KIR sequencing data. To the best of our knowledge, this is the only bioinformatic workflow currently available for high-resolution KIR genotyping from short-read data. Given the importance of KIR variation in human health and disease, availability of a highly accurate method to assess KIR genotypic variation should promote important discoveries related to this complex genomic region.

Supporting information

S1 Fig. European cohort copy determinations.

(EPS)

S2 Fig. Khoesan cohort copy determinations.

(EPS)

S3 Fig. Synthetic dataset copy determinations.

(EPS)

S1 Table. Diverse and minimized reference allele set.

(XLSX)

S2 Table. Virtual probe table for reference modifications.

(XLSX)

S3 Table. Discordant genotype analysis for synthetic dataset.

(XLSX)

S4 Table. Validation genotype table.

(XLSX)

S5 Table. PING determined copy number table.

(XLSX)

S6 Table. Validation copy number table.

(XLSX)

S7 Table. PING determined genotype table.

(XLSX)

S8 Table. K-mer gene match table.

(XLSX)

S9 Table. Synthetic dataset off-target read mappings.

(XLSX)

S10 Table. Benchmarking PING and KPI gene content performance.

(XLSX)

S1 Text. Genotype determination supporting methods.

(DOC)

Acknowledgments

We would like to thank Michael Wilson and Mark Seielstad for constructive comments.

Data Availability

Scripts used to process genotype data and generate figures are available from https://github.com/wesleymarin/ping_paper_scripts. PING pipeline code is available from https://github.com/wesleymarin/ping. All relevant validation genotype data are within the manuscript and its Supporting Information files. Synthetic sequence data is available from https://github.com/wesleymarin/KIR_synthetic_data.

Funding Statement

This work was supported by the National Institutes of Health (https://www.nih.gov/) (NIH-R01AI128775). Award recipients are JAH, PJN and WMM. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Colonna M, Moretta A, Vély F, Vivier E. A high-resolution view of NK-cell receptors: Structure and function. In: Immunology Today. Elsevier Ltd; 2000. p. 428–31. doi: 10.1016/s0167-5699(00)01697-2 [DOI] [PubMed] [Google Scholar]
  • 2.Björkström NK, Béziat V, Cichocki F, Liu LL, Levine J, Larsson S, et al. CD8 T cells express randomly selected KIRs with distinct specificities compared with NK cells. Blood. 2012Oct25;120(17):3455–65. doi: 10.1182/blood-2012-03-416867 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Fauriat C, Ivarsson MA, Ljunggren HG, Malmberg KJ, Michaëlsson J. Education of human natural killer cells by activating killer cell immunoglobulin-like receptors. Blood. 2010Feb11;115(6):1166–74. doi: 10.1182/blood-2009-09-245746 [DOI] [PubMed] [Google Scholar]
  • 4.Pende D, Falco M, Vitale M, Cantoni C, Vitale C, Munari E, et al. Killer Ig-like receptors (KIRs): Their role in NK cell modulation and developments leading to their clinical exploitation. Vol. 10, Frontiers in Immunology. Frontiers Media S.A.; 2019. p. 1179. doi: 10.3389/fimmu.2019.01179 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Kumar S. Natural killer cell cytotoxicity and its regulation by inhibitory receptors [Internet]. Vol. 154, Immunology. Blackwell Publishing Ltd; 2018. [cited 2020 Mar 5]. p. 383–93. Available from: http://doi.wiley.com/10.1111/imm.12921 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Parham P. MHC class I molecules and KIRs in human history, health and survival. Nat Rev Immunol [Internet]. 2005. Mar [cited 2015 Feb 10];5(3):201–14. Available from: http://www.ncbi.nlm.nih.gov/pubmed/15719024 doi: 10.1038/nri1570 [DOI] [PubMed] [Google Scholar]
  • 7.Nelson GW, Martin MP, Gladman D, Wade J, Trowsdale J, Carrington M. Cutting Edge: Heterozygote Advantage in Autoimmune Disease: Hierarchy of Protection/Susceptibility Conferred by HLA and Killer Ig-Like Receptor Combinations in Psoriatic Arthritis. J Immunol. 2004Oct1;173(7):4273–6. doi: 10.4049/jimmunol.173.7.4273 [DOI] [PubMed] [Google Scholar]
  • 8.Salie M, Daya M, Möller M, Hoal EG. Activating KIRs alter susceptibility to pulmonary tuberculosis in a South African population. Tuberculosis. 2015Dec1;95(6):817–21. doi: 10.1016/j.tube.2015.09.003 [DOI] [PubMed] [Google Scholar]
  • 9.Hirayasu K, Ohashi J, Kashiwase K, Hananantachai H, Naka I, Ogawa A, et al. Significant association of KIR2DL3-HLA-C1 combination with cerebral malaria and implications for co-evolution of KIR and HLA. PLoS Pathog. 2012Mar;8(3). doi: 10.1371/journal.ppat.1002565 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Khakoo SI, Thio CL, Martin MP, Brooks CR, Gao X, Astemborski J, et al. HLA and NK cell inhibitory receptor genes in resolving hepatitis C virus infection. Science (80-). 2004Aug6;305(5685):872–4. doi: 10.1126/science.1097670 [DOI] [PubMed] [Google Scholar]
  • 11.Martin MP, Gao X, Lee JH, Nelson GW, Detels R, Goedert JJ, et al. Epistatic interaction between KIR3DS1 and HLA-B delays the progression to AIDS. Nat Genet. 2002;31(4):429–34. doi: 10.1038/ng934 [DOI] [PubMed] [Google Scholar]
  • 12.Giebel S, Locatelli F, Lamparelli T, Velardi A, Davies S, Frumento G, et al. Survival advantage with KIR ligand incompatibility in hematopoietic stem cell transplantation from unrelated donors. Blood [Internet]. 2003. Apr 3 [cited 2019 Oct 11];102(3):814–9. Available from: http://www.bloodjournal.org/cgi/doi/10.1182/blood-2003-01-0091 [DOI] [PubMed] [Google Scholar]
  • 13.Cooley S, Trachtenberg E, Bergemann TL, Saeteurn K, Klein J, Le CT, et al. Donors with group B KIR haplotypes improve relapse-free survival after unrelated hematopoietic cell transplantation for acute myelogenous leukemia. Blood [Internet]. 2009. Jan 15 [cited 2019 Oct 11];113(3):726–32. Available from: https://ashpublications.org/blood/article/113/3/726/25172/Donors-with-group-B-KIR-haplotypes-improve doi: 10.1182/blood-2008-07-171926 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Venstrom JM, Pittari G, Gooley TA, Chewning JH, Spellman S, Haagenson M, et al. HLA-C-dependent prevention of leukemia relapse by donor activating KIR2DS1. N Engl J Med [Internet]. 2012. Aug 30 [cited 2019 Oct 11];367(9):805–16. Available from: http://www.ncbi.nlm.nih.gov/pubmed/22931314 doi: 10.1056/NEJMoa1200503 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Nakamura R, Gendzekhadze K, Palmer J, Tsai NC, Mokhtari S, Forman SJ, et al. Influence of donor KIR genotypes on reduced relapse risk in acute myelogenous leukemia after hematopoietic stem cell transplantation in patients with CMV reactivation. Leuk Res. 2019Dec1;87:106230. doi: 10.1016/j.leukres.2019.106230 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Cooley S, Weisdorf DJ, Guethlein LA, Klein JP, Wang T, Marsh SGE, et al. Donor Killer Cell Ig-like Receptor B Haplotypes, Recipient HLA-C1, and HLA-C Mismatch Enhance the Clinical Benefit of Unrelated Transplantation for Acute Myelogenous Leukemia. J Immunol. 2014May15;192(10):4592–600. doi: 10.4049/jimmunol.1302517 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Guethlein LA, Beyzaie N, Nemat-Gorgani N, Wang T, Ramesh V, Marin WM, et al. Following transplantation for AML, donor KIR Cen B02 better protects against relapse than KIR Cen B01. J Immunol. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Martin AM, Freitas EM, Witt CS, Christiansen FT. The genomic organization and evolution of the natural killer immunoglobulin-like receptor (KIR) gene cluster. Immunogenetics [Internet]. 2000. Apr 4 [cited 2019 Oct 11];51(4–5):268–80. Available from: http://www.ncbi.nlm.nih.gov/pubmed/10803839 doi: 10.1007/s002510050620 [DOI] [PubMed] [Google Scholar]
  • 19.Uhrberg M, Valiante NM, Shum BP, Shilling HG, Lienert-Weidenbach K, Corliss B, et al. Human diversity in killer cell inhibitory receptor genes. Immunity. 1997Dec1;7(6):753–63. doi: 10.1016/s1074-7613(00)80394-5 [DOI] [PubMed] [Google Scholar]
  • 20.Wilson MJ, Torkar M, Haude A, Milne S, Jones T, Sheer D, et al. Plasticity in the organization and sequences of human KIR/ILT gene families. Proc Natl Acad Sci [Internet]. 2000. Apr 25 [cited 2019 Oct 11];97(9):4778–83. Available from: http://www.ncbi.nlm.nih.gov/pubmed/10781084 doi: 10.1073/pnas.080588597 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Robinson J, Halliwell JA, McWilliam H, Lopez R, Marsh SGE. IPD—the Immuno Polymorphism Database. Nucleic Acids Res [Internet]. 2013. Jan [cited 2015 Feb 13];41(Database issue):D1234–40. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3531162&tool=pmcentrez&rendertype=abstract doi: 10.1093/nar/gks1140 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Norman PJ, Abi-Rached L, Gendzekhadze K, Hammond JA, Moesta AK, Sharma D, et al. Meiotic recombination generates rich diversity in NK cell receptor genes, alleles, and haplotypes. Genome Res. 2009May1;19(5):757–69. doi: 10.1101/gr.085738.108 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Traherne JA, Martin M, Ward R, Ohashi M, Pellett F, Gladman D, et al. Mechanisms of copy number variation and hybrid gene formation in the KIR immune gene complex. [cited 2020 May 6]; Available from: http://www.ebi.ac.uk/ipd/kir/ [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Hollenbach JA, Nocedal I, Ladner MB, Single RM, Trachtenberg EA. Killer cell immunoglobulin-like receptor (KIR) gene content variation in the HGDP-CEPH populations. Immunogenetics [Internet]. 2012. Oct 1 [cited 2019 Oct 11];64(10):719–37. Available from: http://www.ncbi.nlm.nih.gov/pubmed/22752190 doi: 10.1007/s00251-012-0629-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Hollenbach JA, Augusto DG, Alaez C, Bubnova L, Fae I, Fischer G, et al. 16(th) IHIW: population global distribution of killer immunoglobulin-like receptor (KIR) and ligands. Int J Immunogenet [Internet]. 2013. Feb [cited 2019 Oct 11];40(1):39–45. Available from: http://www.ncbi.nlm.nih.gov/pubmed/23280119 doi: 10.1111/iji.12028 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Hsu KC, Chida S, Geraghty DE, Dupont B. The killer cell immunoglobulin-like receptor (KIR) genomic region: Gene-order, haplotypes and allelic polymorphism. Vol. 190, Immunological Reviews. 2002. p. 40–52. doi: 10.1034/j.1600-065x.2002.19004.x [DOI] [PubMed] [Google Scholar]
  • 27.Norman PJ, Hollenbach JA, Nemat-Gorgani N, Marin WM, Norberg SJ, Ashouri E, et al. Defining KIR and HLA Class I Genotypes at Highest Resolution via High-Throughput Sequencing. Am J Hum Genet [Internet]. 2016. Aug 4 [cited 2017 Oct 10];99(2):375–91. Available from: http://www.ncbi.nlm.nih.gov/pubmed/27486779 doi: 10.1016/j.ajhg.2016.06.023 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Roe D, Vierra-Green C, Pyo C-W, Eng K, Hall R, Kuang R, et al. Revealing complete complex KIR haplotypes phased by long-read sequencing technology. Genes Immun [Internet]. 2017. [cited 2019 Oct 11];18(3):127–34. Available from: http://www.ncbi.nlm.nih.gov/pubmed/28569259 doi: 10.1038/gene.2017.10 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Uhrberg M, Parham P, Wernet P. Definition of gene content for nine common group B haplotypes of the Caucasoid population: KIR haplotypes contain between seven and eleven KIR genes. Immunogenetics. 2002;54(4):221–9. doi: 10.1007/s00251-002-0463-7 [DOI] [PubMed] [Google Scholar]
  • 30.Amorim LM, Santos THS, Hollenbach JA, Norman PJ, Marin WM, Dandekar R, et al. Cost-effective and fast KIR gene-content genotyping by multiplex melting curve analysis. HLA [Internet]. 2018. Dec 1 [cited 2021 Mar 11];92(6):384–91. Available from: https://pubmed.ncbi.nlm.nih.gov/30468002/ doi: 10.1111/tan.13430 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Wagner I, Schefzyk D, Pruschke J, Schöfl G, Schöne B, Gruber N, et al. Allele-Level KIR Genotyping of More Than a Million Samples: Workflow, Algorithm, and Observations. Front Immunol [Internet]. 2018. Dec 4 [cited 2019 Oct 11];9:2843. Available from: http://www.ncbi.nlm.nih.gov/pubmed/30564239 doi: 10.3389/fimmu.2018.02843 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Anderson KM, Augusto DG, Dandekar R, Shams H, Zhao C, Yusufali T, et al. Killer Cell Immunoglobulin-like Receptor Variants Are Associated with Protection from Symptoms Associated with More Severe Course in Parkinson Disease. J Immunol [Internet]. 2020. Sep 1 [cited 2021 Mar 11];205(5):1323–30. Available from: https://www.jimmunol.org/content/205/5/1323 doi: 10.4049/jimmunol.2000144 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Vargas L de B, Dourado RM, Amorim LM, Ho B, Calonga-Solís V, Issler HC, et al. Single Nucleotide Polymorphism in KIR2DL1 Is Associated With HLA-C Expression in Global Populations. Front Immunol [Internet]. 2020. Aug 21 [cited 2021 Mar 11];11. Available from: /pmc/articles/PMC7478174/ [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Vukcevic D, Traherne JA, Næss S, Ellinghaus E, Kamatani Y, Dilthey A, et al. Imputation of KIR Types from SNP Variation Data. Am J Hum Genet [Internet]. 2015. [cited 2021 Jun 22];97(4):593–607. Available from: /pmc/articles/PMC4596914/ doi: 10.1016/j.ajhg.2015.09.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Roe D, Kuang R. Accurate and Efficient KIR Gene and Haplotype Inference From Genome Sequencing Reads With Novel K-mer Signatures. Front Immunol [Internet]. 2020. Nov 26 [cited 2021 Jun 22];11:1. Available from: /pmc/articles/PMC7727328/ doi: 10.3389/fimmu.2020.00001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Chen J, Madireddi S, Nagarkar D, Migdal M, Vander Heiden J, Chang D, et al. In silico tools for accurate HLA and KIR inference from clinical sequencing data empower immunogenetics on individual-patient and population scales. Brief Bioinform [Internet]. 2021. May 20 [cited 2021 Jun 22];22(3):1–11. Available from: https://academic.oup.com/bib/article/22/3/bbaa223/5906908 doi: 10.1093/bib/bbaa223 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Robinson J, Waller MJ, Stoehr P, Marsh SGE. IPD—the Immuno Polymorphism Database. Nucleic Acids Res [Internet]. 2004. Dec 17 [cited 2019 Oct 11];33(Database issue):D523–6. Available from: https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gki032 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics [Internet]. 2009. Aug [cited 2021 Mar 11];25(16):2078–9. Available from: /pmc/articles/PMC2723002/ doi: 10.1093/bioinformatics/btp352 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Leaton LA, Shortt J, Kichula KM, Tao S, Nemat-Gorgani N, Mentzer AJ, et al. Conservation, extensive heterozygosity, and convergence of signaling potential all indicate a critical role for KIR3DL3 in higher primates. Front Immunol. 2019;10(JAN). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Huang W, Li L, Myers JR, Marth GT. ART: A next-generation sequencing read simulator. Bioinformatics [Internet]. 2012. Feb [cited 2021 Mar 11];28(4):593–4. Available from: /pmc/articles/PMC3278762/ doi: 10.1093/bioinformatics/btr708 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Jiang W, Johnson C, Jayaraman J, Simecek N, Noble J, Moffatt MF, et al. Copy number variation leads to considerable diversity for B but not A haplotypes of the human KIR genes encoding NK cell receptors. Genome Res. 2012Oct;22(10):1845–54. doi: 10.1101/gr.137976.112 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Gu Z, Gu L, Eils R, Schlesner M, Brors B. circlize implements and enhances circular visualization in R. Bioinformatics [Internet]. 2014. Oct 1 [cited 2019 Oct 11];30(19):2811–2. Available from: https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btu393 [DOI] [PubMed] [Google Scholar]
  • 43.R Core Team. R: A Language and Environment for Statistical Computing [Internet]. Vienna, Austria; 2018. Available from: https://www.r-project.org/
  • 44.Hollenbach JA, Norman PJ, Creary LE, Damotte V, Montero-Martin G, Caillier S, et al. A specific amino acid motif of HLA-DRB1 mediates risk and interacts with smoking history in Parkinson’s disease. Proc Natl Acad Sci U S A [Internet]. 2019. Apr 9 [cited 2021 Mar 11];116(15):7419–24. Available from: www.pnas.org/cgi/doi/10.1073/pnas.1821778116 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Nemat-Gorgani N, Hilton HG, Henn BM, Lin M, Gignoux CR, Myrick JW, et al. Different Selected Mechanisms Attenuated the Inhibitory Interaction of KIR2DL1 with C2 + HLA-C in Two Indigenous Human Populations in Southern Africa. J Immunol. 2018Apr15;200(8):2640–55. doi: 10.4049/jimmunol.1701780 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Marin WM, Dandekar R, Augusto DG, Yusufali T, Norman PJ, Hollenbach JA. PING [Internet]. Github. 2020. [cited 2020 Sep 29]. Available from: https://github.com/wesleymarin/PING [Google Scholar]
  • 47.Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods [Internet]. 2012. Apr 4 [cited 2019 Oct 11];9(4):357–9. Available from: http://www.ncbi.nlm.nih.gov/pubmed/22388286 doi: 10.1038/nmeth.1923 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Pyke RM, Genolet R, Harari A, Coukos G, Gfeller D, Carter H. Computational KIR copy number discovery reveals interaction between inhibitory receptor burden and survival. Pac Symp Biocomput [Internet]. 2019. [cited 2019 Oct 11];24:148. Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6417817/ [PMC free article] [PubMed] [Google Scholar]
  • 49.Li H, Wright PW, McCullen M, Anderson SK. Characterization of KIR intermediate promoters reveals four promoter types associated with distinct expression patterns of KIR subtypes. Genes Immun [Internet]. 2016. Jan 1 [cited 2021 Mar 11];17(1):66–74. Available from: /pmc/articles/PMC4724278/ doi: 10.1038/gene.2015.56 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Nutalai R, Gaudieri S, Jumnainsong A, Leelayuwat C. Regulation of KIR3DL3 expression via mirna. Genes (Basel) [Internet]. 2019. Aug 1 [cited 2021 Mar 11];10(8). Available from: /pmc/articles/PMC6723774/ doi: 10.3390/genes10080603 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Amarasinghe SL, Su S, Dong X, Zappia L, Ritchie ME, Gouil Q. Opportunities and challenges in long-read sequencing data analysis [Internet]. Vol. 21, Genome Biology. BioMed Central Ltd.; 2020. [cited 2021 Jun 25]. p. 1–16. Available from: 10.1186/s13059-020-1935-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008904.r001

Decision Letter 0

Jian Ma, Ferhat Ay

5 May 2021

Dear Dr. Hollenbach,

Thank you very much for submitting your manuscript "High-throughput Interpretation of Killer-cell Immunoglobulin-like Receptor Short-read Sequencing Data with PING" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. Specifically, we would like the issue of comparison to two recently published tools brought by both reviewers to be fully addressed as well as other major concerns raised. 

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Ferhat Ay, Ph.D

Associate Editor

PLOS Computational Biology

Jian Ma

Deputy Editor

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Authors describe a new version of the PING tool for genotyping of killer cell immunoglobulin-like receptor (KIR) genes that are critically important in immunogenomics studies and applications. Authors benchmark PING on many reference datasets and demonstrate its accuracy. While I am positive that this manuscript will present an interest for readers of PLOS Computational Biology, I highly recommend addressing the following comments regarding the description of the pipeline and benchmarking against the existing tools prior to publication:

1. I find the placement of the “K-mer similarity analysis” section confusing. Seems like the purpose of this section is to introduce the definition of shared k-mers that is used for demonstration of high repetitiveness of KIR genes, and the k-mer analysis itself is not a part of the PING tool. If so, I recommend combining this section with the section in Results as the definition of shared k-mers is pretty straightforward and probably shouldn’t take the whole section of the Methods.

2. It is difficult to create the correspondence between pipeline blocks in Figure 1 and sections of the Methods. Could you please add names of blocks from Figure 1 to the titles of sections to make understanding the pipeline easier?

3. Sections describing preprocessing of the KIR gene database and the pipeline are currently intermixed - e.g., the “Copy number determination workflow” section corresponding to the pipeline is located between sections describing manipulations with the database. Since it complicates following the Methods, I recommend making a separate section “Preprocessing the database” and make sections “Imputation of uncharacterized regions and extension of untranslated regions to generate comprehensive alignment reference sequence” and “Designing a minimized reference allele set” its subsections.

4. Line 223: initial genotyping and final genotyping are not introduced before and their purposes are unclear. Are they explained later in lines 274-289? In any case, add a short description of these steps before using them.

5. The overall description of the PING pipeline largely overlaps with the description provided in Norman et al., AJHG, 2016. Please provide a more specific description of the differences from the previous work in addition to one provided in lines 88-98.

6. Could you please comment on using long read technologies for genotyping and haplotyping KIR loci? Considering their relatively small sizes, I can imagine that a 70 kbp locus can be completely spanned by a single Pacbio / Nanopore read. Have you tried to use long read data in addition to short reads?

7. Have you benchmarked the new version of PING against recently released tools by Jieming Chen et al., Briefings in Bioinformatics, 2020 (doi.org/10.1093/bib/bbaa223) and Roe and Kuang, Front Immunol, 2020 (doi.org/10.3389/fimmu.2020.583013)? Please also add citations of these papers to the Introduction.

Reviewer #2: The manuscript describes a new tool for determination genotype and copy number for KIR genes. The tool is of huge interest and beautifully tackled the reference alignment part. It is also impressive to see that the tool is further applied to real-life scenario to see KIR genes copy number in population cohorts.

Strengths:

The manuscript stick to the theme in defining the problem and presenting a viable solution to it.

The underlying codes and reference sequences are made available through github repo.

Comments:

1. The workflow is interesting, however there seems to be no source of ground truth. Since, the authors have validated it in a simulated data, how we can be sure about the accuracy values?

2. While determining the genotypes, the authors are detecting the individual alleles and there are a few tools available for determine alleles in KIR data, the authors have not presented any comparison or references to those tools.

3. It is shown in the Figure 4 that even with 250-mer, few of the genes shared a high percentage with related genes. In such a case where the short read size is around 250 bases, how come any reliable selection of gene could be made?

4. One of the claim in the manuscript is that the reference alignment is improved in the workflow because of using an allele diversity, instead of single gene reference. However, no data is shown for improvement, and whether it is significant.

5. Genotype aware alignment might be carrying forward some biases and there would be issues in working with novel genotypes. Authors have not commented anything on this point.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008904.r003

Decision Letter 1

Jian Ma, Ferhat Ay

16 Jul 2021

Dear Dr. Hollenbach,

We are pleased to inform you that your manuscript 'High-throughput Interpretation of Killer-cell Immunoglobulin-like Receptor Short-read Sequencing Data with PING' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Ferhat Ay, Ph.D

Associate Editor

PLOS Computational Biology

Jian Ma

Deputy Editor

PLOS Computational Biology

***********************************************************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #2: The authors have positively address the comments raised to them. They have also revised the manuscript to adapt those suggestions/comments. Furthermore, I am glad to see that the authors have acknowledged in the discussion section about possible biases or any limitations of the study or the field because of lack of ground truth data. I am pleased to recommend the acceptance of this revised article.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: No

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008904.r004

Acceptance letter

Jian Ma, Ferhat Ay

28 Jul 2021

PCOMPBIOL-D-21-00506R1

High-throughput Interpretation of Killer-cell Immunoglobulin-like Receptor Short-read Sequencing Data with PING

Dear Dr Hollenbach,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Zsofi Zombor

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. European cohort copy determinations.

    (EPS)

    S2 Fig. Khoesan cohort copy determinations.

    (EPS)

    S3 Fig. Synthetic dataset copy determinations.

    (EPS)

    S1 Table. Diverse and minimized reference allele set.

    (XLSX)

    S2 Table. Virtual probe table for reference modifications.

    (XLSX)

    S3 Table. Discordant genotype analysis for synthetic dataset.

    (XLSX)

    S4 Table. Validation genotype table.

    (XLSX)

    S5 Table. PING determined copy number table.

    (XLSX)

    S6 Table. Validation copy number table.

    (XLSX)

    S7 Table. PING determined genotype table.

    (XLSX)

    S8 Table. K-mer gene match table.

    (XLSX)

    S9 Table. Synthetic dataset off-target read mappings.

    (XLSX)

    S10 Table. Benchmarking PING and KPI gene content performance.

    (XLSX)

    S1 Text. Genotype determination supporting methods.

    (DOC)

    Attachment

    Submitted filename: PING_Reviewer_response.docx

    Data Availability Statement

    Scripts used to process genotype data and generate figures are available from https://github.com/wesleymarin/ping_paper_scripts. PING pipeline code is available from https://github.com/wesleymarin/ping. All relevant validation genotype data are within the manuscript and its Supporting Information files. Synthetic sequence data is available from https://github.com/wesleymarin/KIR_synthetic_data.


    Articles from PLoS Computational Biology are provided here courtesy of PLOS

    RESOURCES