Abstract
Human immunoglobulin heavy chain (IGH) locus on chromosome 14 includes more than 40 functional copies of the variable gene (IGHV), which are critical for the structure of antibodies that identify and neutralize pathogenic invaders as a part of the adaptive immune system. Because of its highly repetitive sequence composition, the IGH locus has been particularly difficult to assemble or genotype when using standard short read sequencing technologies. Here we introduce ImmunoTyper-SR, an algorithmic tool for genotype and CNV analysis of the germline IGHV genes on Illumina whole genome sequencing (WGS) data using a combinatorial optimization formulation that resolves ambiguous read mappings. We have validated ImmunoTyper-SR on 12 individuals, whose IGHV allele composition had been independently validated, as well as concordance between WGS replicates from nine individuals. We then applied ImmunoTyper-SR on 585 COVID patients to investigate associations between IGHV alleles and anti-type I IFN autoantibodies, previously associated with COVID-19 severity.
Keywords: Genomics, Genotyping, Immunoglobulin, Immunogenomics, Computational Biology, Next Generation Sequencing, ILP, Optimization, WGS, Algorithms
eTOC
Disease association studies traditionally exclude immune system genes like IGHV due to a lack of accurate genotyping methods. ImmunoTyper-SR enables IGHV genotyping using short-read WGS. Ford et al. demonstrate an association study by analyzing IGHV genotypes and COVID-related anti-type 1 interferon autoantibody presence in a 585 patient cohort.
Introduction
Recent groundbreaking sequencing technologies producing long reads with low error profiles have allowed researchers to uncover the last hidden secrets of the DNA portion of the human genome. While these approaches represent notable steps in high fidelity haplotyping of complex and repetitive regions of the human genome (Amarasinghe et al., 2020), they are expensive and not yet scalable to the population level; furthermore they are reported to have issues with (subclonal) somatic mutations, limiting their potential clinical utility (Roberts et al., 2021).
The vast majority of available genome sequencing data, especially clinical sequence data is, and, at least in the next 3–5 years, will be generated with short read technology. Leveraging the enormous quantity of short read Whole Genome Sequencing (WGS) data available for genomic locus characterization provides an opportunity to gain biological insights into population-level diversity and disease associations. Together, such insights can help inform the development of low-cost clinical genotyping approaches that can be used to guide healthcare decisions. For example, the NIH’s NIAID COVID Consortium1 is a large collaborative effort that has generated and analyzed Illumina short read WGS data from COVID-19 patients “with the aim of discovering monogenetic and multigenic variants that contribute to disease susceptibility, severity and treatment outcomes”2. Unfortunately the characteristics of short read sequencing technologies make genotyping difficult for repetitive and homogeneous loci across the human genome.
One such challenging but biologically important region is the immunoglobulin heavy chain (IGH ) locus on chromosome 14 (Box 1). The IGH locus harbors the variable (IGHV ), diversity (IGHD), joining (IGHJ ), and constant (IGHC ) genes (or gene segments) that encode the building blocks of expressed B cell receptors (BCRs) and antibodies (Abs). The variable domain of BCRs and Abs is composed of IGHV, IGHD, and IGHJ gene segments, and is responsible for engaging directly with antigen, playing a critical role in identifying and neutralizing pathogenic viruses and bacteria. Despite the importance of BCRs and Abs in the adaptive immune system, our understanding of population-level genetic diversity in the IGH locus and its contribution to Ab function in disease remains limited (Watson and Breden, 2012; Watson et al., 2017a; Collins et al., 2020). The severe impact of the COVID-19 global pandemic stresses the need for development of tools and approaches to more fully characterize immune system genes, especially those involved in the development of neutralizing Abs. Unfortunately, due to its intricate sequence composition, IGH has remained stubbornly resistant to large scale, high resolution characterization.
Box 1. Progress and Potential.
As we progress further down the path towards precision medicine, genotype screening has become more commonplace and routine, in both consumer and research spheres. Individuals can now investigate their own germline genetics for risk markers of phenotypes ranging from Alzheimer’s disease to lactose intolerance using direct-to-consumer genotyping services. In the research and healthcare space, whole genome sequencing (WGS) is now a routine part of clinical drug trials and disease treatment such as cancer. However, our ability to examine genetic features is far from uniform, as some gene systems remain stubbornly resistant to accurate genotyping using standard whole genome sequencing approaches.
A set of outstanding examples are the genes responsible for antibodies, called immunoglobulins. These genes have a considerable amount of sequence duplication and complexity, which affords the immune system its ability to respond to, and learn from, the enormous range of pathogens that can be encountered in the environment. As a result, genotyping of germline immunoglobulins using standard short read WGS approaches is at best challenging, and at worst impossible, as is the case with the immunoglobulin heavy chain variable genes (IGHV).
To better explain the challenges involved, suppose one would like to use a standard 30x Illumina WGS sequence to genotype IGHV genes. This results in ~30k reads that originate from IGHV genes that need to be mapped. The first challenge is that many IGHV genes are highly similar, with many varying by as little as a single nucleotide. The second challenge is there are a lot of them – as many as 55 functional genes and at least as many pseudogenes – spread across 3 different chromosomes. These features cause a daunting number of ambiguous mapping targets, which must be resolved while maintaining the 30x expected read coverage. If that were not enough, the IGH region is full of structural variants, many of which are expected to be currently unknown, as new haplotypes continue to be discovered. This means that one does not know in advance the number of IGHV genes present in one’s sample and there will be additional mapping ambiguities caused by deleted and/or inserted sequences relative to the mapping reference.
To date, these challenges have been primarily avoided by using specialized sequencing methods such as AIRR-seq and long read sequencing to genotype immunoglobulins. This paper seeks to tackle these challenges head on, and to open the door to using short-read WGS, the most common and cheapest WGS data type, for immunoglobulin genotyping. Our method, called ImmunoTyper-SR, uses an innovative combinatorial optimization approach to resolve read mapping ambiguities, enabling it to provide allele-level genotyping and CNV calls for IGHV genes using short read WGS.
It is our hope that with this tool in hand, researchers and clinicians will no longer be held back by cost and data availability and can integrate IGHV genotyping into all manner of disease association studies and precision medicine pipelines. As a proof of concept, we genotyped 585 COVID-19 patients to investigate associations between IGHV genotype and anti-interferon type 1 autoantibody presence, which has been shown to be a strong effector of COVID-19 severity. While our cohort is limited in the number of cases and association conclusions, this paper represents one of the first genotype-phenotype association studies using IGHV, and we hope will serve as a guide for future researchers eager to embark down this path.
IGH is one of the most complex and dynamic regions of the human genome (Watson and Breden, 2012) - sit is known to contain many large structural variants (SVs), including segmental duplications, large insertions and deletions, and other copy number variants (CNVs) (Watson et al., 2013; Gadala-Maria et al., 2019a; Rodriguez et al., 2020). Of the IGH gene segments, IGHV is the most extensive gene group, and the ~800–1000 Kb region of IGH in which this gene family resides has proven particularly challenging to genotype using short read WGS data, due to its repetitive and homogeneous nature (Collins et al., 2021; Watson et al., 2017b). A given individual is likely to have upwards of 40 functional and ORF IGHV gene copies on each haplotype, and at least twice as many non-functional pseudogene copies, collectively residing across the primary IGH locus on chromosome 14 and two orphon loci located on chromosomes 15 and 16 (Rodriguez et al., 2020; Watson et al., 2013). These gene segments are short - between 165 bp and 305 bp (with a mean of 291 bp) - and are highly similar to others in terms of sequence composition, with 40% of the known functional alleles having a sequence similarity of > 98% with at least one other allele in a different gene3.
While germline IGH genotypes have generally been overlooked in GWAS and other disease association studies, partially due to genotyping challenges, there is increasing interest in investigating the effect of IGH germline genetic variation on the adaptive immune system and disease (Watson and Breden, 2012; Collins et al., 2020). In fact IGHV germline genetic variation has recently been associated with clinical phenotypes including infectious disease and vaccine response (Avnir et al., 2016; Yeung et al., 2016; Yacoob et al., 2016; Lee et al., 2021), autoimmune/inflammatory conditions (Cho et al., 2003; Johnson et al., 2021; Parks et al., 2017), and cancer (Cui et al., 2021).
Of particular interest is COVID-19, which is primarily modulated by the innate immune system (Schultze and Aschenbrenner, 2021). However, there is increasing evidence for an association between the adaptive immune system and COVID-19 severity (Bastard et al., 2020; Wang et al., 2021; van der Wijst et al., 2021), via the presence of anti-type I interferon (IFN) autoantibodies. This presents a potential biological mechanism for an association between IGHV germline genetic variation and COVID-19 disease severity, driven by IGHV germline alleles coding anti-type I IFN antibodies.
To date, there has been only one published method/computational pipeline for germline IGHV genotyping using short read WGS data (Luo et al., 2016, 2019). Due to the genotyping challenges encountered with IGHV, the authors intended their pipeline to be applied on a population level, stating that it “is not intended to be used to accurately genotype individual genomes”. The authors indeed applied their pipeline to a cohort of 109 individuals with the purpose of compiling aggregate measures of variation, and performed gene-level genotyping, with allele-level granularity on 11 (of the 56) IGHV genes.
One other previously published method, ImmunoTyper (Ford et al., 2020), performs IGHV genotype and copy number analysis using standard single molecule real-time (SMRT) Pacific Biosciences (PacBio) long read sequence data. As a result, its application has been limited by the paucity of publicly available long read WGS datasets. ImmunoTyper can not be easily modified to handle short read WGS data because it relies on extracting full length IGHV segments from long PacBio reads (which are then clustered using a facility location formulation, similar to balanced k-means clustering). Short reads, on the other hand, only partially cover IGHV segments.
More recently Rodriguez et al. (2020) have published a technique called IGenotyper, which can haplotype the entire IGH region at high resolution using targeted ultra-deep long read sequencing and de novo assembly. This represents a critical step towards characterizing IGH haplotype diversity and heterogeneity, however since it requires a custom sequencing approach, it is expensive to scale and as a result is more suitable for high-resolution haplotype characterization rather than high-throughput IGHV genotyping.
There are several published methods for inferring germline IGHV genotypes using adaptive immune receptor repertoire sequencing (AIRR-seq) data as input (Peres et al., 2019; Gadala-Maria et al., 2019b). This approach benefits from not having to contend with pseudogene sequence, however it introduces noise that arises from somatic hypermutation. Furthermore, these methods can only provide calls for germline IGHV alleles that are expressed in the B cell population at the time of sampling, and thus are susceptible to missing IGHV alleles with lower expression.
In this paper we introduce ImmunoTyper-SR, a computational approach for genotype and CNV analysis of functional germline IGHV genes using short read WGS data, using a database of known IGHV sequences as a reference (see Figure 1). ImmunoTyper-SR is based on a combinatorial optimization formulation that aims to minimize the total edit distance between the reads and their assigned alleles while maintaining additional constraints on the number and distribution of reads across each allele identified (see Figure 2). This approach extends our previous work on long read based IGHV genotyping by addressing additional challenges introduced through the use of short reads. As a result, ImmunoTyper-SR is the one of the first short read based germline IGHV genotyping tool that offers allele-level resolution.
Figure 1:

Short read mapping of IGH by conventional approaches versus ImmunoTyper-SR: ImmunoTyper-SR resolves mapping ambiguities commonly found when using standard short WGS analysis pipelines on the IGHV locus using an integer linear programming (ILP) optimization approach
Figure 2:

ImmunoTyper-SR workflow: ImmunoTyper-SR maps IGHV-relevant reads to the allele database and resolves mapping ambiguities using an ILP optimization approach, resulting in IGHV allele presence and CNV genotype calls.
We have validated ImmunoTyper-SR on 12 individuals with diverse genetic backgrounds from the 1000 Genomes Project (1000-Genomes-Project-Consortium, 2015) (1kGP), by comparing ImmunoTyper-SR genotype calls on Illumina WGS data from the NYGC (Byrska-Bishop et al., 2021) against targeted long read-based IGH assemblies generated using IGenotyper (Rodriguez et al., 2020). We then applied ImmunoTyper-SR to WGS data from a cohort of 585 individuals from the NIAID COVID Consortium (from here on the “NIAID cohort”) to investigate associations between IGHV genotypes, type I IFN autoantibodies and COVID-19 disease severity. The cohort includes nine individuals who have been independently sequenced twice; this subcohort provides additional means to assess the robustness of ImmunoTyper-SR on clinical sequencing data. We observed that ImmunoTyper-SR is able to produce IGHV accurate allele and CNV calls on both data sets, demonstrating its feasibility on Illumina WGS data with read lengths of 150 bp and moderate genome coverage. We finally employed a permutation test on the alleles identified by ImmunoTyper-SR to determine those strongly associated with anti-type I IFN autoantibody activity – within the limitations of the size and demographic composition of the NIAID cohort.
Results
A primary challenge in germline IGHV genotyping is the validation of computationally identified alleles. To date there has been only one IGHV genotyping protocol published that uses short read data (Luo et al., 2016, 2019). A recently published IGH haplotyping tool (Rodriguez et al., 2020) has opened the door for high quality IGH assemblies which can be used for validation.
IGHV genotyping through inference using AIRR-seq is common-place and well established (Peres et al., 2019; Gadala-Maria et al., 2019b), however, we were unable to find any publicly available paired AIRR-seq and WGS datasets that would be suitable for comparison.
Validation using 1kGP Samples
We ran ImmunoTyper-SR on 12 publicly available high-coverage WGS of 1kGP samples, sequenced at ~30X on the Illumina NovaSeq 6000 Platform (Byrska-Bishop et al., 2021). These samples have had independent de novo IGH haplotyping performed by Rodriguez et al. using a targeted long read sequencing and assembly protocol called IGenotyper (Rodriguez et al., 2020) tailor-made for the IGH region. While this technique represents the most accurate IGH haplotyping tool published to date, it is likely that these assemblies are not 100% accurate, as the samples are derived from LCL cell lines. It has been noted that the usage of LCL cell lines can impact IGH genotype accuracy due to the presence of somatic V(D)J rearrangement (Rodriguez et al., 2021). This may result in somatic deletions or inconsistent read coverage, which can affect the assembly quality, and ultimately ImmunoTyper-SR’s genotype accuracy.
Distribution of genotype call results are shown in Figure 3; as can be seen (results per sample can be found in Supplemental Table 1), the mean precision and recall values are respectively 83.7% and 80% for identification of each allele sequence and its copy number exactly. While ImmunoTyper-SR generally provided accurate genotype calls, the ranges of precision and recall values were large, with a minimum and maximum F-score of 63% and 88% respectively. These figures improve substantially if some limited noise can be tolerated in the sequence composition or copy number of the allele calls, as discussed in the remainder of the section.
Figure 3:

ImmunoTyper-SR’s functional allele call accuracy on 1kGP dataset (n = 12). No CNV indicates measures accuracy for presence/absence of alleles. 3 bp allowance counts a false positive as a true positive if there is a false negative allele within 3 bp edit distance.
Accuracy is Impacted by Novel Alleles
As ImmunoTyper-SR is designed to find the closest allele in the database and not call novel alleles and variants, we investigated the relationship between the presence of putative novel variants as a driver of the variation seen in genotype accuracy. It is important to distinguish these putative novel alleles, which are derived from the 1kGP sample IGH assemblies, from those validated novel alleles that were added to the allele database as described in Section Allele Candidate Assignment, which were sourced from an independent, high-quality dataset.
As shown in Supplemental Figure 2, there is significant correlation between the proportional divergence of a given sample’s IGHV alleles from the allele database and ImmunoTyper-SR’s genotype call accuracy.
As previously mentioned, it is likely that some of these novel variants are in fact due to sequencing or assembly errors, or noise from IGH rearrangement due to the LCL-sourced samples. Since it is impossible to determine which of the putative novel alleles are valid, we chose not to add these novel sequences to the allele database for fear of increasing noise and model complexity.
High Genotype Accuracy for Distinct Genes
To investigate the genotype accuracy on a gene-by-gene basis, we grouped allele call results by genes across all samples. Supplemental Figure 1 demonstrates how genes that have high sequence similarity to many alleles in the database (regardless of gene “source” of each such allele) have lower allele call accuracy.
To reduce the impact of this effect, we recalculated allele call accuracy statistics by not penalizing a miscalled (false positive) allele as long as it is within 3 bp edit distance of a ground truth (false negative) allele (see Figure 3). This increases the mean F-score by 6.6%, from 81.5% to 87.9%. The tolerance threshold is chosen to be 3 bp because it is about 1% of the average length of all IGHV allele sequences.
ImmunoTyper-SR Reports Genotypes with Higher Granularity and Accuracy than Comparable Methods
We compared ImmunoTyper-SR’s genotype results on the 1kGP samples against the only other published short read IGHV genotyping method, Luo et al. (2019). With the exception of at most 11 two-copy genes (out of 56 IGHV genes with at least one functional allele), the pipeline reports only gene-level genotypes and CNV calls (see Section Comparison With Luo et al. (2019) Pipeline). We implemented the pipeline in-house and compared gene call results in Supplemental Table 2. We found that ImmunoTyper-SR had much higher precision (median 84.7% vs 93.1%), and moderately improved recall (mean 87.4% vs 90.1%).
For allele calls, Luo et al. pipeline calls alleles for only a small proportion of the IGHV genes present, an average of 5.75 out of 44 functional genes. For those genes that do have allele calls, ImmunoTyper-SR is much more accurate, with mean precision and recall values more than double the Luo et al. results (see Table 1). Note that the 11 genes listed in Section Comparison With Luo et al. (2019) Pipeline are those genes that the authors deemed two copy genes with high confidence, however in many cases not all of those genes were called with two copies and thus not selected for allele calling, despite having two copies in the ground truth.
Table 1:
Allele call comparison between Luo et al. (2019) pipeline and ImmunoTyper-SR. The final column shows the number of ground truth genes in the sample that met the allele-calling criteria of the Luo et al. pipeline (i.e. that the gene is among the 11 least ambiguous genes and its number of copies is predicted to be ≤ 2); their precision and recall values are calculated only on those genes. In contrast, ImmunoTyper-SR calls alleles for all ground truth genes and its precision and recall values are thus calculated on the entire set of genes.
| Sample | Luo et al. Precision | ImmunoTyper-SR Precision | Luo et al. Recall | ImmunoTyper-SR Recall | Proportion of Functional Genes with Allele Calls by Luo et al. |
|---|---|---|---|---|---|
| HG00512 | 50.0 | 94.7 | 50.0 | 100.0 | 9 / 44 |
| HG00513 | 20.0 | 100.0 | 20.0 | 90.0 | 5 / 45 |
| HG00514 | 37.5 | 87.5 | 37.5 | 87.5 | 4 / 34 |
| HG00731 | 56.2 | 100.0 | 56.2 | 93.8 | 8 / 41 |
| HG00732 | 16.7 | 100.0 | 16.7 | 100.0 | 3 / 49 |
| HG00733 | 37.5 | 88.9 | 37.5 | 100.0 | 4 / 48 |
| HG01109 | 41.7 | 100.0 | 45.5 | 100.0 | 6 / 44 |
| HG01243 | 50.0 | 100.0 | 50.0 | 100.0 | 4 / 44 |
| HG02723 | 35.0 | 77.8 | 38.9 | 77.8 | 10 / 44 |
| HG03098 | 50.0 | 76.9 | 50.0 | 71.4 | 7 / 46 |
| NA12878 | 31.2 | 100.0 | 31.2 | 100.0 | 8 / 40 |
| NA19240 | 62.5 | 100.0 | 62.5 | 100.0 | 4 / 49 |
| Mean | 40.7 | 93.8 | 41.3 | 93.4 | 5.75 / 44 |
| Median | 39.6 | 100.0 | 42.2 | 100.0 | 5.5 / 44 |
Validation and Association with anti-Type I IFN Autoantibodies with a Cohort of 585 COVID-19 Patient WGS Data
As part of the NIAID COVID Consortium we analyzed health records and genomic data from a cohort of 585 COVID patients. The dataset includes various health information for each patient, as well as whole genome sequences performed at The American Genome Center, USUHS, and results from anti-type I IFN autoantibody (aAb) assays.
Genotype Validation Through Doubly-Sequenced Concordance
To further validate ImmunoTyper-SR on non-LCL sourced WGS samples, we compared genotype calls for nine individuals who had been independently sequenced twice as part of the NIAID cohort. The WGS were generated at the same sequencing center, which reduces the effect of systematic sequencing bias. We compared the complete set of allele calls and CNVs between the matched WGS. The mean weighted Jaccard similarity coefficient across all nine doubly-sequenced samples was 0.696. If the CNV calls are ignored, the mean score increases to 0.923, indicating many of the miscalls are due to CNVs, rather than calling the wrong allele (see Figure 4). Jaccard similarity scores for all 9 paired samples can be found in Supplemental Table 3.
Figure 4:

Distribution of Jaccard similarity coefficients for doubly sequenced samples in the NIAID cohort: (left) no error tolerance in sequence composition or copy number of called alleles; (center) no error tolerance in copy number but an edit distance of ≤ 3 tolerated on called alleles; (right) no error tolerance in sequence composition of the alleles with copy number differences ignored.
Association with anti-Type I IFN Autoantibodies
Removing samples that did not have WGS or aAb assay results produced 542 samples, of which 32 tested positive for the presence of anti-type I IFN aAb. The association, effect size, and p-values are provided in Table 2 for the top most significant alleles. It is important to note that three out of four of the top most significant alleles are rare alleles, present in at most 10 individuals. In addition, this analysis suffers from a very low number of anti-type I IFN aAb cases present in this dataset. While two of the selected alleles have relatively low p-values, they remain higher than 0.01, partially due to their rarity in the cohort.
Table 2:
Summary statistics of IGHV alleles that are associated with Type I IFN aAb in the 542 samples from the NIAID cohort that have been tested. Top row: the proportion of the samples with each allele. Second row: the number of samples with the allele as well as the Type I IFN aAb. Third row: Number of samples with the allele but not the IFN aAb. Bottom two rows: Association between the allele and the Type I IFN aAb.
| Allele | IGHV 1–46*Novel-168 | IGHV 5–51*Novel-501 | IGHV 3–49*04 | IGHV 3–9*Novel-205 |
|---|---|---|---|---|
| Samples with genotype | 3 / 542 | 10 / 542 | 237 / 542 | 9 / 542 |
| Samples with IFN aAb | 1 | 2 | 10 | 2 |
| Samples without IFN aAb | 2 | 8 | 227 | 7 |
| Effect probability | 37.5 | 24.6 | 3.7 | 26.6 |
| P-value | 0.037 | 0.018 | 0.11 | 0.013 |
Discussion and future work
ImmunoTyper-SR successfully raises the bar for short read-based IGHV germline genotyping. While the tool has demonstrated the greatest accuracy of published Illumina-based tools to date, it is important to note that there is still room for improvement, particularly when distinguishing highly-similar alleles.
The challenges presented by novel alleles are another opportunity for improving IGHV genotype calls. ImmunoTyper-SR clearly performs best when a sample’s IGHV alleles minimally diverge from the known allele database. As the IGHV allele database is updated and improved, genotype call accuracy will likely increase. Ongoing projects such as the OGRDB (Lees et al., 2020), a curated database of germline IGHV alleles inferred from AIRR-seq data, as well as those being curated from high-quality long read assemblies, may help improve WGS short read-based germline IGHV calling. In addition, our future work includes expanding ImmunoTyper-SR to call novel alleles and variants, and further validating ImmunoTyper-SR calls using trio data.
This paper represents one of the first investigations into germline IGHV genotype-phenotype association using short read WGS data. Despite there being a hypothetical biological mechanism for the association between IGHV genotype and anti-type I IFN autoantibodies, the dataset used is very limited for this application. The low number of autoantibody-positive patients and rarity of the associated alleles severely reduces statistical power resulting in a low-confidence statistical analysis. As a result, we must judge the association results as inconclusive, despite finding two associated alleles with relatively low p-values. However, our future work includes repeating this analysis on a cohort of > 1200 COVID-19 patient WGS data, which we believe will improve our confidence.
While our investigation of association was inconclusive, there are many other potential association targets for future work. It is important to acknowledge a possibility that disease effects of germline IGHV variants can be overcome by somatic hypermutation during antibody repertoire proliferation. However, the (currently limited) body of evidence demonstrating IGHV variant effects on phenotype suggests there are many IGHV alleles that have a considerable effect on a variety of diseases. In the near future, comprehensively investigating these associations is likely only possible using short read WGS data; there are numerous WGS datasets in existence which have remained untouched with respect to IGHV, and alternative data types have additional costs and scaling challenges.
It is possible that using alleles as the explanatory variables in phenotype association is less biologically relevant than single nucleotide variants. If a single nucleotide variant is the root driver of a phenotype effect, perhaps by affecting development of a given protective or detrimental antibody, then using variants as an explanatory variable would be more suitable. However, if the effect is driven by the combination of variants, using alleles may be more suitable. Given a sufficient sized dataset, joint effects, either of variants or alleles, could also be investigated. It is also possible that other variant types, such as structural variants or CNVs, could have an effect on phenotype. Our future work includes improving ImmunoTyper-SR to provide variant calls, which will improve the ability to investigate phenotype and disease associations.
There is great opportunity for future studies that combine WGS with complete BCR repertoire data. The immediate application would be to validate WGS and inference-based germline IGHV genotype calls using orthogonal datasets. Unfortunately we were unable to find any such paired datasets for this study.
There are also open questions regarding the direct effects of germline IGHV genotype on the BCR repertoire. This effect has been investigated with repertoire data alone, however inference-based germline genotype calls are impacted by temporal effects of repertoire sampling, and as a result may be incomplete. By combining germline sequence data with repertoire data, it will be possible to create a complete picture of the interaction between germline IGHV genotype and the BCR repertoire.
By enabling short read WGS-based IGHV genotyping with allele-level granularity, ImmunoTyper-SR opens the door to applying the power of the largest WGS datasets available to uncover the mysteries of one of the least understood loci in the human genome.
STAR Methods
Resource Availability
Lead contact
Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact Cenk Sahinalp (cenk.sahinalp@nih.gov).
Materials availability
This study did not generate new unique reagents
Data and code availability
COVID patient WGS data reported in this study cannot be deposited in a public repository due to privacy agreement restrictions. To request access, contact the lead contact, Cenk Sahinalp, at cenk.sahinalp@nih.gov.
COVID patient anti-IFN autoantibody assay results are provided as supplemental data.
1000 Genomes IGH assemblies can be access through NCBI GenBank under project PRJNA555323. See key resources table for access URL.
ImmunoTyper-SR and all original code has been deposited on Zenodo and is publicly available as of the date of publication. DOIs are listed in the key resources table.
Key resources table
| REAGENT or RESOURCE | SOURCE | IDENTIFIER |
|---|---|---|
| Deposited data | ||
| 1000 genomes WGS sample NA12878 | The International Genome Sample Resource | https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos3/sra-pub-run-19/ERR3239334/ERR3239334.1 |
| 1000 genomes WGS sample NA19240 | The International Genome Sample Resource | https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos3/sra-pub-run-25/ERR3989410/ERR3989410.1 |
| 1000 genomes WGS sample HG00512 | The International Genome Sample Resource | https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos3/sra-pub-run-25/ERR3988780/ERR3988780.1 |
| 1000 genomes WGS sample HG00513 | The International Genome Sample Resource | https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos3/sra-pub-run-21/ERR3241684/ERR3241684.1 |
| 1000 genomes WGS sample HG00514 | The International Genome Sample Resource | https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos3/sra-pub-run-25/ERR3988781/ERR3988781.1 |
| 1000 genomes WGS sample HG00731 | The International Genome Sample Resource | https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos3/sra-pub-run-20/ERR3241754/ERR3241754.1 |
| 1000 genomes WGS sample HG00731 | The International Genome Sample Resource | https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos3/sra-pub-run-20/ERR3241755/ERR3241755.1 |
| 1000 genomes WGS sample HG00733 | The International Genome Sample Resource | https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos3/sra-pub-run-25/ERR3988823/ERR3988823.1 |
| 1000 genomes WGS sample HG01109 | The International Genome Sample Resource | https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos3/sra-pub-run-25/ERR3988842/ERR3988842.1 |
| 1000 genomes WGS sample HG02723 | The International Genome Sample Resource | https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos3/sra-pub-run-25/ERR3989060/ERR3989060.1 |
| 1000 genomes WGS sample HG03098 | The International Genome Sample Resource | https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos3/sra-pub-run-25/ERR3989119/ERR3989119.1 |
| 1000 genomes IGH assemblies | Rodriguez et al. [8] | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA555323/ |
| NIAID COVID-19 Consortium Patient WGS Data | NIAID COVID Consortium [8] | Available upon request |
| NIAID COVID-19 Consortium anti-IFN autoantibody assay results | NIAID COVID Consortium [8] | Supplemental Data S1 |
| Software and algorithms | ||
| ImmunoTyper-SR v1.0 | This paper | 10.5281/zenodo.6513012 |
| Implementation of Luo et al. [24] | This paper | 10.5281/zenodo.6558252 |
| Gurobi v9.1.2 | www.gurobi.com | N/A |
| BWA-MEM v0.7.17 | anaconda.org/bioconda/bwa | N/A |
| Pysam v0.16.0.1 | anaconda.org/bioconda/pysam | N/A |
| Samtools v1.12 | ||
| Python | Python Software Foundation | N/A |
| GLM R package | www.rdocumentation.org/packages/stats/versions/3.6.2/topics/glm | N/A |
| Antibodies | ||
| Rabbit monoclonal anti-Snail | Cell Signaling Technology | Cat#3879S; RRID: AB_2255011 |
| Mouse monoclonal anti-Tubulin (clone DM1A) | Sigma-Aldrich | Cat#T9026; RRID: AB_477593 |
| Rabbit polyclonal anti-BMAL1 | This paper | N/A |
| Bacterial and virus strains | ||
| pAAV-hSyn-DIO-hM3D(Gq)-mCherry | Krashes et al., 2011 | Addgene AAV5; 44361-AAV5 |
| AAV5-EF1a-DIO-hChR2(H134R)-EYFP | Hope Center Viral Vectors Core | N/A |
| Cowpox virus Brighton Red | BEI Resources | NR-88 |
| Zika-SMGC-1, GENBANK: KX266255 | Isolated from patient (Wang et al., 2016) | N/A |
| Staphylococcus aureus | ATCC | ATCC 29213 |
| Streptococcus pyogenes: M1 serotype strain: strain SF370; M1 GAS | ATCC | ATCC 700294 |
| Biological samples | ||
| Healthy adult BA9 brain tissue | University of Maryland Brain & Tissue Bank; http://medschool.umaryland.edu/btbank/ | Cat#UMB1455 |
| Human hippocampal brain blocks | New York Brain Bank | http://nybb.hs.columbia.edu/ |
| Patient-derived xenografts (PDX) | Children’s Oncology Group Cell Culture and Xenograft Repository | http://cogcell.org/ |
| Chemicals, peptides, and recombinant proteins | ||
| MK-2206 AKT inhibitor | Selleck Chemicals | S1078; CAS: 1032350–13-2 |
| SB-505124 | Sigma-Aldrich | S4696; CAS: 694433–59-5 (free base) |
| Picrotoxin | Sigma-Aldrich | P1675; CAS: 124–87-8 |
| Human TGF-β | R&D | 240-B; GenPept: P01137 |
| Activated S6K1 | Millipore | Cat#14-486 |
| GST-BMAL1 | Novus | Cat#H00000406-P01 |
| Critical commercial assays | ||
| EasyTag EXPRESS 35S Protein Labeling Kit | PerkinElmer | NEG772014MC |
| CaspaseGlo 3/7 | Promega | G8090 |
| TruSeq ChIP Sample Prep Kit | Illumina | IP-202-1012 |
| Deposited data | ||
| Raw and analyzed data | This paper | GEO: GSE63473 |
| B-RAF RBD (apo) structure | This paper | PDB: 5J17 |
| Human reference genome NCBI build 37, GRCh37 | Genome Reference Consortium | http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/ |
| Nanog STILT inference | This paper; Mendeley Data | http://dx.doi.org/10.17632/wx6s4mj7s8.2 |
| Affinity-based mass spectrometry performed with 57 genes | This paper; Mendeley Data | Table S8; http://dx.doi.org/10.17632/5hvpvspw82.1 |
| Experimental models: Cell lines | ||
| Hamster: CHO cells | ATCC | CRL-11268 |
| D. melanogaster: Cell line S2: S2-DRSC | Laboratory of Norbert Perrimon | FlyBase: FBtc0000181 |
| Human: Passage 40 H9 ES cells | MSKCC stem cell core facility | N/A |
| Human: HUES 8 hESC line (NIH approval number NIHhESC-09-0021) | HSCI iPS Core | hES Cell Line: HUES-8 |
| Experimental models: Organisms/strains | ||
| C. elegans: Strain BC4011: srl-1(s2500) II; dpy-18(e364) III; unc-46(e177)rol-3(s1040) V. | Caenorhabditis Genetics Center | WB Strain: BC4011; WormBase: WBVar00241916 |
| D. melanogaster: RNAi of Sxl: y[1] sc[*] v[1]; P{TRiP.HMS00609}attP2 | Bloomington Drosophila Stock Center | BDSC:34393; FlyBase: FBtp0064874 |
| S. cerevisiae: Strain background: W303 | ATCC | ATTC: 208353 |
| Mouse: R6/2: B6CBA-Tg(HDexon1)62Gpb/3J | The Jackson Laboratory | JAX: 006494 |
| Mouse: OXTRfl/fl: B6.129(SJL)-Oxtrtm1.1Wsy/J | The Jackson Laboratory | RRID: IMSR_JAX:008471 |
| Zebrafish: Tg(Shha:GFP)t10: t10Tg | Neumann and Nuesslein-Volhard, 2000 | ZFIN: ZDB-GENO-060207-1 |
| Arabidopsis: 35S::PIF4-YFP, BZR1-CFP | Wang et al., 2012 | N/A |
| Arabidopsis: JYB1021.2: pS24(AT5G58010)::cS24:GFP(-G):NOS #1 | NASC | NASC ID: N70450 |
| Oligonucleotides | ||
| siRNA targeting sequence: PIP5K I alpha #1: ACACAGUACUCAGUUGAUA | This paper | N/A |
| Primers for XX, see Table SX | This paper | N/A |
| Primer: GFP/YFP/CFP Forward: GCACGACTTCTTCAAGTCCGCCATGCC | This paper | N/A |
| Morpholino: MO-pax2a GGTCTGCTTTGCAGTGAATATCCAT | Gene Tools | ZFIN: ZDB-MRPHLNO-061106-5 |
| ACTB (hs01060665_g1) | Life Technologies | Cat#4331182 |
| RNA sequence: hnRNPA1_ligand: UAGGGACUUAGGGUUCUCUCUAGGGACUUAGGGUUCUCUCUAGGGA | This paper | N/A |
| Recombinant DNA | ||
| pLVX-Tight-Puro (TetOn) | Clonetech | Cat#632162 |
| Plasmid: GFP-Nito | This paper | N/A |
| cDNA GH111110 | Drosophila Genomics Resource Center | DGRC:5666; FlyBase:FBcl0130415 |
| AAV2/1-hsyn-GCaMP6- WPRE | Chen et al., 2013 | N/A |
| Mouse raptor: pLKO mouse shRNA 1 raptor | Thoreen et al., 2009 | Addgene Plasmid #21339 |
| Software and algorithms | ||
| ImageJ | Schneider et al., 2012 | https://imagej.nih.gov/ij/ |
| Bowtie2 | Langmead and Salzberg, 2012 | http://bowtie-bio.sourceforge.net/bowtie2/index.shtml |
| Samtools | Li et al., 2009 | http://samtools.sourceforge.net/ |
| Weighted Maximal Information Component Analysis v0.9 | Rau et al., 2013 | https://github.com/ChristophRau/wMICA |
| ICS algorithm | This paper; Mendeley Data | http://dx.doi.org/10.17632/5hvpvspw82.1 |
| Other | ||
| Sequence data, analyses, and resources related to the ultra-deep sequencing of the AML31 tumor, relapse, and matched normal | This paper | http://aml31.genome.wustl.edu |
| Resource website for the AML31 publication | This paper | https://github.com/chrisamiller/aml31SuppSite |
| Chemicals, peptides, and recombinant proteins | ||
| QD605 streptavidin conjugated quantum dot | Thermo Fisher Scientific | Cat#Q10101MP |
| Platinum black | Sigma-Aldrich | Cat#205915 |
| Sodium formate BioUltra, ≥99.0% (NT) | Sigma-Aldrich | Cat#71359 |
| Chloramphenicol | Sigma-Aldrich | Cat#C0378 |
| Carbon dioxide (13C, 99%) (<2% 18O) | Cambridge Isotope Laboratories | CLM-185-5 |
| Poly(vinylidene fluoride-co-hexafluoropropylene) | Sigma-Aldrich | 427179 |
| PTFE Hydrophilic Membrane Filters, 0.22 μm, 90 mm | Scientificfilters.com/Tisch Scientific | SF13842 |
| Critical commercial assays | ||
| Folic Acid (FA) ELISA kit | Alpha Diagnostic International | Cat# 0365–0B9 |
| TMT10plex Isobaric Label Reagent Set | Thermo Fisher | A37725 |
| Surface Plasmon Resonance CM5 kit | GE Healthcare | Cat#29104988 |
| NanoBRET Target Engagement K-5 kit | Promega | Cat#N2500 |
| Deposited data | ||
| B-RAF RBD (apo) structure | This paper | PDB: 5J17 |
| Structure of compound 5 | This paper; Cambridge Crystallographic Data Center | CCDC: 2016466 |
| Code for constraints-based modeling and analysis of autotrophic E. coli | This paper | https://gitlab.com/elad.noor/sloppy/tree/master/rubisco |
| Software and algorithms | ||
| Gaussian09 | Frish et al., 2013 | https://gaussian.com |
| Python version 2.7 | Python Software Foundation | https://www.python.org |
| ChemDraw Professional 18.0 | PerkinElmer | https://www.perkinelmer.com/category/chemdraw |
| Weighted Maximal Information Component Analysis v0.9 | Rau et al., 2013 | https://github.com/ChristophRau/wMICA |
| Other | ||
| DASGIP MX4/4 Gas Mixing Module for 4 Vessels with a Mass Flow Controller | Eppendorf | Cat#76DGMX44 |
| Agilent 1200 series HPLC | Agilent Technologies | https://www.agilent.com/en/products/liquid-chromatography |
| PHI Quantera II XPS | ULVAC-PHI, Inc. | https://www.ulvac-phi.com/en/products/xps/phi-quantera-ii/ |
Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request
Method Details
WGS Read Recruitment
A mapped WGS sample is provided to ImmunoTyper-SR in the BAM file format. Reads are extracted if they share any mapping to IGHV, the IGHV orphons on chromosome 15 and 16, or any other loci that share sequences similar to IGHV.
Allele Candidate Assignment
The extracted reads are then mapped to a database of IGHV allele reference sequences using BWA-MEM (Li, 2013), with the −a option. The allele reference database was created by combining the current IMGT allele database (Lefranc, 2003) (including pseudogene and orphon alleles) with additional germline alleles (currently unpublished), resolved using IGenotyper (Rodriguez et al., 2020). While these novel alleles are as-yet unpublished, they are resolved using PacBio sequencing on an independent dataset that is not subject to the noise present in the 1kGP samples (see subsection Annotation of 1kGP Sample Assemblies Annotation of 1kGP Sample Assemblies).
A read is putatively assigned to a single candidate allele or set of candidate alleles if it has a mapping of at least 50 bp to the allele reference sequence. A read can have a truncated mapping to an IGHV allele if and only if the truncation occurs either at the beginning or the end of the allele, such that truncated mappings that do not start on the first base of the allele, or end on the last base, are removed.
The edit distance between a given read and candidate allele is taken from the NM tag of the mapping, which is used below.
Allele Assignment
Reads are assigned to one of their candidate alleles through an Integer Linear Programming (ILP) approach, which aims to minimize the total sequence variation (i.e. edit distance) between assigned reads and alleles, while matching the sequence depth and variance of the given WGS sample. The ILP is solved using the Gurobi package (Gurobi Optimization, LLC, 2021).
As previously noted by Luo et al. (2016, 2019), a correct read assignment for a given allele will have a read depth in the shape of a trapezoid, as the number of reads having the minimum 50 bp of mapped sequence will decrease towards the ends of the allele reference sequence. This problem could be avoided by including the non-coding flanking sequence in section Allele Assignment, however this technique may be complicated by the well known lack of characterized intergenic IGH sequences and haplotypes. As a result, we have opted to omit the flanking sequences in order to minimize any chances of genotype biases caused by reference haplotype sequences.
Sampling Sequence Depth and Variance.
To ensure that our read assignment matches the underlying sample’s sequencing characteristics, we empirically calculate the sequencing depth and variance from each sample by examining the WGS mapping at a representative locus. We use exon 327 of the TTN gene, located on chromosome 2. This region was selected for being one of the longest exons in the genome, and thus provides a ~17 kbp sampling region that is likely to be relatively stable (Bang et al., 2001). We further confirmed that there are no other loci in the genome with a similar sequence by simulating 200X 150 bp reads from the exon using ART (Huang et al., 2012) and mapping them back to GRCh38 using BWA-MEM (Li, 2013) with the −a parameter; indeed, all reads mapped back to the exon region.
The sequencing depth and variance are calculated as the mean and variance of the mapped read depth across exon 327. To obtain the expected sequencing depth variance over the ‘sloped’ regions of the trapezoid shape found in a valid IGHV read assignment as defined in section Allele Assignment, we calculate the read depth variance for each position of window of size read length − 50 across all bases of exon 327.
Coverage Constraints.
In order to match the assigned read depth and variance to the TTN-sampled values, we must monitor the read depth for each possible assignment that arises during the optimization process defined below. We only monitor the read depth at equally-spaced ‘landmark’ positions, which are chosen for every candidate allele in advance, to reduce optimization time. We can then constrain the mean assigned read depth across landmark positions to be within one standard deviation of the expected read depth using the sequencing depth and variance values calculated in section Sampling Sequence Depth and Variance. However, this constraint may allow for a read assignment with outlier read depth values at consecutive landmark positions that retain a mean coverage within the bounds. To combat this problem, we group landmarks (in a round-robin process) and set the constraint defined above for each landmark group independently (see Supplemental Figure 3). This reduces the likelihood that two such balancing outlier landmarks are in the same group.
Discarding reads.
The read assignment optimization strategy allows for reads to be discarded, to contend with reads that come from non-IGH loci which do not have their source sequence characterized in the IMGT allele database, but nonetheless have a mapping due to their similar sequence content. A given read can be discarded with a penalty equal to the expected number of errors for the read, multiplied by a constant factor, which is set to 2, by default. This implies that to be discarded, a read needs twice as many errors as expected relative to to the closest matching allele in the database which allows for a small number of novel variants to exist without the read being discarded.
ILP Model Definitions.
Let be the set of all WGS reads in the sample.
Let be the set of database alleles. Let be the set of candidate alleles of read ri.
Let be a set of landmark nucleotide positions in aj that are part of landmark group g, where g is a subset of all positions in aj. is the set of all landmark positions in allele aj.
Let covers(ri, aj, l) be the set of reads that have aj as candidate/mapping and cover landmark l.
Let µl,j and σl,j be the expected mean and standard deviation (SD) in read coverage of allele aj for a given landmark for a single copy of the allele. This is computed by sampling read coverage within TTN gene’s exon 327 and dividing by two to account for the diploid nature of the exon to estimate empirical mean and SD for the read coverage of landmark l.
Let λ = user-provided upper bound on the number of standard deviations allowed on the deviation of mean read coverage of any reference allele. Default 1.5.
Let min_cov = user-defined proportion of the estimated mean read coverage of any landmark position that the data needs to satisfy. Default 0.3.
Let e(ri, aj) = edit distance for the best mapping/alignment of ri on aj.
Let expected_errors(ri) = sequencing error rate × alignment length of primary mapping for ri.
Let discard_penalty_multiplier = user-provided penalty for discarding a read. Default 2.
ILP Model Variables.
where c ∈ {0,…,K} where K is the maximum allowed number of copies per allele.
ILP Model Constraints.
| (1) |
| (2) |
| (3) |
| (4) |
| (5) |
| (6) |
One-hot encoding of all possible copy numbers for each allele.
Prevent discarded reads from being assigned to any allele and enforce assignment to at most one copy.
If at least one of the reads is assigned to allele aj, then at least one copy must be called.
If Cj == c(> 0) (allele aj is called and has c copies), there must be at least c reads assigned to aj, to ensure there is no allele copy with zero reads assigned.
Each landmark position in should have a minimum read coverage proportional to the expected coverage, if allele aj is called.
If c copies of allele aj are called, the deviation of read coverage of a group of landmark positions away from the estimated mean is bounded.
ILP Model Objective Function.
Minimize:
Annotation of 1kGP Sample Assemblies
We created IGHV gene and pseudogene annotations for each haplotype from the 12 1kGP IGH assemblies to act as ground truth in our evaluation of ImmunoTyper-SR allele calls. The complete allele database was mapped against each assembly using Bowtie2 (Langmead and Salzberg, 2012) (with -af --end-to-end --very-fast parameters). The mapping results are then grouped by target location, and the best mapped allele for each target location is assigned to that gene. Any assignment with an edit distance > 1 is manually reviewed.
It is possible for an assembly to have IGHV gene copies that have considerably divergent sequence from any allele in the database. This can happen either through a true novel allele, as a result of a sequencing or assembly error, or due to large structural alterations present in the sample haplotype. This latter case is of particular concern as the 1kGP samples employed in this study are derived from lymphoblastoid B cell lines (LCL), which have undergone some degree of V(D)J rearrangement. This can introduce noise in the sequencing dataset, and can result in somatic deletions in a haplotype that may affect one or more IGHV genes.
To identify any considerably divergent IGHV genes, we generated 150 bp error-free in-silico reads from all sequences in the allele database and mapped them to the haplotypes using BWA-MEM (Li, 2013) (−a parameter). Any contiguous target region not identified above was extracted and mapped back to the allele database using BWA-MEM (Li, 2013) (-a parameter) to identify the most similar allele and edit distance.
Comparison With Luo et al. (2019) Pipeline
We implemented the pipeline as described in “Worldwide genetic variation of the IGHV and TRBV immune receptor gene families in humans” (Luo et al., 2019). IGHV gene calls and CNVs were compared against IGH assembly annotations as described in section Annotation of 1kGP Sample Assemblies.
Luo et al. limit allele calling to the following 11 genes that they determine are two-copy genes: IGHV1-18, IGHV1-24, IGHV1-45, IGHV1-58, IGHV2-26, IGHV3-20, IGHV3-72, IGHV3-73, IGHV3-74, IGHV5-51, and IGHV6-1. Since not all these genes are present in two copies in the 1kGP samples, we further limited allele calls only to those genes that are present in the above list, and for which the pipeline gave CNV calls of two.
As a result, when comparing the allele calls against a sample’s ground truth and ImmunoTyper-SR, we only used those genes that met the pipeline’s allele-calling criteria in IGHV assembly.
Measuring Concordance between Doubly-sequenced Samples
To calculate genotype call concordance between two WGS sequences from the same subject of the NIAID subcohort that has been doubly-sequenced, we used the weighted Jaccard similarity coefficient (Jw), defined as follows:
Let s be a subject who was sequenced twice and gs,i be the genotype vector corresponding to i-th WGS sample of the subject, where k-th element of the vector (gs,i[k]) is the number of copies of allele Ak, as called by ImmunoTyper-SR. Then, the weighted Jaccard similarity coefficient for the subject s is:
Quantification and Statistical Analysis
Statistical Association between IGHV Genotype and anti-Type I IFN Autoantibody Presence
First, the 585 samples from the NIAID cohort were filtered for the presence of α, β, ω IFN autoantibody labels. Any label with a “Yes, partially” value was modified to “Yes”, and we combined the individual α, β, ω labels into a single binary label indicating presence of any one of the autoantibodies. IGHV alleles that were present in at least one of sample’s genotype were chosen as the independent variable, in the form of binary presence/absence labels. A candidate selection logistic regression model was then performed between all such IGHV alleles and anti-type I IFN autoantibody presence, and those alleles with a significant p-value (>= 0.01) were then selected as top candidates.
The candidate alleles were applied in a separate logistic regression model to determine the effect size. For significance and multiple test correction, we performed a permutation test, by applying a logistic regression model to the dataset 100,000 times, where the autoantibody labels were randomly shuffled in each iteration. The resulting p-value was calculated for each allele as the proportion of the shuffled models that had a more extreme p-value than that in the original, unshuffled model.
All logistic regression was performed using the glm function in R with the family="binomial" parameter.
Supplementary Material
Highlights.
Immune system genes are some of the most difficult to genotype in the human genome
ImmunoTyper-SR allows allele-level genotyping of IGH variable genes using short reads
Initial COVID-related genetic risk analysis shows potential associations with genotype
Opens the door to including IGHV in future disease association studies
Acknowledgements
This work was supported by funding from the the Intramural Research Programs of the National Cancer Institute, and the National Institute of Allergy and Infectious Diseases, National Institutes of Health. Ethics approval was obtained from the University of Milano-Bicocca School of Medicine, San Gerardo Hospital, Monza–Ethics Committee of the National Institute of Infectious Diseases Lazzaro Spallanzani (84/2020) (Italy), and the Comitato Etico Provinciale (NP 4000–Studio CORONAlab). STORM-Health care workers were enrolled in the STudio OsseRvazionale sullo screeningdei lavoratori ospedalieri per COVID-19 (STORM-HCW) study, with approval from the local institutional review board (IRB) obtained on 18 June 2020. Information about the WGS data has been deposited in dbGaP (accession: phs0022454).
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Declaration of interests
The authors declare there are no relevant competing interests.
Disclaimer
The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products, or organizations imply endorsement by the U.S. Government.
Supplemental video and/or data files
Data S1: NIAID COVID-19 Consortium anti-IFN autoantibody assay results used for STAR Methods “Statistical Association between IGHV Genotype and anti-Type I IFN Autoantibody Presence”
A complementary international effort, the COVID-19 host genetics initiative (https://www.covid19hg.org) has similar goals.
i.e., the edit distance between such a pair of alleles is no more than 2% of the length of the shorter allele
References
- 1000-Genomes-Project-Consortium, 2015. A global reference for human genetic variation. Nature 526, 68. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Amarasinghe S, Su S, Dong X, et al. , 2020. Opportunities and challenges in long-read sequencing data analysis. Genome Biology 21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Avnir Y, Watson CT, Glanville J, Peterson EC, Tallarico AS, Bennett AS, Qin K, Fu Y, Huang CY, Beigel JH, Breden F, Zhu Q, Marasco WA, 2016. IGHV1-69 polymorphism modulates anti-influenza antibody repertoires, correlates with IGHV utilization shifts and varies by ethnicity. Scientific Reports 6, 1–11. URL: 10.1038/srep20842, doi: 10.1038/srep20842. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bang ML, Centner T, Fornoff F, Geach AJ, Gotthardt M, McNabb M, Witt CC, Labeit D, Gregorio CC, Granzier H, Labeit S, 2001. The complete gene sequence of titin, expression of an unusual 700-kDa titin isoform, and its interaction with obscurin identify a novel Z-line to I-band linking system. Circulation Research 89, 1065–1072. doi: 10.1161/hh2301.100981. [DOI] [PubMed] [Google Scholar]
- Bastard P, Rosen LB, Zhang Q, Michailidis E, Hoffmann HH, Zhang Y, Dorgham K, Philippot Q, Rosain J, Béziat V, et al. , 2020. Autoantibodies against type i ifns in patients with life-threatening covid-19. Science 370, eabd4585. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Byrska-Bishop M, Evani US, Zhao X, Basile AO, Abel HJ, Regier AA, Corvelo A, Clarke WE, Musunuri R, Nagulapalli K, Fairley S, Runnels A, Winterkorn L, Lowy-Gallego E, Consortium THGSV, Flicek P, Germer S, Brand H, Hall IM, Talkowski ME, Narzisi G, Zody MC, 2021. High coverage whole genome sequencing of the expanded 1000 genomes project cohort including 602 trios. bioRxiv URL: https://www.biorxiv.org/content/early/2021/02/07/2021.02.06.430068, doi: 10.1101/2021.02.06.430068. [DOI] [PMC free article] [PubMed]
- Cho ML, Chen PP, Seo YI, Hwang SY, Kim WU, Min DJ, Park SH, Cho CS, 2003. Association of homozygous deletion of the Humhv3005 and the VH3-30.3 genes with renal involvement in systemic lupus erythematosus. Lupus 12, 400–405. doi: 10.1191/0961203303lu385oa. [DOI] [PubMed] [Google Scholar]
- Collins AM, Peres A, Corcoran MM, Watson CT, Yaari G, Lees WD, Ohlin M, 2021. Commentary on population matched (pm) germline allelic variants of immunoglobulin (ig) loci: relevance in infectious diseases and vaccination studies in human populations. Genes & Immunity, 1–4. [DOI] [PMC free article] [PubMed]
- Collins AM, Yaari G, Shepherd AJ, Lees W, Watson CT, 2020. Germline immunoglobulin genes: Disease susceptibility genes hidden in plain sight? Current Opinion in Systems Biology 24, 100–108. URL: 10.1016/j.coisb.2020.10.011, doi: 10.1016/j.coisb.2020.10.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cui M, Huang J, Zhang S, Liu Q, Liao Q, Qiu X, 2021. Immunoglobulin expression in cancer cells and its critical roles in tumorigenesis. Frontiers in immunology 12, 893. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ford M, Haghshenas E, Watson CT, Sahinalp SC, 2020. Genotyping and Copy Number Analysis of Immunoglobulin Heavy Chain Variable Genes Using Long Reads. iScience 23, 101508. URL: https://pubmed.ncbi.nlm.nih.gov/32896768https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7482014/, doi: 10.1016/j.isci.2020.101508. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gadala-Maria D, Gidoni M, Marquez S, Vander Heiden JA, Kos JT, Watson CT, O’Connor KC, Yaari G, Kleinstein SH, 2019a. Identification of subject-specific immunoglobulin alleles from expressed repertoire sequencing data. Frontiers in immunology 10, 129. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gadala-Maria D, Gidoni M, Marquez S, Vanderheiden JA, Kos JT, Watson C, O'Connor KC, Yaari G, Kleinstein SH, 2019b. Identification of subject-specific immunoglobulin alleles from expressed repertoire sequencing data. Frontiers in Immunology 10, 1–12. URL: http://biorxiv.org/content/early/2018/08/31/405704.abstract, doi: 10.1101/405704. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gurobi Optimization, LLC, 2021. Gurobi Optimizer Reference Manual URL: https://www.gurobi.com.
- Huang W, Li L, Myers JR, Marth GT, 2012. ART: A next-generation sequencing read simulator. Bioinformatics 28, 593–594. doi: 10.1093/bioinformatics/btr708. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Johnson TA, Mashimo Y, Wu JY, Yoon D, Hata A, Kubo M, Takahashi A, Tsunoda T, Ozaki K, Tanaka T, et al. , 2021. Association of an ighv3-66 gene variant with kawasaki disease. Journal of human genetics 66, 475–489. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Langmead B, Salzberg SL, 2012. Fast gapped-read alignment with Bowtie 2. Nature Methods 9, 357–359. doi: 10.1038/nmeth.1923. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee JH, Toy L, Kos JT, Safonova Y, Schief WR, Havenar-Daughton C, Watson CT, Crotty S, 2021. Vaccine genetics of ighv1-2 vrc01-class broadly neutralizing antibody precursor näıve human b cells. NPJ vaccines 6, 1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lees W, Busse CE, Corcoran M, Ohlin M, Scheepers C, Matsen FA, Yaari G, Watson CT, Collins A, Shepherd AJ, 2020. OGRDB: A reference database of inferred immune receptor genes. Nucleic Acids Research 48, D964–D970. doi: 10.1093/nar/gkz822. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lefranc M, 2003. Imgt® databases, web resources and tools for immunoglobulin and t cell receptor sequence analysis, http://www.imgt.org. Leukemia; 17, 260–266. [DOI] [PubMed] [Google Scholar]
- Li H, 2013. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM arXiv:1303.3997.
- Luo S, Jane AY, Li H, Song YS, 2019. Worldwide genetic variation of the ighv and trbv immune receptor gene families in humans. Life science alliance 2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Luo S, Yu JA, Song YS, 2016. Estimating Copy Number and Allelic Variation at the Immunoglobulin Heavy Chain Locus Using Short Reads. PLoS Computational Biology 12, 1–21. doi: 10.1371/journal.pcbi.1005117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Parks T, Mirabel MM, Kado J, Auckland K, Nowak J, Rautanen A, Mentzer AJ, Marijon E, Jouven X, Perman ML, et al. , 2017. Association between a common immunoglobulin heavy chain allele and rheumatic heart disease risk in oceania. Nature communications 8, 1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Peres A, Gidoni M, Polak P, Yaari G, 2019. RAbHIT: R Antibody Haplotype Inference Tool. Bioinformatics 35, 4840–4842. doi: 10.1093/bioinformatics/btz481. [DOI] [PubMed] [Google Scholar]
- Roberts H, Lopopolo M, Pagnamenta A, et al. , 2021. Short and long-read genome sequencing methodologies for somatic variant detection; genomic analysis of a patient with diffuse large b-cell lymphoma. Scientific Reports 11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rodriguez OL, Gibson WS, Parks T, Emery M, Powell J, Strahl M, Deikus G, Auckland K, Eichler EE, Marasco WA, Sebra R, Sharp AJ, Smith ML, Bashir A, Watson CT, 2020. A Novel Framework for Characterizing Genomic Haplotype Diversity in the Human Immunoglobulin Heavy Chain Locus. Frontiers in Immunology 11, 1–16. doi: 10.3389/fimmu.2020.02136. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rodriguez OL, Sharp AJ, Watson CT, 2021. Limitations of lymphoblastoid cell lines for establishing genetic reference datasets in the immunoglobulin loci. bioRxiv [DOI] [PMC free article] [PubMed]
- Schultze JL, Aschenbrenner AC, 2021. COVID-19 and the human innate immune system. Cell 184, 1671–1692. URL: 10.1016/j.cell.2021.02.029, doi: 10.1016/j.cell.2021.02.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang EY, Mao T, Klein J, Dai Y, Huck JD, Jaycox JR, Liu F, Zhou T, Israelow B, Wong P, et al. , 2021. Diverse functional autoantibodies in patients with covid-19. Nature 595, 283–288. [DOI] [PubMed] [Google Scholar]
- Watson CT, Breden F, 2012. The immunoglobulin heavy chain locus: genetic variation, missing data, and implications for human disease. Genes And Immunity 13, 363. URL: 10.1038/gene.2012.12http://10.0.4.14/gene.2012.12. [DOI] [PubMed] [Google Scholar]
- Watson CT, Glanville J, Marasco WA, 2017a. The individual and population genetics of antibody immunity. Trends in immunology 38, 459–470. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Watson CT, Matsen FA, Jackson KJL, Bashir A, Smith ML, Glanville J, Breden F, Kleinstein SH, Collins AM, Busse CE, 2017b. Comment on ”A Database of Human Immune Receptor Alleles Recovered from Population Sequencing Data”. Journal of immunology (Baltimore, Md. : 1950) 198, 3371. doi: 10.4049/jimmunol.1700306. [DOI] [PubMed] [Google Scholar]
- Watson CT, Steinberg KM, Huddleston J, Warren RL, Malig M, Schein J, Willsey AJ, Joy JB, Scott JK, Graves TA, Wilson RK, Holt RA, Eichler EE, Breden F, 2013. Complete haplotype sequence of the human immunoglobulin heavy-chain variable, diversity, and joining genes and characterization of allelic and copy-number variation. American Journal of Human Genetics 92, 530–546. URL: 10.1016/j.ajhg.2013.03.004, doi: 10.1016/j.ajhg.2013.03.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- van der Wijst MG, Vazquez SE, Hartoularos GC, Bastard P, Grant T, Bueno R, Lee DS, Greenland JR, Sun Y, Perez R, et al. , 2021. Type i interferon autoantibodies are associated with systemic immune alterations in patients with covid-19. Science translational medicine 13, eabh2624. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yacoob C, Pancera M, Vigdorovich V, Oliver BG, Glenn JA, Feng J, Sather DN, McGuire AT, Stamatatos L, 2016. Differences in Allelic Frequency and CDRH3 Region Limit the Engagement of HIV Env Immunogens by Putative VRC01 Neutralizing Antibody Precursors. Cell Reports 17, 1560–1570. URL: https://linkinghub.elsevier.com/retrieve/pii/S2211124716314036, doi: 10.1016/j.celrep.2016.10.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yeung YA, Foletti D, Deng X, Abdiche Y, Strop P, Glanville J, Pitts S, Lindquist K, Sundar PD, Sirota M, et al. , 2016. Germline-encoded neutralization of a staphylococcus aureus virulence factor by the human antibody repertoire. Nature communications 7, 1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
COVID patient WGS data reported in this study cannot be deposited in a public repository due to privacy agreement restrictions. To request access, contact the lead contact, Cenk Sahinalp, at cenk.sahinalp@nih.gov.
COVID patient anti-IFN autoantibody assay results are provided as supplemental data.
1000 Genomes IGH assemblies can be access through NCBI GenBank under project PRJNA555323. See key resources table for access URL.
ImmunoTyper-SR and all original code has been deposited on Zenodo and is publicly available as of the date of publication. DOIs are listed in the key resources table.
Key resources table
| REAGENT or RESOURCE | SOURCE | IDENTIFIER |
|---|---|---|
| Deposited data | ||
| 1000 genomes WGS sample NA12878 | The International Genome Sample Resource | https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos3/sra-pub-run-19/ERR3239334/ERR3239334.1 |
| 1000 genomes WGS sample NA19240 | The International Genome Sample Resource | https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos3/sra-pub-run-25/ERR3989410/ERR3989410.1 |
| 1000 genomes WGS sample HG00512 | The International Genome Sample Resource | https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos3/sra-pub-run-25/ERR3988780/ERR3988780.1 |
| 1000 genomes WGS sample HG00513 | The International Genome Sample Resource | https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos3/sra-pub-run-21/ERR3241684/ERR3241684.1 |
| 1000 genomes WGS sample HG00514 | The International Genome Sample Resource | https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos3/sra-pub-run-25/ERR3988781/ERR3988781.1 |
| 1000 genomes WGS sample HG00731 | The International Genome Sample Resource | https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos3/sra-pub-run-20/ERR3241754/ERR3241754.1 |
| 1000 genomes WGS sample HG00731 | The International Genome Sample Resource | https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos3/sra-pub-run-20/ERR3241755/ERR3241755.1 |
| 1000 genomes WGS sample HG00733 | The International Genome Sample Resource | https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos3/sra-pub-run-25/ERR3988823/ERR3988823.1 |
| 1000 genomes WGS sample HG01109 | The International Genome Sample Resource | https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos3/sra-pub-run-25/ERR3988842/ERR3988842.1 |
| 1000 genomes WGS sample HG02723 | The International Genome Sample Resource | https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos3/sra-pub-run-25/ERR3989060/ERR3989060.1 |
| 1000 genomes WGS sample HG03098 | The International Genome Sample Resource | https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos3/sra-pub-run-25/ERR3989119/ERR3989119.1 |
| 1000 genomes IGH assemblies | Rodriguez et al. [8] | https://www.ncbi.nlm.nih.gov/bioproject/PRJNA555323/ |
| NIAID COVID-19 Consortium Patient WGS Data | NIAID COVID Consortium [8] | Available upon request |
| NIAID COVID-19 Consortium anti-IFN autoantibody assay results | NIAID COVID Consortium [8] | Supplemental Data S1 |
| Software and algorithms | ||
| ImmunoTyper-SR v1.0 | This paper | 10.5281/zenodo.6513012 |
| Implementation of Luo et al. [24] | This paper | 10.5281/zenodo.6558252 |
| Gurobi v9.1.2 | www.gurobi.com | N/A |
| BWA-MEM v0.7.17 | anaconda.org/bioconda/bwa | N/A |
| Pysam v0.16.0.1 | anaconda.org/bioconda/pysam | N/A |
| Samtools v1.12 | ||
| Python | Python Software Foundation | N/A |
| GLM R package | www.rdocumentation.org/packages/stats/versions/3.6.2/topics/glm | N/A |
| Antibodies | ||
| Rabbit monoclonal anti-Snail | Cell Signaling Technology | Cat#3879S; RRID: AB_2255011 |
| Mouse monoclonal anti-Tubulin (clone DM1A) | Sigma-Aldrich | Cat#T9026; RRID: AB_477593 |
| Rabbit polyclonal anti-BMAL1 | This paper | N/A |
| Bacterial and virus strains | ||
| pAAV-hSyn-DIO-hM3D(Gq)-mCherry | Krashes et al., 2011 | Addgene AAV5; 44361-AAV5 |
| AAV5-EF1a-DIO-hChR2(H134R)-EYFP | Hope Center Viral Vectors Core | N/A |
| Cowpox virus Brighton Red | BEI Resources | NR-88 |
| Zika-SMGC-1, GENBANK: KX266255 | Isolated from patient (Wang et al., 2016) | N/A |
| Staphylococcus aureus | ATCC | ATCC 29213 |
| Streptococcus pyogenes: M1 serotype strain: strain SF370; M1 GAS | ATCC | ATCC 700294 |
| Biological samples | ||
| Healthy adult BA9 brain tissue | University of Maryland Brain & Tissue Bank; http://medschool.umaryland.edu/btbank/ | Cat#UMB1455 |
| Human hippocampal brain blocks | New York Brain Bank | http://nybb.hs.columbia.edu/ |
| Patient-derived xenografts (PDX) | Children’s Oncology Group Cell Culture and Xenograft Repository | http://cogcell.org/ |
| Chemicals, peptides, and recombinant proteins | ||
| MK-2206 AKT inhibitor | Selleck Chemicals | S1078; CAS: 1032350–13-2 |
| SB-505124 | Sigma-Aldrich | S4696; CAS: 694433–59-5 (free base) |
| Picrotoxin | Sigma-Aldrich | P1675; CAS: 124–87-8 |
| Human TGF-β | R&D | 240-B; GenPept: P01137 |
| Activated S6K1 | Millipore | Cat#14-486 |
| GST-BMAL1 | Novus | Cat#H00000406-P01 |
| Critical commercial assays | ||
| EasyTag EXPRESS 35S Protein Labeling Kit | PerkinElmer | NEG772014MC |
| CaspaseGlo 3/7 | Promega | G8090 |
| TruSeq ChIP Sample Prep Kit | Illumina | IP-202-1012 |
| Deposited data | ||
| Raw and analyzed data | This paper | GEO: GSE63473 |
| B-RAF RBD (apo) structure | This paper | PDB: 5J17 |
| Human reference genome NCBI build 37, GRCh37 | Genome Reference Consortium | http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/ |
| Nanog STILT inference | This paper; Mendeley Data | http://dx.doi.org/10.17632/wx6s4mj7s8.2 |
| Affinity-based mass spectrometry performed with 57 genes | This paper; Mendeley Data | Table S8; http://dx.doi.org/10.17632/5hvpvspw82.1 |
| Experimental models: Cell lines | ||
| Hamster: CHO cells | ATCC | CRL-11268 |
| D. melanogaster: Cell line S2: S2-DRSC | Laboratory of Norbert Perrimon | FlyBase: FBtc0000181 |
| Human: Passage 40 H9 ES cells | MSKCC stem cell core facility | N/A |
| Human: HUES 8 hESC line (NIH approval number NIHhESC-09-0021) | HSCI iPS Core | hES Cell Line: HUES-8 |
| Experimental models: Organisms/strains | ||
| C. elegans: Strain BC4011: srl-1(s2500) II; dpy-18(e364) III; unc-46(e177)rol-3(s1040) V. | Caenorhabditis Genetics Center | WB Strain: BC4011; WormBase: WBVar00241916 |
| D. melanogaster: RNAi of Sxl: y[1] sc[*] v[1]; P{TRiP.HMS00609}attP2 | Bloomington Drosophila Stock Center | BDSC:34393; FlyBase: FBtp0064874 |
| S. cerevisiae: Strain background: W303 | ATCC | ATTC: 208353 |
| Mouse: R6/2: B6CBA-Tg(HDexon1)62Gpb/3J | The Jackson Laboratory | JAX: 006494 |
| Mouse: OXTRfl/fl: B6.129(SJL)-Oxtrtm1.1Wsy/J | The Jackson Laboratory | RRID: IMSR_JAX:008471 |
| Zebrafish: Tg(Shha:GFP)t10: t10Tg | Neumann and Nuesslein-Volhard, 2000 | ZFIN: ZDB-GENO-060207-1 |
| Arabidopsis: 35S::PIF4-YFP, BZR1-CFP | Wang et al., 2012 | N/A |
| Arabidopsis: JYB1021.2: pS24(AT5G58010)::cS24:GFP(-G):NOS #1 | NASC | NASC ID: N70450 |
| Oligonucleotides | ||
| siRNA targeting sequence: PIP5K I alpha #1: ACACAGUACUCAGUUGAUA | This paper | N/A |
| Primers for XX, see Table SX | This paper | N/A |
| Primer: GFP/YFP/CFP Forward: GCACGACTTCTTCAAGTCCGCCATGCC | This paper | N/A |
| Morpholino: MO-pax2a GGTCTGCTTTGCAGTGAATATCCAT | Gene Tools | ZFIN: ZDB-MRPHLNO-061106-5 |
| ACTB (hs01060665_g1) | Life Technologies | Cat#4331182 |
| RNA sequence: hnRNPA1_ligand: UAGGGACUUAGGGUUCUCUCUAGGGACUUAGGGUUCUCUCUAGGGA | This paper | N/A |
| Recombinant DNA | ||
| pLVX-Tight-Puro (TetOn) | Clonetech | Cat#632162 |
| Plasmid: GFP-Nito | This paper | N/A |
| cDNA GH111110 | Drosophila Genomics Resource Center | DGRC:5666; FlyBase:FBcl0130415 |
| AAV2/1-hsyn-GCaMP6- WPRE | Chen et al., 2013 | N/A |
| Mouse raptor: pLKO mouse shRNA 1 raptor | Thoreen et al., 2009 | Addgene Plasmid #21339 |
| Software and algorithms | ||
| ImageJ | Schneider et al., 2012 | https://imagej.nih.gov/ij/ |
| Bowtie2 | Langmead and Salzberg, 2012 | http://bowtie-bio.sourceforge.net/bowtie2/index.shtml |
| Samtools | Li et al., 2009 | http://samtools.sourceforge.net/ |
| Weighted Maximal Information Component Analysis v0.9 | Rau et al., 2013 | https://github.com/ChristophRau/wMICA |
| ICS algorithm | This paper; Mendeley Data | http://dx.doi.org/10.17632/5hvpvspw82.1 |
| Other | ||
| Sequence data, analyses, and resources related to the ultra-deep sequencing of the AML31 tumor, relapse, and matched normal | This paper | http://aml31.genome.wustl.edu |
| Resource website for the AML31 publication | This paper | https://github.com/chrisamiller/aml31SuppSite |
| Chemicals, peptides, and recombinant proteins | ||
| QD605 streptavidin conjugated quantum dot | Thermo Fisher Scientific | Cat#Q10101MP |
| Platinum black | Sigma-Aldrich | Cat#205915 |
| Sodium formate BioUltra, ≥99.0% (NT) | Sigma-Aldrich | Cat#71359 |
| Chloramphenicol | Sigma-Aldrich | Cat#C0378 |
| Carbon dioxide (13C, 99%) (<2% 18O) | Cambridge Isotope Laboratories | CLM-185-5 |
| Poly(vinylidene fluoride-co-hexafluoropropylene) | Sigma-Aldrich | 427179 |
| PTFE Hydrophilic Membrane Filters, 0.22 μm, 90 mm | Scientificfilters.com/Tisch Scientific | SF13842 |
| Critical commercial assays | ||
| Folic Acid (FA) ELISA kit | Alpha Diagnostic International | Cat# 0365–0B9 |
| TMT10plex Isobaric Label Reagent Set | Thermo Fisher | A37725 |
| Surface Plasmon Resonance CM5 kit | GE Healthcare | Cat#29104988 |
| NanoBRET Target Engagement K-5 kit | Promega | Cat#N2500 |
| Deposited data | ||
| B-RAF RBD (apo) structure | This paper | PDB: 5J17 |
| Structure of compound 5 | This paper; Cambridge Crystallographic Data Center | CCDC: 2016466 |
| Code for constraints-based modeling and analysis of autotrophic E. coli | This paper | https://gitlab.com/elad.noor/sloppy/tree/master/rubisco |
| Software and algorithms | ||
| Gaussian09 | Frish et al., 2013 | https://gaussian.com |
| Python version 2.7 | Python Software Foundation | https://www.python.org |
| ChemDraw Professional 18.0 | PerkinElmer | https://www.perkinelmer.com/category/chemdraw |
| Weighted Maximal Information Component Analysis v0.9 | Rau et al., 2013 | https://github.com/ChristophRau/wMICA |
| Other | ||
| DASGIP MX4/4 Gas Mixing Module for 4 Vessels with a Mass Flow Controller | Eppendorf | Cat#76DGMX44 |
| Agilent 1200 series HPLC | Agilent Technologies | https://www.agilent.com/en/products/liquid-chromatography |
| PHI Quantera II XPS | ULVAC-PHI, Inc. | https://www.ulvac-phi.com/en/products/xps/phi-quantera-ii/ |
Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request
