Table I. Summary of the datasets used.
The seven datasets (Khan_R, Khan_C, HEPB, UCB_H, UCB_L, Healthy_H and Healthy_L) were obtained from different sequencing methodologies, organisms and immunization protocols. The Khan_R and Khan_C datasets are the immunized mouse 1 dataset of the Khan et al., (8) study before and after the barcode correction approach. These datasets are from repeated Ig-seq of the same mouse. The majority of sequences in this Ig-seq dataset start at position 8. The Khan_R and Khan_C datasets consist of antibody amino acid and corresponding nucleotide sequences. The Khan_R dataset has the highest redundancy amongst the interrogated non-corrected datasets. We have removed the roughly 10% synthetic spike-ins in the Khan_R and Khan_C datasets. The HEPB dataset from Galson et al., (7) is from 11 participants. Standard Illumina Ig-seq was performed. The reads were gene-aligned and processed using IMGT/HighV-Quest. Due to selection of PCR primers, most of the sequences start at position 17. This dataset contains amino acid sequences only. The dataset’s redundancy is almost two times lower than the Khan_R data. The UCB proprietary Ig-seq datasets were obtained from 494 participants. The UCB_H and UCB_L datasets comprise 5.6m and 9.3m sequences respectively. The UCB_H and UCB_L datasets contain both antibody amino acid and corresponding nucleotide sequences. The UCB datasets were aligned with IgBlast (24), V and J genes identified, and pre-filtered for stop codons, they contain full-length variable chain sequences as described in Krawczyk et al., (31). The UCB_H and UCB_L datasets are the least redundant amongst the datasets. The Healthy_H and Healthy_L datasets come from four healthy human B cell donors from the Vander Heiden et al., (32) study. In this study, sequencing primers for both heavy and light chain genes were used at the same time forming pooled raw nucleotide samples. The raw nucleotide Ig-seq datasets were obtained from the OAS resource (36) followed by translating sequences into amino acids and antibody chain separation using IgBlastn (24).
Dataset name | Study description | Total dataset size | Antibody chain | Dataset average redundancy | Participants |
---|---|---|---|---|---|
Khan_R | Raw sequences of Immunized mouse 1 from Khan et al., (8) | 2.4m | Heavy | 3.74 | 1 (mouse) |
Khan_C | Barcode corrected sequences of immunized mouse 1 from Khan et al., (8) | 2.4m | Heavy | 45.3 | 1 (mouse) |
HEPB | Human hepatitis B vaccination from Galson et al., (7) | 9.9m | Heavy | 1.93 | 11 |
UCB_H | Proprietary UCB Ig-seq of the VH chain | 5.6m | Heavy | 1.15 | 494 |
UCB_L | Proprietary UCB Ig-seq of the VL chain | 9.3m | Light | 1.12 | 494 |
Healthy_H | VH chains from healthy human B cell donors from Vander Heiden et al., (32) | 1.4m | Heavy | 1.9 | 4 |
Healthy_L | VL chains from healthy human B cell donors from Vander Heiden et al., (32) | 6.3m | Light | 2.96 | 4 |