. Author manuscript; available in PMC: 2019 Jun 15.

Published in final edited form as: J Immunol. 2018 Nov 5;201(12):3694–3704. doi: 10.4049/jimmunol.1800669

Table I. Summary of the datasets used.

The seven datasets (Khan_R, Khan_C, HEPB, UCB_H, UCB_L, Healthy_H and Healthy_L) were obtained from different sequencing methodologies, organisms and immunization protocols. The Khan_R and Khan_C datasets are the immunized mouse 1 dataset of the Khan et al., (8) study before and after the barcode correction approach. These datasets are from repeated Ig-seq of the same mouse. The majority of sequences in this Ig-seq dataset start at position 8. The Khan_R and Khan_C datasets consist of antibody amino acid and corresponding nucleotide sequences. The Khan_R dataset has the highest redundancy amongst the interrogated non-corrected datasets. We have removed the roughly 10% synthetic spike-ins in the Khan_R and Khan_C datasets. The HEPB dataset from Galson et al., (7) is from 11 participants. Standard Illumina Ig-seq was performed. The reads were gene-aligned and processed using IMGT/HighV-Quest. Due to selection of PCR primers, most of the sequences start at position 17. This dataset contains amino acid sequences only. The dataset’s redundancy is almost two times lower than the Khan_R data. The UCB proprietary Ig-seq datasets were obtained from 494 participants. The UCB_H and UCB_L datasets comprise 5.6m and 9.3m sequences respectively. The UCB_H and UCB_L datasets contain both antibody amino acid and corresponding nucleotide sequences. The UCB datasets were aligned with IgBlast (24), V and J genes identified, and pre-filtered for stop codons, they contain full-length variable chain sequences as described in Krawczyk et al., (31). The UCB_H and UCB_L datasets are the least redundant amongst the datasets. The Healthy_H and Healthy_L datasets come from four healthy human B cell donors from the Vander Heiden et al., (32) study. In this study, sequencing primers for both heavy and light chain genes were used at the same time forming pooled raw nucleotide samples. The raw nucleotide Ig-seq datasets were obtained from the OAS resource (36) followed by translating sequences into amino acids and antibody chain separation using IgBlastn (24).

Dataset name	Study description	Total dataset size	Antibody chain	Dataset average redundancy	Participants
Khan_R	Raw sequences of Immunized mouse 1 from Khan et al., (8)	2.4m	Heavy	3.74	1 (mouse)
Khan_C	Barcode corrected sequences of immunized mouse 1 from Khan et al., (8)	2.4m	Heavy	45.3	1 (mouse)
HEPB	Human hepatitis B vaccination from Galson et al., (7)	9.9m	Heavy	1.93	11
UCB_H	Proprietary UCB Ig-seq of the VH chain	5.6m	Heavy	1.15	494
UCB_L	Proprietary UCB Ig-seq of the VL chain	9.3m	Light	1.12	494
Healthy_H	VH chains from healthy human B cell donors from Vander Heiden et al., (32)	1.4m	Heavy	1.9	4
Healthy_L	VL chains from healthy human B cell donors from Vander Heiden et al., (32)	6.3m	Light	2.96	4