Skip to main content
. 2019 Sep 5;8:e46935. doi: 10.7554/eLife.46935

Figure 5. Divergences for summary statistics comparing model-generated sequences to held-out repertoire sequences on the De Neuter et al. (2019) data set.

Each colored point represents the divergence of a summary distribution computed on a simulated pool of sequences to the distribution of the same summary on a set of sequences drawn from one of 11 repertoires (Figure 5—figure supplement 1). Each black '+' represents a similar divergence but with a random selection from the training data rather than a simulated pool of sequences. A lower divergence means more similarity with respect to the given summary. The following summary statistics, applied to the CDR3 amino acid sequence, use Jensen-Shannon divergence: acidity, aliphatic index, aromaticity, basicity, bulkiness, length (in amino acids), charge, GRAVY index, nearest neighbor Levenshtein distance, pairwise Levenshtein distance, and polarity. The following summary statistics use 1 divergence: CDR3 amino acid 2mer frequency, CDR3 amino acid frequency, J gene frequency, and V gene frequency.

Figure 5.

Figure 5—figure supplement 1. Nearest neighbor Levenshtein distributions on the De Neuter et al. (2019) data set.

Figure 5—figure supplement 1.

Nearest neighbor Levenshtein distance distributions for simulated sequences and test repertoire sequences. Each of the divergences in Figure 4 calculate a divergence between one of the colored lines (a simulated collection of sequences) and one of the gray lines (test repertoire sequences).
Figure 5—figure supplement 2. Summary statistics comparison on a multiple sclerosis data set.

Figure 5—figure supplement 2.

Analysis as in Figure 4 but instead on the multiple sclerosis samples from Emerson et al. (2013), combining CD4+ and CD8+ sorts. Training was performed on 16 repertoires, with nine repertoires for a held-out test set.