Information preserved by different CDR3 compression schemes. (A) Relevancy of bag-of-words (BOW) representations for the CDR3, CDR3, and both CDR3 chains. CDR3 bag-of-words representations are vectors of dimension 20 with each entry representing the number of occurrences of a particular amino acid. (B) Information retained when compressing CDR3s using reduced amino acid alphabets described in SI Appendix, Text 7.B. CDR3 denotes the CDR3 remapped to the reduced alphabet. (C) Information retained by different two letter alphabets for both CDR3 chains. Polarity, solvation free energy, and normalized Van der Waals volume alphabets are obtained by hierarchical clustering of amino acids with respect to these biophysical properties. (For further properties and different alphabet sizes, see SI Appendix, Table S3.) The optimal alphabet is obtained by a greedy search algorithm described in SI Appendix, Text 7.D.