. 2024 Mar 5;20(3):e1011881. doi: 10.1371/journal.pcbi.1011881

Table 2. The mean number of liabilities per sequence for each dataset in our study.

For most of the datasets, we calculated the mean number of liabilities for unpaired sequences. The NGS and therapeutics subsets offer paired data, which are not directly comparable to single-sequence datasets. Abbreviations after the underscore mean respectively: “H”—heavy chain, “L”- light chain, “all”—all sequences, “human”—only human antibody sequences, “nonhuman”—only non human antibody sequences, “cst”—clinical stage therapeutics, “market”—therapeutics on the market, “std”—standard deviation.

Dataset	mean	std	median
genbank_H	3.59	1.96	4
genbank_L	3.07	1.95	3
genbank_all	3.43	1.97	3
genbank_human	3.37	2.04	3
genbank_nonhuman	3.57	1.79	4
NGS_all (both H+L)	6.55	2.96	6
literature_H	3.35	1.94	3
literature_L	3.00	2.02	3
literature_all	3.23	1.98	3
literature_human	3.08	2.12	3
literature_nonhuman	3.36	1.83	3
patents_H	3.31	1.94	3
patents_L	2.95	1.92	3
patents_all	3.16	1.94	3
patents_human	3.06	2.01	3
patents_nonhuman	3.38	1.75	3
therapeutics_all (both H+L)	5.85	2.62	6
therapeutics_cst (both H+L)	5.91	2.61	6
therapeutics_market (both H+L)	6.13	2.66	6