. 2021 Feb 22;70(5):1046–1060. doi: 10.1093/sysbio/syab010

Table 2.

Summary of the data sets used to estimate new amino-acid replacement matrices

Data set	References	Seqs	Sites	Loci	Training	Testing
Pfam	El-Gebali et al. (2019)	1,150,099	3,433,343	13,308	6654	6654
Plant	Ran et al. (2018)	38	432,014	1308	1000	308
Bird	Jarvis et al. (2015)	52	4,519,041	8295	1000 2	6295
Mammal	Wu et al. (2018)	90	3,050,199	5162	1000 2	3162
Insect	Misof et al. (2014)	144	595,033	2868	1000	1868
Yeast	Shen et al. (2018)	343	1,162,805	2408	1000 100 seqs	1408

For each data set, we randomly subsampled half (Pfam) or 1000 MSAs (others) as the training set and remaining loci as the test set. For bird and plant data sets, we used two nonoverlapping training sets to examine the effect of random subsampling. For the yeast data set, we additionally subsampled 100 sequences from the training set due to the excessive computational burden.