Skip to main content
. 2021 Feb 22;70(5):1046–1060. doi: 10.1093/sysbio/syab010

Table 2.

Summary of the data sets used to estimate new amino-acid replacement matrices

Data set References Seqs Sites Loci Training Testing
Pfam El-Gebali et al. (2019) 1,150,099 3,433,343 13,308 6654 6654
Plant Ran et al. (2018) 38 432,014 1308 1000 308
Bird Jarvis et al. (2015) 52 4,519,041 8295 1000 Inline graphic 2 6295
Mammal Wu et al. (2018) 90 3,050,199 5162 1000 Inline graphic 2 3162
Insect Misof et al. (2014) 144 595,033 2868 1000 1868
Yeast Shen et al. (2018) 343 1,162,805 2408 1000 100 seqs 1408

For each data set, we randomly subsampled half (Pfam) or 1000 MSAs (others) as the training set and remaining loci as the test set. For bird and plant data sets, we used two nonoverlapping training sets to examine the effect of random subsampling. For the yeast data set, we additionally subsampled 100 sequences from the training set due to the excessive computational burden.