Table 2.
Summary of the data sets used to estimate new amino-acid replacement matrices
| Data set | References | Seqs | Sites | Loci | Training | Testing |
|---|---|---|---|---|---|---|
| Pfam | El-Gebali et al. (2019) | 1,150,099 | 3,433,343 | 13,308 | 6654 | 6654 |
| Plant | Ran et al. (2018) | 38 | 432,014 | 1308 | 1000 | 308 |
| Bird | Jarvis et al. (2015) | 52 | 4,519,041 | 8295 | 1000 2 |
6295 |
| Mammal | Wu et al. (2018) | 90 | 3,050,199 | 5162 | 1000 2 |
3162 |
| Insect | Misof et al. (2014) | 144 | 595,033 | 2868 | 1000 | 1868 |
| Yeast | Shen et al. (2018) | 343 | 1,162,805 | 2408 | 1000 100 seqs | 1408 |
For each data set, we randomly subsampled half (Pfam) or 1000 MSAs (others) as the training set and remaining loci as the test set. For bird and plant data sets, we used two nonoverlapping training sets to examine the effect of random subsampling. For the yeast data set, we additionally subsampled 100 sequences from the training set due to the excessive computational burden.

