Table 4.
The distribution of clusters with their characteristics given different values for k (the number of clusters) from 500 to 3,000.
K | 500 | 1,000 | 2,000 | 3,000 |
Single Species cluster | 422 (84.4%) | 904 (90.4%) | 1897 (94.9%) | 2894 (96.5%) |
# of Phenocopy-Pairs (of 25) | 25 (100%) | 13 (52%) | 12 (48%) | 8 (32%) |
Cluster w/PT-Sim ≥ 0.4 | 92 (18.4%) | 293 (29.3%) | 526 (26.3%) | 810 (40.5%) |
# Genes | 3221 | 5886 | 6379 | 6878 |
Cluster w/GO-Sim ≥ 0.4 | 51 (10.2%) | 206 (20.6%) | 522 (26.1%) | 921 (46.1%) |
Correlation GO-Sim vs PT-SIM | 0.53 | 0.41 | 0.37 | 0.28 |
# Genes | 863 | 1800 | 2392 | 3065 |
Cluster w/PPi ≥ 75% | 21 (4.2%) | 60 (6.0%) | 174 (8.7%) | 305 (10.2%) |
# Genes | 1497 | 1858 | 2335 | 2702 |
Cluster w/PPi ≥ 33% | 63 (12.6%) | 138 (13.8%) | 286 (14.3%) | 413 (13.8%) |
# Genes | 3890 | 4322 | 4965 | 4996 |
Cluster for GO-Predictions | 90 (18%) | 196 (19.6%) | 393 (19.7%) | 611 (20.4%) |
# Genes | 2820 | 3213 | 4145 | 4546 |
# Terms | 142 | 345 | 730 | 1226 |
Precision | 72.55% | 67.91% | 63.40% | 60.31% |
Recall | 16.73% | 22.98% | 25.63% | 28.32% |
Avg. Genes/Cluster | 54 | 29 | 16 | 11 |
As internal measure for cluster quality we sought to gain insight how the data structure changes by choosing different values for k, ranging from 500 to 3,000. Here, Filter 1 has been applied for GO-predictions. For details, see text.