Table 6. Ablation study of record features for duplicate classification.
Organism | Meta | Seq | SQ | SQC | SQM | All | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Pre | Rec | Pre | Rec | Pre | Rec | Pre | Rec | Pre | Rec | Pre | Rec | |
Caenorhabditis | ||||||||||||
Naïve Bayes | 0.633 | 0.628 | 0.714 | 0.714 | 0.872 | 0.833 | 0.849 | 0.808 | 0.899 | 0.880 | 0.852 | 0.809 |
Decision tree | 0.815 | 0.730 | 0.816 | 0.814 | 0.971 | 0.971 | 0.979 | 0.979 | 0.980 | 0.980 | 0.981 | 0.981 |
Danio | ||||||||||||
Naïve Bayes | 0.656 | 0.622 | 0.696 | 0.657 | 0.817 | 0.766 | 0.839 | 0.775 | 0.831 | 0.797 | 0.839 | 0.777 |
Decision tree | 0.815 | 0.730 | 0.816 | 0.814 | 0.971 | 0.971 | 0.979 | 0.979 | 0.980 | 0.980 | 0.958 | 0.958 |
Drosophila | ||||||||||||
Naïve Bayes | 0.945 | 0.941 | 0.719 | 0.718 | 0.860 | 0.827 | 0.882 | 0.849 | 0.973 | 0.973 | 0.983 | 0.983 |
Decision tree | 0.951 | 0.950 | 0.950 | 0.950 | 0.996 | 0.996 | 0.998 | 0.998 | 0.999 | 0.999 | 0.999 | 0.999 |
Escherichia | ||||||||||||
Naïve Bayes | 0.778 | 0.654 | 0.842 | 0.820 | 0.979 | 0.979 | 0.937 | 0.930 | 0.972 | 0.972 | 0.927 | 0.918 |
Decision tree | 0.719 | 0.717 | 0.842 | 0.836 | 0.982 | 0.982 | 0.981 | 0.981 | 0.981 | 0.981 | 0.981 | 0.981 |
Zea | ||||||||||||
Naïve Bayes | 0.894 | 0.881 | 0.882 | 0.855 | 0.987 | 0.986 | 0.987 | 0.986 | 0.984 | 0.984 | 0.986 | 0.986 |
Decision tree | 0.961 | 0.960 | 0.965 | 0.965 | 0.997 | 0.997 | 0.998 | 0.998 | 0.998 | 0.998 | 0.998 | 0.998 |
Pre: average precision for two classes (DU and DI); Rec: average recall; Meta: meta-data features; Seq: sequence identity and length ratio; Q: alignment quality related features, such as Expect_value; SQ: combination for Seq with Q; C: coding regions related features, such as CDS_identity; SQC: combination for Seq, Q and C; SQM: Seq, Q and Meta All: all eatures.