Pros |
• Longstanding, well-established methods to investigate functional relationships between proteins |
• Intuitive graphical representation of thousands of protein sequences simultaneously |
• Guilt-by-association methods can reveal new functional relationships for proteins independent of primary sequence |
• Variations in active site architecture can have large consequences for biocatalysis → handles for discovery |
• Deep learning, transfer learning, and autoencoding methods useful to learn complex or hidden relationships for functional inference |
• Insights into evolution of protein families, e.g., through ancestral sequence reconstruction |
• Allows users to quickly identify clusters without known representatives in sequence space |
• Unusual co-occurring domains or interacting proteins are new targets for enzyme discovery |
• Structural motifs are useful for searches independent of full-length primary sequence |
• Capable of recognizing patterns in big metagenomic datasets |
Cons |
• Heavily influenced by the quality of the underlying sequence alignment |
• Pruning of SSNs by BLAST e-value can be subjective |
• Analysis of gene neighborhoods from metagenomes requires assembly → introduces errors and not always possible to recover flanking genes for lowly-abundant organisms |
• Similar structural folds catalyze a wide range of different reactions |
• Requires a large quantity of ‘labeled’ e.g., experimentally-verified training data |
• Not all biosynthetic domains have a consistent or strong phylogenetic signal |
• Unclear how to handle or gain functional insights from ‘singletons’ |
|
• Relatively few structures solved from metagenomic sources |
• Classification systems limited in their ability to predict entirely new enzyme functions |