Figure 2.
Potential applications of natural language models in ecology and evolution. The simplest application is training and applying a document classifier to predict relevant documents (top row). Given a training set of relevant and non-relevant documents (may come from existing databases, a manually curated training set, or documents tagged by a set of rules), the relevance of new documents may be predicted and prioritized for manual screening and curation, or downstream information extraction. Manual screening may be used to validate predictive models or re-train and fine-tune the original classifier. Once a set of relevant documents is identified, the subjects of the documents can be explored through named entity recognition (NER; middle row). Named entities can be identified by comparing text strings to a dictionary. If a complete set of entities is not known or available, a machine learning-based NER tool can be used to predict entities and identify never-before-seen terms. Given a training set, NER can be used to identify terms in a text (for example, species, genes, proteins, locations, morphological structures) and tag their locations in a text. Once components of a document are tagged (parts of speech, named entities, numbers), relationships among them can be identified to create structured datasets for analysis (bottom row). Relationships may be inferred through term co-occurrence frequencies, sentence structures (dependency parsing), or through machine learning-based models that predict the nature of the relationship. Relational data can take a variety of forms including species interactions, biological measurements and their associated units, or networks of different relationship types (ontologies). Figure created with BioRender.com. (Online version in colour.)
