Not that kind of tree: Assessing the potential for decision tree–based plant identification using trait databases

Brianna K Almeida; Manish Garg; Miroslav Kubat; Michelle E Afkhami

doi:10.1002/aps3.11379

. 2020 Jul 31;8(7):e11379. doi: 10.1002/aps3.11379

Not that kind of tree: Assessing the potential for decision tree–based plant identification using trait databases

Brianna K Almeida ^1,^✉, Manish Garg ², Miroslav Kubat ², Michelle E Afkhami ¹

PMCID: PMC7394705 PMID: 32765978

Abstract

Premise

Advancements in machine learning and the rise of accessible “big data” provide an important opportunity to improve trait‐based plant identification. Here, we applied decision‐tree induction to a subset of data from the TRY plant trait database to (1) assess the potential of decision trees for plant identification and (2) determine informative traits for distinguishing taxa.

Methods

Decision trees were induced using 16 vegetative and floral traits (689 species, 20 genera). We assessed how well the algorithm classified species from test data and pinpointed those traits that were important for identification across diverse taxa.

Results

The unpruned tree correctly placed 98% of the species in our data set into genera, indicating its promise for distinguishing among the species used to construct them. Furthermore, in the pruned tree, an average of 89% of the species from the test data sets were properly classified into their genera, demonstrating the flexibility of decision trees to also classify new species into genera within the tree. Closer inspection revealed that seven of the 16 traits were sufficient for the classification, and these traits yielded approximately two times more initial information gain than those not included.

Discussion

Our findings demonstrate the potential for tree‐based machine learning and big data in distinguishing among taxa and determining which traits are important for plant identification.

Keywords: decision tree, information gain, machine learning, plant identification, TRY plant trait database

Plant identification is critical for a wide range of biological fields and goals, ranging from understanding ecological processes, such as community assembly, to the conservation of rare and threatened species (Thessen, 2016). Historically, species have been identified using trait‐based approaches in the form of dichotomous and polyclave keys (Tilling, 1984; Edwards et al., 1987). These identification keys remain an important and widely used resource for scientists (Gaylard and Kerley, 1995; Randler, 2008), as they are convenient, inexpensive, and enable identification when tissue samples cannot be collected for molecular barcoding (Will and Rubinoff, 2004). Improving trait‐based plant identification (e.g., reducing the number of traits required for identification) could be especially useful for improving the efficacy of citizen scientists in large‐scale projects where the use of genetic tools is not feasible or cost‐effective (Gallo and Waitt, 2011; Roy et al., 2016).

Advancements in computational methods such as machine learning, in tandem with the recent rise of online, easily accessible “big data,” could provide an unprecedented opportunity to improve trait‐based identification, just as it has proved useful in other important ecological areas. For instance, machine learning has been applied to large databases to predict phenomena such as global surface temperatures (Casaioli et al., 2003), and underpins some of the most widely used methods for species distribution modeling (Vayssières et al., 2000; Elith et al., 2006). Machine learning has also been employed for the identification of lung cancer cell types, bird and frog calls, and tree species based on leaf shape outline (Zhou et al., 2002; Acevedo et al., 2009; Kumar et al., 2012). These applications have all yielded encouraging results with high identification accuracy; however, one limitation of these studies is that they focused on rather narrow taxonomic groups and did not consider the growing online availability (even abundance) of morphological data.

Importantly, not all data are created equal. Existing plant databases show promise for improving plant identification but do not always contain information about the same set of traits for all species, as many entries come from disparate studies. We posit that machine learning can be used to improve plant specimen identification and inform which traits are important for achieving this goal and therefore should be collected in future studies. To accomplish this, a team of plant ecologists and machine learning researchers developed decision trees using morphological traits and descriptions of species based on specimen data from the TRY plant trait database (TRYdb; Kattge et al., 2011). This study (1) informs how the plant sciences community can invest in future research and data entry efforts to increase the database value for plant identification and (2) provides a template for future investigations combining decision trees and big data for trait‐based identification.

METHODS

Trait data collection and curation

From TRYdb (Kattge et al., 2011), we selected genera that contained categorical traits that could be used in plant identification, such as flower color, fruit type, and flower sexual syndrome. In total, we compiled data for 35 traits, both vegetative and floral, to create a character matrix (Appendix S1); this initial data set included all species in TRYdb with data on any of these traits. We then filtered our data set to exclude genera that contained fewer than 25 species or species that contained less than 10 of the 35 traits. Finally, we removed traits with greater than 65% missing data, leaving a total of 16 traits available for the construction of our final model (for the final character matrix, see Appendix S2). The final set of plant taxa included 20 genera that spanned diverse species, including trees, grasses, herbs, and sedges.

Decision tree induction

Decision trees rank among the most popular paradigms in machine learning and are frequently used for the automated classification of examples into classes. Formally, a decision tree is a partially ordered set of attribute‐value tests, reminiscent of biological keys. In this application of decision trees, our goal was to place species into genera using a series of trait‐based tests (e.g., “what is the fruit type of species x?”). However, before the decision tree can be used for classification, it must be induced from a training set consisting of examples (here, plant species), each described by a vector of attribute values (here, species‐specific plant traits) and labeled with a class (here, the genus to which that species belongs).

This induction is a recursive process in which a trait to be tested is chosen based on the amount of information it contains for distinguishing between classes or groups of classes (i.e., the information gain). Each decision constitutes an internal node of the tree, where the possible outcomes of the decision divide the training set into two or more subsets, each defined by one outcome. For example, for the test “what is the fruit type of species x?”, the training set would be divided into species with each fruit type (e.g., species with nuts, capsules, schizocarps, etc.). In cases where all examples in the given subset belong to the same class, a leaf node is created and assigned the corresponding class label (here, a genus). If, alternatively, the subset contains examples of two or more classes (genera), the process described above is applied recursively to the subset, creating additional internal nodes.

Depending on the trait tested at a given internal node, diverse decision trees can be created, ranging from large trees with many internal nodes to much smaller trees that reach classes more quickly (with many or few steps to genera classification, respectively). To create a small tree, machine learning employs formulas from Shannon’s information theory at each trait‐selection step, which identify traits offering maximum information about the example’s class (Kubat, 2017). As a result, the classifier model then usually contains only tests on the most relevant and non‐redundant traits, with a high level of information gain (in other words, those traits that are best for distinguishing among plant groups). Here, we induced the decision trees using the open‐access software Weka (version 3.8; Eibe et al., 2016), whose J48 module is a Java version of C4.5, a fundamental decision tree software (Quinlan, 1993). A non‐zero rate of error in classification from the unpruned tree would indicate that two or more species are described by the same set of traits but are labeled with different genera.

Pruning decision trees

Once the tree has been induced, pruning can be used to improve its classification performance in terms of portability and generalization by reducing model overfitting to the training data. This process is crucial if we want a tree capable of classifying future examples (species) that were not used during tree induction. In an unpruned decision tree, a leaf node will only be created if all the species reaching it belong to the same genus; otherwise, a new test (i.e., a new internal node) is created that further splits the set of remaining species. A tree is pruned by the removal of lower‐level nodes that are unlikely to contribute to this classifier’s performance. Node termination is determined by the confidence factor and minimum number of objects. The confidence factor is a node’s contribution to the tree’s overall error. For example, if the confidence factor is set to 0.25, the node splits if the error rate at that node without splitting is 25% higher than if it does split. The minimum number of objects determines how many examples need to reach a node for splitting to occur. For example, if the minimum number of objects is 2, then at least two species must reach a node before it can be split. Although our results were robust to a wide range of settings (see Appendices S3–S19), here we report the decision tree built with the complete trait data set, a confidence factor of 0.25, and a minimum number of objects of 2, which returned a low error rate.

10‐fold cross‐validation

We used a routine statistical evaluation technique, 10‐fold cross‐validation, to test how well the classifier model can place species into genera. To perform the 10‐fold cross‐validation, the entire data set was randomly partitioned into 10 equally sized subsets (“folds”). We then constructed a tree based on 90% of the data (nine folds) and tested with the remaining 10% of the data (one fold). We repeated this process holding out each of the 10 unique folds in turn. From each of the 10 trees, we gained an estimate of the classifier model’s ability to place species that were not included in the training set into genera (included in Table 1). By testing with 10 folds, we minimized the possibility that the traits identified as important in the decision tree induction may be due only to the idiosyncrasies of the random divisions in the training and testing sets. To gain an estimate of the future classification behavior of our pruned tree, we determined the average accuracy of these 10 tests (Table 1).

Table 1.

Results of the 10‐fold cross‐validation. Each of the 10 trees produced during the 10‐fold cross‐validation analysis was constructed by first leaving out 10% of the data (one of the 10 folds; test data set) and inducing trees using the remaining 90% of the data (the remaining nine folds). The estimate of accuracy reflects how well the resulting tree classified each test data set, providing insight into how well the classifier model places unseen species into their genera.

Tree fold	Accuracy (%)
1	84.0
2	95.6
3	81.2
4	89.9
5	91.3
6	92.8
7	85.5
8	88.4
9	86.8
10	86.8
Mean ± SE	89.1 ± 1.4

Open in a new tab

RESULTS

Curated character matrix and decision tree topology

We identified 20 genera with at least 25 species each (mean ± SE = 35 ± 3.75 species/genera). After filtering, the character matrix contained 16 traits and 689 species across these 20 genera. Despite filtering, data were still sparse within the matrix, with each genus having an average of 60.1% (±3.5%) of data available (range: 29–81%). The pruned decision tree built from this data matrix contained 240 leaves (Appendix S20), where each “leaf” is a genus and multiple paths can be taken to the same genus. Similar to polyclave keys, this flexibility allows the decision tree to properly classify species when within‐genus trait variation occurs. The pruned tree used seven of the 16 traits we collected from TRYdb (i.e., ~40% of the traits) to distinguish among the genera (Table 2). The unpruned decision tree had 401 leaves and used 11 of the 16 traits (Appendix S21), which corresponds to 67% more leaves and 57% more traits than that of the pruned tree. To classify a species into a genus, the pruned tree used three traits on average, which was 16% fewer steps compared with the unpruned tree.

Table 2.

Description and information content of the 16 plant traits from the TRY plant trait database (TRYdb).

Trait ^a , ^b	Information gain	Description ^c
Leaf shape	2.3554	Describes the leaf veins, lobes, or leaflets (always grass‐like, linear, always long‐leaf).
Fruit type	2.3318	Describes the form of the ripened ovary (nut, capsule, schizocarp).
Plant growth form	2.0645	The whole plant growth form with respect to woodiness (herb, forbs, graminoid).
Flower color	1.8353	The color of the flowers (green, brown, yellow).
Flower sexual system	1.7827	Referring to the presence of the stamen and carpel on an individual plant (hermaphroditic, monoecious, andromonoecious).
Leaf distribution along the axis (arrangement)	1.5035	The form in which leaves cluster on the stem (rosette, semi‐rosette, alternate).
Seed morphology	1.2794	The form of the fully mature fertilized ovule at its associated structures (elaiosome, open structures, flat appendages).
Apomixis	0.9106	A feature of the whole plant characterizing the apomixis with respect to its pollination needs (amphimictic, sexual, obligate apomictic).
Species reproduction types	0.8188	Spore, seed, or vegetative structures (by seed and vegetatively, mostly by seed, rarely by seed).
Leaf shape 5: leaf base	0.6729	Describes the curvature at the base of the leaf where attached to the petiole/stem (cuneate, cordate, rounded).
Leaf shape 6: leaf petiole type	0.6571	Describes the presence of the petiole and any appendages associated with it (petiolate, sessile, subsessile).
Dicliny (monoecious, dioecious, hermaphrodite)	0.6374	A feature of the whole plant defining the spatial separation of sexes on one or several flowers and/or individual plants (hermaphrodite, monoecious, dioecious).
Leaf shape 2: outline	0.5969	Describes the curvature of the leaf margins (toothed, lobed, serrulate).
Shoot growth form	0.507	A feature of the whole plant defining the growth form with respect to stem branching mode (stem ascending to prostrate, stem erect, stem prostate).
Fertilization	0.448	Refers to the genotypic mixing required for fertilization to occur (apomictic, automatic self, obligatory cross).
Leaf shape 3: pointed/round	0.3384	Describes the curvature of the leaf apex (rounded, point, mucronate).

Open in a new tab

^{^a}

The traits shaded in gray were included in the final decision tree by the machine learning algorithm.

^{^b}

TRYdb uses trait definitions from Garnier et al. (2017).

^{^c}

Three categories of each trait are shown in parentheses after the definition. In cases with more than three possible categories, please see Appendix S22 for additional categories. For the definitions of these categories, consult the TRY website (https://www.try‐db.org/TryWeb/Data.php [accessed July 2019]).

Assignment of species to genera and information gain of traits

A total of 98% of the species in our complete data set were correctly assigned to genera using the unpruned tree. This result indicates that, in the given database, a few species from different genera have the same trait values, which makes it difficult to classify these few species unless additional traits are added that are able to distinguish among these taxa. The unpruned tree is also likely to be overfit to the training data, and thus while this type of tree may be useful for identification of taxa used in its construction, it is unlikely to be able to identify new species.

The pruned tree may be better suited to classify new species (those not used in its induction) into the genera present in the tree. Cross‐validation estimated that our pruned decision trees can correctly assign 89.1% ± 1.35% of the species not used in the tree induction to genera on average (Table 1), suggesting that nine out of 10 new species (i.e., species not included in the induction of the decision tree) are likely to be classified into their genera. The 89% value is likely a conservative estimate of the classification accuracy for this final pruned tree, because the estimate is based on the testing of trees constructed with only nine folds of the data. The frequency of misassignments of species into genera was fairly low (mean across genera ± SE = 14% ± 3%). However, there was substantial variation in the misassignment rates among genera (Fig. 1), with five genera showing misassignment rates of 26–41%. When we compared the proportion of missing information for each genus to the rate of misassignment by the tree using a linear regression, we found that the amount of trait information missing from a genus did not explain a significant amount of the variation in the misassignment frequency among genera (F _1,18 = 1.36, P = 0.26). However, we note that genera with low misassignment rates often had high agreement among species for the value of traits. For example, the genera Alchemilla, Galium, and Hieracium were misassigned 0% of the time and had three to five traits with 100% agreement for their values, while the genera Ranunculus and Silene were misassigned ~40% of the time and lacked any trait with 100% agreement.

Genera of examined plants ordered by the proportion of taxonomic misassignment when using the decision tree for plant identification. The proportion of misassignment was calculated by dividing the number of species from a given genus that were not assigned to that genus by the total number of species examined for that genus.

Ultimately, the machine learning algorithm selected seven traits to be included in the final pruned tree. These traits were the flower sexual system, seed morphology, fruit type, apomixis, leaf distribution, species reproduction type, and a component of leaf shape. Details on these traits can be found in Table 2 and Appendix S22. Traits that were included in the pruned decision tree had approximately twice as much initial information gain as the traits that were excluded (Wilcoxon signed‐rank test = 10, P = 0.02; information gain of included traits = 1.5 ± 0.23, gain of excluded traits = 0.81 ± 0.02), indicating that initial information content may be a reasonable proxy for assessing the value of traits for distinguishing among groups of taxa. However, it is worth noting that two traits with high initial information gain (i.e., plant growth form and flower color) were excluded from the final pruned tree. This is likely due to their redundancy with other traits of high information gain, meaning they differentiate among taxa in a similar way as other traits that have an even higher information gain and thus come earlier in the tree. In addition, when we regressed trait information gain on the proportion of missing information of that trait, we found that the proportion of missing information explained a significant amount of the variation in the initial information gain (Fig. 2; t _1,14 = −4.867, P = 0.0002), suggesting that more training data could be useful in improving decision tree–based identification.

Relationship between the initial trait information gain and the proportion of missing information. The amount of missing information for a trait explains ~60% of the variation in the information gain of that trait. Traits with a high proportion of missing information correspond to a low information gain (R ² = 0.602, df = 1,14, P = 0.0002).

DISCUSSION

In this study, we assembled a trait data set using readily accessible online data and machine learning to induce a decision tree with the goal of plant identification. The unpruned tree distinguished among the plant genera in our data set (98% of species were correctly placed into their genus), and a pruned version of the tree was capable of placing most species into correct genera (89% of test species were properly classified). Although using a pruned tree resulted in a small amount of additional misclassification compared to the unpruned tree, it has several advantages. For example, it required fewer steps for classification (three for the pruned vs. five for the unpruned) and is better suited for placing new species into the correct genera because the pruning avoids overfitting. We found that seven traits were sufficient to properly distinguish most of the species in the data set, indicating their importance in identification. Our results showed that the traits with the highest initial information gain were those with the least missing data, suggesting that the amount of missing data can be important for determining the value of different traits when distinguishing among genera, and for emphasizing the gaps where additional data collection may improve the model quality in the future. Missing information is a common challenge in approaches that use big data to address questions for which the data were not originally collected (Wu et al., 2013; Zhang, 2016).

Compared with traditional polyclave keys, machine learning–derived identification could benefit the plant science community in several ways. First, generating traditional keys requires more expert knowledge of the groups being characterized in order to distinguish among them (Austen et al., 2016) than is required for machine learning methods. Machine learning takes advantage of existing databases and computer algorithms, which require a generalized methodological knowledge (e.g., programming skills for manipulating big data and running algorithms) that can be applied to many different taxonomic groups rather than specialized, group‐specific knowledge (Thessen, 2016). While we prize taxonomic and natural history knowledge, the level of expertise needed for effectively distinguishing taxa requires a substantial time investment to acquire.

Second, the computer algorithm determines the importance of traits for species identification using machine learning–based methods, which can save researchers time on identification by focusing their efforts on the most influential traits. Note that the human valuation of traits is not completely removed from this form of identification, as the algorithm picks among the traits researchers have decided to collect and enter into the database (Hortal et al., 2007). Because the traits in these databases are collected for many purposes, it is also possible for the algorithms to pull from traits that are not typically used for identification.

Third, the number of steps required for an identification could be decreased to improve efficiency compared with the use of traditional trait‐based keys. For example, our proof‐of‐principle case study pruned tree was able to classify 89% of species, spanning a large part of the plant phylogeny, in five or fewer steps for all genera. Although improvements are definitely needed to increase the accuracy of this approach, the relatively few steps required to go from “species x” to a genus identity in this study highlights the potential of decision tree–based identification.

To our knowledge, this study is the first to use decision tree methods in this way; however, recent efforts to improve plant identification have applied other machine learning techniques, such as deep learning and hierarchical learning, to plant images (Lee et al., 2017; Singh et al., 2018). For example, Fan et al. (2015) developed a structural learning algorithm that identified plants from images using a coarse to fine subset of plant features. The convenience of image‐based methods, in which the computer analyzes the picture to “collect” trait data based on pixels, can save researchers time; however, image‐based machine learning methods also face challenges, as shown in a study by Lee et al. (2015), where poor leaf condition (e.g., damaged or wrinkled leaves) caused the misclassification of tree species from leaf pictures. It can also be difficult to capture certain types of information from images, including features that require dissection or direct measurement (Wäldchen et al., 2018). While the trait‐based method we employed in this study requires time to measure the traits, challenging features such as these may be more accurately and/or efficiently collected through their direct measurement. The different strengths of direct‐ and image‐based approaches to using machine learning suggest it may be valuable to use them in tandem for improved identification.

Finally, the availability of online data is rapidly increasing (Stajich et al., 2012; Chaudhary et al., 2016; National Ecological Observatory Network [https://www.neonscience.org/]), and recent work has followed TRYdb’s lead, establishing new trait‐focused databases for other groups of organisms (e.g., Fun^Fun for fungal functional trait data; Zanne et al., 2020). Excitingly, in addition to traditional measurements of trait data, new efforts to automate trait data collection from herbarium specimens (see Ott et al., 2020; Weaver et al., 2020 in this collection) will provide a richer and more complete database of trait information for future research. Overall, as the quantity and quality of this type of big data increase for more plant taxa, we expect that the integration of trait‐based approaches and machine learning will improve the efficiency of plant identification, facilitating a wide range of plant biology research.

AUTHOR CONTRIBUTIONS

B.K.A. curated the biological trait data set, conducted the statistical analyses, and took the lead on writing the paper. M.G. conducted the machine learning analysis using the Weka software under the supervision of M.K. M.E.A. oversaw the organization/statistical analysis of the trait data and substantially contributed to writing the paper. M.E.A. and M.K. conceptualized the project. All authors contributed to the manuscript writing.

Supporting information

APPENDIX S1. Original character matrix and trait set.

Click here for additional data file.^{(65.9KB, xlsx)}

APPENDIX S2. Final character matrix.

Click here for additional data file.^{(52.9KB, xlsx)}

APPENDIX S3. Pruned tree: confidence factor 0.2, minimum number of objects 2.

Click here for additional data file.^{(25.8KB, docx)}

APPENDIX S4. Pruned tree: confidence factor 0.2, minimum number of object 5.

Click here for additional data file.^{(24.2KB, docx)}

APPENDIX S5. Pruned tree: confidence factor 0.2, minimum number of objects 8.

Click here for additional data file.^{(22.9KB, docx)}

APPENDIX S6. Pruned tree: confidence factor 0.3, minimum number of objects 2.

Click here for additional data file.^{(25.8KB, docx)}

APPENDIX S7. Pruned tree: confidence factor 0.3, minimum number of objects 5.

Click here for additional data file.^{(25.1KB, docx)}

APPENDIX S8. Pruned tree: confidence factor 0.3, minimum number of objects 8.

Click here for additional data file.^{(23KB, docx)}

APPENDIX S9. Pruned tree: confidence factor 0.4, minimum number of objects 2.

Click here for additional data file.^{(25.7KB, docx)}

APPENDIX S10. Pruned tree: confidence factor 0.4, minimum number of objects 5.

Click here for additional data file.^{(25.1KB, docx)}

APPENDIX S11. Pruned tree: confidence factor 0.4, minimum number of objects 8.

Click here for additional data file.^{(23.6KB, docx)}

APPENDIX S12. Pruned tree: confidence factor 0.5, minimum number of objects 2.

Click here for additional data file.^{(25.9KB, docx)}

APPENDIX S13. Pruned tree: confidence factor 0.5, minimum number of objects 5.

Click here for additional data file.^{(24.2KB, docx)}

APPENDIX S14. Pruned tree: confidence factor 0.5, minimum number of objects 8.

Click here for additional data file.^{(22.8KB, docx)}

APPENDIX S15. Pruned tree: confidence factor 0.6, minimum number of objects 2.

Click here for additional data file.^{(27.9KB, docx)}

APPENDIX S16. Pruned tree: confidence factor 0.6, minimum number of objects 5.

Click here for additional data file.^{(24.5KB, docx)}

APPENDIX S17. Pruned tree: confidence factor 0.25, minimum number of objects 8.

Click here for additional data file.^{(23.5KB, docx)}

APPENDIX S18. Pruned tree: confidence factor 0.6, minimum number of objects 8.

Click here for additional data file.^{(24.9KB, docx)}

APPENDIX S19. Pruned tree: confidence factor 0.25, minimum number of objects 5.

Click here for additional data file.^{(24.2KB, docx)}

APPENDIX S20. Pruned tree: confidence factor 0.25, minimum number of objects 2.

Click here for additional data file.^{(25.1KB, docx)}

APPENDIX S21. Unpruned tree.

Click here for additional data file.^{(29.2KB, docx)}

APPENDIX S22. Trait descriptions and possible categories.

Click here for additional data file.^{(26.8KB, docx)}

ACKNOWLEDGMENTS

This study was supported by the TRY initiative on plant traits (http://www.try‐db.org). The TRY initiative and database is hosted, developed, and maintained by J. Kattge and G. Bönisch (Max Planck Institute for Biogeochemistry, Jena, Germany). TRY is currently supported by DIVERSITAS/Future Earth and the German Centre for Integrative Biodiversity Research (iDiv), Halle‐Jena‐Leipzig, Germany. The authors thank D. Revillini, D. Hernandez, and K. Kiesewetter for their feedback on this manuscript.

Almeida, B. K. , Garg M., Kubat M., and Afkhami M. E.. 2020. Not that kind of tree: Assessing the potential for decision tree–based plant identification using trait databases. Applications in Plant Sciences 8(7): e11379.

Data Availability Statement

All data used in this paper are available in the supporting information and through the TRY plant trait database (http://www.try‐db.org).

LITERATURE CITED

Acevedo, M. A. , Corrada‐Bravo C. J., Corrada‐Bravo H., Villanueva‐Rivera L. J., and Aide T. M.. 2009. Automated classification of bird and amphibian calls using machine learning: A comparison of methods. Ecological Informatics 4: 206–214. [Google Scholar]
Austen, G. E. , Bindemann M., Griffiths R. A., and Roberts D. L.. 2016. Species identification by experts and non‐experts: Comparing images from field guides. Scientific Reports 6: 33634. [DOI] [PMC free article] [PubMed] [Google Scholar]
Casaioli, M. , Mantovani R., Scorzoni F. P., Puca S., Speranza A., and Tirozzi B.. 2003. Linear and nonlinear post‐processing of numerically forecasted surface temperature. Nonlinear Processes in Geophysics 10: 373–383. [Google Scholar]
Chaudhary, V. B. , Rúa M. A., Antoninka A., Bever J. D., Cannon J., Craig A., and Ha M.. 2016. MycoDB, a global database of plant response to mycorrhizal fungi. Scientific Data 3: 160028. [DOI] [PMC free article] [PubMed] [Google Scholar]
Edwards, M. , Morse D. R., and Fielding A. H.. 1987. Expert systems: frames, rules or logic for species identification? Bioinformatics 3: 1–7. [DOI] [PubMed] [Google Scholar]
Eibe, F. , Hall M. A., and Witten I. H. 2016. The WEKA Workbench In Witten I. H. and Franck E.eds.], Data mining: Practical machine learning tools and techniques, 4th ed Morgan Kaufmann Publishers, Burlington, Massachusetts, USA. [Google Scholar]
Elith, J. , Graham C. H., Anderson R. P., Dudík M., Ferrier S., Guisan A., Hijmans R. J., et al. 2006. Novel methods improve prediction of species’ distributions from occurrence data. Ecography 29(2): 129–151. [Google Scholar]
Fan, J. , Zhou N., Peng J., and Gao L.. 2015. Hierarchical learning of tree classifiers for large‐scale plant species identification. IEEE Transactions on Image Processing 24: 4172–4184. [DOI] [PubMed] [Google Scholar]
Gallo, T. , and Waitt D.. 2011. Creating a successful citizen science model to detect and report invasive species. BioScience 61: 459–465. [Google Scholar]
Garnier, E. , Stahl U., Laporte M. A., Kattge J., Mougenot I., Kühn I., Laporte B., et al. 2017. Towards a thesaurus of plant characteristics: An ecological contribution. Journal of Ecology 105: 298–309. [Google Scholar]
Gaylard, A. , and Kerley G. I. H.. 1995. The use of interactive identification keys in ecological studies. South African Journal of Wildlife Research 25: 35–40. [Google Scholar]
Hortal, J. , Lobo J. M., and Jimenez‐Valverde A.. 2007. Limitations of biodiversity databases: Case study on seed‐plant diversity in Tenerife, Canary Islands. Conservation Biology 21: 853–863. [DOI] [PubMed] [Google Scholar]
Kattge, J. , Diaz S., Lavorel S., Prentice I. C., Leadley P., Bönisch G., Garnier E., et al. 2011. TRY—a global database of plant traits. Global Change Biology 17: 2905–2935. [Google Scholar]
Kubat, M. 2017. Introduction to machine learning, 2nd ed Springer Publishing, New York, New York, USA. [Google Scholar]
Kumar, N. , Belhumeur P. N., Biswas A., Jacobs D. W., Kress W. J., Lopez I. C., and Soares J. V.. 2012. Leafsnap: A computer vision system for automatic plant species identification. Computer Vision – ECCV 2012 7573: 502–516. [Google Scholar]
Lee, S. H. , Chan C. S., Wilkin P., and Remagnino P. 2015. Deep‐plant: Plant identification with convolutional neural networks In IEEE International Conference on Image Processing, 452–456. IEEE, New York, New York, USA. [Google Scholar]
Lee, S. H. , Chan C. S., Mayo S. J., and Remagnino P.. 2017. How deep learning extracts and learns leaf features for plant classification. Pattern Recognition 71: 1–13. [Google Scholar]
Ott, T. , Palm C., Vogt R., and Oberprieler C.. 2020. GinJinn: An object‐detection pipeline for automated feature extraction from herbarium specimens. Applications in Plant Sciences 8(6): e11351. [DOI] [PMC free article] [PubMed] [Google Scholar]
Quinlan, J. R. 1993. C4. 5: Programs for machine learning. Morgan Kaufmann Publishers, San Francisco, California, USA. [Google Scholar]
Randler, C. 2008. Teaching species identification—A prerequisite for learning biodiversity and understanding ecology. Eurasia Journal of Mathematics, Science and Technology Education 4(3): 223–231. [Google Scholar]
Roy, H. E. , Baxter E., Saunders A., and Pocock M. J.. 2016. Focal plant observations as a standardised method for pollinator monitoring: Opportunities and limitations for mass participation citizen science. PLoS ONE 11(3): e0150794. [DOI] [PMC free article] [PubMed] [Google Scholar]
Singh, A. K. , Ganapathysubramanian B., Sarkar S., and Singh A.. 2018. Deep learning for plant stress phenotyping: Trends and future perspectives. Trends in Plant Science 23: 883–898. [DOI] [PubMed] [Google Scholar]
Stajich, J. E. , Harris T., Brunk B. P., Brestelli J., Fischer S., Harb O. S., Kissinger J. C., et al. 2012. FungiDB: An integrated functional genomics database for fungi. Nucleic Acids Research 40: D675–D681. [DOI] [PMC free article] [PubMed] [Google Scholar]
Thessen, A. 2016. Adoption of machine learning techniques in ecology and earth science. One Ecosystem 1: e8621. [Google Scholar]
Tilling, S. 1984. Keys to biological identification: their role and construction. Journal of Biological Education 4: 293–304. [Google Scholar]
Vayssières, M. P. , Plant R. E., and Allen‐Diaz B. H.. 2000. Classification trees: An alternative non‐parametric approach for predicting species distributions. Journal of Vegetation Science 11: 679–694. [Google Scholar]
Wäldchen, J. , Rzanny M., Seeland M., and Mäder P.. 2018. Automated plant species identification—Trends and future directions. PLoS Computational Biology 14(4): e1005993. [DOI] [PMC free article] [PubMed] [Google Scholar]
Weaver, W. N. , Ng J., and Laport R. G.. 2020. LeafMachine: Using machine learning to automate leaf trait extraction from digitized herbarium specimens. Applications in Plant Sciences 8(6): e11367. [DOI] [PMC free article] [PubMed] [Google Scholar]
Will, K. W. , and Rubinoff D.. 2004. Myth of the molecule: DNA barcodes for species cannot replace morphology for identification and classification. Cladistics 20: 47–55. [DOI] [PubMed] [Google Scholar]
Wu, X. , Zhu X., Wu G. Q., and Ding W.. 2013. Data mining with big data. IEEE Transactions on Knowledge and Data Engineering 26: 97–107. [Google Scholar]
Zanne, A. , Abarenkov K., Afkhami M. E., Aguilar‐Trigueros C. A., Bates S., Bhatnagar J. M., Busby P. E., et al. 2020. Fungal functional ecology: Bringing a trait‐based approach to plant‐associated fungi. Biological Reviews 95(2): 409–433. [DOI] [PubMed] [Google Scholar]
Zhang, Z. 2016. Missing data imputation: Focusing on single imputation. Annals of Translational Medicine 4(1): 9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhou, Z.‐H. , Jiang Y., Yang Y.‐B., and Chen S.‐F.. 2002. Lung cancer cell identification based on artificial neural network ensembles. Artificial Intelligence in Medicine 24: 25–36. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

APPENDIX S1. Original character matrix and trait set.

Click here for additional data file.^{(65.9KB, xlsx)}

APPENDIX S2. Final character matrix.

Click here for additional data file.^{(52.9KB, xlsx)}

APPENDIX S3. Pruned tree: confidence factor 0.2, minimum number of objects 2.

Click here for additional data file.^{(25.8KB, docx)}

APPENDIX S4. Pruned tree: confidence factor 0.2, minimum number of object 5.

Click here for additional data file.^{(24.2KB, docx)}

APPENDIX S5. Pruned tree: confidence factor 0.2, minimum number of objects 8.

Click here for additional data file.^{(22.9KB, docx)}

APPENDIX S6. Pruned tree: confidence factor 0.3, minimum number of objects 2.

Click here for additional data file.^{(25.8KB, docx)}

APPENDIX S7. Pruned tree: confidence factor 0.3, minimum number of objects 5.

Click here for additional data file.^{(25.1KB, docx)}

APPENDIX S8. Pruned tree: confidence factor 0.3, minimum number of objects 8.

Click here for additional data file.^{(23KB, docx)}

APPENDIX S9. Pruned tree: confidence factor 0.4, minimum number of objects 2.

Click here for additional data file.^{(25.7KB, docx)}

APPENDIX S10. Pruned tree: confidence factor 0.4, minimum number of objects 5.

Click here for additional data file.^{(25.1KB, docx)}

APPENDIX S11. Pruned tree: confidence factor 0.4, minimum number of objects 8.

Click here for additional data file.^{(23.6KB, docx)}

APPENDIX S12. Pruned tree: confidence factor 0.5, minimum number of objects 2.

Click here for additional data file.^{(25.9KB, docx)}

APPENDIX S13. Pruned tree: confidence factor 0.5, minimum number of objects 5.

Click here for additional data file.^{(24.2KB, docx)}

APPENDIX S14. Pruned tree: confidence factor 0.5, minimum number of objects 8.

Click here for additional data file.^{(22.8KB, docx)}

APPENDIX S15. Pruned tree: confidence factor 0.6, minimum number of objects 2.

Click here for additional data file.^{(27.9KB, docx)}

APPENDIX S16. Pruned tree: confidence factor 0.6, minimum number of objects 5.

Click here for additional data file.^{(24.5KB, docx)}

APPENDIX S17. Pruned tree: confidence factor 0.25, minimum number of objects 8.

Click here for additional data file.^{(23.5KB, docx)}

APPENDIX S18. Pruned tree: confidence factor 0.6, minimum number of objects 8.

Click here for additional data file.^{(24.9KB, docx)}

APPENDIX S19. Pruned tree: confidence factor 0.25, minimum number of objects 5.

Click here for additional data file.^{(24.2KB, docx)}

APPENDIX S20. Pruned tree: confidence factor 0.25, minimum number of objects 2.

Click here for additional data file.^{(25.1KB, docx)}

APPENDIX S21. Unpruned tree.

Click here for additional data file.^{(29.2KB, docx)}

APPENDIX S22. Trait descriptions and possible categories.

Click here for additional data file.^{(26.8KB, docx)}

Data Availability Statement

All data used in this paper are available in the supporting information and through the TRY plant trait database (http://www.try‐db.org).

[aps311379-bib-0001] Acevedo, M. A. , Corrada‐Bravo C. J., Corrada‐Bravo H., Villanueva‐Rivera L. J., and Aide T. M.. 2009. Automated classification of bird and amphibian calls using machine learning: A comparison of methods. Ecological Informatics 4: 206–214. [Google Scholar]

[aps311379-bib-0002] Austen, G. E. , Bindemann M., Griffiths R. A., and Roberts D. L.. 2016. Species identification by experts and non‐experts: Comparing images from field guides. Scientific Reports 6: 33634. [DOI] [PMC free article] [PubMed] [Google Scholar]

[aps311379-bib-0003] Casaioli, M. , Mantovani R., Scorzoni F. P., Puca S., Speranza A., and Tirozzi B.. 2003. Linear and nonlinear post‐processing of numerically forecasted surface temperature. Nonlinear Processes in Geophysics 10: 373–383. [Google Scholar]

[aps311379-bib-0004] Chaudhary, V. B. , Rúa M. A., Antoninka A., Bever J. D., Cannon J., Craig A., and Ha M.. 2016. MycoDB, a global database of plant response to mycorrhizal fungi. Scientific Data 3: 160028. [DOI] [PMC free article] [PubMed] [Google Scholar]

[aps311379-bib-0005] Edwards, M. , Morse D. R., and Fielding A. H.. 1987. Expert systems: frames, rules or logic for species identification? Bioinformatics 3: 1–7. [DOI] [PubMed] [Google Scholar]

[aps311379-bib-0006] Eibe, F. , Hall M. A., and Witten I. H. 2016. The WEKA Workbench In Witten I. H. and Franck E.eds.], Data mining: Practical machine learning tools and techniques, 4th ed Morgan Kaufmann Publishers, Burlington, Massachusetts, USA. [Google Scholar]

[aps311379-bib-0007] Elith, J. , Graham C. H., Anderson R. P., Dudík M., Ferrier S., Guisan A., Hijmans R. J., et al. 2006. Novel methods improve prediction of species’ distributions from occurrence data. Ecography 29(2): 129–151. [Google Scholar]

[aps311379-bib-0008] Fan, J. , Zhou N., Peng J., and Gao L.. 2015. Hierarchical learning of tree classifiers for large‐scale plant species identification. IEEE Transactions on Image Processing 24: 4172–4184. [DOI] [PubMed] [Google Scholar]

[aps311379-bib-0009] Gallo, T. , and Waitt D.. 2011. Creating a successful citizen science model to detect and report invasive species. BioScience 61: 459–465. [Google Scholar]

[aps311379-bib-0010] Garnier, E. , Stahl U., Laporte M. A., Kattge J., Mougenot I., Kühn I., Laporte B., et al. 2017. Towards a thesaurus of plant characteristics: An ecological contribution. Journal of Ecology 105: 298–309. [Google Scholar]

[aps311379-bib-0011] Gaylard, A. , and Kerley G. I. H.. 1995. The use of interactive identification keys in ecological studies. South African Journal of Wildlife Research 25: 35–40. [Google Scholar]

[aps311379-bib-0012] Hortal, J. , Lobo J. M., and Jimenez‐Valverde A.. 2007. Limitations of biodiversity databases: Case study on seed‐plant diversity in Tenerife, Canary Islands. Conservation Biology 21: 853–863. [DOI] [PubMed] [Google Scholar]

[aps311379-bib-0013] Kattge, J. , Diaz S., Lavorel S., Prentice I. C., Leadley P., Bönisch G., Garnier E., et al. 2011. TRY—a global database of plant traits. Global Change Biology 17: 2905–2935. [Google Scholar]

[aps311379-bib-0014] Kubat, M. 2017. Introduction to machine learning, 2nd ed Springer Publishing, New York, New York, USA. [Google Scholar]

[aps311379-bib-0015] Kumar, N. , Belhumeur P. N., Biswas A., Jacobs D. W., Kress W. J., Lopez I. C., and Soares J. V.. 2012. Leafsnap: A computer vision system for automatic plant species identification. Computer Vision – ECCV 2012 7573: 502–516. [Google Scholar]

[aps311379-bib-0016] Lee, S. H. , Chan C. S., Wilkin P., and Remagnino P. 2015. Deep‐plant: Plant identification with convolutional neural networks In IEEE International Conference on Image Processing, 452–456. IEEE, New York, New York, USA. [Google Scholar]

[aps311379-bib-0017] Lee, S. H. , Chan C. S., Mayo S. J., and Remagnino P.. 2017. How deep learning extracts and learns leaf features for plant classification. Pattern Recognition 71: 1–13. [Google Scholar]

[aps311379-bib-0018] Ott, T. , Palm C., Vogt R., and Oberprieler C.. 2020. GinJinn: An object‐detection pipeline for automated feature extraction from herbarium specimens. Applications in Plant Sciences 8(6): e11351. [DOI] [PMC free article] [PubMed] [Google Scholar]

[aps311379-bib-0019] Quinlan, J. R. 1993. C4. 5: Programs for machine learning. Morgan Kaufmann Publishers, San Francisco, California, USA. [Google Scholar]

[aps311379-bib-0020] Randler, C. 2008. Teaching species identification—A prerequisite for learning biodiversity and understanding ecology. Eurasia Journal of Mathematics, Science and Technology Education 4(3): 223–231. [Google Scholar]

[aps311379-bib-0021] Roy, H. E. , Baxter E., Saunders A., and Pocock M. J.. 2016. Focal plant observations as a standardised method for pollinator monitoring: Opportunities and limitations for mass participation citizen science. PLoS ONE 11(3): e0150794. [DOI] [PMC free article] [PubMed] [Google Scholar]

[aps311379-bib-0022] Singh, A. K. , Ganapathysubramanian B., Sarkar S., and Singh A.. 2018. Deep learning for plant stress phenotyping: Trends and future perspectives. Trends in Plant Science 23: 883–898. [DOI] [PubMed] [Google Scholar]

[aps311379-bib-0023] Stajich, J. E. , Harris T., Brunk B. P., Brestelli J., Fischer S., Harb O. S., Kissinger J. C., et al. 2012. FungiDB: An integrated functional genomics database for fungi. Nucleic Acids Research 40: D675–D681. [DOI] [PMC free article] [PubMed] [Google Scholar]

[aps311379-bib-0024] Thessen, A. 2016. Adoption of machine learning techniques in ecology and earth science. One Ecosystem 1: e8621. [Google Scholar]

[aps311379-bib-0025] Tilling, S. 1984. Keys to biological identification: their role and construction. Journal of Biological Education 4: 293–304. [Google Scholar]

[aps311379-bib-0026] Vayssières, M. P. , Plant R. E., and Allen‐Diaz B. H.. 2000. Classification trees: An alternative non‐parametric approach for predicting species distributions. Journal of Vegetation Science 11: 679–694. [Google Scholar]

[aps311379-bib-0027] Wäldchen, J. , Rzanny M., Seeland M., and Mäder P.. 2018. Automated plant species identification—Trends and future directions. PLoS Computational Biology 14(4): e1005993. [DOI] [PMC free article] [PubMed] [Google Scholar]

[aps311379-bib-0028] Weaver, W. N. , Ng J., and Laport R. G.. 2020. LeafMachine: Using machine learning to automate leaf trait extraction from digitized herbarium specimens. Applications in Plant Sciences 8(6): e11367. [DOI] [PMC free article] [PubMed] [Google Scholar]

[aps311379-bib-0029] Will, K. W. , and Rubinoff D.. 2004. Myth of the molecule: DNA barcodes for species cannot replace morphology for identification and classification. Cladistics 20: 47–55. [DOI] [PubMed] [Google Scholar]

[aps311379-bib-0030] Wu, X. , Zhu X., Wu G. Q., and Ding W.. 2013. Data mining with big data. IEEE Transactions on Knowledge and Data Engineering 26: 97–107. [Google Scholar]

[aps311379-bib-0031] Zanne, A. , Abarenkov K., Afkhami M. E., Aguilar‐Trigueros C. A., Bates S., Bhatnagar J. M., Busby P. E., et al. 2020. Fungal functional ecology: Bringing a trait‐based approach to plant‐associated fungi. Biological Reviews 95(2): 409–433. [DOI] [PubMed] [Google Scholar]

[aps311379-bib-0032] Zhang, Z. 2016. Missing data imputation: Focusing on single imputation. Annals of Translational Medicine 4(1): 9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[aps311379-bib-0033] Zhou, Z.‐H. , Jiang Y., Yang Y.‐B., and Chen S.‐F.. 2002. Lung cancer cell identification based on artificial neural network ensembles. Artificial Intelligence in Medicine 24: 25–36. [DOI] [PubMed] [Google Scholar]

PERMALINK

Not that kind of tree: Assessing the potential for decision tree–based plant identification using trait databases

Brianna K Almeida

Manish Garg

Miroslav Kubat

Michelle E Afkhami

Abstract

Premise

Methods

Results

Discussion

METHODS

Trait data collection and curation

Decision tree induction

Pruning decision trees

10‐fold cross‐validation

Table 1.

RESULTS

Curated character matrix and decision tree topology

Table 2.

Assignment of species to genera and information gain of traits

Figure 1.

Figure 2.

DISCUSSION

AUTHOR CONTRIBUTIONS

Supporting information

ACKNOWLEDGMENTS

Data Availability Statement

LITERATURE CITED

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Not that kind of tree: Assessing the potential for decision tree–based plant identification using trait databases

Brianna K Almeida

Manish Garg

Miroslav Kubat

Michelle E Afkhami

Abstract

Premise

Methods

Results

Discussion

METHODS

Trait data collection and curation

Decision tree induction

Pruning decision trees

10‐fold cross‐validation

Table 1.

RESULTS

Curated character matrix and decision tree topology

Table 2.

Assignment of species to genera and information gain of traits

Figure 1.

Figure 2.

DISCUSSION

AUTHOR CONTRIBUTIONS

Supporting information

ACKNOWLEDGMENTS

Data Availability Statement

LITERATURE CITED

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases