Data-mining Techniques for Image-based Plant Phenotypic Traits Identification and Classification

Md Matiur Rahaman; Md Asif Ahsan; Ming Chen

doi:10.1038/s41598-019-55609-6

. 2019 Dec 20;9:19526. doi: 10.1038/s41598-019-55609-6

Data-mining Techniques for Image-based Plant Phenotypic Traits Identification and Classification

Md Matiur Rahaman ^1,², Md Asif Ahsan ¹, Ming Chen ^1,^✉

PMCID: PMC6925301 PMID: 31862925

Abstract

Statistical data-mining (DM) and machine learning (ML) are promising tools to assist in the analysis of complex dataset. In recent decades, in the precision of agricultural development, plant phenomics study is crucial for high-throughput phenotyping of local crop cultivars. Therefore, integrated or a new analytical approach is needed to deal with these phenomics data. We proposed a statistical framework for the analysis of phenomics data by integrating DM and ML methods. The most popular supervised ML methods; Linear Discriminant Analysis (LDA), Random Forest (RF), Support Vector Machine with linear (SVM-l) and radial basis (SVM-r) kernel are used for classification/prediction plant status (stress/non-stress) to validate our proposed approach. Several simulated and real plant phenotype datasets were analyzed. The results described the significant contribution of the features (selected by our proposed approach) throughout the analysis. In this study, we showed that the proposed approach removed phenotype data analysis complexity, reduced computational time of ML algorithms, and increased prediction accuracy.

Subject terms: Plant sciences, Statistics

Introduction

Phenomics technologies have been rapidly developed in plant science. They provide a great potential to gain more valuable information than traditionally destructive methods of plant phenotyping. It carried out large-scale plant phenotyping facilities that acquire a large number of images of hundreds of plants simultaneously. With the aid of automated image processing, the phenotype-image data are converted into phenotype-feature matrices¹. It is a great challenge to find a suitable techniques or methodologies to analysis phenotype data in the context of high-throughput phenotyping. However, extracting data patterns, data assimilation, and features (traits) identification from this large corpus of data requires the use of data mining (DM) and machine learning (ML) tools^1–3. Supervised and unsupervised DM and ML algorithms are promising tools to assist in the analysis of complex data sets; novel approaches are needed to apply them on phenotyping data of mature plants⁴.

In agricultural development, there is a demand to control diseases and numerous stresses to maintain food quality worldwide and to reduce food-borne illness originated from infected plants. A wide variety of plant stresses and diseases caused by the environmental factors, for example, light quantity, light quality, CO₂, nutrients, air humidity, water, temperature, drought, salinity or other organisms such as fungi, bacteria, and viruses. They hinder agricultural development by disturbing grain production and quality through competing with these factors. Thus, it is important to detect and classify the plant infestations⁵.

Supervised ML methods are useful for biological and plant image analysis^1,4,6–9. Linear Discriminant analysis (LDA) is a popular supervised ML method widely used for biomedical data classification^5,10,11. Among the supervised ML algorithms, Random Forest (RF) is a non-parametric method has been applied in several biological fields for gene selection, protein sequence selection and disease prediction^12–14. RF has been used for accurate prediction of plant biomass from image-based features⁹. Support Vector Machine (SVM) is another powerful supervised ML method which can be trained to classify individuals in high-dimensional space¹⁵. SVM has been widely used in the various biomedical fields as well as neuro-image classification, plant image classification, biomass prediction, stress plant identification based on image-derived features^9,16–19. In most cases, symptoms of stress and disease in plants result are the change of the plant color^9,10. ML approaches can be used to classify color-related traits, which obtain from the plant phenotype image pixels under the biotic and abiotic conditions¹.

In high-throughput plant studies, most informative phenotypic traits offer better data analysis results. Plant biologists train classification model; however need to improve the training data by inspection of the significant phenotypic traits. Identifying candidate traits from ten to hundred or even more image-derived phenotypic traits for QTL (quantitative traits locus) or GWAS (genome-wide association study) study is also an important challenging research topic to bridge the genotype-phenotype gap⁹. This analysis is highly essential in resisting environmental stress rates in agronomic importance^20–22. Traditional statistical methods are extensively used to deal with genomic data analysis¹. A powerful statistical approach or analytical framework is essential for describing crop cultivars by integrating traditional or novel methods with the complex traits set⁴.

In this study, we propose a statistical framework for quantitative image data pre-processing, and improve the training dataset for estimating ML model by inspecting important phenotypic traits using DM technique. We explore how performance varies with the selected number of traits, and investigate the performance of each ML method (classifier) mentioned earlier. We used plant phenotype dataset that has different types of phenotypic features (geometrical and physiological). We also used cross-validation technique, which is important because it is needed to evaluate the performance of a classifier, and needs to be done many times in training a classifier in an iterative fashion. The next part describes the dataset, the approach and the supervised ML methods used in this study. The last part consists of results and discussions.

Materials and Methods

Data description

Simulated data

To investigate the performance of ML methods based on selected features through our proposed approach, we generated simulated training and test dataset from m = 2 (Π₁ and Π₂) multivariate normal distributions and the data structure is:

D : Π_{1} \sim n_{1} N_{p} (μ_{1}, V_{1}), Π_{2} \sim n_{2} N_{p} (μ_{2}, V_{2}) .

Where n₁ and n₂ are the numbers of individuals; $N_{p} (μ_{1}, V_{1})$ and $N_{p} (μ_{2}, V_{2})$ are p-variate normal distributions with mean vector μ₁ and μ₂, and covariance matrix V₁ and V₂, respectively. We considered here, V₁ = V₂ = V; and μ₂ = μ₁ + ϵ with ϵ = 0, 1,…, 10 such that μ₁ = μ₂ for ϵ = 0, otherwise μ₁ ≠ μ₂, where the scalar quantity ϵ denotes the common difference between two corresponding mean components of μ₁ and μ₂. We considered constant covariance matrices for the normal populations and the generated data vectors are arranged in a n × p matrix to obtain training and test data sets respectively, where n = n₁ + n₂.

Plant phenomics data

The Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Gatersleben, Germany has generated a high-throughput phenomics dataset. We downloaded the quantitative phenomics dataset from http://iapg2p.sourceforge.net/modeling/#dataset, and the details description of this dataset is available at Chen et al.⁹. The summarized description of the dataset according to the Chen et al.⁹ as follows:

A mini core set of 16 German two-rowed spring barley cultivars and two parents of a DH-mapping population (cv Morex and cv Barke) were screened. Plants grew under controlled greenhouse conditions and were phenotyped using the automated LemnaTec-Scanalyzer 3D (LemnaTec GmbH, Aachen, Germany) phenotyping and imaging platform consisting of conveyor belts, a weighing and watering station, and imaging sensors. The experiments were performed under two treatments: well-watered (control treatment) and water limited (drought stress treatment). Drought stress was imposed by intercepting water supply from 27 days after sowing until days 44. Stressed plants were re-watered at days 45. Control plants remained well watered. After the stress period (27–44 days), all plants were watered to 90% field capacity (FC) and kept well-watered again until the end of the experiment. The greenhouse growth conditions were set to 18 °C and 16 °C during the day and night, respectively. The daylight period lasted ~13 h started at 7 AM. During each treatment, six plants per DH parent and nine plants per core set cultivar were tested. For each plant, top and side cameras were used to capture images daily at three different wavelength bands: visible light, FLUO, and NIR.

Chen et al.⁹ performed image analysis through IAP software to extract quantitative information from the barley plant images²³. Images were exported and analyzed using the barley analysis pipeline with optimized parameters. Image processing operations included steps: pre-processing, to prepare the images for segmentation; segmentation, to divide the image into foreground and background parts of the images, and feature extraction. The analyzed features were exported in .csv file format.

Phenomics data processing and features selection

We proposed a statistical framework (Fig. 1) which is depicted in two phases: (a) Processing and (b) Ranking. A description of the framework elements are given below.

Framework of plant phenotype image-based traits (features) selection.

(a) Data pre-processing and features selection (Processing)

Given a set of phenotype data Ω_n, we need to set data configuration based on color, shape structure, genotype, etc. for plotting and frequently used in the analysis. After that data filtering is needed, for example, removing ‘0’ values (in the image data are empty values), outlier detection, trait reproducibility assessment. For outlier detection, Grubbs test²⁴ is a useful method based on assumption of the normal distribution of phenotype data points for repeated measures on replicated plants of a single genotype for each trait⁹. Bonferroni Outlier Test is another outlier detection method for identifying outliers from the image dataset, and need to remove outliers that could bias the results²⁵. Then feature processing needs to continue, reasoned that phenotypic information should be more robust and informative. Features reproducibility test can be evaluated by the Pearson correlation coefficient. Resulting data sets may contain redundant features that are correlated with each other. To remove this problem and feature selection, stepwise variable selection using variance inflations factors⁹, principal component analysis²⁵, RF⁴ are useful methods to get an optimal set of meaningful features.

(b) Features ranking by SVM-RFE (Ranking)

In this step, we have described phenotypic features ranking procedure using a ML method called Support Vector Machine-Recursive Feature Elimination (SVM-RFE). The SVM-RFE algorithm is an iterative procedure for SVM. A cost function β computed on training samples is used as an objective function. Expanding β in Taylor series to the second-order using the OBD algorithm²⁶, and neglecting the first order-term at the optimum of β, yielding:

Δ β (i) = \frac{1}{2} \frac{δ^{2} β}{δ {w_{i}}^{2}} (Δ {w_{i}}^{2})

Here, ${w_{i}}^{2}$ was used as a ranking criterion^27,28. We present below the outlines of the SVM-RFE for phenotype dataset as follows:

Features Ranking

1. Procedure: Process (Ω, K)

Where Ω is phenotypic traits space, K is the set of labels (treatment or genotype)

2. Ψ_s ← Trait Selection (Ω, K)

3. Inputs: Training sample (Processed phenotypic image dataset)

X_{0} = {[x_{1 \times Ψ_{x}}, x_{2 \times Ψ_{s}}, \dots, x_{k \times Ψ_{s}}, \dots, x_{n \times Ψ_{s}}]}^{T}

4. Group labels K = {0,1,…m}

5. Initialize: Ψ_s = [1, 2,…, p]; surviving traits

6. Trait ranked list, r = []; Repeat until ψ_s = []

7. α ← svm-train(X₀, K); train the classifier.

8. w ←

\sum_{t} α_{t} X_{t} K_{t}

; the weight of each selected trait of t-th training pattern.

9. R_i ← (w_i)², ∀ i; ranking criteria for the i-th trait.

10. g ← argmin(R); trait with the lowest ranking.

11. r ← [Ψ_s(g), r]; renew the trait-ranking list.

12. Ψ_s ← Ψ_s (1:g-1, g + 1:length(ψ_s)); eliminate the trait with lowest ranking.

13. return ()

14. End procedure.

Open in a new tab

Supervised machine learning methods

Supervised learning have input variables (x) and an output variable (y) and we use an algorithm to learn the mapping function from the input to the output.

y = f (x)

The goal is to approximate the mapping function. When we have new input data (x) that we can predict the output variables (y) for that data. It is called supervised learning because the process of an algorithm learning from the training dataset. The algorithm iteratively makes predictions on the basis of training data and learning stops when the algorithm achieves an acceptable level of performance.There is no single supervised ML (classification) algorithm which outperforms on all datasets. Every classification method has its own strengths and limitations^29,30. From the literature review, in this study, we have tested popular three ML algorithms for classification: Linear Discriminant Analysis (LDA); Random Forest (RF); and Support Vector Machine (SVM). SVM we differentiated based on linear and radial basis kernel functions. These algorithms belong to the type of supervised classification require of a training stage before performing the classification process. The details of the implementation and tuning of the parameters of these classifiers are as follows:

Linear Discriminant Analysis (LDA): Linear Discriminant Analysis is a useful ML algorithm when features are linearly independent and normally distributed. LDA tries to maximize the separation between classes by estimating class boundedness as a linear combination of the features. It does not need parameter tuning. We choose this supervised classifier because it is conventionally considered to be a good benchmark classifier³¹. R package MASS is used for LDA method.
Random Forest (RF): Random forest is a classifier that consists of many decision trees. It outputs the class that is the mode of the classes output by individual trees. To achieve excellent performance, RF requires tuning parameter, mtry, the number of input features tried at each split for building each tree^4,12,32. We used the cforest function in the R Party package and, mtry = p was tuned, where p is the amount of selected phenotypic features.
Linear support vector machine (SVM-l): Linear support vector machine is used for large data sets where with/without nonlinear mapping gives similar performance^31,33. To reduce training and testing times, SVM-l requires only one hyper parameter C. The search for the optimal hyper parameter C was performed on values C ∈ [2⁰, 2¹, …, 2⁴].
Support vector machine with radial basis function (SVM-r): Generally, the Support vector machine with radial basis function classifier is better in performance and is tolerant to irrelevant and interdependent features^31,33. SVM-r is a useful method when data is not linearly separable but slower because of the hyper parameters C and γ optimization problem. For a selection of parameters C and γ, parameter tuning was performed on values C ∈ [2⁰, 2¹,…, 2⁴] and γ ∈ [2⁻⁸, 2⁻⁷, …, 1].

R package e1071 is performed for SVMs implementation. We have repeated simulated and real datasets subjected to 100 repeats of 10-cross-validation throughout the analysis.

Results

Simulated data results

We analysis simulated dataset where n₁ = n₂ = 150; p = 25, 50, 100 for evaluating the performance of rank features during the classification. The classification accuracy of 10% to 50% rank features and all features were evaluated.

When the considered features p = 25, Table 1 shows that the classification accuracy is around 98% for only 10% rank features. We calculated classification accuracy for 20%, 30%, 40%, 50% rank features. All has provided almost same classification accuracy like a non-rank all features. Here, up to 50% rank features have reduced and provided good results (≥98%). The more features means more complexity during training the model, and sometimes it provides misleading results due to the lack of meaningful features in the dataset. Figure 2 is an illustration of the performance of the number of percentages of rank variables based on computational times. It indicates that, as the percentage of the variable increases, the computational time also increases. However, from 10% to 50% rank features based classification model computational time is much lower than that the computational time of the model which contains all the features, but performance is similar.

Table 1.

Average classification accuracy (%) of the simulated data (p = 25) subjected to 100 repeats of 10-cross-validation based on rank features.

Rank Features Accuracy
ML Methods	10%	20%	30%	40%	50%	All features (100%)
LDA	98.21	98.87	99.41	99.63	99.86	100.00
RF	97.30	97.56	97.70	97.78	97.81	97.90
SVM-l	98.08	98.65	99.07	99.22	99.39	99.53
SVM-r	97.88	98.34	98.55	98.61	98.67	98.53

Open in a new tab

Performance of the number of percentage of the rank features according to the computational time.

For p = 50, 10% rank features classification accuracy is more than 90%, 20% rank features classification accuracy is around 93%, 30% rank features classification accuracy is 93%, 40% and 50% rank features classification accuracy are almost same as like as without rank features for all ML methods except RF. But RF accuracy is more than 91% (Table 2).

Table 2.

Average classification accuracy (%) of the simulated data (p = 50) subjected to 100 repeats of 10-cross-validation based on rank features.

Rank Features Accuracy
ML Methods	10%	20%	30%	40%	50%	All features (100%)
LDA	91.97	93.29	94.32	95.19	95.78	100.00
RF	91.15	91.48	91.66	91.73	91.79	91.91
SVM-l	91.82	92.90	93.62	94.23	94.73	95.07
SVM-r	91.48	92.62	93.33	93.67	93.75	93.13

Open in a new tab

When p = 100, all the ML methods prediction accuracy was more than 80% with 10% rank features. We increased the percentage of the rank features, and then prediction accuracy also increased. When we choose rank features up to 50%, LDA and SVM-l accuracy are more than 90%. However, when we used all the features during classification, the prediction accuracy was equivalent to the 50% rank features. The classification accuracy of all the ML methods for p = 100 is shown in Table 3. In the simulation study, it was also noticeable that, among the ML algorithms RF prediction accuracy has decreased only when the number of variables in the dataset increased (p = 100). Otherwise, their performance was almost similar in all cases.

Table 3.

Average classification accuracy (%) of the simulated data (p = 100) subjected to 100 repeats of 10-cross-validation based on rank features.

Rank Features Accuracy
ML Methods	10%	20%	30%	40%	50%	All features (100%)
LDA	84.97	87.23	89.30	91.10	92.43	94.88
RF	83.55	83.80	83.89	83.92	83.87	83.87
SVM-l	84.49	86.53	88.21	89.50	90.39	91.45
SVM-r	84.42	86.78	88.03	88.62	89.04	87.21

Open in a new tab

This classification accuracy we have obtained based on the mean difference of populations (m) which was 9 to 10 by 0.01. We have generated simulated data 100 times and taken average corresponding ML methods classification accuracy. Since simulation study proved that up to 50% rank features prediction accuracy is almost same when we are using all the non-rank features. Therefore, we used up to 50% rank features after processing real dataset for plant status detection, and validating/evaluating the performance of the rank features based on prediction accuracy of the ML methods used in this study.

Real data results

We analysis two growing period (stress period and recovery period) plants phenotype datasets, and divided it into six datasets based on phenotypic traits category. The last day of stress and recovery period datasets with geometrical (Geo) and physiological (Phy) traits has been analyzed (Fig. 3). These datasets have processed and obtained meaningful traits by following the first phase of our proposed framework (processing) summarized from Chen et al.⁹. Then we ranked (the second phase) ‘geometrical’, ‘physiological’ and ‘geometrical + physiological’ traits, and evaluate the selected traits (features) performance by the prediction of plant status (stress/non-stress).

Plant phenotype dataset. Dataset preparation based on features categories of two plant growing period.

Using k-fold cross-validation method, we split the dataset and k-1 set data was used for the training model, and rest set of data was used for testing, here k = 10. This procedure was repeated 100 times. The obtained results were an average of the classification accuracy for each sets of data. In stress period dataset, only the first two ranked features have provided almost 100% classification accuracy for all categories of features. Then we sequentially added four to ten features and observed that the accuracy has unchanged (Fig. 4). In recovery period data, classification accuracy is 99.99% for Geo (geometrical) rank features, whereas Phy (physiological) rank features classification accuracy is 80% when a number of rank features are 2. After sequentially adding rank Phy features, classification accuracy has improved and when a number of rank Phy features is 10 then the prediction accuracy turn into 100%. Similar accuracy results were found for the SVM-l and SVM-r, whereas RF has provided lower accuracy (≤85%) for Phy rank features. However, in this dataset, combined Geo and Phy (Geo + Phy) rank features prediction accuracy is 99.98% for all ML methods on average (Fig. 5). The standard error among the accuracy is ≈0.

Performance of rank features for stress period data set. ‘Geo’ is geometrical, ‘Phy’ is Physiological and ‘Geo + Phy’ is combined Geometrical and Physiological features.

Performance of rank features for recovery period data set. ‘Geo’ is Geometrical, ‘Phy’ is Physiological and ‘Geo + Phy’ is combined Geometrical and Physiological features.

Figures 6 and 7 describe a comparison among the ML methods for both stress and recovery period dataset, respectively. In the case of stress dataset, LDA and SVM-r prediction accuracy are 100% for Geo, Phy and Geo + Phy rank features (the number of rank features are 2, 4, 6, 8, and 10) except when Phy rank features are 2. SVM-l outperforms than others and its prediction accuracy is 100% for all categories rank features. Although, RF is slightly worse classification accuracy than LDA, SVM-l and SVM-r; however its prediction accuracy is more than 97% on average. For recovery period data, LDA and SVM-l prediction accuracy are 99.99% and 99.15% when the number of rank features are 10 of Geo + Phy and Phy, respectively. Whereas LDA accuracy is 100% when the number of rank features of Geo are 10. RF suffers lower performance and its prediction accuracy of all categories features is more than 97% except Phy features, even though a number of rank features we have taken up to 10. However, there is no noticeable difference in the performance of LDA and SVM-l for the recovery period dataset. Overall, all the ML methods in the real data analysis, the classification accuracy reached an acceptable level of performance for all cases throughout the analysis.

Comparison of classification accuracy of ML methods based on rank features for stress period data set. The Number of rank features is shown on the left and features categories are shown in the right of panels, respectively. In each column of panels, the results from a different type of ML methods are shown. Every ML method was subjected to 100 repeats of 10-cross-validation and the results shown are the average of the classification accuracy. The value in each cell is color coded (0, 1), ranging from red to blue.

Comparison of classification accuracy of ML methods based on rank features for recovery period data set. The Number of rank features is shown on the left and features categories are shown in the right of panels, respectively. In each column of panels, the results from a different type of ML methods are shown. Every ML method was subjected to 100 repeats of 10-cross-validation and the results shown are the average of the classification accuracy. The value in each cell is color coded (0, 1), ranging from red to blue.

Discussion

DM and ML is an inherently multidisciplinary approach to data analysis that draws inspiration, and borrows heavily, from statistics, probability theory, decision theory, optimization, and visualization. DM and ML methods are typically useful in situations where big data problems are available. Several image-based studies have used and evaluated DM and ML methods performance in biology and images obtained in high-throughput screening^31,34–37. The enormous volume, variety, velocity, and veracity of imaging and remote-sensing data generated by such real-time platforms represent a ‘big data’ problem.

High-throughput plant phenomics technologies have resulted in an inundation of high-resolution images and sensor data of plants. Extracting these data patterns and features requires powerful statistical approaches for increasing amount of phenotyping information of plants. Combining DM and/or integrating ML methods for plant phenomics data pre-processing, variable selection and group classification, respectively, might overcome this big data analysis problem⁴. One of the major benefits of using DM and ML approaches for plant breeders, physiologists, pathologists, and biologists is the opportunity to search large data sets to discover patterns and govern discovery by simultaneously looking at a combination of factors instead of analyzing each feature (trait) individually. Previously, this was a major bottleneck because the high dimensionality of individual images makes them extremely hard to analyze through conventional techniques. Another key challenge that the underlying processes for linking the inputs to the outputs are too complex to derive mathematical models³.

Previous studies have applied ML methods for feature selection, feature ranking and classification based on root features of phenomics data^4,7,9. Integrated methods or powerful techniques improved the accuracy of the data analysis confirming earlier results by Löw et al.³⁸ and Zhao et al.⁴. We combined DM and ML methods for feature selection, feature ranking and classification, and the performance accuracy is much better (≥98%) for all the classifiers on an average.

We used shoot image features in this study. Our results clearly demonstrated the importance of selecting important features to obtain efficient classification results for the phenomics dataset. The improved accuracy probably benefits from alleviating the ‘curse of dimensionality’ through rank features selection by removing less informative features during classification. The ‘Geo’ features are the most important features performing better than ‘Phy’ feature in case of recovery period data for all ML methods. Although, ‘Phy’ features performing same as like as ‘Geo’ features in case of stress data set. The combined ‘Geo’ and ‘Phy’ feature performing well in both cases of the datasets. The classification performance of ML methods increases when rank features not more than 50%. The overall prediction accuracy of the ML methods was cross-validated.

In summary, our study advocates that among the considered ML methods except RF, there is no noticeable difference among the classification accuracy, when the features was selected through our proposed approach. This approach reduces the computational time as well as increases the classification accuracy power by adding rank features sequentially for achieving acceptable performance of the algorithms. However, LDA is good when data are normally distributed and there is no curse of dimensionality, otherwise it does provide misleading results. RF accuracy is much lower than other ML methods for both the simulated and real dataset used in this study. SVMs are the appropriate choice for high-throughput phenomics data analysis (especially SVM-l in the iterative training of the classifier to classify all the phenotype data including classifying unlabeled plant phenotype dataset).

Conclusions

The accurate classification of stress plant (accuracy more than 98% on average) indicates that rank features performed well which were selected through our proposed approach. In particular, this study showed that the combined DM and ML method for trait identification and classification, respectively, can overcome problems in applying ML approaches to analysis phenotype data. Hence, the proposed approach is generally useful to make plant phenotype data analysis more effective and robust throughout the classification. We conclude that this proposed analytical approach, in advance our views can be useful for image-based plant phenotype data processing and finding complex traits for the study of QTL (Quantitative Trait Locus) or GWAS (Genome-wide Association Study), stress identification, disease prediction, and for further statistical investigation of phenomics dataset in plant growth and development research.

Acknowledgements

Ming Chen’s laboratory appreciate the support of the National Key Research and Development Program of China (Grant Nos. 2016YFA0501700, 2018YFC0310602), National Natural Science Foundation of China (Grant Nos. 31571366, 31771477), the Fundamental Research Funds for the Central Universities, and Jiangsu Collaborative Innovation Center for Modern Crop Production. We are also grateful to the Chinese Government Scholarship; the Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Gatersleben, Germany; and to Dijun Chen for providing this real phenotype data set.

Author contributions

M.M.R. contributed to the conception and the development of methods, prepared the manuscript and analyzed the results. M.A.A. prepared and revised the manuscript. M.C. contributed to the design and conception of the project, critically read and approved the final manuscript. All authors read and approved the final manuscript.

Data availability

The phenotype image data we downloaded from http://iapg2p.sourceforge.net/modeling/#dataset, and the R code is available upon request.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Rahaman MM, Chen D, Gillani Z, Klukas C, Chen M. Advanced phenotyping and phenotype data analysis for the study of plant growth and development. Front Plant Sci. 2015;6:619. doi: 10.3389/fpls.2015.00619. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Granier Christine, Vile Denis. Phenotyping and beyond: modelling the relationships between traits. Current Opinion in Plant Biology. 2014;18:96–102. doi: 10.1016/j.pbi.2014.02.009. [DOI] [PubMed] [Google Scholar]
3.Singh A, Ganapathysubramanian B, Singh AK, Sarkar S. Machine Learning for High-Throughput Stress Phenotyping in Plants. Trends Plant Sci. 2016;21:110–124. doi: 10.1016/j.tplants.2015.10.015. [DOI] [PubMed] [Google Scholar]
4.Zhao J, Bodner G, Rewald B. Phenotyping: using machine learning for improved pairwise genotype classification based on root traits. Frontiers in plant science. 2016;7:1864. doi: 10.3389/fpls.2016.01864. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Dudoit S, Fridlyand J, Speed TP. Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American statistical association. 2002;97:77–87. doi: 10.1198/016214502753479248. [DOI] [Google Scholar]
6.Cocosco, C. A., Zijdenbos, A. P. & Evans, A. C. A fully automatic and robust brain MRI tissue classification method. Med Image Anal7, 513-527, S1361841503000379 [pii] (2003). [DOI] [PubMed]
7.Iyer-Pascuzzi AS, et al. Imaging and analysis platform for automatic phenotyping and trait ranking of plant root systems. Plant physiology. 2010;152:1148–1157. doi: 10.1104/pp.109.150748. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Ahmed F, Al-Mamun HA, Bari AH, Hossain E, Kwan P. Classification of crops and weeds from digital images: A support vector machine approach. Crop Protection. 2012;40:98–104. doi: 10.1016/j.cropro.2012.04.024. [DOI] [Google Scholar]
9.Chen D, et al. Dissecting the phenotypic components of crop plant growth and drought responses based on high-throughput image analysis. Plant Cell. 2014;26:4636–4655. doi: 10.1105/tpc.114.129601. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Chan H-P, et al. Computer-aided classification of mammographic masses and normal tissue: linear discriminant analysis in texture feature space. Physics in Medicine &. Biology. 1995;40:857. doi: 10.1088/0031-9155/40/5/010. [DOI] [PubMed] [Google Scholar]
11.Kim T-K, Kittler J. Locally linear discriminant analysis for multimodally distributed classes for face recognition with a single model image. IEEE transactions on pattern analysis and machine intelligence. 2005;27:318–327. doi: 10.1109/TPAMI.2005.58. [DOI] [PubMed] [Google Scholar]
12.Díaz-Uriarte R, De Andres SA. Gene selection and classification of microarray data using random forest. BMC bioinformatics. 2006;7:3. doi: 10.1186/1471-2105-7-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Pan X-Y, Shen H-B. Robust prediction of B-factor profile from sequence using two-stage SVR based on random forest feature selection. Protein and peptide letters. 2009;16:1447–1454. doi: 10.2174/092986609789839250. [DOI] [PubMed] [Google Scholar]
14.Yang, J., Yao, D., Zhan, X. & Zhan, X. In International Symposium on Bioinformatics Research and Applications. 1–11 (Springer).
15.Cortes C, Vapnik V. Support-vector networks. Machine learning. 1995;20:273–297. [Google Scholar]
16.Chen, D. et al. Predicting plant biomass accumulation from image-derived parameters. GigaScience7, 10.1093/gigascience/giy001 (2018). [DOI] [PMC free article] [PubMed]
17.Schikora, M. et al. An image classification approach to analyze the suppression of plant immunity by the human pathogen Salmonella Typhimurium. BMC Bioinformatics13, 171, 10.1186/1471-2105-13-1711471-2105-13-171 [pii] (2012). [DOI] [PMC free article] [PubMed]
18.Gaonkar B, Davatzikos C. Analytic estimation of statistical significance maps for support vector machine based multi-variate image analysis and classification. Neuroimage. 2013;78:270–283. doi: 10.1016/j.neuroimage.2013.03.066. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Choi H, Yeo D, Kwon S, Kim Y. Gene selection and prediction for cancer classification using support vector machines with a reject option. Computational Statistics & Data Analysis. 2011;55:1897–1908. doi: 10.1016/j.csda.2010.12.001. [DOI] [Google Scholar]
20.Yang, W. et al. Combining high-throughput phenotyping and genome-wide association studies to reveal natural genetic variation in rice. Nat Commun5, 5087, 10.1038/ncomms6087ncomms6087 [pii] (2014). [DOI] [PMC free article] [PubMed]
21.Campbell MT, et al. Integrating Image-Based Phenomics and Association Analysis to Dissect the Genetic Architecture of Temporal Salinity Responses in Rice. Plant Physiol. 2015;168:1476–1489. doi: 10.1104/pp.15.00450. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Al-Tamimi N, et al. Salinity tolerance loci revealed in rice using high-throughput non-invasive phenotyping. Nature communications. 2016;7:13342. doi: 10.1038/ncomms13342. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Klukas C, Chen D, Pape JM. Integrated Analysis Platform: An Open-Source Information System for High-Throughput Plant Phenotyping. Plant Physiol. 2014;165:506–518. doi: 10.1104/pp.113.233932. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Grubbs FE. Sample Criteria for Testing Outlying Observations. Ann Math Stat. 1950;21:27–58. doi: 10.1214/aoms/1177729885. [DOI] [Google Scholar]
25.Camargo Anyela, Papadopoulou Dimitra, Spyropoulou Zoi, Vlachonasios Konstantinos, Doonan John H., Gay Alan P. Objective Definition of Rosette Shape Variation Using a Combined Computer Vision and Data Mining Approach. PLoS ONE. 2014;9(5):e96889. doi: 10.1371/journal.pone.0096889. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.LeCun, Y., Denker, J. S. & Solla, S. A. In Advances in neural information processing systems. 598–605.
27.Liang Y, et al. Prediction of drought-resistant genes in Arabidopsis thaliana using SVM-RFE. PLoS one. 2011;6:e21750. doi: 10.1371/journal.pone.0021750. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Wang, J. et al. In BICoB. 30–35.
29.Huang K, Murphy RF. Boosting accuracy of automated classification of fluorescence microscope images for location proteomics. Bmc Bioinformatics. 2004;5:78. doi: 10.1186/1471-2105-5-78. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Kotsiantis SB, Zaharakis I, Pintelas P. Supervised machine learning: A review of classification techniques. Emerging artificial intelligence applications in computer engineering. 2007;160:3–24. [Google Scholar]
31.Abbas SS, Dijkstra TM, Heskes T. A comparative study of cell classifiers for image-based high-throughput screening. BMC bioinformatics. 2014;15:342. doi: 10.1186/1471-2105-15-342. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Pirooznia M, Yang JY, Yang MQ, Deng Y. A comparative study of different machine learning methods on microarray gene expression data. BMC genomics. 2008;9:S13. doi: 10.1186/1471-2164-9-S1-S13. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Gillani Z, Akash MSH, Rahaman MM, Chen M. CompareSVM: supervised, Support Vector Machine (SVM) inference of gene regularity networks. BMC bioinformatics. 2014;15:395. doi: 10.1186/s12859-014-0395-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Yoon Hyun Jung, Sohn Insuk, Cho Jong Ho, Lee Ho Yun, Kim Jae-Hun, Choi Yoon-La, Kim Hyeseung, Lee Genehee, Lee Kyung Soo, Kim Jhingook. Decoding Tumor Phenotypes for ALK, ROS1, and RET Fusions in Lung Adenocarcinoma Using a Radiomics Approach. Medicine. 2015;94(41):e1753. doi: 10.1097/MD.0000000000001753. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Buggenthin F, et al. An automatic method for robust and fast cell detection in bright field images from high-throughput microscopy. BMC bioinformatics. 2013;14:297. doi: 10.1186/1471-2105-14-297. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Aerts HJ, et al. Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach. Nature communications. 2014;5:4006. doi: 10.1038/ncomms5006. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Du Z, Santella A, He F, Tiongson M, Bao Z. De novo inference of systems-level mechanistic models of development from live-imaging-based phenotype analysis. Cell. 2014;156:359–372. doi: 10.1016/j.cell.2013.11.046. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Löw, F., Schorcht, G., Michel, U., Dech, S. & Conrad, C. In Earth Resources and Environmental Remote Sensing/GIS Applications III. 85380R (International Society for Optics and Photonics).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The phenotype image data we downloaded from http://iapg2p.sourceforge.net/modeling/#dataset, and the R code is available upon request.

[CR1] 1.Rahaman MM, Chen D, Gillani Z, Klukas C, Chen M. Advanced phenotyping and phenotype data analysis for the study of plant growth and development. Front Plant Sci. 2015;6:619. doi: 10.3389/fpls.2015.00619. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR2] 2.Granier Christine, Vile Denis. Phenotyping and beyond: modelling the relationships between traits. Current Opinion in Plant Biology. 2014;18:96–102. doi: 10.1016/j.pbi.2014.02.009. [DOI] [PubMed] [Google Scholar]

[CR3] 3.Singh A, Ganapathysubramanian B, Singh AK, Sarkar S. Machine Learning for High-Throughput Stress Phenotyping in Plants. Trends Plant Sci. 2016;21:110–124. doi: 10.1016/j.tplants.2015.10.015. [DOI] [PubMed] [Google Scholar]

[CR4] 4.Zhao J, Bodner G, Rewald B. Phenotyping: using machine learning for improved pairwise genotype classification based on root traits. Frontiers in plant science. 2016;7:1864. doi: 10.3389/fpls.2016.01864. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR5] 5.Dudoit S, Fridlyand J, Speed TP. Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American statistical association. 2002;97:77–87. doi: 10.1198/016214502753479248. [DOI] [Google Scholar]

[CR6] 6.Cocosco, C. A., Zijdenbos, A. P. & Evans, A. C. A fully automatic and robust brain MRI tissue classification method. Med Image Anal7, 513-527, S1361841503000379 [pii] (2003). [DOI] [PubMed]

[CR7] 7.Iyer-Pascuzzi AS, et al. Imaging and analysis platform for automatic phenotyping and trait ranking of plant root systems. Plant physiology. 2010;152:1148–1157. doi: 10.1104/pp.109.150748. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR8] 8.Ahmed F, Al-Mamun HA, Bari AH, Hossain E, Kwan P. Classification of crops and weeds from digital images: A support vector machine approach. Crop Protection. 2012;40:98–104. doi: 10.1016/j.cropro.2012.04.024. [DOI] [Google Scholar]

[CR9] 9.Chen D, et al. Dissecting the phenotypic components of crop plant growth and drought responses based on high-throughput image analysis. Plant Cell. 2014;26:4636–4655. doi: 10.1105/tpc.114.129601. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] 10.Chan H-P, et al. Computer-aided classification of mammographic masses and normal tissue: linear discriminant analysis in texture feature space. Physics in Medicine &. Biology. 1995;40:857. doi: 10.1088/0031-9155/40/5/010. [DOI] [PubMed] [Google Scholar]

[CR11] 11.Kim T-K, Kittler J. Locally linear discriminant analysis for multimodally distributed classes for face recognition with a single model image. IEEE transactions on pattern analysis and machine intelligence. 2005;27:318–327. doi: 10.1109/TPAMI.2005.58. [DOI] [PubMed] [Google Scholar]

[CR12] 12.Díaz-Uriarte R, De Andres SA. Gene selection and classification of microarray data using random forest. BMC bioinformatics. 2006;7:3. doi: 10.1186/1471-2105-7-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] 13.Pan X-Y, Shen H-B. Robust prediction of B-factor profile from sequence using two-stage SVR based on random forest feature selection. Protein and peptide letters. 2009;16:1447–1454. doi: 10.2174/092986609789839250. [DOI] [PubMed] [Google Scholar]

[CR14] 14.Yang, J., Yao, D., Zhan, X. & Zhan, X. In International Symposium on Bioinformatics Research and Applications. 1–11 (Springer).

[CR15] 15.Cortes C, Vapnik V. Support-vector networks. Machine learning. 1995;20:273–297. [Google Scholar]

[CR16] 16.Chen, D. et al. Predicting plant biomass accumulation from image-derived parameters. GigaScience7, 10.1093/gigascience/giy001 (2018). [DOI] [PMC free article] [PubMed]

[CR17] 17.Schikora, M. et al. An image classification approach to analyze the suppression of plant immunity by the human pathogen Salmonella Typhimurium. BMC Bioinformatics13, 171, 10.1186/1471-2105-13-1711471-2105-13-171 [pii] (2012). [DOI] [PMC free article] [PubMed]

[CR18] 18.Gaonkar B, Davatzikos C. Analytic estimation of statistical significance maps for support vector machine based multi-variate image analysis and classification. Neuroimage. 2013;78:270–283. doi: 10.1016/j.neuroimage.2013.03.066. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR19] 19.Choi H, Yeo D, Kwon S, Kim Y. Gene selection and prediction for cancer classification using support vector machines with a reject option. Computational Statistics & Data Analysis. 2011;55:1897–1908. doi: 10.1016/j.csda.2010.12.001. [DOI] [Google Scholar]

[CR20] 20.Yang, W. et al. Combining high-throughput phenotyping and genome-wide association studies to reveal natural genetic variation in rice. Nat Commun5, 5087, 10.1038/ncomms6087ncomms6087 [pii] (2014). [DOI] [PMC free article] [PubMed]

[CR21] 21.Campbell MT, et al. Integrating Image-Based Phenomics and Association Analysis to Dissect the Genetic Architecture of Temporal Salinity Responses in Rice. Plant Physiol. 2015;168:1476–1489. doi: 10.1104/pp.15.00450. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR22] 22.Al-Tamimi N, et al. Salinity tolerance loci revealed in rice using high-throughput non-invasive phenotyping. Nature communications. 2016;7:13342. doi: 10.1038/ncomms13342. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR23] 23.Klukas C, Chen D, Pape JM. Integrated Analysis Platform: An Open-Source Information System for High-Throughput Plant Phenotyping. Plant Physiol. 2014;165:506–518. doi: 10.1104/pp.113.233932. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR24] 24.Grubbs FE. Sample Criteria for Testing Outlying Observations. Ann Math Stat. 1950;21:27–58. doi: 10.1214/aoms/1177729885. [DOI] [Google Scholar]

[CR25] 25.Camargo Anyela, Papadopoulou Dimitra, Spyropoulou Zoi, Vlachonasios Konstantinos, Doonan John H., Gay Alan P. Objective Definition of Rosette Shape Variation Using a Combined Computer Vision and Data Mining Approach. PLoS ONE. 2014;9(5):e96889. doi: 10.1371/journal.pone.0096889. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR26] 26.LeCun, Y., Denker, J. S. & Solla, S. A. In Advances in neural information processing systems. 598–605.

[CR27] 27.Liang Y, et al. Prediction of drought-resistant genes in Arabidopsis thaliana using SVM-RFE. PLoS one. 2011;6:e21750. doi: 10.1371/journal.pone.0021750. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR28] 28.Wang, J. et al. In BICoB. 30–35.

[CR29] 29.Huang K, Murphy RF. Boosting accuracy of automated classification of fluorescence microscope images for location proteomics. Bmc Bioinformatics. 2004;5:78. doi: 10.1186/1471-2105-5-78. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR30] 30.Kotsiantis SB, Zaharakis I, Pintelas P. Supervised machine learning: A review of classification techniques. Emerging artificial intelligence applications in computer engineering. 2007;160:3–24. [Google Scholar]

[CR31] 31.Abbas SS, Dijkstra TM, Heskes T. A comparative study of cell classifiers for image-based high-throughput screening. BMC bioinformatics. 2014;15:342. doi: 10.1186/1471-2105-15-342. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR32] 32.Pirooznia M, Yang JY, Yang MQ, Deng Y. A comparative study of different machine learning methods on microarray gene expression data. BMC genomics. 2008;9:S13. doi: 10.1186/1471-2164-9-S1-S13. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR33] 33.Gillani Z, Akash MSH, Rahaman MM, Chen M. CompareSVM: supervised, Support Vector Machine (SVM) inference of gene regularity networks. BMC bioinformatics. 2014;15:395. doi: 10.1186/s12859-014-0395-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR34] 34.Yoon Hyun Jung, Sohn Insuk, Cho Jong Ho, Lee Ho Yun, Kim Jae-Hun, Choi Yoon-La, Kim Hyeseung, Lee Genehee, Lee Kyung Soo, Kim Jhingook. Decoding Tumor Phenotypes for ALK, ROS1, and RET Fusions in Lung Adenocarcinoma Using a Radiomics Approach. Medicine. 2015;94(41):e1753. doi: 10.1097/MD.0000000000001753. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR35] 35.Buggenthin F, et al. An automatic method for robust and fast cell detection in bright field images from high-throughput microscopy. BMC bioinformatics. 2013;14:297. doi: 10.1186/1471-2105-14-297. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR36] 36.Aerts HJ, et al. Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach. Nature communications. 2014;5:4006. doi: 10.1038/ncomms5006. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR37] 37.Du Z, Santella A, He F, Tiongson M, Bao Z. De novo inference of systems-level mechanistic models of development from live-imaging-based phenotype analysis. Cell. 2014;156:359–372. doi: 10.1016/j.cell.2013.11.046. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR38] 38.Löw, F., Schorcht, G., Michel, U., Dech, S. & Conrad, C. In Earth Resources and Environmental Remote Sensing/GIS Applications III. 85380R (International Society for Optics and Photonics).

PERMALINK

Data-mining Techniques for Image-based Plant Phenotypic Traits Identification and Classification

Md Matiur Rahaman

Md Asif Ahsan

Ming Chen

Abstract

Introduction

Materials and Methods

Data description

Simulated data

Plant phenomics data

Phenomics data processing and features selection

Figure 1.

(a) Data pre-processing and features selection (Processing)

(b) Features ranking by SVM-RFE (Ranking)

Supervised machine learning methods

Results

Simulated data results

Table 1.

Figure 2.

Table 2.

Table 3.

Real data results

Figure 3.

Figure 4.

Figure 5.

Figure 6.

Figure 7.

Discussion

Conclusions

Acknowledgements

Author contributions

Data availability

Competing interests

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases