An optimized method for mulberry silkworm, Bombyx mori (Bombycidae:Lepidoptera) sex classification using TLBPSGA-RFEXGBoost

Sania Thomas; Jyothi Thomas

doi:10.1242/bio.060468

. 2024 Jul 18;13(7):bio060468. doi: 10.1242/bio.060468

An optimized method for mulberry silkworm, Bombyx mori (Bombycidae:Lepidoptera) sex classification using TLBPSGA-RFEXGBoost

Sania Thomas ^1,^*, Jyothi Thomas ¹

PMCID: PMC11273299 PMID: 38885006

ABSTRACT

Silkworm seed production is vital for silk farming, requiring precise breeding techniques to optimize yields. In silkworm seed production, precise sex classification is crucial for optimizing breeding and boosting silk yields. A non-destructive approach for sex classification addresses these challenges, offering an efficient alternative that enhances both yield and environmental responsibility. Southern India is a hub for mulberry silk and cocoon farming, with the high-yielding double-hybrid varieties FC1 (foundation cross 1) and FC2 (foundation cross 2) being popular. Traditional methods of silkworm pupae sex classification involve manual sorting by experts, necessitating the cutting of cocoons – a practice with a high risk of damaging the cocoon and affecting yield. To address this issue, this study introduces an accelerated histogram of oriented gradients (HOG) feature extraction technique that is enhanced by block-level dimensionality reduction. This non-destructive method allows for efficient and accurate silkworm pupae classification. The modified HOG features are then fused with weight features and processed through a machine learning classification model that incorporates recursive feature elimination (RFE). Performance evaluation shows that an RFE-hybridized XGBoost model attained the highest classification accuracy, achieving 97.2% for FC1 and 97.1% for FC2. The model further optimized with a novel teaching learning-based population selection genetic algorithm (TLBPSGA) achieved a remarkable accuracy of 98.5% for FC1 and 98.2% for FC2. These findings have far-reaching implications for improving both the ecological sustainability and economic efficiency of silkworm seed production.

Keywords: Sericulture, Accelerated HOG, RFEXGBoost, TLBPSGA, Seed production, Classification

Summary: This study introduces an optimized non-destructive silkworm pupae sex classification method that enhances both ecological sustainability and economic efficiency in silkworm seed production.

INTRODUCTION

Sericulture is a complex process of cultivating silkworms for the production of silk that requires careful attention to detail and a deep understanding of the life cycle of the silkworm. The different steps involved in sericulture includes mulberry cultivation, egg production, silkworm rearing, cocoon production, harvesting, and weaving (Altman and Farrell, 2022; Gundi et al., 2023; Thomas and Thomas, 2020). The advancement of silkworm breeding is crucial for high quality silk and cocoon production. A significant breakthrough in this field was the introduction of the CSR2×CSR4 hybrid breed in India in 1997, transforming the country's sericulture industry. Alongside this, the CSR hybrids, specifically FC1 (CSR6×CSR26) and FC2 (CSR2×CSR27), have proven to be highly productive and robust, suitable for cultivation by farmers. These breeds, however, require meticulous care to ensure the availability of parental cocoons. To mitigate the challenges associated with these pure breeds, bivoltine double hybrid breeds have been developed. These hybrids are more resistant to adverse weather conditions, resulting in more consistent crop yields compared to their single hybrid counterparts. One such successful double hybrid breed in India is (CSR2×CSR27)×(CSR6×CSR26), capable of yielding around 68.00 kg of silk per 100 disease-free layings (dfls). In our study we used single hybrid FC1 and FC2 varieties. Egg production is one of the key steps in sericulture that involves obtaining healthy silkworm eggs for the next generation of silkworms. Egg production involves the selection of parent stock, mating, egg laying, incubation, harvesting the eggs and storage. Selecting healthy parent stock is a crucial phase in egg production. In the traditional method, the cocoons are cut open, and the pupae are visually inspected to determine their sex. The traditional method is a destructive method that has several disadvantages, such as: damage to the cocoon; risk of bacterial and fungal infections in the pupa, which can lead to lower survival rates and reduced silk quality; cutting open the cocoon can be stressful and traumatic for the pupa, leading to a higher mortality rate among the pupae; the traditional method is a time-consuming and labour intensive process, requiring skilled workers to carefully handle the delicate silkworm pupa; and, the accuracy of sex classification by visual inspection can vary depending on the skill and experience of the worker performing the classification, leading to inconsistencies and errors in the final product. Introduction of non-destructive methods for sex classification of silkworm pupae has the following advantages such as: maintaining cocoon quality, reducing the risk of infection, and saves time and labour. The non-destructive method makes the overall sericulture process more efficient and cost-effective.

There are several available methods; polymerase chain reaction (PCR) is a molecular biology technique that can be used to amplify and detect specific DNA (deoxyribonucleic acid) sequences (Reddy et al., 1999). Sex-specific markers have been identified in silkworms, which can be used to identify the sex of the pupa by analyzing its DNA. This method is highly accurate and reliable but requires specialized equipment and trained personnel. Near infrared spectroscopy (NIR) is a method that uses light to analyze the chemical composition of a substance. It has been used to identify the sex of silkworm pupae based on differences in their chemical composition (Zhu et al., 2018; Fu et al., 2023; Lin et al., 2019; Tao et al., 2018; Qiu et al., 2021). In the NIR method, environmental factors can affect the accuracy, and the instrument requires calibration and expertise, which is expensive.

Sex specific dyes are a non-destructive method (Traut et al., 2008). In this method certain dyes have been developed that are specific to male or female silkworm pupae. These dyes are injected into the pupae, and their coloration indicates their sex. There are some problems that can limit their widespread use, including that some sex-specific dyes may be toxic to silkworm pupae, which can reduce the survival rate and overall health of the pupae. The accuracy of sex-specific dyes can vary depending on factors such as the type and concentration of dye used, as well as environmental conditions. This can lead to inconsistencies in the sex classification of pupae. Sex-specific dyes may have a limited shelf life, meaning they need to be stored and used within a certain time frame. Sex-specific dyes need to be applied carefully and accurately to the pupae, which can be a time-consuming and labor-intensive process. This can increase the cost of production. The use of sex-specific dyes can increase the risk of contamination in the production process.

X-ray imaging techniques were also experimented with in the classification of silkworm pupae without cutting the cocoon (Cai et al., 2014). In previous research, we explored the use of X-ray imaging as a non-destructive technique for the sex classification of silkworm pupae. By employing X-ray imaging, we were able to investigate the internal structures of the pupae without damaging the cocoon. Although this approach holds promise for more precise and non-invasive classification, two major challenges identified were the financial burden associated with the acquisition and operation of specialized X-ray equipment, and concerns about the safety of exposing silkworm pupae to ionizing radiation. Despite these hurdles, our study serves as a pioneering effort in employing advanced imaging technology for the enhancement of sericulture practices (Thomas and Thomas, 2022).

A camera imaging method for pupae sex classification in silkworms is becoming increasingly popular in the sericulture industry. Frontal gonad images are utilized in the classification process (Guo et al., 2022). For this purpose, Guo et al. developed an image exposure correction algorithm to address underexposure and overexposure issues in the images. This approach achieved an accuracy of 90.5%, which was approximately 7% higher than using the original gonad images. He et al. (2023) presents a comprehensive approach for identifying silkworm pupae species and sex using machine learning and deep learning techniques using pupae images. This method obtained an accuracy of 99% for sex classification. The camera images of the cocoon are also used for the sex classification, which eliminates cutting the cocoon. Shape features of the cocoon is studied for the classification process (Joseph Raj et al., 2019; Mahesh et al., 2017). Mahesh et al. (2017) used various combinations of weight, volume, geometric, and Zernike moment-based shape features for discrimination. A mean classification accuracy of 91.3% was attained with the neural network for CSR2 and 100% with the support vector machine (SVM) for the Pure Mysore breed. Joseph Raj et al. (2019) used the combination of weight data with shape features to differentiate male and female pupae and attained an accuracy ranging from 86.48% to 93.54% using SVM. Our previous study explored different feature extraction method such as Zernike moment-based features, Gabor feature, gray level co-occurrence matrix (GLCM), local binary pattern (LBP), and histogram of oriented gradients (HOG). The study identified HOG features best suitable for our problem (Thomas and Thomas, 2024). HOG-based descriptors are a widely used technique in image classification. They capture the distribution of gradient orientations in an image, providing a robust representation that is particularly effective in detecting object shapes and textures (Nam et al., 2011; Klein and Celik, 2017). The HOG feature descriptor is used in many areas, including gender recognition from face images, object and pedestrian localization, attentive semantic alignment, emotion recognition and personality analysis (Son et al., 2015; Yavariabdi et al., 2022; Seo et al., 2018; Yavariabdi et al., 2021; Nam et al., 2014; Moghimi et al., 2022). In our proposed method, accelerated HOG is implemented to extract the features from the camera images of the pupa, which enhanced the speed of the feature extraction compared with conventional HOG feature extraction. To enhance the efficiency of the feature set, the HOG feature set is modified by performing LDA dimensionality reduction on each feature block of the HOG feature descriptor. Along with HOG features, weight of the cocoon with pupa is also fused. To obtain the finer feature set, we performed recursive feature elimination (RFE) with different machine learning models. XGBoost ensemble learning obtained the highest results compared to other models. In pursuit of further optimization for the RFEXGBoost model, a novel approach, teaching learning-based population selection genetic algorithm was introduced, and we compared its performance against classical genetic algorithm and grid search for hyper parameter tuning. The advantages of our method have been shown to have a high accuracy rate in sex classification of silkworm pupae, which can help to improve overall efficiency. Unlike traditional destructive methods, our method is non-invasive and does not harm the pupae, allowing them to continue their development and eventual emergence as moths. It is a cost-effective method for pupa sex classification, particularly when compared to other methods such as NIR or PCR. It can also be scaled up or down depending on the size of the production facility, making it a flexible option for the sericulture industry.

RESULTS

Performance metrics

In this binary classification task, several metrics including accuracy, precision, recall, F1 score, and the area under the receiver operating characteristic curve (AUC-ROC) are utilized to evaluate the model's efficacy. These performance metrics are calculated using Eqns 1 to 4. The accuracy metric reveals the model's overall ability to correctly distinguish between male and female cocoons. Precision quantifies the model's performance with respect to misclassifications, commonly known as false negatives. Recall assesses the model's efficiency in identifying true positives, avoiding false negatives. The F1 score offers a comprehensive efficiency measure by calculating the weighted average of precision and recall. Lastly, the AUC-ROC serves as an indicator of the model's aptitude for differentiating between the two classes.

(1)

(2)

(3)

(4)

Data analysis

In the initial phase of our proposed methodology, we focused on the critical task of feature extraction, which serves as the foundational step in achieving high classification accuracy. To this end, the HOG descriptor was employed due to its well-established performance in object recognition and image classification tasks. Recognizing the necessity for computational efficiency, we integrated the fast fourier transform (FFT) into the traditional HOG feature extraction process. This amalgamation aimed at accelerating the feature extraction step without compromising the quality of the extracted features. Our empirical results convincingly demonstrated the efficacy of this integration. Specifically, the computational speed for feature extraction was enhanced by approximately 50%, thereby achieving a significant reduction in the computational time required for this critical task. The results are shown in Fig. 1A.

Fig. 1. — **Performance analysis.** (A) Illustration of HOG versus accelerated HOG. (B) Confusion matrix of FC1 and FC2. (C) Comparison of GA and TLBPSGA RFEXGBoost for FC1 and FC2. (D) Performance analysis based on performance matrices accuracy score, f1 score, precision, recall and ROC-AUC score.

In the subsequent phase of our research, we addressed the issue of high-dimensional feature spaces. We employed linear discriminant analysis (LDA) to perform dimensionality reduction on each feature block. LDA was selected for its well-regarded capability to maximize class separability, thereby retaining the most discriminative features while compressing the feature space. Our results indicate that the application of LDA effectively reduced the dimensionality of each feature block to a manageable size of 105 features. This not only simplified the computational landscape but also optimized the feature set for subsequent classification tasks without sacrificing discriminatory power. To further refine this optimized feature set, we implemented RFE, a technique that systematically eliminates the least important features based on computed feature importance scores.

In the final stage of our investigation, we augmented our feature selection process with a hybrid approach that combined RFE with various classification algorithms. The goal was to evaluate and identify the best performing model for our classification task. Stratified 10-fold cross-validation is employed for analyzing the performance of classification models. This method ensures a more accurate and reliable assessment by maintaining an even distribution of each class across all folds. Each subset of the data is used in a rotation as both the training and testing set, providing a comprehensive evaluation of the model's performance. Performance analysis of FC1 and FC2 are respectively shown in Table 1.

Table 1.

Performance analysis of FC1 and FC2 sex classification with different classification models

Open in a new tab

The results from this hybrid approach were illuminating. We observed that among the tested algorithms, ensemble methods, specifically the RFE XGBoost model, exhibited superior classification performance. For instance, in classifying FC1 variety, the RFE XGBoost model achieved an accuracy of 97.2%, while for the FC2 variety, it recorded an accuracy of 97.1%.

To further refine the model to improve the performance optimization methods such as grid search, classic genetic algorithm and modified genetic algorithm with teaching learning-based population selection. RFEXGBoost hyperparameters such as learning rate, n estimators, maximum depth, minimum child weight, gamma, subsample, column sample by tree were taken for optimization. The range of values used for optimization is given in Table 2. Grid search systematically explores various combinations of hyper parameters for a model. It involves defining a set of possible values or ranges for each hyper parameter of interest and then exhaustively trying all possible combinations of these values. For each combination, the model is trained and evaluated using a cross-validation procedure to assess its performance. The major limitation was that it cannot use a wide range of search space as it is very time-consuming process. Genetic algorithms (GAs) are renowned for their versatility in tackling a vast search space, making them a valuable asset in optimization tasks. They often outpace grid search in terms of convergence speed, but it is worth noting that GAs may require more iterations to reach an optimal solution. However, a promising solution to expedite convergence while reducing the number of iterations is the introduction of a teaching-learning-based population selection genetic algorithm. This innovative approach integrates teaching and learning mechanisms to harness problem-specific insights, steering the optimization process toward promising solution areas with greater efficiency. As a result, this approach not only accelerates convergence but also minimizes the computational effort needed to attain an optimal solution. It strikes a better balance between exploration and exploitation, adapting dynamically to the evolving landscape of complex optimization problems. Table 3 compares the performance of grid search, classical genetic algorithm and teaching learning-based population selection genetic algorithm.

Table 2.

Hyper parameter details for optimization

graphic file with name biolopen-13-060468-i2.jpg

Open in a new tab

Table 3.

Comparison of grid search, classical GA and TLBPSGA

Open in a new tab

Analyzing Table 3 reveals that the convergence speed of the TLBPSGA surpasses that of the classic genetic algorithm. TLBPSGA achieves optimal solutions with impressive fitness scores of 0.985 for FC1 and 0.982 for FC2. In this experimental study comprising 100 generations, TLBPSGA accomplishes the optimal fitness score remarkably early, reaching its peak at the 31st generation for FC1 and the 36th generation for FC2. In contrast, the classic genetic algorithm requires significantly more iterations to achieve similar fitness levels, hitting its highest fitness at the 91st generation for FC1 and the 96th generation for FC2.The fitness progression of each generation is visually depicted in the accompanying Fig. 1C, illustrating the notable advantages of TLBPSGA in terms of convergence speed and solution quality. Table 4 shows the optimal hyper parameter obtained for both FC1 and FC2 sex classification. The confusion matrix obtained for FC1 and FC2 silkworm cocoon classification is shown in Fig. 1B. The performance analysis is shown in Fig. 1D.

Table 4.

Optimal parameter obtained for the model using TLBPSGA

graphic file with name biolopen-13-060468-i4.jpg

Open in a new tab

DISCUSSION

Our comprehensive study aimed to optimize the feature extraction, feature selection, and classification processes for high-accuracy identification of FC1 and FC2 cocoon varieties. Starting with the integration of FFT into the traditional HOG descriptor, we recorded a dramatic 50% reduction in feature extraction time. As substantiated by the empirical data, the computational speed improved significantly, transforming the feature extraction process from a bottleneck into a much more efficient procedure. This finding underlines the potential of FFT-HOG synergy for a wide range of applications where speed and accuracy are paramount. Moreover, addressing the issue of high-dimensionality, our application of LDA and RFE successfully distilled the feature set down to its most essential elements. This dimensionality reduction not only simplified the computational framework but also enabled us to maintain the discriminatory power of the features, thereby optimizing the feature set for the succeeding classification tasks. It is noteworthy that LDA reduced the feature dimensions to a manageable 105, weight of the cocoon with pupa is integrated to 105 features, while RFE further refined this set, corroborated by the feature importance scores. The final stage involved a hybrid approach that combined RFE with multiple classification algorithms, leading to a series of intriguing observations. Among the ensemble and traditional machine learning methods tested, RFE XGBoost clearly outperformed the others in both FC1 and FC2 categories, achieving accuracies of 97.2% and 97.1%, respectively. These results suggest that ensemble methods are particularly effective in leveraging the quality of the carefully selected and optimized features, especially when the feature selection process is as rigorously executed as in our methodology.

In the pursuit of further optimizing the high-performing RFEXGBoost classifier, introduced a teaching-learning-based population selection genetic algorithm. In contrast to the conventional genetic algorithm, where populations are randomly generated, the proposed approach leveraged a teaching-learning-based optimization process to generate populations. This method led to the acquisition of a more refined population, from which the best individuals were selected for mating in each iteration. This iterative refinement of the population selection process resulted in notable improvements. One significant advantage of this approach is its faster convergence when compared to traditional methods. With fewer generations, the updated genetic algorithm optimization method achieved the convergence of the optimum solution. Following the optimization process, the RFEXGBoost classifier attained an impressive accuracy of 98.5% for FC1 and 98.2% for FC2 sex classification. These accuracy levels represent a substantial improvement over existing methods.

In contextualizing our work within the landscape of existing methods for cocoon classification, a variety of techniques have been explored with varying levels of efficacy, operational suitability, and practical constraints. The exceptional precision of spectral analysis on silkworm pupae is noteworthy, but its practicality is significantly hindered by its destructive nature, as it necessitates the cutting of cocoons. On the other hand, non-invasive techniques like the one proposed by Dai et al. (2021), which employed convolutional neural networks (CNN) on spectral analysis, achieved a commendable 94% accuracy. However, Cai et al. (2014) used X-ray imaging coupled with LDA classification, reporting an accuracy of 93.3%. Despite its effectiveness, this method raises concerns about the potential impact of X-rays on cocoon reproduction and hence, remains less viable for seed production centers. When it comes to cost-effective solutions, camera imaging emerges as a front runner. Mahesh et al. (2017) achieved an accuracy of 91.3% using Neural Networks with features derived from shape and Zernike moments. Joseph Raj et al. (2019) also relied on shape feature extraction, obtaining accuracies ranging from 86.48% to 93.54%. Notably, these studies were conducted with limited sample sizes. Our study significantly extends the existing literature by leveraging a larger dataset, comprising 1579 samples from FC1 variety and 1669 from FC2 variety, totaling 3248 samples. Both FC1 and FC2 are critical for high-yield cocoon production in southern India. The dataset used in our study was validated by the experts in the silkworm seed production center in Palakkad, Kerala, India. These outcomes of the proposed method confirm the efficacy of hybridizing RFE with advanced classification algorithms like XGBoost, particularly in tasks that demand high accuracy. They also validate the effectiveness of our multi-stage optimization strategy, starting from accelerated feature extraction to refined feature selection and finally to hybrid classification and optimization. The proposed non-destructive method for sex classification of silkworm pupae, avoiding cocoon cutting, provides ecological and economic benefits by replacing invasive practices, supporting sustainability, enhancing productivity, and promoting a balanced silkworm life cycle. In the context of seed production, accurate sex classification is pivotal for breeding programs. Being able to classify the sex of pupae without damaging them allows for a more precise and efficient selection process. This in turn enhances the quality of the seed, leading to higher yields and more robust silk production. A more efficient system would reduce the time and labour costs associated with rearing and sexing the pupae, translating into economic benefits for sericulture operations. This insight is also valuable in silk manufacturing centers as the male silkworm cocoons yield a more quality and quantity silk filament than female cocoons (Bu et al., 2022). By differentiating between the filament qualities of male and female cocoons, the silk industry can optimize its processes for higher quality silk output.

MATERIALS AND METHODS

Fig. 2 shows the workflow of the process. The research methodology employed in this study is a sophisticated, multi-stage design to construct a robust and efficient classification model. Initially the FC1 and FC2 cocoon images were acquired to prepare the dataset for the study. Feature extraction is done on the image using fast fourier transform histogram oriented gradient (FFT-HOG). Further dimensionality reduction is performed on each feature block of HOG descriptor to enhance efficiency of the features and then fused weight feature and performed recursive feature elimination with classification models to obtain the highest performing classification model.

Sample collection

In this study, two specific hybrid varieties of silkworm cocoons – FC1 and FC2 – were investigated. The samples were sourced from silkworm rearers authorized by the state sericulture department to rear these particular varieties. The research utilized a dataset comprising 3248 samples, which included 1579 cocoons of the FC1 variety (837 males and 742 females) and 1669 cocoons of the FC2 variety (832 males and 837 females). The validation of the data set was done with the help of experts in the silkworm seed production center in Palakkad, Kerala, India. Sample cocoons are shown in Fig. 3A. The same dataset was used in our previous work (Thomas and Thomas, 2024).

Fig. 3. — **Image processing on cocoon images.** (A) Sample cocoon images. (B) Visualization of HOG (a) original image (b) greyscale image (c) resized image (d) HOG image. (C) Visualization of FFT conversion of spatial domain data to frequency domain in 9 bin gradient orientation.

Feature extraction

Feature extraction is a crucial part in the image classification process. In mulberry silkworm pupa sex classification, HOG features of the images were extracted. The HOG feature descriptor is a technique used in computer vision to represent the distribution of edge directions in an image. It provides a compact representation of local object appearance and shape by quantifying gradient information within localized regions. To enhance the efficiency of the feature extraction process we have combined HOG feature extraction with two dimensional FFT convolution. In this process the images are converted to grayscale and resized to 128×64 for better performance. Gradient magnitude and direction were computed using Eqns 5 and 6:

(5)

(6)

where G_x is the gradient in x direction and G_y is the gradient in the y direction, r and c refer to the rows and columns, respectively. I is the intensity value of a pixel. Magnitude and orientation were calculated using Eqns 7 and 8;

(7)

(8)

A set of gradient filters were created based on the orientations. Performing convolution in the spatial domain is expensive so to speed up the process gradient magnitude image and gradient filters are converted to frequency domain using Fourier transform then perform the convolution in the frequency domain after this process inverse Fourier transform is applied to convert back to spatial domain, which is shown in Eqn 9. This improves the speed of the feature extraction.

(9)

where c(x, y) is the result of the convolution at position (x, y), g(x, y) is the gradient image and f(x, y) is the gradient filter. Now compute the histogram of the filtered image. Normalize the histograms to improve the robustness of the descriptor and then concatenate the histograms to form the final HOG descriptor. Fig. 3B, shows the various stages of HOG feature extraction. Fig. 3C, helps to visualize the frequency domain conversion of the spatial domain data for each of the original 9 bins of gradient orientation.

Feature engineering

Final HOG descriptor consists of 3780 features, which is very high compared with the number of samples. To improve the efficiency of the classification, process the dataset is divided in to patches of 36 features then applied LDA to reduce the dimensionality to a single component. After performing dimensionality reduction 3780 features were reduced to 105 features.

Linear discriminant analysis

In classification tasks, LDA is frequently used for reducing dimensionality (Binson et al., 2021e; Anowar et al., 2021). In our research, we initially employed the HOG for feature extraction. Each extracted block comprised 36 features, upon which LDA was subsequently performed for dimensionality reduction. Utilizing Eqns 10 and 11, LDA calculates the mean and covariance matrix for each class. Subsequently, two key matrices are computed, the between-class scatter matrix, gauging the dispersion of mean feature vectors across distinct classes, and the within-class scatter matrix, assessing the dispersion of feature vectors within individual classes.

(10)

(11)

Weight feature

The weight of each cocoon containing a pupa was meticulously measured using a high-precision single-point load cell. This weight data holds significance as it correlates with the sex of the pupa. Male and female pupae weights are different but there is no significant difference in the weight of male and female cocoons when the pupa is inside the cocoon. We cannot accurately classify them with the weight feature alone. Additionally, this weight-based feature was seamlessly integrated with the extracted HOG feature set with simple concatenation method.

Recursive feature elimination (RFE)

RFE is a technique commonly used in machine learning and feature selection to improve the performance of models by selecting a subset of the most relevant features from the original feature set. The main idea behind RFE is to iteratively train a model and eliminate the least important features at each iteration until a desired number of features is reached or a certain performance metric is optimized (Senan et al., 2021; Jeon and Oh, 2020).

Initially all the 106 features obtained were taken into consideration then feature importance ranking is computed using mutual information gain. The mutual information gain measures the dependency between two variables. The values of the mutual information gain range between 0 and 1. The maximum mutual information gain and minimum entropy state that the relation between the random variables is high else the relationship is insignificant. The general representation of mutual information gain is as follows:

(12)

which measures the information gain of the i^th feature X_i and label Y. Where H(X_i) is the entropy and H(X_i|Y) is the conditional entropy for X_i given Y. Then using different models, recursively the lowest important feature is eliminated, and the performance matrix is analyzed this is repeated until an optimum result is obtained. The details of the classification algorithm.

Classification algorithms

Support vector machine

The SVM serves as a supervised classification method, capable of handling both linearly separable and non-separable data sets. Its primary aim is to identify an optimal separating hyperplane that best distinguishes between classes. To achieve this optimum separation, SVM focuses on maximizing the distance from the hyperplane to the nearest points, known as support vectors, in each class. In essence, the objective is to enlarge the smallest distance between the support vectors and the hyperplane to the maximum extent possible (Chandra and Bedi, 2021; Binson et al., 2021d; Malek et al., 2022). The distance from a data point (x₀ ,y₀) to the hyperplane is obtained by Eqn 13:

(13)

where Φ(x₀) is the point vector, w is the weight vector, b is the bias term and w₂ is Euclidean norm. Now we need to maximize the minimum distance (distance to the support vector to the hyperplane), which is represented in Eqn 14:

(14)

where Inline graphic weight vector for the optimum hyperplane, Φ(x_n) is the minimum distance data point. The decision is made according to the hyperplane.

K-nearest neighbors

The K-nearest neighbors (KNN) algorithm is a form of supervised learning, emphasizing classification based on similarity metrics. In KNN, a data point is assigned to a particular class based on the majority voting mechanism, where votes are sourced from its K closest neighbors in the feature space, as determined by a predefined similarity measure (Zhang et al., 2017; Cover and Hart, 1967). In the context of a dataset, where X represents a feature matrix and Y signifies class labels, KNN aims to estimate the conditional distribution of Y given X. It then categorizes an observation into the class that exhibits the highest conditional probability. Specifically, for a designated positive integer K, the algorithm identifies the K nearest observations to a test data point, denoted as x, and computes the conditional likelihood of x belonging to a certain class, labelled as j using Eqn 15:

(15)

X denotes the matrix of input features, and Y represents the target class labels. The variable j signifies a specific class label, and K indicates the number of closest neighbors considered in the decision-making process. The set A comprises the K closest observations to a point in question. The variable I serves as an indicator for class membership, where a value of 1 means the observation belongs to class j, and a value of 0 means it does not. Choosing a small K results in lower bias but higher variance in the model's predictions, whereas a larger K value leads to reduced variance but increased bias.

Naive bayes

Bayes' theorem is applied under the assumption that the predictor variables are conditionally independent. The Naive Bayes algorithm computes the conditional probability of the input variables x_i for each class variable y, and then selects the class that has the highest probability value as the outcome which is represented as follows:

(16)

where P(y) is the class probability, P(x|y) is the conditional probability. It is assumed that all the input features are normally distributed in Gaussian Naive Bayes. The conditional probability of Gaussian naive Bayes is expressed as follows:

(17)

Naive Bayes best works with small training samples and works well with real-world situations (Shaban et al., 2021; Binson et al., 2021b).

Decision Tree

The Decision Tree serves as a hierarchical, supervised learning algorithm characterized by a tree-like model of decisions. Within this structure, internal nodes signify decision rules that guide the classification process, while leaf nodes correspond to the final outcomes. The algorithm employs measures of attribute impurity to facilitate node splitting, aiming to maximize information gain for each attribute, thereby ensuring an optimal division at each level of the tree (Charbuty et al., 2021; Cañete-Sifuentes et al., 2021). The entropy and information gain calculation formulas are given below:

(18)

where P_i is the probability of class i. c is the total number of classes.

(19)

where IG is the information gain.

Random Forest

Random Forest operates as an ensemble-based classification approach that leverages multiple decision trees for decision making (Parmar et al., 2019; Binson et al., 2021a). The collective outcome, determined by a majority vote among these trees, is what is commonly referred to as bagging. The Random Forest algorithm commences by partitioning the original dataset into a bootstrapped version, wherein samples are randomly drawn to create this new dataset. Decision trees are then constructed using these bootstrapped datasets. During this construction phase, input feature vectors are randomly chosen. Various techniques for this random selection include picking random inputs, generating random combinations, or identifying the most effective feature for splitting. This entire process is iteratively repeated multiple times. The model's performance is subsequently evaluated using samples absent from the original dataset. These evaluation samples are fed into the individual decision trees, and the final class assignment is determined through a majority voting mechanism. The randomization inherent in the algorithm serves to reduce the correlation among the decision trees, thereby enhancing the model's generalization error (G.E.). The equation is given as follows:

(20)

where ρ indicates the mean correlation of decision trees and s is the strength of the tree classifier margin.

XGBoost

XGBoost (Extreme Gradient Boost) generates a model based on gradient-boosted decision trees introduced by Tianqi Chen (Chen et al., 2015). XGBoost provides a powerful, scalable, and accurate framework for supervised learning tasks such as classification, regression, and ranking. It is particularly well-suited for large datasets and high-dimensional feature spaces, offering features like handling missing values, regularization, and built-in cross validation to enhance model performance (Prabha et al., 2021; Mahesh et al., 2022). In XGBoost, the final model is built based on the loss function calculated from the previous model.

(21)

where N is the number of samples, y_i is the outcome of the i^th sample, p_i is the probability of i^th sample. The advantage of using XGBoost is the learning and running time efficiency.

AdaBoost

Adaptive boosting (AdaBoost) is an ensemble learning technique aimed at improving the classification performance of weak classifiers. Works by sequentially fitting a series of weak classifiers to weighted versions of the data. After each round, the algorithm updates the weights of incorrectly classified instances, effectively boosting the importance of those instances for subsequent classifiers. The final model is a weighted sum of these weak classifiers, resulting in a strong classifier that often performs significantly better than any individual weak classifier. AdaBoost is particularly effective for high-dimensional data and complex classification tasks. However, it can be sensitive to noisy data and outliers (Wyner et al., 2017; Binson et al., 2021c). AdaBoost gives weights to each training instance in the training dataset to determine its influence on the dataset. Then calculate α, the influence of each decision stumb to the final classifier.

(22)

Now new weights are given to the feature instance. If the instance is correctly classified, then α value kept is negative otherwise positive.

(23)

Model optimization

To enhance the model's performance, a comprehensive hyper parameter optimization of the chosen XGBoost model was performed. Key hyper parameters identified for fine-tuning included the learning rate, n estimators, max depth, min child weight, gamma, subsample, and colsample bytree. For this optimization task, employed the genetic algorithm (GA), a prominent choice among population-based meta heuristic techniques. Notably, to ensure optimal outcomes, incorporated modifications to the classical GA, tailoring it for enhanced efficacy in the context.

Genetic algorithm

The GA is an optimization technique inspired by natural selection, employing a population-based search strategy guided by the principle of survival of the fittest (Katoch et al., 2021). This algorithm iteratively applies genetic operators to individuals within the population, including chromosome representation, selection, crossover, mutation, and fitness function evaluation (Sabeti et al., 2022). The GA procedure can be summarized as follows: initially, a population of n chromosomes is randomly generated. The fitness of each chromosome in the population is assessed. Two chromosomes are selected from the population based on their fitness values. A single-point crossover operator with a specified crossover probability is then applied to the selected chromosomes, resulting in an offspring. Subsequently, a uniform mutation operator is applied to the produced offspring using a specified mutation probability, generating new offspring which is introduced into the new population. These selection, crossover, and mutation operations are iteratively repeated on the current population until the new population is complete. Mathematically, GA dynamically adapts its search process by adjusting the probabilities of crossover and mutation, ultimately converging towards optimal solutions. GA is capable of modifying encoded genes and evaluating multiple individuals, yielding multiple optimal solutions. Consequently, GA exhibits superior global search capabilities. The offspring produced through the crossover of parent chromosomes have the potential to disrupt the favorable genetic schemas of the parent chromosomes. The crossover formula is expressed as follows:

(24)

here, g represents the number of generations, and G is the total number of evolutionary generations determined by the population. Eqn 24 demonstrates that the value of R dynamically changes and increases as the number of evolutionary generations rises. In the initial stages of GA, individual similarity is low, and R should be kept low to preserve the integrity of excellent genetic schemas within the population. As evolution progresses, individual similarity increases, and R should subsequently increase. According to the Schema theorem, the original genetic schema must adapt to modified schemas. To maintain diversity within the population, these new schemas are retained during the early stages of evolution. Towards the end of evolution, appropriate schemas are established to safeguard the integrity of excellent genetic schemas, preventing distortion.

Algorithm 1: Classical GA

Input:

Population Count, p
Maximum iterations, l

Output:

Best overall solution, B_opt

begin

Initiate initial group of p sequences, B_i (i=1,2, 3,………p)
Set loop counter t=0
Determine the fitness score of each member
While(t<l)

Pick parent sequences from initial group based on their fitness score

Conduct Crossover method on chosen pair using crossover likelihood

Implement mutation on the descendants using mutation likelihood

Substitute the former group with the new generation

Increase the counter t by 1

End While

Return the optimal solution, B_opt

end

Teaching-learning-based population selection GA

In contrast to the conventional GA, which typically begins with a random initial population and applies genetic operators to it over several generations to eventually obtain an optimal solution, the proposed approach incorporates a population-refinement process inspired by teaching-learning dynamics (Rao et al., 2011). In this method, the population is initiated with random individuals and then the fitness of each member is calculated. Subsequently, a teaching-learning-based population refinement procedure is performed, which may reduce the population size while aiming to improve the quality of the individuals. For instance, if the initial population comprises 100 individuals, this refinement step may whittle it down to 50 individuals with enhanced characteristics. The primary objective here is to cultivate a population of superior individuals, rather than relying solely on random generation. The population-refinement process consists of two distinct phases: the teaching phase and the learning phase. The teaching phase commences by identifying the individual within the population exhibiting the highest fitness. This distinguished individual assumes the role of the teacher and serves as a reference point for guiding the learning process. The remaining individuals in the population are designated as learners. The enhancement of the learners is facilitated through an iterative update mechanism. Specifically, the solutions of the learners are adjusted using the following equation:

(25)

In this equation, the variable rand represents a random factor and TF is the teaching factor. Subsequently, the fitness of the new solution is evaluated for each learner. If the fitness of this new solution surpasses that of the teacher's solution, the teacher is substituted by the improved learner, ensuring that the best solution guides the refinement process. In the subsequent learner phase of the algorithm, a learner is randomly selected from the pool, and this chosen learner engages in interactions with other learners. The key criterion for interaction is rooted in the fitness of the randomly selected learner, which must be superior to that of the currently considered learner. During interaction, solution of the current learner is updated using the following equations:

(26)

However, if the fitness of the randomly chosen learner does not exceed that of the current learner, the update equation should be adjusted as follows:

(27)

This learner phase contributes to diversifying the population and exploring alternative solutions. In summary, the proposed algorithm adopts a dual-phase approach. It initiates with a teaching phase, where the finest individual guides the learning process, followed by a learner phase that fosters interactions among individuals. By iteratively applying these phases and refining the population, the algorithm aims to elevate the overall solution quality while preserving diversity. Parent selection is carried out from this refined population using the Tournament selection technique, followed by crossover and mutation operations to update the individuals. The iterative process continues until the optimal solution is achieved. The overview of TLBPSGA is depicted in Fig. 4.

Algorithm 2: Teaching-learning-based population selection genetic algorithm (TLBPSGA)

Input:

Initial population count, p
Maximum iterations, t

Output:

Optimal solution, B_opt

begin

Create an initial set of p members, B_i (i=1,2,3,………p)
Set loop counter l=0
Determine the fitness score of each member
While(l<t)

//Teaching Phase

If fitness(member)==max(fitness)

Teacher=member

Else

Learner=member

Iterate through learners and update their solutions using:

new_solution=old_solution+rand(teacher_solution-Teaching_Factor*mean)

Evaluate the fitness of the new solution for each learner

If fitness(learner_solution)>fitness(Teacher)

learner_solution=Teacher

//Learner Phase

For each learner in the population:

Randomly pick another learner

If fitness(random_learner)>(current learner)

New_solution=old_solution+rand(random_learner-old_solution)

Else

New_solution=old_solution+rand(old_solution-random_learner)

Evaluate the fitness of the new solution and update accordingly

//Genetic Operations

Use Tournament selection technique for parent selection from the refined population

Apply crossover and mutation methods to refine individuals

Increment the iteration counter until the optimal solution is found

End While

Return the best overall solution, B_opt

end

Conclusion

Our proposed methodology demonstrates a multi-stage optimization strategy from feature extraction to classification. The FFT-HOG integration offers an expedited feature extraction process, reducing time consumption by about 50% without sacrificing feature quality. Meanwhile, the application of LDA and RFE for dimensionality reduction resulted in an optimized feature set suitable for high-accuracy classification tasks. Importantly, the hybrid RFE XGBoost model emerged as the most effective classification algorithm, thus validating the robustness of our entire methodological pipeline. TLBPSGA enhanced the efficiency of the RFE XGBoost Model, outperforming traditional genetic algorithms in convergence time and achieved impressive accuracy levels, 98.5% for FC1 and 98.2% for FC2. The study contributes to the literature by providing a well-vetted, efficient, and high-performing method for FC1 and FC2 cocoon variety classification. This research holds promising implications not only for the scientific community engaged in machine learning and computer vision but also for practical, real-world applications in agriculture and silk-production.

Acknowledgements

The authors wish to express their gratitude to M. Shyamala (scientist D) and the employees of the Seed Production Center, Palakkad, Kerala, India, for their invaluable assistance in the collection of samples and validation of the dataset.

Footnotes

Author Contributions

Conceptualization: S.T., J.T.; Methodology: S.T.; Software: S.T.; Formal analysis: S.T.; Investigation: S.T.; Data curation: S.T.; Writing - original draft: S.T.; Writing - review & editing: S.T., J.T.; Supervision: J.T.; Funding acquisition: J.T.

Funding

This research was funded by the Department of Science and Technology (DST) with grant reference number: SEED/WS/2019/135. Deposited in PMC for immediate release.

Data availability

The data supporting the findings of this study are available from the corresponding author and will be made available upon reasonable request.

References

Altman, G. H. and Farrell, B. D. (2022). Sericulture as a sustainable agroindustry. Clean. Circ. Bioecon. 2, 100011. 10.1016/j.clcb.2022.100011 [DOI] [Google Scholar]
Anowar, F., Sadaoui, S. and Selim, B. (2021). Conceptual and empirical comparison of dimensionality reduction algorithms (pca, kpca, lda, mds, svd, lle, isomap, le, ica, t-sne). Comput. Sci. Rev. 40, 100378. 10.1016/j.cosrev.2021.100378 [DOI] [Google Scholar]
Binson, V. A., Subramoniam, M. and Mathew, L. (2021a). Detection of COPD and Lung Cancer with electronic nose using ensemble learning methods. Clin. Chim. Acta. 523, 231-238. 10.1016/j.cca.2021.10.005 [DOI] [PubMed] [Google Scholar]
Binson, V. A., Subramoniam, M. and Mathew, L. (2021b). Discrimination of COPD and lung cancer from controls through breath analysis using a self developed e-nose. J. Breath Res. 15, 046003. 10.1088/1752-7163/ac1326 [DOI] [PubMed] [Google Scholar]
Binson, V. A., Subramoniam, M., Ragesh, G. K. and Kumar, A. (2021c). Early detection of lung cancer through breath analysis using adaboost ensemble learning method. In: 2021 2nd International Conference on Advances in Computing, Communication, Embedded and Secure Systems (ACCESS) (pp. 183-187). IEEE. [Google Scholar]
Binson, V. A., Subramoniam, M., Sunny, Y. and Mathew, L. (2021d). Prediction of pulmonary diseases with electronic nose using SVM and XGBoost. IEEE Sens. J. 21, 20886-20895. 10.1109/JSEN.2021.3100390 [DOI] [Google Scholar]
Binson, V. A., Subramoniam, M. and Mathew, L. (2021e). Noninvasive detection of COPD and Lung Cancer through breath analysis using MOS Sensor array based e-nose. Expert Rev. Mol. Diagn. 21, 1223-1233. 10.1080/14737159.2021.1971079 [DOI] [PubMed] [Google Scholar]
Bu, C., Zheng, R., Huang, G., Wu, J., Liu, G., Donald, M. L., Dong, T. and Xu, X. (2022). The differences in cocoon and silk qualities among sex-related mulberry and silkworm feeding groups. PLoS One 17, e0270021. 10.1371/journal.pone.0270021 [DOI] [PMC free article] [PubMed] [Google Scholar]
Cai, J. R., Yuan, L. M., Liu, B. and Sun, L. (2014). Nondestructive gender identification of silkworm cocoons using X-ray imaging with multivariate data analysis. Anal. Methods 6, 7224-7233. 10.1039/C4AY00940A [DOI] [Google Scholar]
Cañete-Sifuentes, L., Monroy, R. and Medina-Pérez, M. A. (2021). A review and experimental comparison of multivariate decision trees. IEEE Access. 9, 110451-110479. 10.1109/ACCESS.2021.3102239 [DOI] [Google Scholar]
Chandra, M. A. and Bedi, S. S. (2021). Survey on SVM and their application in image classification. Int. J. Inf. Technol. 13, 1-11.33527094 [Google Scholar]
Charbuty, B. and Abdulazeez, A. (2021). Classification based on decision tree algorithm for machine learning. Int. J. Appl. Sci. Technol. 2, 20-28. [Google Scholar]
Chen, T., He, T., Benesty, M., Khotilovich, V., Tang, Y. and Cho, H. (2015). Xgboost: extreme gradient boosting. R package version 0.4-2.1(4), 1-4.
Cover, T. and Hart, P. (1967). Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13, 21-27. 10.1109/TIT.1967.1053964 [DOI] [Google Scholar]
Dai, F., Wang, X., Zhong, Y., Zhong, S. and Chen, C. (2021). Convolution neural network application in the simultaneous detection of gender and variety of silkworm (Bombyx mori) cocoons. J. Phys., Conf. Ser. 1769, 012017. 10.1088/1742-6596/1769/1/012017 [DOI] [Google Scholar]
Fu, X., Zhao, S., Luo, H., Tao, D., Wu, X. and Li, G. (2023). Sex classification of silkworm pupae from different varieties by near infrared spectroscopy combined with chemometrics. Infrared Phys. Technol. 129, 104553. 10.1016/j.infrared.2023.104553 [DOI] [Google Scholar]
Gundi, R., Vanitha, C., Tulsi, K. N., Velusamy, L., Ramesha, A., Ponnuvel, K. M., Rabha, M., Sivaprasad, V., Pradeep, A. R. and Mishra, R. K. (2023). Molecular marker assisted breeding and development of bidensovirus resistant and thermo tolerant silkworm (Bombyx mori) hybrids suitable for tropical climatic conditions. Agric. Res. 12, 428-438. 10.1007/s40003-023-00662-x [DOI] [Google Scholar]
Guo, F., He, F., Tao, D. and Li, G. (2022). Automatic exposure correction algorithm for online silkworm pupae (Bombyx mori) sex classification. Comput. Electron. Agric. 198, 107108. 10.1016/j.compag.2022.107108 [DOI] [Google Scholar]
He, H., Zhu, S., Shen, L., Chang, X., Wang, Y., Zeng, D., Zhao, T., Dai, F. and Zhao, T. and (2023). Integrated analysis of machine learning and deep learning in silkworm pupae (Bombyx mori) species and sex identification. Animals 13, 3612. 10.3390/ani13233612 [DOI] [PMC free article] [PubMed] [Google Scholar]
Jeon, H. and Oh, S. (2020). Hybrid-recursive feature elimination for efficient feature selection. Appl. Sci. 10, 3211. 10.3390/app10093211 [DOI] [Google Scholar]
Joseph Raj, A. N., Sundaram, R., Mahesh, V. G., Zhuang, Z. and Simeone, A. (2019). A multi-sensor system for silkworm cocoon gender classification via image processing and support vector machine. Sensors 19, 2656. 10.3390/s19122656 [DOI] [PMC free article] [PubMed] [Google Scholar]
Katoch, S., Chauhan, S. S. and Kumar, V. (2021). A review on genetic algorithm: past, present, and future. Multimed. Tools Appl. 80, 8091-8126. 10.1007/s11042-020-10139-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
Klein, R. and Celik, T. (2017). The Wits Intelligent Teaching System: Detecting student engagement during lectures using convolutional neural networks. In: 2017 IEEE international conference on image processing (ICIP) (pp. 2856-2860). IEEE. [Google Scholar]
Lin, X., Zhuang, Y., Dan, T., Guanglin, L., Xiaodong, Y., Jie, S. and Xuwen, L. (2019). The model updating based on near infrared spectroscopy for the sex identification of silkworm pupae from different varieties by a semi-supervised learning with pre-labeling method. Spectrosc. Lett. 52, 642-652. 10.1080/00387010.2019.1681463 [DOI] [Google Scholar]
Mahesh, V. G., Raj, A. N. J. and Celik, T. (2017). Silkworm cocoon classification using fusion of Zernike moments-based shape descriptors and physical parameters for quality egg production. Int. J. Intell. Syst. Technol. Appl. 16, 246-268. [Google Scholar]
Mahesh, T. R., Vinoth Kumar, V., Muthukumaran, V., Shashikala, H. K., Swapna, B. and Guluwadi, S. (2022). Performance analysis of xgboost ensemble methods for survivability with the classification of breast cancer. J. Sens. 2022, 1-8. 10.1155/2022/4649510 [DOI] [Google Scholar]
Malek, N. H. A., Wan Yaacob, W. F., Md Nasir, S. A. and Shaadan, N. (2022). Prediction of water quality classification of the Kelantan River Basin, Malaysia, using machine learning techniques. Water 14, 1067. 10.3390/w14071067 [DOI] [Google Scholar]
Moghimi, A., Mohammadzadeh, A., Celik, T., Brisco, B. and Amani, M. (2022). Automatic relative radiometric normalization of bi-temporal satellite images using a coarse-to-fine pseudo-invariant features selection and fuzzy integral fusion strategies. Remote Sens 14, 1777. 10.3390/rs14081777 [DOI] [Google Scholar]
Nam, W., Han, B. and Han, J. H. (2011). Improving object localization using macrofeature layout selection. In: 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops) (pp. 1801-1808). IEEE. [Google Scholar]
Nam, W., Han, B. and Han, J. H. (2014). Macrofeature layout selection for pedestrian localization and its acceleration using GPU. Comput. Vis. Image Underst. 120, 46-58. 10.1016/j.cviu.2013.10.011 [DOI] [Google Scholar]
Parmar, A., Katariya, R. and Patel, V. (2019). A review on random forest: An ensemble classifier. In: International conference on intelligent data communication technologies and internet of things (ICICI) 2018 (pp. 758-763). Springer International Publishing. [Google Scholar]
Prabha, A., Yadav, J., Rani, A. and Singh, V. (2021). Design of intelligent diabetes mellitus detection system using hybrid feature selection based XGBoost classifier. Comput. Biol. Med. 136, 104664. 10.1016/j.compbiomed.2021.104664 [DOI] [PubMed] [Google Scholar]
Qiu, G., Tao, D., Xiao, Q. and Li, G. (2021). Simultaneous sex and species classification of silkworm pupae by NIR spectroscopy combined with chemometric analysis. J. Sci. Food Agric. 101, 1323-1330. 10.1002/jsfa.10740 [DOI] [PubMed] [Google Scholar]
Rao, R. V., Savsani, V. J. and Vakharia, D. P. (2011). Teaching–learning based optimization: a novel method for constrained mechanical design optimization problems. Comput. Aided Des. 43, 303-315. 10.1016/j.cad.2010.12.015 [DOI] [Google Scholar]
Reddy, K. D., Nagaraju, J. and Abraham, E. G. (1999). Genetic characterization of the silkworm Bombyx mori by simple sequence repeat (SSR)-anchored PCR. Heredity 83, 681-687. 10.1046/j.1365-2540.1999.00607.x [DOI] [PubMed] [Google Scholar]
Sabeti, V., Sobhani, M. and Hasheminejad, S. M. H. (2022). An adaptive image steganography method based on integer wavelet transform using genetic algorithm. Comput. Electr. Eng. 99, 107809. 10.1016/j.compeleceng.2022.107809 [DOI] [Google Scholar]
Senan, E. M., Al-Adhaileh, M. H., Alsaade, F. W., Aldhyani, T. H., Alqarni, A. A., Alsharif, N., Uddin, M. I., Alahmadi, A. H., Jadhav, M. E. and Singh, D. (2021). Diagnosis of chronic kidney disease using effective classification algorithms and recursive feature elimination techniques. J. Healthc. Eng. 2021, 1004767. 10.1155/2021/1004767 [DOI] [PMC free article] [PubMed] [Google Scholar]
Seo, P. H., Lee, J., Jung, D., Han, B. and Cho, M. (2018). Attentive semantic alignment with offset-aware correlation kernels. In: Proceedings of the European Conference on Computer Vision (ECCV) (pp. 349-364). [Google Scholar]
Shaban, W. M., Rabie, A. H., Saleh, A. I. and Abo-Elsoud, M. A. (2021). Accurate detection of COVID-19 patients based on distance biased Naïve Bayes (DBNB) classification strategy. Pattern Recognit. 119, 108110. 10.1016/j.patcog.2021.108110 [DOI] [PMC free article] [PubMed] [Google Scholar]
Son, J., Jung, I., Park, K. and Han, B. (2015). Tracking-by segmentation with online gradient boosting decision tree. In: Proceedings of the IEEE international conference on computer vision (pp. 3056-3064). [Google Scholar]
Tao, D., Wang, Z., Li, G. and Qiu, G. (2018). Accurate identification of the sex and species of silkworm pupae using near infrared spectroscopy. J. Appl. Spectrosc. 85, 949-952. 10.1007/s10812-018-0744-z [DOI] [Google Scholar]
Thomas, S. and Thomas, J. (2020). A review on existing methods and classification algorithms used for sex determination of silkworm in sericulture. In: International Conference on Intelligent Systems Design and Applications (pp. 567-579). Cham: Springer International Publishing. [Google Scholar]
Thomas, S. and Thomas, J. (2022). Non-destructive silkworm pupa gender classification with X-ray images using ensemble learning. Artif Intell Agric. 6, 100-110. [Google Scholar]
Thomas, S. and Thomas, J. (2024). Nondestructive and cost-effective silkworm, Bombyx mori (Lepidoptera: Bombycidae) cocoon sex classification using machine learning. Int. J. Trop. Insect Sci. 1-13. [Google Scholar]
Traut, W., Sahara, K. and Marec, F. (2008). Sex chromosomes and sex determination in Lepidoptera. Sex Dev. 1, 332-346. 10.1159/000111765 [DOI] [PubMed] [Google Scholar]
Wyner, A. J., Olson, M., Bleich, J. and Mease, D. (2017). Explaining the success of adaboost and random forests as interpolating classifiers. J. Mach. Learn. Res. 18, 1558-1590. [Google Scholar]
Yavariabdi, A., Kusetogullari, H., Celik, T. and Cicek, H. (2021). FastUAVnet: a multi-UAV detection algorithm for embedded platforms. Electronics 10, 724. 10.3390/electronics10060724 [DOI] [Google Scholar]
Yavariabdi, A., Kusetogullari, H., Celik, T., Thummanapally, S., Rijwan, S. and Hall, J. (2022). Cardis: a Swedish historical handwritten character and word dataset. IEEE Access. 10, 55338-55349. 10.1109/ACCESS.2022.3175197 [DOI] [Google Scholar]
Zhang, S., Li, X., Zong, M., Zhu, X. and Cheng, D. (2017). Learning k for knn classification. ACM Trans. Intell. Syst. Technol. 8, 1-19. [Google Scholar]
Zhu, Z., Yuan, H., Song, C., Li, X., Fang, D., Guo, Z., Zhu, X., Liu, W. and Yan, G. (2018). High-speed sex identification and sorting of living silkworm pupae using near-infrared spectroscopy combined with chemometrics. Sens. Actuat. B Chem. 268, 299-309. 10.1016/j.snb.2018.04.093 [DOI] [Google Scholar]

[BIO060468C1] Altman, G. H. and Farrell, B. D. (2022). Sericulture as a sustainable agroindustry. Clean. Circ. Bioecon. 2, 100011. 10.1016/j.clcb.2022.100011 [DOI] [Google Scholar]

[BIO060468C2] Anowar, F., Sadaoui, S. and Selim, B. (2021). Conceptual and empirical comparison of dimensionality reduction algorithms (pca, kpca, lda, mds, svd, lle, isomap, le, ica, t-sne). Comput. Sci. Rev. 40, 100378. 10.1016/j.cosrev.2021.100378 [DOI] [Google Scholar]

[BIO060468C3] Binson, V. A., Subramoniam, M. and Mathew, L. (2021a). Detection of COPD and Lung Cancer with electronic nose using ensemble learning methods. Clin. Chim. Acta. 523, 231-238. 10.1016/j.cca.2021.10.005 [DOI] [PubMed] [Google Scholar]

[BIO060468C4] Binson, V. A., Subramoniam, M. and Mathew, L. (2021b). Discrimination of COPD and lung cancer from controls through breath analysis using a self developed e-nose. J. Breath Res. 15, 046003. 10.1088/1752-7163/ac1326 [DOI] [PubMed] [Google Scholar]

[BIO060468C5] Binson, V. A., Subramoniam, M., Ragesh, G. K. and Kumar, A. (2021c). Early detection of lung cancer through breath analysis using adaboost ensemble learning method. In: 2021 2nd International Conference on Advances in Computing, Communication, Embedded and Secure Systems (ACCESS) (pp. 183-187). IEEE. [Google Scholar]

[BIO060468C6] Binson, V. A., Subramoniam, M., Sunny, Y. and Mathew, L. (2021d). Prediction of pulmonary diseases with electronic nose using SVM and XGBoost. IEEE Sens. J. 21, 20886-20895. 10.1109/JSEN.2021.3100390 [DOI] [Google Scholar]

[BIO060468C7] Binson, V. A., Subramoniam, M. and Mathew, L. (2021e). Noninvasive detection of COPD and Lung Cancer through breath analysis using MOS Sensor array based e-nose. Expert Rev. Mol. Diagn. 21, 1223-1233. 10.1080/14737159.2021.1971079 [DOI] [PubMed] [Google Scholar]

[BIO060468C8] Bu, C., Zheng, R., Huang, G., Wu, J., Liu, G., Donald, M. L., Dong, T. and Xu, X. (2022). The differences in cocoon and silk qualities among sex-related mulberry and silkworm feeding groups. PLoS One 17, e0270021. 10.1371/journal.pone.0270021 [DOI] [PMC free article] [PubMed] [Google Scholar]

[BIO060468C9] Cai, J. R., Yuan, L. M., Liu, B. and Sun, L. (2014). Nondestructive gender identification of silkworm cocoons using X-ray imaging with multivariate data analysis. Anal. Methods 6, 7224-7233. 10.1039/C4AY00940A [DOI] [Google Scholar]

[BIO060468C10] Cañete-Sifuentes, L., Monroy, R. and Medina-Pérez, M. A. (2021). A review and experimental comparison of multivariate decision trees. IEEE Access. 9, 110451-110479. 10.1109/ACCESS.2021.3102239 [DOI] [Google Scholar]

[BIO060468C11] Chandra, M. A. and Bedi, S. S. (2021). Survey on SVM and their application in image classification. Int. J. Inf. Technol. 13, 1-11.33527094 [Google Scholar]

[BIO060468C12] Charbuty, B. and Abdulazeez, A. (2021). Classification based on decision tree algorithm for machine learning. Int. J. Appl. Sci. Technol. 2, 20-28. [Google Scholar]

[BIO060468C13] Chen, T., He, T., Benesty, M., Khotilovich, V., Tang, Y. and Cho, H. (2015). Xgboost: extreme gradient boosting. R package version 0.4-2.1(4), 1-4.

[BIO060468C14] Cover, T. and Hart, P. (1967). Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13, 21-27. 10.1109/TIT.1967.1053964 [DOI] [Google Scholar]

[BIO060468C15] Dai, F., Wang, X., Zhong, Y., Zhong, S. and Chen, C. (2021). Convolution neural network application in the simultaneous detection of gender and variety of silkworm (Bombyx mori) cocoons. J. Phys., Conf. Ser. 1769, 012017. 10.1088/1742-6596/1769/1/012017 [DOI] [Google Scholar]

[BIO060468C16] Fu, X., Zhao, S., Luo, H., Tao, D., Wu, X. and Li, G. (2023). Sex classification of silkworm pupae from different varieties by near infrared spectroscopy combined with chemometrics. Infrared Phys. Technol. 129, 104553. 10.1016/j.infrared.2023.104553 [DOI] [Google Scholar]

[BIO060468C17] Gundi, R., Vanitha, C., Tulsi, K. N., Velusamy, L., Ramesha, A., Ponnuvel, K. M., Rabha, M., Sivaprasad, V., Pradeep, A. R. and Mishra, R. K. (2023). Molecular marker assisted breeding and development of bidensovirus resistant and thermo tolerant silkworm (Bombyx mori) hybrids suitable for tropical climatic conditions. Agric. Res. 12, 428-438. 10.1007/s40003-023-00662-x [DOI] [Google Scholar]

[BIO060468C18] Guo, F., He, F., Tao, D. and Li, G. (2022). Automatic exposure correction algorithm for online silkworm pupae (Bombyx mori) sex classification. Comput. Electron. Agric. 198, 107108. 10.1016/j.compag.2022.107108 [DOI] [Google Scholar]

[BIO060468C19] He, H., Zhu, S., Shen, L., Chang, X., Wang, Y., Zeng, D., Zhao, T., Dai, F. and Zhao, T. and (2023). Integrated analysis of machine learning and deep learning in silkworm pupae (Bombyx mori) species and sex identification. Animals 13, 3612. 10.3390/ani13233612 [DOI] [PMC free article] [PubMed] [Google Scholar]

[BIO060468C20] Jeon, H. and Oh, S. (2020). Hybrid-recursive feature elimination for efficient feature selection. Appl. Sci. 10, 3211. 10.3390/app10093211 [DOI] [Google Scholar]

[BIO060468C21] Joseph Raj, A. N., Sundaram, R., Mahesh, V. G., Zhuang, Z. and Simeone, A. (2019). A multi-sensor system for silkworm cocoon gender classification via image processing and support vector machine. Sensors 19, 2656. 10.3390/s19122656 [DOI] [PMC free article] [PubMed] [Google Scholar]

[BIO060468C22] Katoch, S., Chauhan, S. S. and Kumar, V. (2021). A review on genetic algorithm: past, present, and future. Multimed. Tools Appl. 80, 8091-8126. 10.1007/s11042-020-10139-6 [DOI] [PMC free article] [PubMed] [Google Scholar]

[BIO060468C23] Klein, R. and Celik, T. (2017). The Wits Intelligent Teaching System: Detecting student engagement during lectures using convolutional neural networks. In: 2017 IEEE international conference on image processing (ICIP) (pp. 2856-2860). IEEE. [Google Scholar]

[BIO060468C24] Lin, X., Zhuang, Y., Dan, T., Guanglin, L., Xiaodong, Y., Jie, S. and Xuwen, L. (2019). The model updating based on near infrared spectroscopy for the sex identification of silkworm pupae from different varieties by a semi-supervised learning with pre-labeling method. Spectrosc. Lett. 52, 642-652. 10.1080/00387010.2019.1681463 [DOI] [Google Scholar]

[BIO060468C25] Mahesh, V. G., Raj, A. N. J. and Celik, T. (2017). Silkworm cocoon classification using fusion of Zernike moments-based shape descriptors and physical parameters for quality egg production. Int. J. Intell. Syst. Technol. Appl. 16, 246-268. [Google Scholar]

[BIO060468C26] Mahesh, T. R., Vinoth Kumar, V., Muthukumaran, V., Shashikala, H. K., Swapna, B. and Guluwadi, S. (2022). Performance analysis of xgboost ensemble methods for survivability with the classification of breast cancer. J. Sens. 2022, 1-8. 10.1155/2022/4649510 [DOI] [Google Scholar]

[BIO060468C27] Malek, N. H. A., Wan Yaacob, W. F., Md Nasir, S. A. and Shaadan, N. (2022). Prediction of water quality classification of the Kelantan River Basin, Malaysia, using machine learning techniques. Water 14, 1067. 10.3390/w14071067 [DOI] [Google Scholar]

[BIO060468C28] Moghimi, A., Mohammadzadeh, A., Celik, T., Brisco, B. and Amani, M. (2022). Automatic relative radiometric normalization of bi-temporal satellite images using a coarse-to-fine pseudo-invariant features selection and fuzzy integral fusion strategies. Remote Sens 14, 1777. 10.3390/rs14081777 [DOI] [Google Scholar]

[BIO060468C29] Nam, W., Han, B. and Han, J. H. (2011). Improving object localization using macrofeature layout selection. In: 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops) (pp. 1801-1808). IEEE. [Google Scholar]

[BIO060468C30] Nam, W., Han, B. and Han, J. H. (2014). Macrofeature layout selection for pedestrian localization and its acceleration using GPU. Comput. Vis. Image Underst. 120, 46-58. 10.1016/j.cviu.2013.10.011 [DOI] [Google Scholar]

[BIO060468C31] Parmar, A., Katariya, R. and Patel, V. (2019). A review on random forest: An ensemble classifier. In: International conference on intelligent data communication technologies and internet of things (ICICI) 2018 (pp. 758-763). Springer International Publishing. [Google Scholar]

[BIO060468C32] Prabha, A., Yadav, J., Rani, A. and Singh, V. (2021). Design of intelligent diabetes mellitus detection system using hybrid feature selection based XGBoost classifier. Comput. Biol. Med. 136, 104664. 10.1016/j.compbiomed.2021.104664 [DOI] [PubMed] [Google Scholar]

[BIO060468C33] Qiu, G., Tao, D., Xiao, Q. and Li, G. (2021). Simultaneous sex and species classification of silkworm pupae by NIR spectroscopy combined with chemometric analysis. J. Sci. Food Agric. 101, 1323-1330. 10.1002/jsfa.10740 [DOI] [PubMed] [Google Scholar]

[BIO060468C34] Rao, R. V., Savsani, V. J. and Vakharia, D. P. (2011). Teaching–learning based optimization: a novel method for constrained mechanical design optimization problems. Comput. Aided Des. 43, 303-315. 10.1016/j.cad.2010.12.015 [DOI] [Google Scholar]

[BIO060468C35] Reddy, K. D., Nagaraju, J. and Abraham, E. G. (1999). Genetic characterization of the silkworm Bombyx mori by simple sequence repeat (SSR)-anchored PCR. Heredity 83, 681-687. 10.1046/j.1365-2540.1999.00607.x [DOI] [PubMed] [Google Scholar]

[BIO060468C36] Sabeti, V., Sobhani, M. and Hasheminejad, S. M. H. (2022). An adaptive image steganography method based on integer wavelet transform using genetic algorithm. Comput. Electr. Eng. 99, 107809. 10.1016/j.compeleceng.2022.107809 [DOI] [Google Scholar]

[BIO060468C37] Senan, E. M., Al-Adhaileh, M. H., Alsaade, F. W., Aldhyani, T. H., Alqarni, A. A., Alsharif, N., Uddin, M. I., Alahmadi, A. H., Jadhav, M. E. and Singh, D. (2021). Diagnosis of chronic kidney disease using effective classification algorithms and recursive feature elimination techniques. J. Healthc. Eng. 2021, 1004767. 10.1155/2021/1004767 [DOI] [PMC free article] [PubMed] [Google Scholar]

[BIO060468C38] Seo, P. H., Lee, J., Jung, D., Han, B. and Cho, M. (2018). Attentive semantic alignment with offset-aware correlation kernels. In: Proceedings of the European Conference on Computer Vision (ECCV) (pp. 349-364). [Google Scholar]

[BIO060468C39] Shaban, W. M., Rabie, A. H., Saleh, A. I. and Abo-Elsoud, M. A. (2021). Accurate detection of COVID-19 patients based on distance biased Naïve Bayes (DBNB) classification strategy. Pattern Recognit. 119, 108110. 10.1016/j.patcog.2021.108110 [DOI] [PMC free article] [PubMed] [Google Scholar]

[BIO060468C40] Son, J., Jung, I., Park, K. and Han, B. (2015). Tracking-by segmentation with online gradient boosting decision tree. In: Proceedings of the IEEE international conference on computer vision (pp. 3056-3064). [Google Scholar]

[BIO060468C41] Tao, D., Wang, Z., Li, G. and Qiu, G. (2018). Accurate identification of the sex and species of silkworm pupae using near infrared spectroscopy. J. Appl. Spectrosc. 85, 949-952. 10.1007/s10812-018-0744-z [DOI] [Google Scholar]

[BIO060468C42] Thomas, S. and Thomas, J. (2020). A review on existing methods and classification algorithms used for sex determination of silkworm in sericulture. In: International Conference on Intelligent Systems Design and Applications (pp. 567-579). Cham: Springer International Publishing. [Google Scholar]

[BIO060468C43] Thomas, S. and Thomas, J. (2022). Non-destructive silkworm pupa gender classification with X-ray images using ensemble learning. Artif Intell Agric. 6, 100-110. [Google Scholar]

[BIO060468C44] Thomas, S. and Thomas, J. (2024). Nondestructive and cost-effective silkworm, Bombyx mori (Lepidoptera: Bombycidae) cocoon sex classification using machine learning. Int. J. Trop. Insect Sci. 1-13. [Google Scholar]

[BIO060468C45] Traut, W., Sahara, K. and Marec, F. (2008). Sex chromosomes and sex determination in Lepidoptera. Sex Dev. 1, 332-346. 10.1159/000111765 [DOI] [PubMed] [Google Scholar]

[BIO060468C46] Wyner, A. J., Olson, M., Bleich, J. and Mease, D. (2017). Explaining the success of adaboost and random forests as interpolating classifiers. J. Mach. Learn. Res. 18, 1558-1590. [Google Scholar]

[BIO060468C47] Yavariabdi, A., Kusetogullari, H., Celik, T. and Cicek, H. (2021). FastUAVnet: a multi-UAV detection algorithm for embedded platforms. Electronics 10, 724. 10.3390/electronics10060724 [DOI] [Google Scholar]

[BIO060468C48] Yavariabdi, A., Kusetogullari, H., Celik, T., Thummanapally, S., Rijwan, S. and Hall, J. (2022). Cardis: a Swedish historical handwritten character and word dataset. IEEE Access. 10, 55338-55349. 10.1109/ACCESS.2022.3175197 [DOI] [Google Scholar]

[BIO060468C49] Zhang, S., Li, X., Zong, M., Zhu, X. and Cheng, D. (2017). Learning k for knn classification. ACM Trans. Intell. Syst. Technol. 8, 1-19. [Google Scholar]

[BIO060468C50] Zhu, Z., Yuan, H., Song, C., Li, X., Fang, D., Guo, Z., Zhu, X., Liu, W. and Yan, G. (2018). High-speed sex identification and sorting of living silkworm pupae using near-infrared spectroscopy combined with chemometrics. Sens. Actuat. B Chem. 268, 299-309. 10.1016/j.snb.2018.04.093 [DOI] [Google Scholar]

PERMALINK

An optimized method for mulberry silkworm, Bombyx mori (Bombycidae:Lepidoptera) sex classification using TLBPSGA-RFEXGBoost

Sania Thomas

Jyothi Thomas

ABSTRACT

INTRODUCTION

RESULTS

Performance metrics

Data analysis

Fig. 1.

Table 1.

Table 2.

Table 3.

Table 4.

DISCUSSION

MATERIALS AND METHODS

Fig. 2.

Sample collection

Fig. 3.

Feature extraction

Feature engineering

Linear discriminant analysis

Weight feature

Recursive feature elimination (RFE)

Classification algorithms

Support vector machine

K-nearest neighbors

Naive bayes

Decision Tree

Random Forest

XGBoost

AdaBoost

Model optimization

Genetic algorithm

Teaching-learning-based population selection GA

Fig. 4.

Conclusion

Acknowledgements

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases