Abstract
The DNA barcoding approach has been used extensively in taxonomy and phylogenetics. The differences in certain DNA sequences are able to differentiate and help classify organisms into taxa. It has been used in cases of taxonomic disputes where morphology by itself is insufficient. This research aimed to utilize hierarchical clustering, an unsupervised machine learning method, to determine and resolve disputes in plant family taxonomy. We take a case study of Leguminosae that historically some classify into three families (Fabaceae, Caesalpiniaceae, and Mimosaceae) but others classify into one family (Leguminosae). This study is divided into several phases, which are: (i) data collection, (ii) data preprocessing, (iii) finding the best distance method, and (iv) determining disputed family. The data used are collected from several sources, including National Center for Biotechnology Information (NCBI), journals, and websites. The data for validation of the methods were collected from NCBI. This was used to determine the best distance method for differentiating families or genera. The data for the case study in the Leguminosae group was collected from journals and a website. From the experiment that we have conducted, we found that the Pearson method is the best distance method to do clustering ITS sequence of plants, both in accuracy and computational cost. We use the Pearson method to determine the disputed family between Leguminosae. We found that the case study of Leguminosae should be grouped into one family based on our research.
Keywords: DNA barcoding, Unsupervised learning, Bioinformatics, Hierarchical clustering, Machine learning, Taxonomy, R programming language
1. Introduction
DNA barcoding is the use of Deoxyribonucleic acid (DNA) barcodes or specific portions of the DNA [1]. A single gene would ideally be effective in all different groupings of organisms or taxa, however, different portions of the DNA have been found to be more effective in different taxa [2]. For animals, the most effective barcode is a fragment of ∼650 base pairs (bp) near the 5′-terminus of the mitochondrial cytochrome c oxidase I (COI) gene [3]. In fungi, the more appropriate barcode is the internal transcribed spacer (ITS) nuclear ribosome sequence [4]. In plant species, there are several difficulties with barcoding, one of which is the low nucleotide substitution rate of COI [5]. The Consortium for the Barcode of Life (CBOL) has recommended that the chloroplast ribulose-1,5-bisphosphate carboxylase large subunit (rbcL) genes [6] and Maturase K (matK) [7] be used as plant barcode [8]. Another difficulty for plants is the higher identification success rate in animals compared to plants [9].
The method of obtaining DNA barcodes itself has multiple variations depending on the taxa [10]. In general, the process involves collecting a sample, isolation of DNA, matching specific primers, polymerase chain reaction (PCR), analysis of chromatogram, meeting the DNA barcoding standard, The Barcode of Life Data System (BOLD) submission, data analysis and validation, publication and data hosting, and finally the end user [2].
DNA barcoding can be helpful in many real-life applications [11]. It can be used for pest identification for biosecurity purposes to protect from potentially invasive species [12]. Authorities can use DNA barcodes to monitor the illegal trade of animals from protected species [13]. DNA barcoding has been described as being a powerful addition to the identification of wood despite the typical DNA quality of dry tissue being of middling to poor quality [14]. Aside from identification purposes, DNA barcodes can be used for grouping specimens when there is an ambiguity in the morphology, such as due to the lack of descriptions of morphological features [15]. It can also be used as a tool for determining whether unknown species should be grouped with earlier known species or as a new species based on DNA barcodes [16]. It can also be used as a supplement to other taxonomic datasets in the process of delimiting species boundaries [17].
There are several categories of computational approaches for analyzing DNA barcodes: tree-based, similarity-based, and character-based methods [18]. Other approaches include combination and alignment-free [1,19,20]. These approaches each came with their advantages and disadvantages. Similarity and tree-based methods, for example, are dependent on sequence alignment. Diagnostic or character-based methods experience more success than similarity and tree-based approaches, but the accuracies are still less than that of supervised machine learning-based approaches [1]. One such approach mapped barcode sequences into a vector based on k-mer frequencies and used a random forest classifier to identify sequences [21]. Several contemporary computational approaches used in DNA barcoding take the form of machine learning [22]. This is due to the complexity and variability of studies involved with genomics.
Heralded as revolutionary for taxonomic discovery, DNA Barcoding was formalized as a broader natural history tool only two decades ago [23]. The formal classification of organisms in Western science dates back to around 1753 with work by Carl Linnaeus [24]. However, the classification of different organisms itself has always presented itself in different human cultures throughout history. The classification or more accurately taxonomy proposed by Linnaeus classified organisms into different ranks with each rank becoming more specific. Since it was first conceived, this design has gone through many revisions and changes [25]. Several sources used by the scientific community define the hierarchy in the following ranks in order of most to the least homogenous: realm, subrealm, kingdom, subkingdom, phylum, subphylum, class, subclass, order, suborder, family, subfamily, genus, and subgenus [[26], [27], [28], [29]]. However, due to the inconsistencies of the ranking system [25] and other factors [30], discrepancies and disputes in taxonomy also arise [[31], [32], [33], [34]] such as in the case of Leguminosae [35,36].
Leguminosae is a large group of agriculturally important flowering plants. The group consists of a variety of species including herbaceous plants, shrubs, and trees [36]. Humans use legumes in various ways, including as a staple food source, animal feed, and fertilizer. Additionally, legumes are also used to synthesize many products including flavorings, drugs, poisons, and colorings. This group of Plantae is also beneficial to other plants by converting atmospheric nitrogen into nitrogen compounds which are useful in biochemical processes. Leguminosae is the third largest group in the flowering plants after Orchidaceae [37] and includes 650 genera with 18,000 species [38]. Dhakad [39] describes this group as holding an important role in biodiversity in the ecosystem and dominating a majority of vegetation types in the world. In addition, Leguminosae also holds an important role in the composition of forests and the management of sustainable goals.
The classification of Leguminosae as one family has become a disputed taxonomic grouping with experts taking several different stances on the issue. The first group of experts agrees that Leguminosae should be classified as a distinct order and subclassified into three distinct families which are Fabaceae (Papilionoid), Caesalpiniaceae, and Mimosaceae [[40], [41], [42]]. This includes the argument that Fabales [40,43] become an order that is subclassified into the three families mentioned above. There are several issues with this perspective. The nomenclature for Fabaceae is ambiguous as it can be used for a family but is also used just for the Papilionoids [39]. Both use cases of Fabaceae are accepted according to articles 18.5 and 18.6 in the International Code of Botanical Nomenclature [26]. Their placement stresses the close relationship between the three aforementioned families under the same order [35]. However, the placement of species and genera of Leguminosae is not systematically consistent [38]. The morphology by itself cannot ascertain phylogenetic relationships. Many species, such as Acacia and Mimosa, are hard to differentiate based on their morphological characteristics [44,45]. Species in the Mimosaceae and Caesalpiniaceae families are mostly physically similar and are consistently different from Fabaceae which is dominated by herbaceous plants [35].
The second group of experts has the opinion that Leguminosae is a family with three subfamilies of Mimosoideae, Caesalpinioideae, and Papilionoideae [[46], [47], [48]]. The naming changes proposed by several experts are also recorded in the International Code of Botanical Nomenclature, one of which is the change of Fabales into Fabaceae [35]. A few recent studies that covered this dispute by Mondal & Mondal [35] and Patel & Panchal [36] agree that the three groups are distinct. However, Patel & Panchal stresses that the distinction is made as different subfamilies of the same family.
This study aims to assist in clarifying the dispute on the taxonomy of Leguminosae by leveraging machine learning in the form of hierarchical clustering and DNA barcodes. Hierarchical clustering is an unsupervised technique to perform data exploratory analysis. The main aim of the technique is to build a binary merge tree [49]. This technique was the first answer to the limits of similarity-based methods [18]. A dendrogram, the visual drawing of hierarchical clustering, gives rich information for either qualitative or quantitative evaluations [49]. Thus, this visualization created from hierarchical clustering can be used to assess this study. Many studies with DNA Barcoding continue to use hierarchical clustering techniques due to its ubiquity and relative simplicity [[50], [51], [52], [53], [54], [55], [56]]. In this study, hierarchical clustering with DNA Barcodes will be used to achieve two objectives. First, we validate the usability of hierarchical clustering and different distance methods for the problem. Second, we utilize the validated method to determine the grouping in the taxonomy of Leguminosae. This is done to determine which view on the taxonomy of Leguminosae the results from hierarchical clustering supports. In short, we aim to clarify whether Leguminosae should be classified as three distinct families, subfamilies, or another permutation altogether.
Machine learning is naturally an interdisciplinary field. It draws on insights from a variety of disciplines, including artificial intelligence, probability and statistics, computational complexity theory, control theory, information theory, philosophy, psychology, and neurobiology. In a broad range of domains, machine learning algorithms are proven to be extremely useful, for example, in the domain of speech recognition algorithm-based machine learning outperforms any other approach that has been tried [57]. Machine learning as an approach has the advantage of learning with experience and the lack of need to manually account for the multitude of variations found in genetic data. The category of approaches in the literature does not have a consistent naming scheme. Several articles refer to machine learning as a separate category from tree-based, distance-based, and character-based [20,21,58], despite some approaches in the other categories also being machine learning, albeit unsupervised for the most part such as hierarchical clustering or in other words tree-based approaches [18,22].
2. Material and methods
2.1. Proposed computational model
This section of this paper will describe the computational model that was used in this study. This study uses R version 4.1.2, and the package used in this study is described as follows.
-
1.
rentrez [59]: This is an R package used for retrieving data from NCBI. We used version 1.2.3 of this package.
-
2.
Biostrings [60]: This package is run in R, the purpose of this package is for data manipulation and for dealing with biological sequences. The version that we use is 2.62.0.
-
3.
msa [61]: This package is used for sequence alignment for multiple DNA sequences, the default preset used in this package is ClustalW algorithm. We use version 1.26.0 of this package.
-
4.
ips [62]: ips is an R package for trimming the beginning and the end of the sequences. We use version 0.0.11 of this package.
-
5.
factoextra [63]: This is an R package providing additional distance methods that was used in this study. The version of this package that is used is 1.0.7.
The full flow of the computational model in this study is depicted in Fig. 1. A detailed explanation of each stage is as follows.
-
1.
Data collection: First, we obtain a list of the available families and genera on the National Center for Biotechnology Information (NCBI) Taxonomy Databases [64] (http://www.ncbi.nlm.nih.gov/taxonomy). Subsequently, we identified families and genera with a representation of more than 25 records of ITS marker sequences. From this pool of families and genera that fulfilled the criteria, 3 families or genera were taken randomly over multiple iterations. The data sequences were retrieved with the help of the rentrez package [59] version 1.2.3. The sequences that were retrieved are in the FASTA format exemplified in Fig. 2.
Fig. 1.
The proposed computational model.
Fig. 2.
Example of FASTA format of the Crataegus bretschneideri species.
For the validation phase, we used a query to gather the data on individual organisms for each family and genera. The query also limited the results by the length of the DNA sequences and filtered by gene which in this case is ITS. For the case study data, we gather several references that divide Leguminales into three different groups, in this case, Fabaceae, Mimosaceae, and Caesalpiniaceae. We specifically selected sequences and their species names from websites and other previous research in GenBank to improve the quality of the sequences used in this study. The main references to the data that have been used are [35,36,65], and Plant Specimen Database Program & Publication (https://plantsp-eflora.bnh.gov.bd/family-list). Most of the sources just provide the name of the species. However, we need the corresponding DNA sequence from NCBI.
Above is an example of retrieving data using the rentrez library. In the example, we try to fetch data from the Araceae family and the gene that we want to retrieve is ITS. Since the ITS gene usually ranges from 500 to 850 base pairs, we also filter the sequences based on this range of base pairs. Detailed information about the query in entrez_search() can be seen in this article https://www.ncbi.nlm.nih.gov/books/NBK49540/. The retmax parameter is to filter the maximum number of records that are retrieved. The function entrez_fetch() fetches the sequences from the NCBI Nucleotide database, this function returns a string in the FASTA format. The Nucleotide database itself is a collection of sequences from several sources, including GenBank, RefSeq, TPA, and PDB.
Fig. 2 is an example of FASTA formatted data. The main structure of the FASTA formatted data the outline of the FASTA format consists of two parts. The first part is the header starting with the character “>” and followed by the description of the sequence. The second part is the sequence itself which is a string composed of the characters “A”, “C”, “T”, and “G”. This research only considers the second part that is used in the computation. The length of the sequences varies depending on the part or gene that is used.
The example shown in Fig. 2 is from the Crataegus bretschneideri DNA sequence, and the sequence contains the internal transcribed spacer 1, 5.8 S, and internal transcribed spacer 2 genes. From the first letter of the header, we can see the “MZ686456.1” as the accession id that is used in NCBI, the detailed information on the sequence can be found at https://www.ncbi.nlm.nih.gov/nuccore/MZ686456.1.
-
2.
Data preprocessing: The initial step of this part is to parse the DNA sequence. DNA sequence parsing is the process of parsing the data into the desired format, in this case, the DNAStringSet format or a collection of DNAString. The purpose of this change is to allow the data to work with several packages from Biostrings, the package allows the manipulation of large biological sequences. The FASTA formatted data that were gathered are then converted to DNAStringSet format using Biostrings [60] version 2.62.0.
DNAStringSet is depicted in Fig. 3. It consists of several columns, including width for base pair length, seq for sequence, and names for the name or label of the sequence. By default, if the DNAStringSet data is called to be printed, it will show the first and the last five sequences on the set, indicated by the indexing on the very left of the data.
Fig. 3.
Example of DNAStringSet format.
To get the DNAStringSet data format, we can use the readDNAStringSet() function from Biostrings library [60]. The “fetchRes” variable is a String variable from the previous code block. We need to write the FASTA formatted string to some “.fasta” file, in this example, we use “output.fasta” as our FASTA formatted file. Afterward, we use the readDNAStringSet() function to read the FASTA file as the DNAStringSet format, and we save it in a variable (named “dna_data” in the example). This scenario will return DNAStringSet formatted data, saved in a variable called “dna_data”.
Upon parsing the data, it is essential to proceed with the alignment of sequences through DNA sequence alignment. This process involves arranging multiple data sequences in a specific manner with the aim of identifying diagnostic patterns that characterize protein families. Such alignment is instrumental in predicting the secondary and tertiary structures of new sequences and serving as an initial step in molecular evolutionary analysis [66]. The dataset that has been aligned will have the same length. The alignment is performed using the msa package [61] version 1.26.0. We run the default preset of the msa package which is using the ClustalW alignment algorithm [66].
The input data type of this package is an object of the class XStringSet (which includes the class DNAStringSet). The data used in this example is in the variable “dna_data” which has a DNAStringSet data type. The output from this code will be of the class MsaDNAMultipleAlignment. In addition to the regular characters representing each base, an additional “-” character is added for alignment purposes.
Afterward, we conducted DNA sequence trimming, a process that involves truncating the initial and final characters of the sequences. The objective of this process is to decrease sequence length, thereby accelerating the clustering process without adversely affecting the model's accuracy. The package that was used for this in this study is the ips package [62] version 0.0.11.
The example process of DNA trimming is presented in the code block above. The input of the process above is the variable “aligned_dna” which is retrieved from the alignment process. The parameter used is “min.n.seq” which is set to 75. From several experiments conducted, the parameter value of 75 speeds up the clustering algorithm, without reducing the accuracy of the algorithm.
To enable us to compute the result of the sequence we do the One Hot Encoding process. One hot encoding is applied to the sequences to allow the data to be processed in the following stages. This process converts characters into numeric representations.
Fig. 4 illustrates the operation of one hot encoding. For each character in the aligned sequence, five columns are established to denote the presence of one of the five unique characters. These columns represent the bases “A", “C", “G", and “T", along with the "-" character originating from the sequence alignment. As an example, if the character being represented is “G", the “G" column will be designated with “1″, while the remaining columns will be marked as “0". This process is iteratively executed for each character in the sequence, and the results are then aggregated. Through this procedure, the sequence initially represented by characters is converted into a numeric format, rendering it suitable for processing in the subsequent stages.
Fig. 4.
One hot encoding.
The following is the pseudocode of One Hot Encoding:
The following is the implemented code:
The full functionality of the one hot coding used in this study is illustrated in the code blocks above. The process converts a data frame containing a list of sequences into a data frame format based on the results of one-hot encoding. It requires the unmasking of the MsaDNAMultipleAlignment datatype back to DnaStringSet, followed by the use of the as. data.frame() function to transform the DnaStringSet into a data frame. If the data is already presented in data frame format, then the oneHotDNA() function is executed as declared above.
-
3.
Finding the best distance method: To determine the most effective distance method for accurate clustering of biological data, we evaluated several distance methods, namely (i) Euclidean [67], (ii) Manhattan [67], (iii) Canberra [68], (iv) Minkowski [69], (v) Pearson [70], and (vi) Spearman [71] distances. We run all of the options with the ITS Marker. To perform hierarchical clustering after obtaining the distances from the distance method, we used the hclust() function provided by the R programming language. The data used for this clustering consisted of established families and genera; we used genera from the same family to assess the ability of the clustering method to differentiate groups with higher similarity:
We load the factoextra library [63] version 1.0.7 that provides the get_dist() function for using several distance methods. We loop through the different distance methods and run the clustering using them. We also capture the speed of each distance method based on the differences between the start and end times.
The following is the pseudocode of hierarchical clustering:
To validate our approach, multiple experiments were undertaken, after which the performance of each method was assessed. This assessment was centered on the distinctiveness of the groups based on the visualization of the dendrogram that was generated. These procedures were enacted with the intent to identify the most effective method for classifying both the families and the genera of the species under consideration.
We examined three distinct genera after evaluating the outcomes of grouping three distinct family groups. This dual-level validation approach was designed to determine the functioning of the clustering not only in a less homogenous taxonomic rank but also in a more homogeneous one.
Subsequently, we assessed the accuracy of each distance method. Accuracy was gauged by the ability of the method to differentiate between the three groups that were utilized in the validation process. This was performed over a series of experimental runs with varying combinations. Finally, the mean accuracy of each distance method was calculated, and the model with the highest accuracy was studied further.
Following the evaluation step, we decided on the best distance approach to use in our case study to resolve the disputed family classification. The weighted average of the accuracy from the two-tiered validation procedure was used to identify the optimal distance approach. This selection aims to achieve the best accuracy to solve the disputed problem in the case study, which follows that the result obtained is likely to be more valid and trustworthy.
-
4.
Determining disputed family: Upon identifying the most suitable distance method, we applied it with hierarchical clustering to assess the taxonomic family under dispute. The procedure in this section parallels the steps delineated earlier. Firstly, we initiated data collection for the disputed family. Following data acquisition, the data underwent a data preprocessing stage. The most effective distance method, as determined in the preceding steps, was then used to cluster the disputed family. Subsequent to this, an evaluation was carried out, and conclusions were drawn based on the results of the clustering process. This methodology allowed us to objectively address the taxonomic disputes.
2.2. Experimental setup
In the initial phase of our experiment, the selection of the optimal model for cluster analysis, aimed at addressing the case study issue, was paramount. Subsequently, Internal Transcribed Spacer (ITS) sequence data from individuals of undisputed taxonomic classifications were obtained from the National Center for Biotechnology Information (NCBI). The acquired data was then parsed into the DNAStringSet format, which subsequently facilitated string manipulation operations on the sequences. Following this, a sequence alignment was conducted to detect distinctive patterns within the data. Subsequently, One Hot Encoding was implemented to transmute our string data into a numerical format. Once the data conversion process was complete, clustering was performed using an array of selected distance methods for this study. Upon completion of this process, the results were scrutinized in order to identify the most effective distance method for implementation in hierarchical clustering on ITS sequences.
Having determined the most suitable distance method, we proceeded to conduct clustering using data from the Leguminosae family. This data was gleaned from a variety of scholarly journals and a website. The sequences retrieved were parsed into the DNAStringSet data type, aligned, and then subjected to the One Hot Encoding process. Finally, clustering was conducted using the selected optimal distance method in order to scrutinize the familial classification dispute within the Leguminosae family.
2.2.1. Data collection for validating the best distance method
In the initial phase, the data used in this study were retrieved from GenBank [72] (accessed August 2022). Datasets contain Internal Transcribed Spacer (ITS) from the ribosomal RNA gene of the plant. Detailed information about the data that was used in this study is explained in Table 1. All of the information in Table 1, Table 2 can be accessed at https://www.ncbi.nlm.nih.gov/nuccore using the filter.
-
●
“species_name” [Organism] filter for filtering the family or genera of the organism.
-
●
(internal transcribed spacer 1 [Title] OR ITS1 [Title]) AND (internal transcribed spacer 2 [Title] OR ITS2 [Title]) filter to obtain the ITS gene sequences.
-
●
500:850 [SLEN] filter to refine the result to the ITS gene which is generally 500 to 850 bp in length.
-
●
NOT UNVERIFIED filter to exclude the unverified data on NCBI.
Table 1.
Detailed information on validation data at the family level.
No | Family | Total Record | Average Base Pair Length | No | Family | Total Record | Average Base Pair Length |
---|---|---|---|---|---|---|---|
1 | Cannabaceae | 25 | 640.72 | 9 | Alismataceae | 25 | 707.8 |
Cucurbitaceae | 25 | 671.24 | Arecaceae | 25 | 712.8 | ||
Zosteraceae | 25 | 597.24 | Burseraceae | 25 | 680.96 | ||
2 | Cleomaceae | 25 | 676.88 | 10 | Chrysobalanaceae | 25 | 692.44 |
Dilleniaceae | 25 | 614.72 | Hypericaceae | 25 | 675.68 | ||
Typhaceae | 25 | 709.72 | Iridaceae | 25 | 670.52 | ||
3 | Brassicaceae | 25 | 621.08 | 11 | Boraginaceae | 25 | 637.88 |
Hydrangeaceae | 25 | 645.68 | Convolvulaceae | 25 | 685.16 | ||
Linaceae | 25 | 616.76 | Haloragaceae | 25 | 688.04 | ||
4 | Buxaceae | 25 | 650.84 | 12 | Clusiaceae | 25 | 693.44 |
Cactaceae | 25 | 634.24 | Menispermaceae | 25 | 596.24 | ||
Haloragaceae | 25 | 693.48 | Zosteraceae | 25 | 585.52 | ||
5 | Iridaceae | 25 | 666.32 | 13 | Brassicaceae | 25 | 639.12 |
Linaceae | 25 | 617.36 | Ceratophyllaceae | 25 | 658.08 | ||
Malvaceae | 25 | 697.64 | Haloragaceae | 25 | 661.2 | ||
6 | Amaryllidaceae | 25 | 645.8 | 14 | Araliaceae | 25 | 623.84 |
Eriocaulaceae | 25 | 743.04 | Elaeocarpaceae | 25 | 634.12 | ||
Urticaceae | 25 | 643.72 | Meliaceae | 25 | 676.92 | ||
7 | Cymodoceaceae | 25 | 614.64 | 15 | Buxaceae | 25 | 664.16 |
Ericaceae | 25 | 672.2 | Ceratophyllaceae | 25 | 655.72 | ||
Urticaceae | 25 | 657 | Chrysobalanaceae | 25 | 712.88 | ||
8 | Ceratophyllaceae | 25 | 644.04 | Average | 25 | 658.6702 | |
Haloragaceae | 25 | 700.44 | |||||
Juncaceae | 25 | 612.84 |
Table 2.
Detailed information on validation data at the genera level.
No | Family | Genera | Average Base Pair Length | Total Sequences | No | Family | Genera | Average Base Pair Length | Total Sequences |
---|---|---|---|---|---|---|---|---|---|
1 | Orchidaceae | Anoectochilus | 693.52 | 25 | 6 | Poaceae | Brachypodium | 596.56 | 25 |
Bulbophyllum | 662.24 | 25 | Briza | 641.6 | 25 | ||||
Coelogyne | 639.96 | 25 | Chusquea | 657.64 | 25 | ||||
2 | Orchidaceae | Anoectochilus | 696.04 | 25 | 7 | Rosaceae | Acaena | 684.64 | 25 |
Calopogon | 675.96 | 25 | Alchemilla | 638.84 | 25 | ||||
Cypripedium | 709.8 | 25 | Crataegus | 633.2 | 25 | ||||
3 | Orchidaceae | Aerides | 648.2 | 25 | 8 | Rosaceae | Acaena | 677.88 | 25 |
Anoectochilus | 700.08 | 25 | Alchemilla | 640.64 | 25 | ||||
Coelogyne | 631.92 | 25 | Amelanchier | 603.16 | 25 | ||||
4 | Asteraceae | Ambrosia | 613.44 | 25 | 9 | Sapindaceae | Acer | 699.92 | 25 |
Chrysanthemum | 699.08 | 25 | Aesculus | 625.56 | 25 | ||||
Coreopsis | 651 | 25 | Cardiospermum | 603.6 | 25 | ||||
5 | Brassicaceae | Aethionema | 655.48 | 25 | 10 | Ranunculaceae | Adonis | 608.08 | 25 |
Alyssum | 668.96 | 25 | Caltha | 614.32 | 25 | ||||
Brassica | 633.84 | 25 | Coptis | 647.88 | 25 | ||||
Average | 651.768 | 25 |
The example of the filter will look like this:
After we get the result from the rentrez library, we take the random sample of the ids to be used.
The validation dataset contains a total of 1875 sequences of ITS markers from various families and genera among the Plantae Kingdom. The validation dataset was retrieved by using the family and the genera on the NCBI that were not disputed and contained more than 25 records. The family and the genera that were used were collected randomly from the list of eligible families. We used the data on different genera to make sure that the model that we have developed can cluster groups with higher levels of similarity or in other words are more homogenous.
2.2.2. Case study data
The case study data were collected from GenBank [72]. The dataset used in this study contains Internal Transcribed Spacer (ITS) from the ribosomal RNA. The brief information about the data is explained in Table 3 and the detailed information can be accessed in Appendix 1.
Table 3.
Detailed information on case study data.
Case study data | ||
---|---|---|
Family | Average bp length | Total records |
Caesalpiniaceae | 700.35 | 63 |
Fabaceae | 668.33 | 95 |
Mimosaceae | 637.46 | 41 |
The case study family data consist of 63 data on Caesalpiniaceae, 95 data on Fabaceae, and 41 data on Mimosaceae, 199 data in total. The length of the sequences is varying from 510 bp to 785 bp and has an average length of the sequences of 672,105.
2.2.3. Hardware specification
All of the programs in this study run on an 8-core CPU computer, with a RAM capacity of 52 GB, and storage using Solid State Disk (SSD). The programming language used in this study is R version 4.1.2 which runs in RStudio. The libraries used can be found at CRAN and Bioconductor III.
3. Results and discussion
3.1. Validation phase
The aim of the validation phase is to get the best distance method that can cluster the DNA sequences data clearly, the method that was retrieved and then used in the case study to assess the dispute between the Leguminosae group. The distance method that is examined in this phase is Euclidean [67], Manhattan [67], Canberra [68], Minkowski [69], Pearson [70], and Spearman [71] distances. The test was run 25 times using random data from the family and genera that are not in dispute, the list of the family is shown in Table 1, Table 2.
The summary of the Validation phase is mapped in Table 4, The table shows a summary of the accuracy and computational time of each method in each experimental run. The family validation phase consists of 15 experiments (Table 1), each experiment using 3 different families with 25 sequences each. Whereas the genera validation phase consists of 10 experiments (Table 2), each experiment using 3 different genera from the same family with 25 sequences each. The weighted average column is the weighted average of family validation and genera validation phase, calculated by:
(1) |
(2) |
where:
Table 4.
Result of the Validation phase.
Family Validation (15 Experiments) |
Genera Validation (10 Experiments) |
Weighted Average |
||||
---|---|---|---|---|---|---|
Method | Accuracy | Time (s) | Accuracy | Time (s) | Accuracy | Time (s) |
Canberra | 98.22% | 0.0377 | 99.33% | 0.0328 | 98.67% | 0.0357 |
Euclidean | 98.22% | 0.0165 | 99.47% | 0.0143 | 98.72% | 0.0156 |
Manhattan | 98.76% | 0.0164 | 99.47% | 0.0146 | 99.04% | 0.0157 |
Minkowski (p = 3) | 98.93% | 0.0566 | 98.53% | 0.0453 | 98.77% | 0.0521 |
Pearson | 98.76% | 0.0160 | 99.47% | 0.0136 | 99.04% | 0.0150 |
Spearman | 98.76% | 0.0645 | 99.47% | 0.0528 | 99.04% | 0.0598 |
Average | 98.61% | 0.0346 | 99.29% | 0.0289 | 98.88% | 0.0323 |
waa = Weighted average accuracy (%)
fva = Family validation accuracy (%)
gfa = Genera validation accuracy (%)
wat = Weighted average time (s).
fvt = Family validation time (s).
gft = Genera validation time (s).
The first equation (1) is used to calculate the weighted average accuracy that is used in Table 4, and the second equation (2) is used to calculate the weighted average computational time that is used in Table 4.
During the validation phase, the aggregated results indicated that the Pearson correlation emerged as the best distance method, resulting in an overall accuracy of 99.04% and an execution time of 0.0150 s. The highest accuracy was observed in Pearson, Manhattan, and Spearman methods, each achieving 99.04% accuracy, correctly classifying nearly all the cases used in this study. Any instances of misclassification could be due to anomalies or imbalances inherent in the data. However, in terms of execution time, the Pearson method proved to be the fastest, completing classification in 0.0150 s, followed closely by the Euclidean and Manhattan methods, requiring 0.0156 and 0.0157 s, respectively.
In the family validation phase, where we assessed three different families to identify the best distance method, a total of 15 test scenarios were implemented. The specifics of these tests are elaborated in Table 1. In this phase, the Minkowski method (p = 3) proved to be the most effective for classifying the three different families, with an accuracy rate of 98.93%. Nevertheless, this method attained the lowest accuracy in the genera validation phase, scoring 98.53%. The Pearson correlation, alongside the Spearman and Manhattan distances, earned the second-highest accuracy of 98.76%. Regarding speed, the Pearson correlation method was the fastest, averaging 0.0160 s to classify different families, followed by the Manhattan and Euclidean methods, which required 0.0164 and 0.0165 s, respectively.
Genera validation is the phase that uses 3 different genera from the same family, the purpose of this phase is to examine whether the methods can classify the sequence with higher similarity, the detailed information about this phase is explained in Table 2. The overall accuracy in this phase shows a higher average accuracy than the family validation phase with 99.29% accuracy compared to the family validation phase with 98.61% accuracy. In a term of computational speed, Pearson correlation gains the fastest run time with 0.0136 s on average to cluster the different genera, followed by Euclidean and Manhattan methods at 0.0143 and 0.0146 s respectively. In contrast, although the Spearman method has high accuracy at 99.47%, it takes the longest run time to cluster the genera at 0.0528 s.
Based on the result, the ITS marker, despite being recommended for fungi, can be used to distinguish the families and the genera of the data in the experiment, and all of the distance methods give the consistently good result with more than 98.22% percent accuracy for all distance methods as long as the data used is not disputed or has problems with their taxa. We can also see that the genera validation phase is more accurate and less time-consuming than the family validation phase. Fig. 5 shows how genera of the same family are well separated using the Euclidean distance method.
Fig. 5.
Example of the well-separated families. The color represents the family of the sequence. Family: Haloragaceae (Turquoise), Cactaceae (Pink), and Buxaceae (Green).
The distribution of the data that we used in this validation experiment in Fig. 6 explains that most of the sequences fall into 600–675 bp (Fig. 6a) on the family validation experiments (Table 1) and 575–675 bp (Fig. 6b) on the genera validation experiments (Table 2). Fig. 5 shows us the example of visualization of the hierarchical clustering that can cluster each of the families clearly, the result is from the experiment on 3 different families: Haloragaceae (Turquoise), Cactaceae (Pink), and Buxaceae (Green).
Fig. 6.
Base pair distribution that is used in the validation phase of this study. a.) Family validation base pair length distribution from Table 1, b.) Genera validation base pair distribution from Table 2.
3.2. Case study: Leguminosae
This phase aimed to resolve the disputed classification of the Leguminosae family, specifically whether it should be categorized as one or three distinct families. The data used in this study consisted of the ITS sequences of Fabaceae (Papilionoid), Caesalpiniaceae, and Mimosaceae, procured from various journals and a website featuring species from this group. If a source did not provide the NCBI accession id for the data, researchers located the corresponding sequence for that species and included it in this study. The Pearson method, due to its best accuracy and computational speed as demonstrated in the validation phase, was employed to ascertain the familial placement of the Leguminosae group.
Following the validation phase, we applied the most effective distance method to perform clustering on the case study data, which resulted in a dendrogram (Fig. 7) that advocated for the consolidation of the three families into a single family termed Leguminosae. The arrangement of each data sample on the dendrogram was determined by the similarity of the DNA sequences; the more closely the two samples resembled each other, the more closely they were positioned on the dendrogram.
Fig. 7.
Clustering result from case study.
Fig. 7 shows the dendrogram visualization of the members of Fabaceae indicated with green labels on the dendrogram, Caesalpiniaceae with red labels, and Mimosaceae with black labels. We can see that some of the groups are clustered correctly, like at the top branch of the visualization, the group of Fabaceae (green) gathered in one place. However, overall, most were mixed and were not gathered in the same branch with the other group members. This is different compared to the visualization presented in Fig. 5, where each of the groups gathered on the same branch of the dendrogram. Samples from three families did not converge to produce clusters for their own families. This indicates that the three families are not different enough to be grouped into separate families. Thus, we can conclude that the group of Fabaceae, Caesalpiniaceae, and Mimosaceae should be grouped into one family. The morphological similarities among these three families are further reinforced by the resemblance in the shape of their fruits. The fruit from the Fabaceae family, illustrated in Fig. 8 a, resembles the fruit from the Mimosaceae family, shown in Fig. 8 b, as well as the fruit from the Caesalpiniaceae family, depicted in Fig. 8 c.
Fig. 8.
a.) Fabaceae fruit, adapted from Ref. [73], b.) Mimosaceae fruit, adapted from Ref. [74], c.) Caesalpiniaceae fruit, adapted from Ref. [75].
The purposed method in this study uses common mathematical distance measures such as Euclidean, Manhattan, Canberra, Minkowski (P = 3), Pearson, and Spearman and does not use pairwise distance methods like Kimura 2-parameter (K2P) distance and Jukes and Cantor distance [76]. We also did not compare the result with the biological approach like electrophoretic analysis to cluster the species [35]. The machine learning approach may also be inconsistent if the libraries used in this approach receive an update or adjustment in their parameters, which is not a significant concern in traditional methods. This inconsistency directly ties into another limitation of this research which is that the dendrogram result needs to be validated by an expert to interpret the result. This human interpretation is limited to the bigger picture homogeneity of clusters. A computational method for interpreting the dendrogram may be able to parse out further details in the finer structures of the dendrogram [77], but this approach may be debatable. Finally, this research only uses a hierarchical clustering algorithm, any other algorithms like K-means [78], DBSCAN [79], Gaussian Mixture [80], etc. Can be used for this purpose and may give a different result.
The results from our experiment support several previous works that classify legumes as one family. Lewis [47] argues that the argument for three separate families is untenable because of two reasons. First, apparently, Mimosoideae and Papilionoideae are unique and distinct lineages arising in the Caesalpinioid alliance and are not comparable to it on the same taxonomic level. Second, Caesalpinioideae are under scrutiny and once further detailed studies are concluded it seems inevitable for divisions into more definable groups comparable in rank to the other two subfamilies. Hsuan [46], while not providing any arguments for the one-family classification, address the three groups as subfamilies in describing their morphology. Takhtajan [48] and Patel & Panchal [36] both refer to Leguminosae as one family.
On the other hand, a number of works argue against the one-family classification and instead classify legumes as three families. One such work by Cronquist [40] describes the author's preference for this classification because it is more in harmony with the customary classifications of families within angiosperms. Other works refer to a specific group in the legumes as a family such as Hou [41] for Caesalpiniaceae and Nielsen [42] with Mimosaceae. The argument for three families is also supported by other works that refer to the whole group as Fabales as an order, such as the works by Cronquist [40] and Dahlgren [43].
IV. Conclusion.
In this study, we validated our proposed machine learning, namely hierarchical clustering, for the objective of clustering a disputed group of Plantae–the Leguminosae. There are four main steps in this research, as follows: (i) data collection, (ii) data preprocessing, (iii) finding the best distance method, and (iv) determining the disputed family. According to the third step, our study shows that the Pearson correlation method is the best distance method to cluster different groups of families and genera. Through the application of the Pearson correlation approach within our hierarchical clustering experiments, the case study of the Leguminosae family, we ascertained that the Fabaceae, Mimosaceae, and Caesalpiniaceae are appropriately clustered into a single family. This conclusion is supported by the classification used or referred to by a number of previous works [36,[46], [47], [48]].
Author contribution statement
Lala Septem Riza: Conceived and designed the experiments; Analyzed and interpreted the data; Contributed reagents, materials, analysis tools or data; Wrote the paper. Muhammad Iqbal Zain; Ahmad Izzuddin; Yudi Prasetyo: Performed the experiments; Wrote the paper. Topik Hidayat: Conceived and designed the experiments; Analyzed and interpreted the data; Contributed reagents, materials, analysis tools or data. Khyrina Airin Fariza Abu Samah: Analyzed and interpreted the data; Wrote the paper.
Data availability statement
Data associated with this study has been deposited at http://www.ncbi.nlm.nih.gov/taxonomy.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Contributor Information
Lala Septem Riza, Email: lala.s.riza@upi.edu.
Muhammad Iqbal Zain, Email: iqbalzain99@upi.edu.
Ahmad Izzuddin, Email: ahmadizzuddin@upi.edu.
Yudi Prasetyo, Email: yudiprasetyo@upi.edu.
Topik Hidayat, Email: topikhidayat@upi.edu.
Khyrina Airin Fariza Abu Samah, Email: khyrina783@uitm.edu.my.
Appendix 1.
To get access to sequence link in NCBI you can access through http://www.ncbi.nlm.nih.gov/nuccore/[Accession Number].
No | Full Name | Accession Number | Family Name |
---|---|---|---|
1 | Adenanthera pavonina | KP092694.1 | Mimosaceae |
2 | Mimosa diplotricha | MH768250.1 | Mimosaceae |
3 | Prosopis glandulosa | AF174630.1 | Mimosaceae |
4 | Prosopis juliflora | JX139107.1 | Mimosaceae |
5 | Mimosa pudica | KX057889.1 | Mimosaceae |
6 | Leucaena leucocephala | MH070604.1 | Mimosaceae |
7 | Desmanthus pumilus | AF458845.1 | Mimosaceae |
8 | Desmanthus virgatus | AF458843.1 | Mimosaceae |
9 | Neptunia oleracea | KX057891.1 | Mimosaceae |
10 | Entada abyssinica | KX057869.1 | Mimosaceae |
11 | Albizia julibrissin | FJ572041.1 | Mimosaceae |
12 | Samanea saman | JX870770.1 | Mimosaceae |
13 | Calliandra surinamensis | JX870747.1 | Mimosaceae |
14 | Acacia lycopodiifolia | AF360716.1 | Mimosaceae |
15 | Dichrostachys paucifoliolata | AF458812.1 | Mimosaceae |
16 | Leucaena lanceolata | JF339948.1 | Mimosaceae |
17 | Archidendron utile | KT767599.1 | Mimosaceae |
18 | Archidendron lucidum | KT321363.1 | Mimosaceae |
19 | Acacia victoriae | DQ029281.1 | Mimosaceae |
20 | Gleditsia triacanthos | AF509980.1 | Caesalpiniaceae |
21 | Gleditsia microphylla | AF510029.1 | Caesalpiniaceae |
22 | Caesalpinia pulcherrima | JX856420.1 | Caesalpiniaceae |
23 | Haematoxylum campechianum | KX372832.1 | Caesalpiniaceae |
24 | Haematoxylum brasiletto | KX372834.1 | Caesalpiniaceae |
25 | Haematoxylum dinteri | KX372830.1 | Caesalpiniaceae |
26 | Cassia fistula | JX856430.1 | Caesalpiniaceae |
27 | Senna odorata | HM116996.1 | Caesalpiniaceae |
28 | Senna siamea | KJ638423.1 | Caesalpiniaceae |
29 | Chamaecrista choriophylla | KR134122.1 | Caesalpiniaceae |
30 | Chamaecrista potentilla | KR134123.1 | Caesalpiniaceae |
31 | Maniltoa grandiflora | MG949352.1 | Caesalpiniaceae |
32 | Bauhinia purpurea | JX856406.1 | Caesalpiniaceae |
33 | Bauhinia syringifolia | AY258398.1 | Caesalpiniaceae |
34 | Cynometra letestui | MG949304.1 | Caesalpiniaceae |
35 | Maniltoa gemmipara | KY306626.1 | Caesalpiniaceae |
36 | Crudia papuana | MH535137.1 | Caesalpiniaceae |
37 | Tamarindus indica | MG949357.1 | Caesalpiniaceae |
38 | Flemingia macrophylla | MN165994.1 | Fabaceae |
39 | Flemingia mengpengensis | MN177611.1 | Fabaceae |
40 | Phaseolus sinuatus | AF115194.1 | Fabaceae |
41 | Glycine pindanica | AY433933.1 | Fabaceae |
42 | Pisum sativum | AY143482.1 | Fabaceae |
43 | Phaseolus amblysepalus | AF115218.1 | Fabaceae |
44 | Glycine max | FJ609734.1 | Fabaceae |
45 | Cicer arietinum | DQ312219.1 | Fabaceae |
46 | Medicago sativa | AF053142.1 | Fabaceae |
47 | Cicer microphyllum | KP338131.1 | Fabaceae |
48 | Glycyrrhiza pallidiflora | EU591998.1 | Fabaceae |
49 | Glycyrrhiza astragalina | GQ246134.1 | Fabaceae |
50 | Pueraria montana | AF338215.1 | Fabaceae |
51 | Lupinus albus | AF007481.1 | Fabaceae |
52 | Ulex parviflorus | AF007470.1 | Fabaceae |
53 | Trifolium buckwestiorum | AF053148.1 | Fabaceae |
54 | Lathyrus aphaca | AY839345.1 | Fabaceae |
55 | Vicia sativa | MH808491.1 | Fabaceae |
56 | Pongamia pinnata | AF467493.1 | Fabaceae |
57 | Melilotus indicus | MK918730.1 | Fabaceae |
58 | Acacia nilotica | JX139101.1 | Mimosaceae |
59 | Acacia auriculiformis | KC955519.1 | Mimosaceae |
60 | Acacia farnesiana | AF360728.1 | Mimosaceae |
61 | Albizia lebbeck | MN181375.1 | Mimosaceae |
62 | Leucaena leucocephala | MH050230.1 | Mimosaceae |
63 | Senna alata | MH050234.1 | Caesalpiniaceae |
64 | Celtis occidentalis | DQ499147.1 | Caesalpiniaceae |
65 | Delonix regia | KY321088.1 | Caesalpiniaceae |
66 | Phaseolus vulgaris | MW843824.1 | Fabaceae |
67 | Sesbania grandiflora | AF536354.1 | Fabaceae |
68 | Tephrosia purpurea | MH768297.1 | Fabaceae |
69 | Abrus precatorius | MF440357.1 | Fabaceae |
70 | Butea monosperma | KJ436384.1 | Fabaceae |
71 | Cicer arietinum | MW424513.1 | Fabaceae |
72 | Clitoria ternatea | MH260279.1 | Fabaceae |
73 | Crotalaria pallida | MH050227.1 | Fabaceae |
74 | Crotalaria retusa | KP698625.1 | Fabaceae |
75 | Dalbergia sissoo | JX856444.1 | Fabaceae |
76 | Erythrina variegata | MT023961.1 | Fabaceae |
77 | Glycyrrhiza glabra | MT350378.1 | Fabaceae |
78 | Indigofera tinctoria | MN879515.1 | Fabaceae |
79 | Melilotus albus | MN560612.1 | Fabaceae |
80 | Melilotus indicus | MW241661.1 | Fabaceae |
81 | Pisum sativum | AY839340.1 | Fabaceae |
82 | Phaseolus vulgaris | MW843825.1 | Fabaceae |
83 | Vigna mungo | MF467912.1 | Fabaceae |
84 | Canavalia lineata | KT751442.1 | Fabaceae |
85 | Lathyrus odoratus | AY839377.1 | Fabaceae |
86 | Trifolium repens | MT481887.1 | Fabaceae |
87 | Sesbania sesban | MW560073.1 | Fabaceae |
88 | Sesbania bispinosa | MH768288.1 | Fabaceae |
89 | Tephrosia purpurea | MH768296.1 | Fabaceae |
90 | Desmodium gangeticum | KP092721.1 | Fabaceae |
91 | Guilandina bonduc | MH768079.1 | Caesalpiniaceae |
92 | Senna alata | MH050233.1 | Caesalpiniaceae |
93 | Cassia fistula | MW367522.1 | Caesalpiniaceae |
94 | Senna occidentalis | MH558633.1 | Caesalpiniaceae |
95 | Senna siamea | KJ638421.1 | Caesalpiniaceae |
96 | Senna sophera | HQ833042.1 | Caesalpiniaceae |
97 | Senna tora | KP092708.1 | Caesalpiniaceae |
98 | Tamarindus indica | KF055236.1 | Caesalpiniaceae |
99 | Saraca asoca | MW301610.1 | Caesalpiniaceae |
100 | Delonix regia | KX057862.1 | Caesalpiniaceae |
101 | Acacia mangium | KC955551.1 | Mimosaceae |
102 | Acacia catechu | KC952019.1 | Mimosaceae |
103 | Pithecellobium dulce | JX856483.1 | Mimosaceae |
104 | Adenanthera pavonina | KP092695.1 | Mimosaceae |
105 | Leucaena leucocephala | MG755502.1 | Mimosaceae |
106 | Bauhinia purpurea | MH548397.1 | Caesalpiniaceae |
107 | Bauhinia tomentosa | KX057838.1 | Caesalpiniaceae |
108 | Libidibia coriaria | JX856416.1 | Caesalpiniaceae |
109 | Caesalpinia pulcherrima | KX057841.1 | Caesalpiniaceae |
110 | Chamaecrista absus | KT279729.1 | Caesalpiniaceae |
111 | Cassia fistula | MW367497.1 | Caesalpiniaceae |
112 | Senna italica | KY576676.1 | Caesalpiniaceae |
113 | Cassia javanica | MW386314.1 | Caesalpiniaceae |
114 | Chamaecrista mimosoides | KX057847.1 | Caesalpiniaceae |
115 | Senna obtusifolia | KX057900.1 | Caesalpiniaceae |
116 | Senna occidentalis | MW326931.1 | Caesalpiniaceae |
117 | Cassia roxburghii | MW326753.1 | Caesalpiniaceae |
118 | Senna siamea | KC984644.1 | Caesalpiniaceae |
119 | Senna surattensis | MW367670.1 | Caesalpiniaceae |
120 | Senna tora | MH712712.1 | Caesalpiniaceae |
121 | Delonix elata | KY321105.1 | Caesalpiniaceae |
122 | Delonix regia | KY321089.1 | Caesalpiniaceae |
123 | Parkinsonia aculeata | KF379226.1 | Caesalpiniaceae |
124 | Tamarindus indica | JX856519.1 | Caesalpiniaceae |
125 | Vachellia farnesiana | KF532059.1 | Mimosaceae |
126 | Prosopis juliflora | OK184559.1 | Mimosaceae |
127 | Mimosa pudica | MN081594.1 | Mimosaceae |
128 | Leucaena leucocephala | KF048811.1 | Mimosaceae |
129 | Senegalia senegal | KY688828.1 | Mimosaceae |
130 | Pithecellobium dulce | KX057895.1 | Mimosaceae |
131 | Albizia amara | MW699936.1 | Mimosaceae |
132 | Albizia lebbeck | MW699948.1 | Mimosaceae |
133 | Albizia procera | MW699953.1 | Mimosaceae |
134 | Samanea saman | EF638210.1 | Mimosaceae |
135 | Medicago lupulina | MW241681.1 | Fabaceae |
136 | Medicago polymorpha | OK036671.1 | Fabaceae |
137 | Trifolium repens | MT481888.1 | Fabaceae |
138 | Melilotus albus | MW241669.1 | Fabaceae |
139 | Lathyrus odoratus | JN115031.1 | Fabaceae |
140 | Vicia hirsuta | MH808488.1 | Fabaceae |
141 | Vicia sativa | MW540820.1 | Fabaceae |
142 | Lupinus albus | MK532380.1 | Fabaceae |
143 | Aeschynomene indica | MN718416.1 | Fabaceae |
144 | Arachis hypogaea | MT230611.1 | Fabaceae |
145 | Gliricidia sepium | AF398816.1 | Fabaceae |
146 | Sesbania sesban | KY968926.1 | Fabaceae |
147 | Indigofera tinctoria | MH595834.1 | Fabaceae |
148 | Dalbergia lanceolaria | JX856439.1 | Fabaceae |
149 | Erythrina suberosa | MT023956.1 | Fabaceae |
150 | Clitoria ternatea | KT876054.1 | Fabaceae |
151 | Cajanus cajan | MK253074.1 | Fabaceae |
152 | Rhynchosia minima | MH768286.1 | Fabaceae |
153 | Butea monosperma | MN700631.1 | Fabaceae |
154 | Pueraria montana | JN407470.1 | Fabaceae |
155 | Glycine max | MW391260.1 | Fabaceae |
156 | Lablab purpureus | MH518283.1 | Fabaceae |
157 | Phaseolus vulgaris | MW843826.1 | Fabaceae |
158 | Vigna aconitifolia | JN008333.1 | Fabaceae |
159 | Abrus precatorius | MN091943.1 | Fabaceae |
160 | Tephrosia candida | HE681571.1 | Fabaceae |
161 | Tephrosia villosa | MN173946.1 | Fabaceae |
162 | Smithia sensitiva | MF281645.1 | Fabaceae |
163 | Alysicarpus vaginalis | MH768274.1 | Fabaceae |
164 | Lathyrus aphaca | KJ864924.1 | Fabaceae |
165 | Vigna radiata | MW366905.1 | Fabaceae |
166 | Vigna unguiculata | JN008290.1 | Fabaceae |
167 | Mimosa pigra | KT364060.1 | Mimosaceae |
168 | Albizia procera | JX856397.1 | Mimosaceae |
169 | Cynometra ramiflora | MG949301.1 | Caesalpiniaceae |
170 | Senna hirsuta | KJ638428.1 | Caesalpiniaceae |
171 | Senna occidentalis | MZ505523.1 | Caesalpiniaceae |
172 | Senna siamea | KJ638422.1 | Caesalpiniaceae |
173 | Senna alata | MH915657.1 | Caesalpiniaceae |
174 | Mezoneuron hymenocarpum | KX372820.1 | Caesalpiniaceae |
175 | Moullava digyna | KX372803.1 | Caesalpiniaceae |
176 | Brownea coccinea | MH535219.1 | Caesalpiniaceae |
177 | Senna tora | MH050240.1 | Caesalpiniaceae |
178 | Caesalpinia sp | KP003675.1 | Caesalpiniaceae |
179 | Cassia javanica | MW386313.1 | Caesalpiniaceae |
180 | Bauhinia acuminata | JX856404.1 | Caesalpiniaceae |
181 | Guilandina crista | KX372808.1 | Caesalpiniaceae |
182 | Crotalaria juncea | KP698651.1 | Fabaceae |
183 | Crotalaria calycina | KP698617.1 | Fabaceae |
184 | Crotalaria pallida | MH050226.1 | Fabaceae |
185 | Crotalaria verrucosa | KP698648.1 | Fabaceae |
186 | Crotalaria saltiana | KX371754.1 | Fabaceae |
187 | Desmodium triflorum | LC377412.1 | Fabaceae |
188 | Uraria crinita | JN407474.1 | Fabaceae |
189 | Aeschynomene americana | MT902905.1 | Fabaceae |
190 | Mucuna bracteata | LC494604.1 | Fabaceae |
191 | Millettia pinnata | KF848293.1 | Fabaceae |
192 | Dalbergia volubilis | KM276224.1 | Fabaceae |
193 | Trigonella foenum-graecum | MH645773.1 | Fabaceae |
194 | Derris trifoliata | MT312808.1 | Fabaceae |
195 | Grona heterocarpos | MK933480.1 | Fabaceae |
196 | Vigna marina | MH768299.1 | Fabaceae |
197 | Pongamia pinnata | MN076243.1 | Fabaceae |
198 | Flemingia strobilifera | MW732036.1 | Fabaceae |
199 | Derris scandens | JX506450.1 | Fabaceae |
References
- 1.Weitschek E., Fiscon G., Felici G. Supervised DNA Barcodes species classification: analysis, comparisons and results. BioData Min. 2014;7(1):4. doi: 10.1186/1756-0381-7-4. Dec. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Purty R., Chatterjee S. DNA barcoding: an effective technique in molecular taxonomy. Austin J. Biotechnol. Bioeng. 2016;3:1059. 1. [Google Scholar]
- 3.Saddhe A.A., Kumar K. DNA barcoding of plants: selection of core markers for taxonomic groups. Plant Sci. Today. 2017;5(1):9–13. doi: 10.14719/pst.2018.5.1.356. Dec. [DOI] [Google Scholar]
- 4.Schoch C.L., et al. Nuclear ribosomal internal transcribed spacer (ITS) region as a universal DNA barcode marker for Fungi. Proc. Natl. Acad. Sci. USA. 2012;109(16):6241–6246. doi: 10.1073/pnas.1117018109. Apr. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Fazekas A.J., et al. Multiple multilocus DNA barcodes from the plastid genome discriminate plant species equally well. PLoS One. 2008;3(7):e2802. doi: 10.1371/journal.pone.0002802. Jul. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Gielly L., Taberlet P. The use of chloroplast DNA to resolve plant phylogenies: noncoding versus rbcL sequences. Mol. Biol. Evol. 1994;11(5):769–777. doi: 10.1093/oxfordjournals.molbev.a040157. Sep. [DOI] [PubMed] [Google Scholar]
- 7.Hilu K.W., Liang gping. The matK gene: sequence variation and application in plant systematics. Am. J. Bot. 1997;84(6):830–839. doi: 10.2307/2445819. Jun. [DOI] [PubMed] [Google Scholar]
- 8.China Plant Bol Group, et al. Comparative analysis of a large dataset indicates that internal transcribed spacer (ITS) should be incorporated into the core barcode for seed plants. Proc. Natl. Acad. Sci. USA. 2011;108(49):19641–19646. doi: 10.1073/pnas.1104551108. Dec. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Fazekas A.J., et al. Are plant species inherently harder to discriminate than animal species using DNA barcoding markers? Mol. Ecol. Resour. 2009;9:130–139. doi: 10.1111/j.1755-0998.2009.02652.x. May. [DOI] [PubMed] [Google Scholar]
- 10.Wong W.H., Tay Y.C., Puniamoorthy J., Balke M., Cranston P.S., Meier R. ‘Direct PCR’ optimization yields a rapid, cost-effective, nondestructive and efficient method for obtaining DNA barcodes without DNA extraction. Mol. Ecol. Resour. 2014;14(6):1271–1280. doi: 10.1111/1755-0998.12275. Nov. [DOI] [PubMed] [Google Scholar]
- 11.Fišer Ž., Pečnikar, Buzan E.V. 20 years since the introduction of DNA barcoding: from theory to application. J. Appl. Genet. Feb. 2014;55(1):43–52. doi: 10.1007/s13353-013-0180-y. [DOI] [PubMed] [Google Scholar]
- 12.Madden M.J.L., Young R.G., Brown J.W., Miller S.E., Frewin A.J., Hanner R.H. Using DNA barcoding to improve invasive pest identification at U.S. ports-of-entry. PLoS One. 2019;14(9) doi: 10.1371/journal.pone.0222291. Sep. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Gonçalves P.F.M., Oliveira-Marques A.R., Matsumoto T.E., Miyaki C.Y. DNA barcoding identifies illegal parrot trade. J. Hered. 2015;106(S1):560–564. doi: 10.1093/jhered/esv035. [DOI] [PubMed] [Google Scholar]
- 14.Jiao L., Lu Y., He T., Guo J., Yin Y. DNA barcoding for wood identification: global review of the last decade and future perspective. IAWA J. 2020;41(4):620–643. doi: 10.1163/22941932-bja10041. Oct. [DOI] [Google Scholar]
- 15.Tänzler R., Sagata K., Surbakti S., Balke M., Riedel A. DNA barcoding for community ecology - how to tackle a hyperdiverse, mostly undescribed melanesian fauna. PLoS One. 2012;7 doi: 10.1371/journal.pone.0028832. Jan. 1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Rossini B.C., et al. Highlighting Astyanax species diversity through DNA barcoding. PLoS One. 2016;11(12) doi: 10.1371/journal.pone.0167203. Dec. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Lukhtanov V.A., Sourakov A., Zakharov E.V. DNA barcodes as a tool in biodiversity research: testing pre-existing taxonomic hypotheses in Delphic Apollo butterflies (Lepidoptera, Papilionidae) Syst. Biodivers. 2016;14(6):599–613. doi: 10.1080/14772000.2016.1203371. Nov. [DOI] [Google Scholar]
- 18.Sandionigi A., et al. Analytical approaches for DNA barcoding data – how to find a way for plants? Plant Biosyst. - Int. J. Deal. Asp. Plant Biol. 2012;146(4):805–813. doi: 10.1080/11263504.2012.740084. Dec. [DOI] [Google Scholar]
- 19.Little D.P., Stevenson D. Wm. A comparison of algorithms for the identification of specimens using DNA barcodes: examples from gymnosperms. Cladistics. 2007;23(1):1–21. doi: 10.1111/j.1096-0031.2006.00126.x. Feb. [DOI] [PubMed] [Google Scholar]
- 20.Yang C.-H., Wu K.-C., Chuang L.-Y., Chang H.-W. DeepBarcoding: deep learning for species classification using DNA barcoding. IEEE ACM Trans. Comput. Biol. Bioinf. 2022;19(4):2158–2165. doi: 10.1109/TCBB.2021.3056570. Jul. [DOI] [PubMed] [Google Scholar]
- 21.Meher P.K., Sahu T.K., Rao A.R. Identification of species based on DNA barcode using k-mer feature vector and Random forest classifier. Gene. 2016;592(2):316–324. doi: 10.1016/j.gene.2016.07.010. Nov. [DOI] [PubMed] [Google Scholar]
- 22.Soueidan H., Nikolski M. Machine learning for metagenomics: methods and tools. 2015 doi: 10.48550/ARXIV.1510.06621. [DOI] [Google Scholar]
- 23.DeSalle R., Goldstein P. Review and interpretation of trends in DNA barcoding. Front. Ecol. Evol. 2019;7:302. doi: 10.3389/fevo.2019.00302. Sep. [DOI] [Google Scholar]
- 24.Stevens P.F. eLS. first ed. John Wiley & Sons, Ltd, Ed. Wiley; 2003. History of taxonomy. [DOI] [Google Scholar]
- 25.Avise J.C., Liu J.-X. On the temporal inconsistencies of Linnean taxonomic ranks. Biol. J. Linn. Soc. 2011;102(4):707–714. doi: 10.1111/j.1095-8312.2011.01624.x. Apr. [DOI] [Google Scholar]
- 26.Greuter W., et al., editors. International Code of Botanical Nomenclature (Saint Louis Code): Adopted by the Sixteenth International Botanical Congress, St Louis, Missouri, July-August 1999. Koeltz scientific books; Königstein: 2000. [Google Scholar]
- 27.Parker C.T., Tindall B.J., Garrity G.M. International code of nomenclature of prokaryotes: prokaryotic code (2008 revision) Int. J. Syst. Evol. Microbiol. 2019;69(1A):S1–S111. doi: 10.1099/ijsem.0.000778. Jan. [DOI] [PubMed] [Google Scholar]
- 28.Schoch C.L., et al. NCBI Taxonomy: a comprehensive update on curation, resources and tools. Database. 2020;2020 doi: 10.1093/database/baaa062. baaa062, Jan. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Walker P.J., et al. Changes to virus taxonomy and the international code of virus classification and nomenclature ratified by the international committee on taxonomy of viruses (2019) Arch. Virol. 2019;164(9):2417–2429. doi: 10.1007/s00705-019-04306-w. Sep. [DOI] [PubMed] [Google Scholar]
- 30.Sigwart J.D., Sutton M.D., Bennett K.D. How big is a genus? Towards a nomothetic systematics. Zool. J. Linn. Soc. 2018;183(2):237–252. doi: 10.1093/zoolinnean/zlx059. Jun. [DOI] [Google Scholar]
- 31.Das S. Domestication, phylogeny and taxonomic delimitation in underutilized grain Amaranthus (Amaranthaceae) – a status review. Feddes Repert. 2012;123(4):273–282. doi: 10.1002/fedr.201200017. Feb. [DOI] [Google Scholar]
- 32.Docker M.F., editor. Lampreys: Biology, Conservation and Control. Volume 1, first ed. Dordrecht: Springer Netherlands; Imprint: Springer: 2015. 2015. [DOI] [Google Scholar]
- 33.Ji Y., Liu C., Yang J., Jin L., Yang Z., Yang J.-B. Ultra-barcoding discovers a cryptic species in Paris yunnanensis (melanthiaceae), a medicinally important plant. Front. Plant Sci. 2020;11:411. doi: 10.3389/fpls.2020.00411. Apr. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Ristaino J.B. The importance of mycological and plant herbaria in tracking plant killers. Front. Ecol. Evol. 2020;7 doi: 10.3389/fevo.2019.00521. [DOI] [Google Scholar]
- 35.Mondal A.K., Mondal S. Circumscription of the families within Leguminales as determined by cladistic analysis based on seed protein. Afr. J. Biotechnol. 2011;10(15):2850–2856. doi: 10.5897/AJB10.206. Apr. [DOI] [Google Scholar]
- 36.Patel S., Panchal H. Evolutionary studies of few species belonging to Leguminosae family based on RBCL gene. Discovery. Jan. 2014;9(22):38–50. [Google Scholar]
- 37.Doyle J.J., Luckow M.A. The rest of the iceberg. Legume diversity and evolution in a phylogenetic context. Plant Physiol. 2003;131(3):900–910. doi: 10.1104/pp.102.018150. Mar. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Polhill R.M., Raven P.H. Kew/Surrey: Royal botanic gardens; 1981. Advances in Legume Systematics. [Google Scholar]
- 39.Dhakad A.K. Forest Research Institute University; 2018. Molecular Phylogeny of Selected Tree Species of Families Fabaceae Caesalpiniaceae and Mimosaceae of Uttarakhand.http://hdl.handle.net/10603/203120 [Online]. Available: [Google Scholar]
- 40.Cronquist A. Columbia University Press; New York: 1981. An Integrated System of Classification of Flowering Plants. [Google Scholar]
- 41.Hou D., Larsen K., Larsen S.S. Caesalpiniaceae (Leguminosae-Caesalpinioideae) Flora Malesiana. 1996;12(2):409–730. Jan. [Google Scholar]
- 42.Nielsen I.C. Mimosaceae (Leguminosae-Mimosoideae) Flora Malesiana. 1992;11(1):1–226. [Google Scholar]
- 43.Dahlgren R. “General aspects of angiosperm evolution and macrosystematics,” Nord. J. Bot., Le. 1983;3(1):119–149. doi: 10.1111/j.1756-1051.1983.tb01448.x. [DOI] [Google Scholar]
- 44.G. Bentham, “Notes on Mimoseae, with a synopsis of species,” Lond. J. Bot., vol. 1, pp. 318–528, 1842..
- 45.Wardill T.J., Graham G.C., Zalucki M., Palmer W.A., Playford J., Scott K.D. The importance of species identity in the biocontrol process: identifying the subspecies of Acacia nilotica (Leguminosae: Mimosoideae) by genetic distance and the implications for biological control. J. Biogeogr. 2005;32(12):2145–2159. doi: 10.1111/j.1365-2699.2005.01348.x. Dec. [DOI] [Google Scholar]
- 46.Hsuan K. Singapore University Press; Singapore: 1983. Orders and Families of Malayan Seed Plants. [Google Scholar]
- 47.Lewis G.P., editor. Legumes of the World. Royal Botanic Gardens, Kew; Richmond, UK: 2005. [Google Scholar]
- 48.Takhtajan A.L. Outline of the classification of flowering plants (magnoliophyta) Bot. Rev. 1980;46(3):225–359. doi: 10.1007/BF02861558. Jul. [DOI] [Google Scholar]
- 49.Nielsen F. Springer International Publishing; Cham: 2016. Introduction to HPC with MPI for Data Science. [DOI] [Google Scholar]
- 50.An Q., et al. Predicting medicinal resources in Ranunculaceae family by a combined approach using DNA barcodes and chemical metabolites. Phytochem. Lett. 2022;50:67–76. doi: 10.1016/j.phytol.2022.04.009. Aug. [DOI] [Google Scholar]
- 51.Lucas C., Thangaradjou T., Papenbrock J. Development of a DNA barcoding system for seagrasses: successful but not simple. PLoS One. 2012;7(1) doi: 10.1371/journal.pone.0029987. Jan. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Nikitina E.V., Karimov F.I., Savina N.V., Kubrak S.V., Kilchevsky A.V. Inventory of some Tulipa species from Uzbekistan using DNA barcoding. BIO Web Conf. 2021;38 doi: 10.1051/bioconf/20213800086. 00086. [DOI] [Google Scholar]
- 53.Nikitina E.V., Beshko N. Yu, Omarov S.A. Assessment of plant species diversity (Lamiaceae Lindle.) in Uzbekistan based on DNA barcoding. IOP Conf. Ser. Earth Environ. Sci. 2022;1068(1) doi: 10.1088/1755-1315/1068/1/012042. Jul. [DOI] [Google Scholar]
- 54.Papa Y., Le Bail P., Covain R. Genetic landscape clustering of a large DNA barcoding data set reveals shared patterns of genetic divergence among freshwater fishes of the Maroni Basin. Mol. Ecol. Resour. 2021;21(6):2109–2124. doi: 10.1111/1755-0998.13402. Aug. [DOI] [PubMed] [Google Scholar]
- 55.Xu H., Li P., Ren G., Wang Y., Jiang D., Liu C. Authentication of three source spices of arnebiae radix using DNA barcoding and HPLC. Front. Pharmacol. Jul. 2021;12 doi: 10.3389/fphar.2021.677014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Zhao L., Yu X., Shen J., Xu X. Identification of three kinds of Plumeria flowers by DNA barcoding and HPLC specific chromatogram. J. Pharm. Anal. Jun. 2018;8(3):176–180. doi: 10.1016/j.jpha.2018.02.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Mitchell T.M. McGraw-Hill; New York: 1997. Machine Learning. [Google Scholar]
- 58.He T., Jiao L., Wiedenhoeft A.C., Yin Y. Machine learning approaches outperform distance- and tree-based methods for DNA barcoding of Pterocarpus wood. Planta. May 2019;249(5):1617–1625. doi: 10.1007/s00425-019-03116-3. [DOI] [PubMed] [Google Scholar]
- 59.Winter D.J. PeerJ Preprints, preprint; Aug. 2017. Rentrez: an R Package for the NCBI eUtils API. [DOI] [Google Scholar]
- 60.Pagès P.A.H. “Biostrings.” Bioconductor. 2017 doi: 10.18129/B9.BIOC.BIOSTRINGS. [DOI] [Google Scholar]
- 61.Enrico Bonatesta C.H.-K. MSA; 2017. “msa.” Bioconductor. [DOI] [Google Scholar]
- 62.Heibl C. 2008. PHYLOCH: R Language Tree Plotting Tools and Interfaces to Diverse Phylogenetic Software Packages.http://www.christophheibl.de/Rpackages.html Jan[Online]. Available: [Google Scholar]
- 63.Kassambara A., Mundt F. 2020. Factoextra: Extract and Visualize the Results of Multivariate Data Analyses.https://CRAN.R-project.org/package=factoextra [Online]. Available: [Google Scholar]
- 64.Federhen S. The NCBI Taxonomy database. Nucleic Acids Res. 2012;40(D1):D136–D143. doi: 10.1093/nar/gkr1178. Jan. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Tripathi S., Mondal A.K. Taxonomic diversity in epidermal cells (stomata) of some selected Anthophyta under the order Leguminales (Caeselpiniaceae, Mimosaceae and Fabaceae) based on numerical analysis: a systematic approach. Int. J. Sci. Nat. 2012;3(4):778–798. Dec. [Google Scholar]
- 66.Thompson J.D., Higgins D.G., Gibson T.J. Clustal W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994;22(22):4673–4680. doi: 10.1093/nar/22.22.4673. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Malkauthekar M.D. Third International Conference on Computational Intelligence and Information Technology. CIIT 2013); Mumbai, India: 2013. Analysis of euclidean distance and manhattan distance measure in face recognition; pp. 503–507. [DOI] [Google Scholar]
- 68.Faisal M., Zamzami E.M., Sutarman Comparative analysis of inter-centroid K-means performance using euclidean distance, Canberra distance and manhattan distance. J. Phys. Conf. Ser. 2020;1566(1) doi: 10.1088/1742-6596/1566/1/012112. Jun. [DOI] [Google Scholar]
- 69.Merigó J.M., Casanovas M. A new Minkowski distance based on induced aggregation operators. Int. J. Comput. Intell. Syst. 2011;4(2):123–133. doi: 10.1080/18756891.2011.9727769. Apr. [DOI] [Google Scholar]
- 70.Weber J.H., Schouhamer Immink K.A., Blackburn S.R. Pearson codes. IEEE Trans. Inf. Theor. 2016;62(1):131–135. doi: 10.1109/TIT.2015.2490219. Jan. [DOI] [Google Scholar]
- 71.Xie Y., Wang Y., Nallanathan A., Wang L. An improved K-Nearest-Neighbor indoor localization method based on spearman distance. IEEE Signal Process. Lett. 2016;23(3):351–355. doi: 10.1109/LSP.2016.2519607. Mar. [DOI] [Google Scholar]
- 72.Benson D.A., et al. “GenBank,” Nucleic Acids Res. 2012;41(D1):D36–D42. doi: 10.1093/nar/gks1195. Nov. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Gerard J.H. Common honey locust (gleditsia triacanthos) https://www.britannica.com/plant/Fabales/Classification-of-Fabaceae#/media/1/199654/115993 [Online]. Available:
- 74.Vijay Adenanthera pavonina L. https://indiabiodiversity.org/species/show/245164 [Online]. Available:
- 75.Allen J.C. Son. Soybeans (Glycine max) https://www.britannica.com/plant/common-bean#/media/1/199651/7490 [Online]. Available:
- 76.Nishimaki T., Sato K. An extension of the Kimura two-parameter model to the natural evolutionary process. J. Mol. Evol. Jan. 2019;87(1):60–67. doi: 10.1007/s00239-018-9885-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77.Lee P., Yang S.T., West J.D., Howe B. 2017 14th IAPR International Conference On Document Analysis And Recognition (ICDAR), Kyoto. IEEE, Nov; 2017. PhyloParser: a hybrid algorithm for extracting phylogenies from dendrograms; pp. 1087–1094. [DOI] [Google Scholar]
- 78.Hartigan J.A., Wong M.A. Algorithm as 136: a K-means clustering algorithm. Applied Statistics. 1979;28(1):100. doi: 10.2307/2346830. [DOI] [Google Scholar]
- 79.Hahsler M., Piekenbrock M., Doran D. Dbscan : fast density-based clustering with R. J. Stat. Software. 2019;91(1) doi: 10.18637/jss.v091.i01. [DOI] [Google Scholar]
- 80.Ouyang M., Welsh W.J., Georgopoulos P. Gaussian mixture clustering and imputation of microarray data. Bioinformatics. Apr. 2004;20(6):917–923. doi: 10.1093/bioinformatics/bth007. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Data associated with this study has been deposited at http://www.ncbi.nlm.nih.gov/taxonomy.