Begin Iterative Design and Development of Data-Driven Clinical Decision Support Systems
|
6. Random permutations |
This stage allows us to make random permutations into the selected dataset(s). This is made to avoid the same input and output(s) order and could be chosen different classes or attributes at the moment for choosing the two or three subsets for training, validation, and test through random sampling or cross-validation processes. |
Knowledge Database Creation
|
7. Cluster analysis: fuzzification process. |
At this stage, the main idea is to classify the data values of the respective input and output variables, according to the criteria set out in the previous section. For the individual analyses, it is recommended to use at least one of the two types of algorithms recognized in the academic and scientific field. Non-hierarchical (K-means) and hierarchical clusters (nearest neighbor and the Ward method) can be used to evaluate the performance of each one of the algorithms through the next phase of the methodology—Appendix A. Depending on each variable, these results (clusters) could change for each case. The result of this stage is the knowledge representation or knowledge base for the fuzzy inference systems because the clusters represent the number of fuzzy sets (input and output range values or membership functions). See Appendix B. |
8. Sampling: Cross-validation datasets. |
This stage seeks to randomly divide the dataset into two or three sub-sets. The framework proposes two kinds of data partition. The first one consists of random sampling. The user can select two or three subsets (training, validation, and testing). Generally, the percentage for each one, could be: 70:30:0; 70:20:10; 70:15:15; 70:10:20, etc. The user chooses this percentage. See Appendix D. The user also chose the number of repetitions that want to have, for example, if the problem has three input variables. The recommendation is to make at least three repetitions, which increases the possibility of finding an optimal combination of the sets formed by the clusters by avoiding some set(s) outside of the learning process. The results of the two or three selected subsets are saved into a multidimensional array with the divided datasets. The second option for data partition is through the Cross Validation process. In this option, the dataset is randomly partitioned into k equal sized partitions (by default is 10 —the users can change this value). One partition is used for testing the performance of the system, whereas the rest of the partitions is used for the training process. This procedure is repeated k times to use each partition as a test subset exactly one time. In the end, a mean accuracy of the individual results is consolidated. All two processes are automatized. See Appendix H. |
Pivot Tables: Feature Selection and Knowledge Rulebase Creation.
|
9. Pivot tables |
Knowing the number of the optimal clusters to each variable, we can calculate the rules number that the fuzzy system can have through the interaction between the input variables—Rulebase. According to Hernández-Julio, Hernández, Guzmán, Nieto-Bernal, Díaz, and Ferraz [29] and Hernández-Julio, et al. [82], this stage looks for a combination that allows capable results without having to reach the maximum number of input variables—Feature extraction. Doing this, it is guaranteed that the fuzzy system will be optimized in the number of variables (minimum) and the number of fuzzy sets for each variable, mainly in the output variables—knowledge discovery. In this phase, the following sub-phases are done (Table 4)—Appendix A and Appendix C. |