Table 1.
Exclusion criteria | Justifications |
Columns with more than 25,000 “NA” (not available) | The objective was to remove variables that many professionals have not declared the value of, as a large amount of missing data may impair processing. The number 25,000 was arbitrarily defined, focusing on not drastically reducing the total amount of data |
Categorical variable with more than 53 input possibilities | R shows an alert when a categorical variable with more than 53 input possibilities is being used, given that the greater the number of input possibilities, the smaller the meaning of each input to the model |
Variables that may induce a result | Some variables imply an operational classification (eg, “g-MB” therapeutic scheme implies that the patient has a case of multibacillary leprosy), causing bias to the model |
Variables with no apparent correlation with the prediction. | Boruta [16], an algorithm to find the most relevant variables to predict outcomes in a given dataset, was used; variables with little relevance were excluded |
Redundant variables | Redundant variables do not provide additional information to the model, and therefore there is no reason to keep both. An analysis using Python showed that some variables had almost 100% correspondence with another (eg, the state where the case was notified and the state where the patient lives). The Boruta algorithm is also useful to remove redundant variables. |