Skip to main content
. 2021 Apr 7;9(4):e23718. doi: 10.2196/23718

Table 1.

Exclusion criteria and justifications.

Exclusion criteria Justifications
Columns with more than 25,000 “NA” (not available) The objective was to remove variables that many professionals have not declared the value of, as a large amount of missing data may impair processing. The number 25,000 was arbitrarily defined, focusing on not drastically reducing the total amount of data
Categorical variable with more than 53 input possibilities R shows an alert when a categorical variable with more than 53 input possibilities is being used, given that the greater the number of input possibilities, the smaller the meaning of each input to the model
Variables that may induce a result Some variables imply an operational classification (eg, “g-MB” therapeutic scheme implies that the patient has a case of multibacillary leprosy), causing bias to the model
Variables with no apparent correlation with the prediction. Boruta [16], an algorithm to find the most relevant variables to predict outcomes in a given dataset, was used; variables with little relevance were excluded
Redundant variables Redundant variables do not provide additional information to the model, and therefore there is no reason to keep both. An analysis using Python showed that some variables had almost 100% correspondence with another (eg, the state where the case was notified and the state where the patient lives). The Boruta algorithm is also useful to remove redundant variables.