Predictive performance of machine and statistical learning methods: Impact of data-generating processes on external validity in the “large N, small p” setting

. 2021 Apr 13;30(6):1465–1483. doi: 10.1177/09622802211002867

1. Fit a random forest to the EFFECT1 sample using the observed outcomes. The observed outcomes are no longer used after this step.

2. Apply the random forest fit in Step 1 to both the EFFECT1 and EFFECT2 samples. Obtain a predicted probability of the outcome for each subject in the EFFECT1 and EFFECT2 samples using the fitted model.

3. Generate a binary outcome for each subject in the EFFECT1 and EFFECT2 samples using a Bernoulli random variable with subject-specific probability equal to the predicted probability obtained in Step 2. These are the simulated outcomes that will be used in all subsequent steps.

4. Apply a given analysis method (e.g. unpenalized logistic regression) by fitting that model to the EFFECT1 sample with the simulated outcomes generated in Step 3.

5. Apply the fitted model from Step 4 to the EFFECT2 sample.

6. For each subject in the EFFECT2 sample, obtain a predicted probability of the outcome based on the fitted analysis model that was applied to the EFFECT2 sample in Step 5.

7. Use the eight performance metrics to compare the predicted probability of the outcome obtained in Step 6 with the simulated binary outcome generated in Step 3.

8. Repeat Steps 3 to 7 1000 times. Summarize the performance metrics across the 1000 simulation replicates.

9. Repeat Steps 3 to 8 for a total of six analysis methods (lasso, ridge regression and unpenalized logistic regression; random forest, bagged classification trees, boosted trees).

10. Repeat Steps 1 to 9 with the five other data-generating processes (bagged classification trees, boosted trees, the lasso, ridge regression, and unpenalized logistic regression).