Fig. 3. Kaggle challenges: shifts from public to private set compared to improvement across the top 10% models on eight medical-imaging challenges with significant incentives.
The blue violin plot shows the evaluation noise—the distribution of differences between public and private leaderboards. A systematic shift between public and private set (positive means that the private leaderboard is better than the public leaderboard) indicates overfitting or dataset bias. The width of this distribution shows how noisy the evaluation is, or how representative the public score is for the private score. The brown bar is the winner gap, the improvement between the top-most model (the winner) and the 10% best model. It is interesting to compare this improvement to the shift and width in the difference between the public and private sets: if the winner gap is smaller, the 10% best models reached diminishing returns and did not lead to a actual improvement on new data.