Intra-dataset and inter-dataset accuracy comparison
A.–C. Results of intra-dataset accuracy comparison over eight real datasets are shown as heatmaps of three classification metrics: overall accuracy (A), ARI (B), and V-measure (C). For each dataset, a 5-fold cross validation was performed: using 4-fold as the reference and 1-fold as the query. D.–F. Results of inter-dataset accuracy comparison over four pairs of experimental datasets and one pair of simulation datasets are shown as heatmaps of three classification metrics: overall accuracy (D), ARI (E), and V-measure (F). PBMC pair: PBMC sorted and PBMC-3K; pancreas cell pair: pancreas CEL-Seq2 and pancreas Fluidigm C1; TM full pair: TM full Smart-Seq2 and TM full 10X; TM lung pair: TM lung Smart-Seq2 and TM lung 10X; simulation: true assay and raw assay. TM lung datasets were downsampled from TM full datasets by taking cells from lung tissue only. Within the simulation dataset pair, the true assay without dropouts was used as the reference and the raw assay with dropout mask was used as the query. The columns are datasets, and the rows are annotation methods. The heatmap scale is shown on the figure, where the brighter yellow color indicates a better classification performance. On the right of each heatmap is a boxplot to summarize the classification metrics across all datasets for each method. Box colors represent different methods. The methods in the heatmap and the boxplot are arranged in descending order by their average metrics across all datasets. Some methods failed to produce a prediction for certain datasets (indicated by gray squares). ****, significantly higher (P < 0.05) than nine other methods using Wilcoxon paired rank test; *, significantly higher (P < 0.05) than six other methods using Wilcoxon paired rank test. ARI, adjusted rand index; PBMC, peripheral blood mononuclear cell; TM, Tabula Muris.