Figure 4.
Certainty-based annotation refinement in ProtoCloud
(A) The relationship between per-cell-type accuracy and similarity scores across eight experimental datasets. The x axis denotes the similarity score of each cell type, and the y axis denotes its prediction accuracy. Each point represents a distinct cell type from one of the experimental datasets. Point size and color intensity jointly encode the variance of similarity scores for each cell type. The black curve represents an isotonic regression fit, demonstrating the positive correlation between similarity score and accuracy.
(B and C) Workflow for annotation refinement in ProtoCloud. (B) Step 1: assign a confidence score by classifying each cell prediction as “certain” (blue) or “ambiguous” (orange) using class-specific thresholds derived from the training data. Cells with a similarity score above the threshold are classified as “certain,” while those below the threshold are categorized as “ambiguous.” (C) Step 2: if the original annotations are available, we can re-annotate confidently predicted cells by comparing predictions with the original annotations. Type 1 cells are cells with high prediction confidence that do not match the original labels (green) and are therefore re-annotated. Type 2 cells (red) are ambiguous cells and have unaligned labels that will not be re-annotated.
(D–F) Expression patterns of major type 1 annotation pairs in PBMC30K, comparing original annotations (first row), predicted types (second row), type 1, and type 2 cells. Type 1 cells should align more closely with the predicted cell type than with the original annotation. Type 2 cells remain ambiguous, showing limited separation between the original and predicted labels. The shared region contains genes that are top-ranked HRGs or known markers in both cell types. Asterisks denote marker genes that are also present in the respective HRG set. (D) CD4+ T cells versus cytotoxic T cells. Type 1 cells were originally labeled as CD4+ T cells and re-annotated as cytotoxic T cells. (E) Cytotoxic T cells versus CD4+ T cells. (F) Cytotoxic T cells versus natural killer cells. HRGs, highly relevant genes and M, Marker.
(G) Reliability diagram of similarity scores and calibrated similarities of benchmarking datasets. The x axis of each triangle represents the average similarity score of a dataset, while that of a circle represents the average calibrated probability. The diagonal dashed line indicates perfect calibration, where predicted uncertainty precisely matches observed accuracy rates.
(H) Reliability diagram of calibrated probability between dichotomous confidence groups of benchmarking datasets. Each point represents one dataset, stratified into “certain” (circles) and “ambiguous” (diamonds) subsets. The dashed diagonal indicates perfect calibration, where predicted certainty matches the empirical accuracy.
