a, UMAP Embeddings visualizing the integrated hidden representation of the data, for each method. Each cell is colored according to the dataset from which it was sequenced. b, Box plots that plot the correlation (left) and the RMSE (right) between each imputed protein’s predicted and true values for each method. Note that the box plots for Haniffa involves the proteins that were missing from Haniffa and imputed, and likewise the box plots for Sanger involves the proteins that were missing from Sanger and were imputed. The lower and upper hinges correspond to the first and third quartiles, and the center refers to the median value. The upper (lower) whiskers extend from the hinge to the largest (smallest) value no further (at most) than 1.5 × interquartile range from the hinge. Results are based on the analysis of 647,366 cells in the Haniffa data and 240,627 cells in the Sanger data. c, Feature plots for selected proteins CD7, TCR_Va7.2, CD123, and HLA-DR. The first two are proteins that were imputed into the Haniffa dataset, the second two are proteins that were imputed into the Sanger dataset. The scatterplot is a UMAP representation of the true protein expression for the missing protein data. One UMAP representation is computed for the missing proteins in the Haniffa data, and another UMAP representation is computed for the missing proteins in the Sanger data. In each feature plot, we color each cell in the scatterplot according to the intensity of its relative value for the specified protein. In the first row, we use the true values to guide the feature plot color mapping. In the subsequent rows, we color each cell according to the protein’s predicted expression, as predicted by sciPENN and totalVI. The number in the top right is the correlation between the gold standard (true) protein expression counts and the predicted counts.