Our architecture. For an input highly multiplexed image I, we split channels and feed them into ResNet18 [4]. Next, the embedding encoder extracts an interpretable representation r = [r0, …, rN], where N is the number of channels. Both the backbone and embedding encoders share weights across channels. Finally, we adopt three fully-connected layers to estimate class probabilities p.