Skip to main content
. 2022 Apr 8;13:1914. doi: 10.1038/s41467-022-29443-w

Fig. 2. Latent embedding of the protein family of β-lactamase, color-coded by taxonomy at the phyla level.

Fig. 2

In the upper row, we embed the family using sequential models (LSTM, Resnet, Transformer) trained on the full corpus of protein families. In the lower row we train the same sequential models again only on the β-lactamase family (PFAM PF0014428). For the models in the first three columns, a simple mean strategy is employed to extract a global representation from local representations, while the fourth column uses the Bottleneck aggregation method. Finally, in the last column, we show the result of preprocessing the sequences in a multiple sequence alignment and applying a dense variational autoencoder (VAE) model. We see clear differences in how well the different phyla are separated, which demonstrates the impact that model choice and data preprocessing can have on the learned representation.