Figure - PMC

Skip to main content

View full-text article in PMC

. Author manuscript; available in PMC: 2020 Mar 13.

Published in final edited form as: Nat Methods. 2019 Oct 21;16(12):1315–1322. doi: 10.1038/s41592-019-0598-1

Fig. 1 | — a, UniRep model was trained on 24 million UniRef50 primary amino-acid sequences. The model was trained to perform next amino-acid prediction (minimizing cross-entropy loss) and, in so doing, was forced to learn how to internally represent proteins. b, During application, the trained model is used to generate a single fixed-length vector representation of the input sequence by globally averaging intermediate mLSTM numerical summaries (the hidden states). A top model (for example, a sparse linear regression or random forest) trained on top of the representation, which acts as a featurization of the input sequence, enables supervised learning on diverse protein informatics tasks.