Skip to main content
. 2022 Nov 8;13:6736. doi: 10.1038/s41467-022-34435-x

Fig. 1. Overview of NLP-ML.

Fig. 1

The NLP-ML approach contains four steps: (i) Text preprocessing: Unstructured metadata of samples are preprocessed to remove text elements extraneous to sample classification and reduce words to their roots; (ii) Creating text-based sample embeddings: A neural network model trained on large text corpora is used to create numerical embeddings of individual words. An embedding of a sample is created by averaging the embeddings of the words in that sample’s metadata; (iii) Training sample tissue classifiers: Supervised machine learning models—one per tissue/cell type—are trained using sample text embeddings as features and manually curated sample to tissue/cell-type annotations as labels; and (iv) Classifying new samples: Descriptions of unlabeled samples are preprocessed and turned into numerical embeddings. Each trained model takes these embeddings as input and provides the probability that the sample is from that tissue/cell type.