Protocol to assess fatal embolism risks from human stem cells

Fei Ma; Jinlai Zhang; Xin Jin; Pengfei Han; Yuling Liu; Ting Zhang; Kaijing Yan; Y James Kang

doi:10.1016/j.xpro.2023.102268

. 2023 May 1;4(2):102268. doi: 10.1016/j.xpro.2023.102268

Protocol to assess fatal embolism risks from human stem cells

Fei Ma ^1,^2,³, Jinlai Zhang ^1,², Xin Jin ¹, Pengfei Han ¹, Yuling Liu ¹, Ting Zhang ¹, Kaijing Yan ¹, Y James Kang ^1,^4,^∗

PMCID: PMC10176070 PMID: 37133989

Summary

Here, we present a protocol to identify the pro-embolic sub-population of human adipose-derived multipotent stromal cells (ADSCs) and predict fatal embolism risks from ADSC infusion. We describe steps for the collection, processing, and classification of ADSC single-cell RNA-seq data. We then detail the development of a mathematical model for predicting ADSC embolic risk. This protocol allows for the development of prediction models to enhance the assessment of cell quality and advance the clinical applications of stem cells.

For complete details on the use and execution of this protocol, please refer to Yan et al. (2022).¹

Subject areas: Bioinformatics, Health Sciences, RNAseq, Stem Cells

Graphical abstract

Highlights

•
A scRNA-seq-based embolism risk assessment of stem cells for clinical use
•
A step-by-step process for building a supervised machine learning risk assessment model
•
A logistic analogy for stem cell heterogeneity quality control in clinical application

Publisher’s note: Undertaking any experimental protocol requires adherence to local institutional guidelines for laboratory safety and ethics.

Before you begin

The protocol needs to be executed on the Ubuntu operating system deployed on the computer with CPU main frequency ≥ 2.70 GHz and Memory ≥ 16 GB. We used the APT which is the default package manager of the Debian based on Linux distributions to install software. Ensure that your operating system can support the installation and use of Perl (version = 5.26.3), Python (version = 3.6.9) and R (version = 4.1.0) language and ensure that your computer has enough space (on average, 100 GB per sample) to store the output results.

Software installation

Timing: ∼1 h

This section describes the software and packages required for this protocol. We used the APT, which is the default package manager of the Debian based on Linux distributions, to install the software.

1.
Install the R language and its packages.
- a.
  Install the R language.
  $sudo apt-get install r-base-core r-base r-base-dev
- b.
  Install the R package dependencies at a R console opened from the Linux terminal.
  $R
  
  >install.packages(c(“Seurat”, “ggplot2”))
  
  >q()
  Note: The required packages are Seurat² and ggplot2.³

2.
Install the Python language and its packages.
- a.
  Install the Python language.
  $sudo add-apt-repository -y ppa:jblgf0/python
  
  $sudo apt-get update
  
  $sudo apt-get install python3.6
  
  ###add the “export” line in the bashrc file.
  
  $vim ∼/.bashrc
  
  export PATH=$PATH:/usr/local/lib/python3.6/dist-packages
  
  ###save the changes and quit the bashrc file$ source ∼/.bashrc
  
  :wq
  Note: We used the “jblgf0/python” repository to ensure that we can install python language that can match different Ubuntu distributions. You can add the “export” line in the bashrc file so that you can directly use python packages with their name on the command line without providing their full path.
- b.
  Install the Python modules.
  $python -m pip install pandas numpy sklearn matplotlib cwlref-runner

3.
Install the Perl language.

$sudo wgethttps://www.cpan.org/src/5.0/perl-5.26.3.tar.bz2

$sudo tar -jxvf Perl/perl-5.26.3.tar.bz2

$cd perl-5.26.3

$sudo ./Configure –des –Dprefix=$Home/localperl

$sudo make

$sudo make test

$sudo make install

$perl -v

###verify installation, if you see the correct version (v5.26.3), it indicates that the installation has been successful.

4.
Install the other software.
- a.
  Install the FastQC.
  - i.
    Replace path with the real path where fastqc is downloaded.
  - ii.
    Add the “export” line in the bashrc file.
    Note: In this way, you can directly use the software with the name “fastqc” on the command line without providing its full path.
    
    ###download the fastqc_v0.11.9 installation package to the local directory “path”
    
    $wgethttps://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.9.zip path/
    
    $unzip path/fastqc_v0.11.9.zip
    
    $cd path/FastQC
    
    $chmod 754 path/FastQC/fastqc
    
    ###add the “export” line in the bashrc file.
    
    $vim ∼/.bashrc
    
    export PATH="path/FastQC:$PATH" ###path is where the fastqc downloaded
    
    ###save the changes and quit the bashrc file
    
    :wq
    
    $source ∼/.bashrc
- b.
  Install the cellranger.
  - i.
    Download the cellranger from this link (registration is required and reference files should be downloaded).
  - ii.
    Replace path with the real path where cellranger is downloaded.
  - iii.
    Add the “export” line in the bashrc file.
    Note: In this way, you can directly use the software with the name “cellranger” on the command line without providing its full path.
    
    $tar -xzvf path/cellranger-6.1.2.tar.gz
    
    $tar -xzvf refdata-gex-GRCh38-2020-A.tar.gz
    
    $vim ∼/.bashrc
    
    export PATH="path/cellranger-6.1.2:$PATH"
    
    :wq
    
    $source ∼/.bashrc

Single cell RNA sequencing

Timing: 4–7 days

Note: ADSC cells were obtained from three independent donors, A2105, A2013 and A2106, and cultured in commercially available media. The media used included MF (αMEM + FBS) media, which is a medium supplemented with fetal bovine serum (FBS),¹ and IL medium, which is a chemically-defined medium which is named IL medium.¹

Note: Our research demonstrated that ADSCs cultured in the MF medium caused pulmonary embolism after intravenous infusion into mice, while ADSCs cultured in the IL medium did not.¹ Therefore, we used MF and IL culture process to obtain pro-embolic or non-embolic ADSC samples, respectively (see Table 1).

Table 1.

Sample and data details used in the protocol

Sample id	Mixed data	Donor	Culture	Passage	Used for	Phenotype of cells
A2105C2P5(70%)^a	S1	A2105	MF	P5	Training set	Pro-embolic
A2105C3P5(70%)^a	S1	A2105	IL	P5	Training set	Non-embolic
A2105C2P5(30%) ^b	T1	A2105	MF	P5	Test set 1	Pro-embolic
A2105C3P5(30%) ^b	T1	A2105	IL	P5	Test set 1	Non-embolic
A2105C2P3	T2	A2105	MF	P3	Test set 2	Pro-embolic
A2105C3P3	T2	A2105	IL	P3	Test set 2	Non-embolic
A2013C2P5	T3	A2013	MF	P5	Test set 3	Pro-embolic
A2013C3P5	T3	A2013	IL	P5	Test set 3	Non-embolic
A2106C2P3	T4	A2106	MF	P3	Test set 4	Pro-embolic
A2106C3P3	T4	A2106	IL	P3	Test set 4	Non-embolic

Open in a new tab

Represents that randomly selected 70% cells from the sample.

Represents that selected remaining 30% cells.

5.
Harvested cells at passage 3 and 5.
6.
Perform scRNA-seq experiments (The BD Rhapsody or 10X genomics platform) on these samples to obtain the raw fastq data of single-cell gene expression.
7.
Mix the scRNA-seq data of pro-embolic and non-embolic samples from the same donor and passage to form mixed data.

Note: These data were used as the training set (S1) and test sets (T1, T2, T3, T4) (see Table 1).

Note: For more information about the experimental details of single cell RNA sequencing, please refer to our research article.¹

Key resources table

REAGENT or RESOURCE	SOURCE	IDENTIFIER
Software and algorithms

Linux OS	https://ubuntu.com/download/server	Ubuntu 18.04.5 LTS
R statistical software project	https://www.R-project.org/	v4.1.0
Python	https://www.python.org/	Version >=3.6
Perl	https://www.perl.org/	v5.26.3
Seurat	R packages	v4.0.3
cellranger	https://www.10xgenomics.com/support/single-cell-gene-expression	V6.1.2
Pandas	https://pandas.pydata.org/	V1.5.1
numpy	https://numpy.org	V1.23.0
matplotlib	https://matplotlib.org/	V3.6.0
RandomSelectCellsAndHVG2000Exp.pl	https://doi.org/10.5281/zenodo.7672376	https://doi.org/10.5281/zenodo.7672376
add_typeToinput.pl	https://doi.org/10.5281/zenodo.7672376	https://doi.org/10.5281/Zenodo.7672376

Experimental models: Organisms/strains

Mouse: NCG	GemPharmatech	N/A

Biological samples

Human adipose tissue	Yan et al.¹	STAR★METHODS: Table S2. Donor information related to STAR★METHODS.

Other

Experimental information about the scRNA sequencing of hADSCs	Yan et al.¹	STAR★METHODS: Single cell RNA-seq library generation and sequencing
Illumina NextSeq 2000	Illumina	Cat#20038897
BD Rhapsody™ Single-Cell Analysis System	BD Biosciences	Cat#633701
AMPure XP Beads	Beckman Coulter	Cat#A63881
NextSeq 1000/2000 P2 Reagents V3	Illumina	Cat#20046813
ADSC	Abdominal liposuction	N/A
MF medium	Yan et al.¹	STAR★METHODS
IL medium	Yan et al.¹	STAR★METHODS
BD repository	https://bitbucket.org/CRSwDev/cwl/get/f5ea290bcafb.zip	V1.0
Reference genome file (for BD pipeline)	https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/GRCh38.p13.genome.fa.gz	GENCODE: Human Release 32 (GRCh38.p13)
Transcriptome annotation file (for BD pipeline)	https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/gencode.v32.chr_patch_hapl_scaff.annotation.gtf.gz	GENCODE: Human Release 32 (GRCh38.p13)
Reference files (for 10X genomic pipeline)	https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2020-A.tar.gz	GRCh38-2020
Single cell RNA sequencing data	https://bigd.big.ac.cn/gsa-human/browse/HRA003811	GSA-Human: HRA003811

Open in a new tab

Step-by-step method details

Data preprocessing

Timing: 2∼5 h

This section describes the preprocessing process of the scRNA-seq data, of which outputs is used as the input file to establish a classifier for the pro-embolic cells.

1.
Raw data quality control and filtering.
- a.
  Quality control and filtering of the raw sequencing fastq data file. The analysis process of the S1 sample is as follows, the QC filtering of other samples is the same as that of S1 sample.
  #R1 reads of the S1 sample:
  
  $fastqc -t 10 -o outdir -d./temp -f fastq S1_R1.fastq.gz
  
  #R2 reads of the S1 sample
  
  $fastqc –t 10 –o outdir –d /temp –f fastq S1_R2.fastq.gz
  Note: The outputs of fastqc include a “.zip” file which records detailed information and a QC report file in html format. The report file provides a modular set of analysis which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing further analysis. For the interpretation of the report, please refer to https://www.bioinformatics.babraham.ac.uk/projects/fastqc/.
- b.
  Quality control pipeline for the BD Rhapsody platform.
  Note: This pipeline is used to process raw sequencing from the BD Rhapsody platform.
  - i.
    Download BD repository file (.zip format, see the key resources table), decompress the .zip file and obtain “template.yml”.
  - ii.
    Download the reference genome file and transcriptome annotation file (see the key resources table).
  - iii.
    Modify the “template.yml“ file to indicate the correct location of sequencing fastq file, reference genome file and transcriptome annotation file (Figure 1) before running the following Linux commands.
    $cwl-runner --parallel --ourdir /home/MSC/S1 --tmpdir-prefix /home/MSC/tmp_Dir --tmp-outdir-prefix /home/MSC/tmp_Dir --rm-tmpdir /home/script/rhapsody_wta_1.9.1.cwl /home/script/template_wta_1.9.1.yml
    Note: Through this step, you can get the quality report.
    
    file: /home/MSC/S1/S1_Metrics_Summary.csv and original read count.
    
    file: /home/MSC/S1/S1_RSEC_MolsPerCell.csv which is the input of the step 2.
- c.
  Quality control pipeline for the 10X genomic platform.
  Note: This pipeline is used to process raw sequence data from the 10X genomic platform.
  - i.
    Download compressed reference file: refdata-gex-GRCh38-2020-A.tar.gz (see the key resources table) to the appropriate directory (for example, /home/database/cellranger/).
  - ii.
    Run following commands.
    #Make dictionary for fastq files
    
    $cp S1_R1.fastq.gz /home/fastqs/S1/
    
    $cp S1_R2.fastq.gz /home/fastqs/S1/
    
    $tar –zxvf /home/database/cellranger/refdata-gex-GRCh38-2020-A.tar.gz
    
    #Run cellranger
    
    $Cellranger count --id S1 --transcriptome /home/database/cellranger/refdata-gex-GRCh38-2020-A --fastqs /home/fastqs/S1 --expect-cells 1000 --localcores 10 --localmem 64
    Note: Through this step, you can get the quality report
    
    file: /home/fastqs/S1/outs/ web_summary.html
    
    Original read count folder: /home/fastqs/S1/outs/filtered_feature_bc_matrix, which is the input of the step 2.
- d.
  Judge whether the sample is qualified according to quality criterion.
  Note: Criterion is as follows: Estimated Number of Cells >3000, Mean Reads per cell >50k, Median genes per cell >3000, Fraction reads in cells >70%, Valid barcodes >75%, Valid UMIs >75% and Q30 Bases in RNA Read >70%.

2.
Building the cell-gene expression seurat object.
In this step, the outputs in step 1 are used as the input (please select the corresponding platform) to run the commands. Of course, variable names in commands can be changed according to your own habits.
- a.
  Enter the R interactive environment and load the Seurat package.
  $R
  
  >library(Seurat)
- b.
  For BD Rhapsody platform.
  >count.data <- t(read.csv(“/home/MSC/S1/S1_RSEC_MolsPerCell.csv”, row.names=1, check.names=FALSE, comment.char="#"))
  
  >colnames(counts.data) <- paste(“S1”, colnames(counts.data), sep="-")
  
  >S1 <- CreateSeuratObject(counts=counts.data, project=”S1”, min.cells=10)
- c.
  For 10X genomic platform.
  >S1_dir <- '/home/fastqs/S1/outs/filtered_feature_bc_matrix’
  
  >S1 <- Read10X(data.dir=S1_dir)
  
  >S1 <- CreateSeuratObject(counts=S1, min.cells=10, min.features=200)
  
  >S1$Sample <- rep(“S1”, dim(S1)[2])
  Note: Create the seurat object of other samples using this step.

3.
Normalization, removal of the batch effect and clustering analysis.
- a.
  Normalization.
  Note: This sub-block is to remove unwanted cells from the dataset and normalize the data followed by step 2. It employs a global-scaling normalization method “LogNormalize” that normalizes the feature expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result.
  
  >ADSC1 <- merge(x=S1, y=list(T1,T2,T3,T4))
  
  >DefaultAssay(ADSC1) <- "RNA"
  
  >ADSC1[["percent.mt"]] <- PercentageFeatureSet(ADSC1, pattern=paste(c("ˆMT-", "ˆMt-", "ˆmt-", "ˆMT."), collapse="|"))
  
  >ADSC1 <- subset(ADSC1, subset=nFeature_RNA > 250 & nCount_RNA > 500 & percent.mt < 10)
  
  >ADSC1 <- NormalizeData(ADSC1, verbose=FALSE)
- b.
  Removal of the batch effect.
  Note: This sub-block is to correct for technical differences between datasets caused by the batch effect through identify cross-dataset pairs of cells that are in a matched biological state.
  
  >ADSC1_MG <- FindVariableFeatures(ADSC1_MG, selection.method="vst", nfeatures=2000)
  
  >adsc.list <- SplitObject(ADSC1_MG, split.by="Sample")
  
  >adsc.list <- lapply(X=adsc.list, FUN=function(x) {
  
  x <- NormalizeData(x)
  
  x <- FindVariableFeatures(x, selection.method="vst", nfeatures=2000)
  
  })
  
  >features <- SelectIntegrationFeatures(object.list=adsc.list)
  
  >adsc.anchors <- FindIntegrationAnchors(object.list=adsc.list, anchor.features=features)
  
  >ADSC1_IT <- IntegrateData(anchorset=adsc.anchors)
- c.
  Clustering analysis.
  Note: This sub-block is to cluster the cells without technical bias.
  - i.
    Shift the expression distribution of each gene across cells into the standard normal distribution using the “ScaleData”.
  - ii.
    Perform linear dimensional reduction using the “RunPCA”.
  - iii.
    Determine the ‘dimensionality’ of dataset using the “JackStraw” and the “ScoreJackStraw”.
  - iv.
    Cluster the cells using the “FindNeighbors” and the “FindClusters”.
  - v.
    Use “UMAP” to visualize and explore the datasets.
    Note: The interpretation of the outputs in this sub-block can be referred to https://satijalab.org/seurat/articles/pbmc3k_tutorial.html.
    
    >ADSC1_IT <- ScaleData(ADSC1_IT, verbose=FALSE,vars.to.regress=c("S.Score", "G2M.Score","nCount_RNA", "percent.mt"))
    
    >ADSC1_IT <- RunPCA(ADSC1_IT, verbose=FALSE)
    
    >ElbowPlot(ADSC1_IT, ndims=50)
    
    >ADSC1_IT <- JackStraw(ADSC1_IT, num.replicate=100)
    
    >ADSC1_IT <- ScoreJackStraw(ADSC1_IT, dims=1:20)
    
    >JackStrawPlot(ADSC1_IT, dims=1:20)
    
    >ADSC1_IT <- FindNeighbors(ADSC1_IT, reduction="pca", dims=1:30)
    
    >ADSC1_IT <- FindClusters(ADSC1_IT, resolution=0.5)
    
    >ADSC1_IT <- RunUMAP(ADSC1_IT, reduction=“pca”, dims=1:30)
    
    >write.csv(GetAssayData(ADSC1_IT,slot="data"),file="ADSC1_IT_exp.csv")
    
    >write.csv(ADSC1_IT@meta.data, file="ADSC1_IT_meta.csv")
    
    >write.csv(VariableFeatures(ADSC1_IT),file="ADSC1_IT_HVG.csv")
    Note: The “ADSC1_ IT_ Exp.csv” file is the standardized cell gene expression matrix, the “ADSC1_ IT_ The meta.csv” file contains the cluster, cell name, and cell sample information; and the “ADSC1_ IT_ HVG.csv” file is the first 2000 genes with the largest variance of expression in all cells. These files are used for the future analysis.
    
    CRITICAL: Step 3 consumes a lot of memory, 30 k cells across 5 datasets consumes about 30 GB memory. So please ensure that your computer has enough memory during this step.

Modify the YML file before running preprocessing pipeline of BD Rhapsody platform

The file location corresponding to the red character must be filled correctly, especially the sequenced fastq file.

Feature selection

Timing: ∼4 h

This section describes the detailed procedure of the feature (genes expressed in the cell) importance ranking and the optimal feature number analysis for the classifier development using machine learning. This allows us to utilize the most effective feature information while reducing noise (such as genes involved in general biological processes of cells) in the process of classifier establishment.

4.
Set up training and test sets.
- a.
  Set up the training set (S1 mixed data): Randomly selected 70% cells and their HGV gene expression from each subgroup of A2105C3P5 and A2105C2P5 to form the training set (S1) using perl script “RandomSelectCellsAndHVG2000Exp.pl”.
  #Make dictionary for training set
  
  $mkdir DataSet
  
  #Extract cell-gene expression matrixes of HVG genes from each subgroup
  
  $nohup perl script/RandomSelectCellsWithHVG2000Exp.pl ADSC1_IT_meta.csv ADSC1_IT_HVG.csv ADSC1_IT_exp.csv A2105C3P5 0.7 DataSet &
  
  $nohup perl script/RandomSelectCellsWithHVG2000Exp.pl ADSC1_IT_meta.csv ADSC1_IT_HVG.csv ADSC1_IT_exp.csv A2105C2P5 0.7 DataSet &
  
  #Use the “jobs” command to check whether the task submitted to the background via nohup is completed
  
  $jobs
  
  $cd DataSet
  
  #Merge cells from each subgroup
  
  $paste A2105C3P5_C0_HVG2000Exp70.txt A2105C3P5_C1_HVG2000Exp70.txt A2105C3P5_C2_HVG2000Exp70.txt A2105C3P5_C3_HVG2000Exp70.txt A2105C3P5_C4_HVG2000Exp70.txt A2105C3P5_C5_HVG2000Exp70.txt > A2105C3P5_70.txt
  
  $paste A2105C2P5_C0_HVG2000Exp70.txt A2105C2P5_C1_HVG2000Exp70.txt A2105C2P5_C2_HVG2000Exp70.txt A2105C2P5_C3_HVG2000Exp70.txt A2105C2P5_C4_HVG2000Exp70.txt A2105C2P5_C5_HVG2000Exp70.txt > A2105C2P5_70.txt
- b.
  Set up the test set 1 (T1 mixed data): The remaining 30% cells and their HGV gene expression in A2105C2P5 and A2105C3P5 are mixed as T1 and used as test set 1.
  #Merge cells from each subgroup
  
  cd DataSet
  
  $paste A2105C3P5_C0_HVG2000Exp30.txt A2105C3P5_C1_HVG2000Exp30.txt A2105C3P5_C2_HVG2000Exp30.txt A2105C3P5_C3_HVG2000Exp30.txt A2105C3P5_C4_HVG2000Exp30.txt A2105C3P5_C5_HVG2000Exp30.txt > A2105C3P5_30.txt
  
  $paste A2105C2P5_C0_HVG2000Exp30.txt A2105C2P5_C1_HVG2000Exp30.txt A2105C2P5_C2_HVG2000Exp30.txt A2105C2P5_C3_HVG2000Exp30.txt A2105C2P5_C4_HVG2000Exp30.txt A2105C2P5_C5_HVG2000Exp30.txt > A2105C2P5_30.txt
- c.
  Set up the test set 2 (T2 mixed data): Get cells and their HGV gene expression from each subgroup in A2105C3P3 and A2015C2P3.
  #Extract cell-gene expression matrixes of HVG genes from each subgroup
  
  $nohup perl script/RandomSelectCellsWithHVG2000Exp.pl ADSC1_IT_meta.csv ADSC1_IT_HVG.csv ADSC1_IT_exp.csv A2105C3P3 1 DataSet &
  
  $nohup perl script/RandomSelectCellsWithHVG2000Exp.pl ADSC1_IT_meta.csv ADSC1_IT_HVG.csv ADSC1_IT_exp.csv A2105C2P3 1 DataSet &
  
  #Use the “jobs” command to check whether the task submitted to the background via nohup is completed
  
  $jobs
  
  $cd DataSet
  
  #Merge cells from each subgroup
  
  $paste A2105C3P3_C0_HVG2000Exp100.txt A2105C3P3_C1_HVG2000Exp100.txt A2105C3P3_C2_HVG2000Exp100.txt A2105C3P3_C3_HVG2000Exp100.txt A2105C3P3_C4_HVG2000Exp100.txt A2105C3P3_C5_HVG2000Exp100.txt > A2105C3P3_100.txt
  
  $paste A2105C2P3_C0_HVG2000Exp100.txt A2105C2P3_C1_HVG2000Exp100.txt A2105C2P3_C2_HVG2000Exp100.txt A2105C2P3_C3_HVG2000Exp100.txt A2105C2P3_C4_HVG2000Exp100.txt A2105C2P3_C5_HVG2000Exp100.txt > A2105C2P3_100.txt
- d.
  Set up test set 3 and set 4 (T3 and T4 mixed data) using the method of setting up the test set 2.

5.
Label the class of cells.
Note: The ADSC cells amplified using different cultivation processes were infused into 6 mice by vein, respectively. Our previous work¹ demonstrated that ADSC samples amplified by MF caused embolism in all mice, while IL amplified samples did not cause embolism. Therefore, we assumed that MF-expanded cells were pro-embolic cells, while IL-expanded cells are non-embolic cells, and labeled pro- and non-embolic cells in training and test sets using the perl script “add_typeToinput.pl” (Table 1).
- a.
  Label the cells of the S1 mixed data.
  $cd DataSet
  
  #matrix transpose
  
  $awk '{i=1;while(i <= NF){col[i]=col[i] $i " ";i=i+1}} END {i=1;while(i<=NF){print col[i];i=i+1}}' A2105C3P5_70.txt | sed 's/[ ]∗$//g' > A2105C3P5_70_T.txt
  
  $awk '{i=1;while(i <= NF){col[i]=col[i] $i " ";i=i+1}} END {i=1;while(i<=NF){print col[i];i=i+1}}' A2105C2P5_70.txt | sed 's/[ ]∗$//g' > A2105C2P5_70_T.txt
  
  #add label for each cell
  
  $perl script/add_typeToinput.pl A2105C3P5_70_T.txt nonembolic > A2105C3P5_70_T_L.txt
  
  $perl script/add_typeToinput.pl A2105C2P5_70_T.txt embolic > A2105C2P5_70_T_L.txt
  
  #Combine embolic and non-embolic cell data
  
  $cat A2105C3P5_70_T_L.txt A2105C2P5_70_T_L.txt > S1_exp.txt
  
  # Use text editor such as “Vim” to manually remove redundant headers from the “S1_exp.txt” file
- b.
  Label the cells of the T1 mixed data.
  $cd DataSet
  
  #matrix transpose
  
  $awk '{i=1;while(i <= NF){col[i]=col[i] $i " ";i=i+1}} END {i=1;while(i<=NF){print col[i];i=i+1}}' A2105C3P5_30.txt | sed 's/[ ]∗$//g' > A2105C3P5_30_T.txt
  
  $awk '{i=1;while(i <= NF){col[i]=col[i] $i " ";i=i+1}} END {i=1;while(i<=NF){print col[i];i=i+1}}' A2105C2P5_30.txt | sed 's/[ ]∗$//g' > A2105C2P5_30_T.txt
  
  #add label for each cell according to theTable 1
  
  $perl script/add_typeToinput.pl A2105C3P5_30_T.txt nonembolic > A2105C3P5_30_T_L.txt
  
  $perl script/add_typeToinput.pl A2105C2P5_70_T.txt embolic > A2105C2P5_30_T_L.txt
  
  #Combine embolic and non-embolic cell data
  
  $cat A2105C3P5_30_T_L.txt A2105C2P5_30_T_L.txt > T1_exp.txt
  
  # Use text editor such as “Vim” to manually remove redundant headers from the “T1_exp.txt” file
- c.
  Change the corresponding input file and label the T2, T3 and T4 cells with the same command line as T1, and then get the file T2_exp.txt, T3_exp.txt and T4_exp.txt.

6.
Rank the importance of features and determine the optimal feature number.
Note: We recommend using the Recursive feature elimination (RFE) to rank the feature importance and calculate cross validation accuracy when using different number of features.
- a.
  Import python packages.
  >mkdir RFECV
  
  >import pandas as pd
  
  >import numpy as np
  
  >from sklearn.model_selection import train_test_split, cross_val_score, cross_validate
  
  >from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, log_loss
  
  >from sklearn import svm
  
  >from sklearn.feature_selection import RFE,RFECV
  
  >import matplotlib.pyplot as plt
- b.
  Build Linear SVM Classifiers.⁴
  >input=DataSet/S1_exp.txt
  
  >rfecv_dir=RFECV
  
  >c_use=0.2 # Regularization parameter C, need set different value.
  
  >method=’linear’
  
  >function=’ovo’
  
  >raw_data=pd.read_csv(input,index_col=0, header='infer', sep='\t', low_memory=False)
  
  >raw_data=raw_data[raw_data.columns[np.sum(raw_data)!=0]]
  
  >x=raw_data.drop('type', axis=1)
  
  >y=raw_data['type']
  
  >raw_embolic=raw_data[raw_data['type']=='embolic']
  
  >x_embolic=raw_embolic.drop('type', axis=1)
  
  >y_embolic=pd.DataFrame(raw_embolic['type'])
  
  >raw_nonembolic=raw_data[raw_data['type']=='nonembolic']
  
  >x_nonembolic=raw_nonembolic.drop('type',axis=1)
  
  >x_embolic_train, x_embolic_test, y_embolic_train, y_embolic_test=train_test_split(x_embolic, y_embolic, random_state=1, train_size=0.7, test_size=0.3)
  
  >x_nonembolic_train, x_nonembolic_test, y_non-embolic_train, y_nonembolic_test = train_test_split(x_nonembolic, y_nonembolic, random_state=1, train_size=0.7, test_size=0.3)
  
  >x_test=pd.concat([x_embolic_test,x_nonembolic_test], axis=0)
  
  >y_test=pd.concat([y_embolic_test,y_nonembolic_test], axis=0)['type']
  
  >x_train=pd.concat([x_embolic_train,x_nonembolic_train], axis=0)
  
  >y_train=pd.concat([y_embolic_train,y_nonembolic_train], axis=0)['type']
  
  >clf=svm.SVC(C=c_use, kernel=method, gamma='auto', decision_function_shape=function, probability=True, class_weight='balanced', cache_size=2000).
  Note: Use svm SVC function to establish linear SVM as the estimator of RFECV. Set the regularization parameter C to 0.2, 0.6, 0.8, 1.0, 1.2, 1.6, 2.0, 2.2, 2.6 or 3.0 to balance the model complexity and the loss function, and then calculate the corresponding cross validation accuracy when unimportant features are eliminated in turn through the 10-fold cross validation.
- c.
  Run the RFE method.
  >rfecv=RFECV (estimator=clf, step=1, cv=10, scoring='accuracy')
  
  >rfecv.fit(x, y)
- d.
  Plot relationship between numbers of selected feature and the cross validation accuracy (Figure 2A).
  >plt.figure()
  
  >plt.xlabel("Number of features selected")
  
  >plt.ylabel("Cross validation Accuracy")
  
  >plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
  
  >plt.show()
- e.
  Output the result to a file.
  >ref_txt=’RFECV_’+str(c_use)+’.txt’
  
  >pd.DataFrame(rfecv.grid_scores_, columns=['socre']).to_csv(ref_txt, index=False, header=False)
  Note: By setting different C values, a total of 10 feature selection models are obtained.

7.
Count the cross validation accuracy of different C-value models under different feature numbers to determine the optimal number of important features.

Note: Our previous work¹ demonstrated that in models with different C values, the cross validation accuracy using few important features is more than 95%. When using the first 13 important features, the cross validation accuracy is close to 100%. So we use 13 important features to train classifiers (Figure 2B). Therefore, we recommend choosing only the top few important features to train classifiers.

Selected optimal number of most important features

(A) the change curve of cross validation accuracy with the continuous addition of important features in a single model, (B) zoom in at the inflection point of the change curve (A) of the feature selection model.

Training classifier

Timing: ∼10 h

This section describes how to build and train a classifier for pro-embolic and non-embolic cells.

8.
Develop machine learning strategies.
- a.
  Use a linear SVM framework to train classifiers based on feature selection.
- b.
  Set the hyperparameter C to values of 0.0001, 0.0005, 0.001, 0.002, 0.004, 0.008, 0.02, 0.05, 0.2, 0.6, 1.2, 1.8, 2.4 or 3.0 to optimize the performance of classifiers on both the training and test sets.
  Note: C is essentially a regularization parameter that determines how much the SVM classifier should avoid misclassifying each training cell. For large C values, the classifier tends to correctly classify all training set cells, including abnormal cells. However, this can cause the classifier to pay too much attention to the features of training set cells, leading to reduced performance on test sets, which is called over-fitting. The opposite is true for smaller C values. Therefore, selecting an appropriate C value is a vital step in the best practice of using SVMs to develop classifiers that perform well on both the training set and test sets.
- c.
  Optimize the training parameters using 10-fold cross-validation.

9.
Import cell instances of the training set.

>raw_data=pd.read_csv(S1_exp.txt, index_col=0, header='infer')

>raw_data=raw_data[raw_data.columns[np.sum(raw_data)!=0]]

>x_train=raw_data.drop('type', axis=1)

>y_train=raw_data['type']

10.
Establish SVM machine learning framework.⁴

>clf=svm.SVC(C=c_use, kernel='linear', gamma='auto', decision_function_shape=function, probability=True, class_weight='balanced',cache_size=2000)

#c_use refers to different C values

11.
Obtain the 13 most important feature genes and their expression levels.

>rfe = RFE(estimator=clf, n_features_to_select=1, step=1)

>rfe.fit(x_train, y_train)

>ranking=sorted(zip(rfe.ranking_, x_train.columns))

>gene_list=[]

>for i in ranking:

if i[0]<=13:

gene_list.append(i[1])

>print(ranking)

12.
Train classifiers and output their performance.

>svc=clf.fit(x_train_new, y_train.ravel())

>scores=cross_val_score(clf, x_train_new, y_train, cv=10, error_score='raise', scoring='accuracy')

>print("cross validation Accuracy: %0.4f (+/- %0.4f)" % (scores.mean(), scores.std() ∗ 1.96))

>print(scores)

>y_pred=clf.predict(x_train_new)

>y_proba=clf.predict_proba(x_train_new)

>print(list(y_train))

>print(list(y_pred))

>for i in y_proba:

print(i)

13.
Change the super parameter C value and repeat steps 10, 11 and 12.

Note: A total of 14 candidate classifiers were obtained.

Test and determine the optimal classifier

Timing: ∼1 h

This section describes how to evaluate the performance of classifiers on test set and pick the optimal classifier based on test performance.

14.
Import test set 1 cell instances.

>raw_data_test=pd.read_csv(T1_exp.txt, index_col=0, header='infer', sep='\t')

>raw_data_test=raw_data_test[raw_data_test.columns[np.sum(raw_data_test)!=0]]

>x_test=raw_data_test.drop('type', axis=1)

>y_test=raw_data_test['type']

15.
Test the performance of the candidate classifier 1 in test set 1.

>x_test_new=x_test[gene_list]

>y_pred=clf.predict(x_test_new)

>y_proba=clf.predict_proba(x_test_new)

>print(list(y_test))

>print(list(y_pred))

>for i in y_proba:

print(i)

>accuracy=accuracy_score(y_test, y_pred)

>print("Classification Accuracy: %0.4f" %accuracy)

>classifi=classification_report(y_test, y_pred)

>print(classifi)

>logloss=log_loss(y_test, y_proba)

>print('log_loss:'+str(logloss))

16.
Output classifier 1 parameters.

>print ("Weighted coefficients of selected gene features:")

>print (svc.coef_)

>print ("Bias value of decision function b:")

>print (svc.intercept_)

>print ("Index of supported_vectors sample:")

>print (svc.support_)

>print ("All supported_vectors:")

>print (svc.support_vectors_)

>print ("Number of class-supported_vectors:”)

>print (svc.n_support_)

17.
Repeat steps 15 and 16 to test the performance of all 14 candidate classifiers in test set 1.
18.
Repeat steps 14, 15, 16 and 17 to test the performance of all 14 candidate classifiers in test set 2, 3 and 4.
19.
Determine the optimal classifier.
- a.
  If the prediction accuracy of the classifier in all test sets does not increase, then the classifier is optimal. Based on this, the classifier with C = 0.2 is determined as the optimal classifier. Its prediction accuracy in the training set and the four test sets is 100%, 100%, 100%, 100%, 97% and 95% respectively (Table 2).
- b.
  Details of the performance of the optimal classifier in the four test sets (Table 3)

Table 2.

Accuracy of classifiers in training and test sets

Classifier (C value)	Training set	Test set 1	Test set 2	Test set 3	Test set 4
Classifier (C value)	Training set	Similar to train set	Different generation	Different donor	Different donor and generation
3	1.00	1.00	1.00	0.97	0.95
2.4	1.00	1.00	1.00	0.97	0.95
1.8	1.00	1.00	1.00	0.97	0.95
1.2	1.00	1.00	1.00	0.97	0.95
0.6	1.00	1.00	1.00	0.97	0.95
0.2	1.00	1.00	1.00	0.97	0.95
0.05	1.00	1.00	1.00	0.97	0.89
0.02	1.00	1.00	1.00	0.97	0.92
0.008	1.00	1.00	1.00	0.95	0.8
0.004	1.00	1.00	1.00	0.96	0.78
0.002	1.00	1.00	1.00	0.96	0.77
0.001	1.00	1.00	1.00	0.96	0.71
0.0005	1.00	1.00	1.00	0.97	0.71
0.0001	1.00	1.00	1.00	0.95	0.67

Open in a new tab

Table 3.

Performance details of the prediction model in test sets

Evaluation index	Test set 1		Test set 2		Test set 3		Test set 4
Evaluation index	Pro- embolic	Non-embolic	Pro- embolic	Non-embolic	Pro- embolic	Non-embolic	Pro- embolic	Non-embolic
Accuracy	1.00		1.00		0.97		0.95
Precision	1.00	1.00	1.00	1.00	0.98	0.95	1.00	0.91
Recall	1.00	1.00	1.00	1.00	0.96	0.98	0.89	1.00
F1 score	1.00	1.00	1.00	1.00	0.97	0.96	0.94	0.95
Log loss	0.002		0.003		0.142		0.137

Open in a new tab

Development of mathematical model for embolic risk of ADSC cell

Timing: ∼1 h

This section describes how to develop mathematical model for embolic risk of ADSC cells based on the optimal classifier. We use the 13 key features (genes) obtained from the optimal classifier and their weight coefficients, with the expression level of these genes in cells, to establish the following mathematical model for calculating the embolic risk score of a single cell.

20.
After getting the gene expression profile of a cell, extract the expression amount of its 13 key genes, and then calculate the embolic risk of the cell according to the Equation (1):

R S = 1 + e^{- \sum_{i = 1}^{n} W_{i} * G_{i}}

(Equation 1)

Note:Wi is the weighted coefficient of ith gene determined by the optimal classifier and showed by the output of the “print (svc.coef_)” command in the step 16, Gi is the expression of the ith key gene in this cell, and n is the number of key genes. The value of RS ranges from 0 to ∞ with small RS indicating a non-embolic cell and a larger risk score indicating a potential embolic cell.

21.
Select an appropriate risk threshold according to the cell production process and determine whether the cell is a pro-embolic cell.
- a.
  Use an ROC curve analysis to determine the RS threshold of pro-embolic and non-embolic cells in test samples.
  Note: From the ROC curve, the thresholds of the four test sets were defined to be 2.131, 2.131, 2.048 and 3.368, respectively. The specificity and sensitivity of using the thresholds to distinguish cell embolism is more than 0.96 in each test dataset.
- b.
  You can use following commands to determine RS thresholds of four test sets.
  Note: The input files are “RSandLabel_test1.txt”, “RSandLabel_test2.txt”, “RSandLabel_test3.txt” and “RSandLabel_test4.txt”. Each line of the input file is the cell name, the “RiskScore” column is the RS value of each cell, and the “RealLabel” column is the actual category label of each cell.
  
  library(pROC)
  
  data1 <- read.csv("RSandLabel_test1.txt", header=T, sep='\t')
  
  data2 <- read.csv("RSandLabel_test2.txt", header=T, sep='\t')
  
  data3 <- read.csv("RSandLabel_test3.txt", header=T, sep='\t')
  
  data4 <- read.csv("RSandLabel_test4.txt", header=T, sep='\t')
  
  roc1 <- roc(data1$RealLabel, data1$RiskScore, levels=c("nonembolic", "embolic"))
  
  roc2 <- roc(data2$RealLabel, data2$RiskScore, levels=c("nonembolic", "embolic"))
  
  roc3 <- roc(data3$RealLabel, data3$RiskScore, levels=c("nonembolic", "embolic"))
  
  roc4 <- roc(data4$RealLabel, data4$RiskScore, levels=c("nonembolic", "embolic"))
  
  plot(roc1, print.auc=TRUE, col="purple", print.auc.x=0.45, print.auc.y=0.4, print.thres=TRUE, cex.axis=1.5, cex.lab=2)
  
  plot.roc(roc2, add=T, col="black", print.auc=TRUE, print.auc.x=0.45, print.auc.y=0.35)
  
  plot.roc(roc3, add=T, col="blue", print.auc=TRUE, print.auc.x=0.45, print.auc.y=0.30, print.thres=TRUE)
  
  plot.roc(roc4, add=T, col="red", print.auc=TRUE, print.auc.x=0.45, print.auc.y=0.25, print.thres=TRUE)
  
  legend("bottomright", legend=c("Test 1", "Test 2", "Test 3", "Test 4"), col=c("purple", "black", "blue", "red"), lwd=2, bty="n", cex=1.5)

22.
Calculate the proportion of embolic cells in the sample, and predict the embolic possibility of reinfused individuals according to the established regression relationship between the embolic cell proportion and the embolic risk after reinfusion.
- a.
  Using seven ADSC samples cultured with different culture techniques, infuse these ADSC samples into the mice (more than 6 mice per ADSC sample).¹
- b.
  Calculate the proportion of mice with pulmonary embolism, which is used as the embolism possibility after in vivo ADSCs infusion.
- c.
  For each ADSC sample, conduct the scRNA-seq experiment and identify the pro-embolic cells through step 20 and 21, and calculate the proportion of pro-embolic cells in each sample.
- d.
  Establish the linear regression relationship between the pro-embolic cell proportion and the embolization possibility after the ADSC reinfusion.

Note: NCG mice were purchased from GemPharmatech and all animal protocols were approved by the Institutional Animal Care and Use Committee, Experimental Animal Center. All mice were housed in standard SPF facility with a temperature between 18 °C and 23 °C, a humidity of 40%–60%, and a 12 h light-dark cycle. Eight-to-ten-week-old male and female NCG mice were used in this study. The number of mice used in each experiment was indicated, respectively. Mice were randomly assigned into groups. For the injection, 1 × 10⁶ hADSCs were resuspended in saline and infused into each NCG mouse via tail vein slowly (about 10 s) using a 29-gauge needle. For complete details of the experimental models, please refer to our previous research.¹

Note: The establishment details of this step have been described in the “Mathematical model to predict embolic risk” in the “Results” section, and the established relationship curve of this protocol was shown in Figure 6H of our published article.¹

Note: For complete details on the use and execution of this protocol, please refer to our previous research.¹

Expected outcomes

Using this protocol, we have successfully established a model for predicting the risk of ADSC embolism. The model can predict the embolic risk of ADSCs that have undergone amplification via four different culture processes (four test sets) with an accuracy rate of over 95%. To compute the embolic risk of the ADSC reinfusion, we randomly extracted a small part of the sample (about 100 million cells), detected the expression levels of the single cell full transcriptome or specific genes (such as the 13 feature genes identified in our study), and then could conveniently obtain the possibility of embolism through our prediction model. Furthermore, the 13 feature genes and their corresponding weight coefficients derived by machine learning provided a direction for scientists to study the molecular mechanism of embolism caused by ADSCs. Currently, single-cell omics detection and phenotype identification technologies can easily provide the data required by the protocol,⁵ various machine learning algorithms can meet diversified modeling requirements,⁶ and powerful and convenient computer resources make this process very convenient and greatly accelerated.

This protocol can be easily extended to establish other quality assessment methods for cell therapy products. For example, the quality evaluation of safety attributes such as tumorigenicity⁷ and immunogenicity,⁸ as well as effective attributes such as homing ability⁹ and immune regulation ability.¹⁰ Similar to this protocol, in which the risk of embolism was assessed based on transcriptome characteristics, these quality indicators are also largely determined by the transcriptome or other omics characteristics of cells. Therefore, this protocol can be extended to assess them more comprehensively, accurately and predictably. Single cell transcriptome used in this study can be easily replaced by other quantifiable omics, and the SVM can also be replaced by any other appropriate machine learning methods, including deep learning, to adapt to the evaluation of different quality attributes of cells (Figure 3).

The development and operation flow chart of the protocol

Limitations

In this protocol, we established a final classifier using eight ADSC samples with an average cost of about 50k RMB (7.3k USD) per sample, and due to the limitation of research funds, we selected four different culture processes to test and determine the final classifier. It is unknown whether the embolic risk of samples from other culture processes can also be accurately identified. Therefore, we need to further collect ADSC samples from various sources to optimize the model. A typical experiment to establish a reliable classifier requires at least 50 samples; however, the required sample size also needs to be determined according to the performance of the classifier. For example, the accuracy of our classifier on test sets is more than 95%, which allows us to achieve a stable and reliable classifier with only a few other data different from our current sample (such as <5 samples). The SVM used in this study is a supervised machine learning algorithm, which has better performance than unsupervised algorithms but requires labeling the class of training and test set samples (cells) in advance.¹¹ However, the labeling process usually needs to be completed by using bioinformatics analysis, in vitro, and in vivo experiments. The whole process is manual, and the labeling quality is easily affected by experimental methods, manual operations, and other factors. Therefore, class labeling is time-consuming and expensive. Some stem cell quality phenotypes (such as tumorigenicity) are difficult to verify through in vivo experiments, making it impossible to accurately label cells with these phenotypes. This limits the application field of this protocol.

Troubleshooting

Problem 1

The feature selection process cannot achieve high cross validation accuracy (e.g., 90%) with a small number of features (<100) (step 7).

Potential solution

Pre-optimize features by using statistical difference analysis, dimension reduction (linear or nonlinear) and other methods to improve the variability of features in different class of samples (cells). One common statistical method used in feature pre-optimization is correlation analysis. Correlation analysis helps to select features which are strongly correlated with the target variable, and therefore likely to be important for predicting the target variable. Other statistical methods such as regression analysis, hypothesis testing, and Analysis of Variance can also be used in feature pre-optimization to identify important features and pattern in the data.

Problem 2

The classifier performs well in the training set, but poorly (accuracy rate <70%) in the test set (step 12).

Potential solution

This is because the classifier was over-fitted. To address this issue, you could increase the regularization parameter value or select a simpler learning algorithm to reduce the model complexity. Alternatively, increase training set sample instance, or add a certain proportion (such as 10%) of random noise to the training set.

Problem 3

The classifier performs poorly (accuracy rate <70%) not only in the training set, but also in the test set (step 19).

Potential solution

This is because the classifier is not well fitted. Please increase the number of training sample instances, or optimize input features using biology knowledge, such as transforming gene features into pathway features through enrichment analysis. Alternatively, you could reduce the regularization parameter value or select more complex learning algorithms such as non-linear learning algorithm and neural network.

Problem 4

Expression levels of key feature genes were not detected in the test set samples (step 20).

Potential solution

Combine the training and test set samples to remove the batch effect and screen the highly variable genes during the data preprocessing analysis, and make sure that the gene detection depth is consistent in the training and test set samples (e.g., 50k reads per sample in scRNA-seq).

Problem 5

Installation error of R and Python packages (step 1).

Potential solution

Check whether the package version matches the R and Python versions as described in the protocol, and then configure the correct system environment according to the error message. Sometimes you need to contact your administrator to solve the environment problem.

Resource availability

Lead contact

Further information and requests for resources and reagents should be directly to and will be fulfilled by the lead contact, Y. James Kang (ykang7@uthsc.edu).

Materials availability

This study did not generate new unique reagents.

Data and code availability

The accession number for the single-cell RNA-seq data reported in this paper is GSA-Human: HRA003811 at GSA-Human (https://ngdc.cncb.ac.cn/gsa-human/) , which is the same as the sequencing data used in our published article.¹ The scripts "RandomSelectCellsAndHVG2000Exp.pl" and "add_typeToinput.pl" have been deposited to Zenodo Database: https://doi.org/10.5281/zenodo.7672376 and other code has been provided in the 'Step-by-step method details' section. Any additional information required to reanalyze the data reported in this paper can be obtained from the lead contact upon request.

Acknowledgments

The authors thank Tasly Pharmaceutical Co, Tianjin, China, for facility and instrumental supports.

Author contributions

Y.J.K. conceptualized the study and designed the research. K.Y. organized all the studies and led the experimental performance. F.M. and J.Z. completed sequencing data analysis and developed machine learning models. F.M., J.Z., X.J., P.H., Y.L., and T.Z. contributed to data analysis and interpretation. F.M., X.J., J.Z., and Y.J.K. wrote and revised the article with the input of other co-authors.

Declaration of interests

The authors declare no competing interests.

References

1.Yan K., Zhang J., Yin W., Harding J.N., Ma F., Wu D., Deng H., Han P., Li R., Peng H., et al. Transcriptomic heterogeneity of cultured ADSCs corresponds to embolic risk in the host. iScience. 2022;25:104822. doi: 10.1016/j.isci.2022.104822. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Hao Y., Hao S., Andersen-Nissen E., Mauck W.M., 3rd, Zheng S., Butler A., Lee M.J., Wilk A.J., Darby C., Zager M., et al. Integrated analysis of multimodal single-cell data. Cell. 2021;184:3573–3587.e29. doi: 10.1016/j.cell.2021.04.048. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Wickham H. Use R! 2nd edn. Springer International Publishing: Imprint; Cham: 2016. ggplot2: elegant graphics for data analysis; p. 1. [Google Scholar]
4.Ben-Hur A., Weston J. A user's guide to support vector machines. Methods Mol. Biol. 2010;609:223–239. doi: 10.1007/978-1-60327-241-4_13. [DOI] [PubMed] [Google Scholar]
5.Stein C.M., Weiskirchen R., Damm F., Strzelecka P.M. Single-cell omics: overview, analysis, and application in biomedical science. J. Cell. Biochem. 2021;122:1571–1578. doi: 10.1002/jcb.30134. [DOI] [PubMed] [Google Scholar]
6.Ji Y., Lotfollahi M., Wolf F.A., Theis F.J. Machine learning for perturbational single-cell omics. Cell Syst. 2021;12:522–537. doi: 10.1016/j.cels.2021.05.016. [DOI] [PubMed] [Google Scholar]
7.Heslop J.A., Hammond T.G., Santeramo I., Tort Piella A., Hopp I., Zhou J., Baty R., Graziano E.I., Proto Marco B., Caron A., et al. Concise review: workshop review: understanding and assessing the risks of stem cell-based therapies. Stem Cells Transl. Med. 2015;4:389–400. doi: 10.5966/sctm.2014-0110. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Gu L.H., Zhang T.T., Li Y., Yan H.J., Qi H., Li F.R. Immunogenicity of allogeneic mesenchymal stem cells transplanted via different routes in diabetic rats. Cell. Mol. Immunol. 2015;12:444–455. doi: 10.1038/cmi.2014.70. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Ullah M., Liu D.D., Thakor A.S. Mesenchymal stromal cell homing: mechanisms and strategies for improvement. iScience. 2019;15:421–438. doi: 10.1016/j.isci.2019.05.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Song N., Scholtemeijer M., Shah K. Mesenchymal stem cell immunomodulation: mechanisms and therapeutic potential. Trends Pharmacol. Sci. 2020;41:653–664. doi: 10.1016/j.tips.2020.06.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Greener J.G., Kandathil S.M., Moffat L., Jones D.T. A guide to machine learning for biologists. Nat. Rev. Mol. Cell Biol. 2022;23:40–55. doi: 10.1038/s41580-021-00407-0. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

[bib1] 1.Yan K., Zhang J., Yin W., Harding J.N., Ma F., Wu D., Deng H., Han P., Li R., Peng H., et al. Transcriptomic heterogeneity of cultured ADSCs corresponds to embolic risk in the host. iScience. 2022;25:104822. doi: 10.1016/j.isci.2022.104822. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib2] 2.Hao Y., Hao S., Andersen-Nissen E., Mauck W.M., 3rd, Zheng S., Butler A., Lee M.J., Wilk A.J., Darby C., Zager M., et al. Integrated analysis of multimodal single-cell data. Cell. 2021;184:3573–3587.e29. doi: 10.1016/j.cell.2021.04.048. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib3] 3.Wickham H. Use R! 2nd edn. Springer International Publishing: Imprint; Cham: 2016. ggplot2: elegant graphics for data analysis; p. 1. [Google Scholar]

[bib4] 4.Ben-Hur A., Weston J. A user's guide to support vector machines. Methods Mol. Biol. 2010;609:223–239. doi: 10.1007/978-1-60327-241-4_13. [DOI] [PubMed] [Google Scholar]

[bib5] 5.Stein C.M., Weiskirchen R., Damm F., Strzelecka P.M. Single-cell omics: overview, analysis, and application in biomedical science. J. Cell. Biochem. 2021;122:1571–1578. doi: 10.1002/jcb.30134. [DOI] [PubMed] [Google Scholar]

[bib6] 6.Ji Y., Lotfollahi M., Wolf F.A., Theis F.J. Machine learning for perturbational single-cell omics. Cell Syst. 2021;12:522–537. doi: 10.1016/j.cels.2021.05.016. [DOI] [PubMed] [Google Scholar]

[bib7] 7.Heslop J.A., Hammond T.G., Santeramo I., Tort Piella A., Hopp I., Zhou J., Baty R., Graziano E.I., Proto Marco B., Caron A., et al. Concise review: workshop review: understanding and assessing the risks of stem cell-based therapies. Stem Cells Transl. Med. 2015;4:389–400. doi: 10.5966/sctm.2014-0110. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib8] 8.Gu L.H., Zhang T.T., Li Y., Yan H.J., Qi H., Li F.R. Immunogenicity of allogeneic mesenchymal stem cells transplanted via different routes in diabetic rats. Cell. Mol. Immunol. 2015;12:444–455. doi: 10.1038/cmi.2014.70. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib9] 9.Ullah M., Liu D.D., Thakor A.S. Mesenchymal stromal cell homing: mechanisms and strategies for improvement. iScience. 2019;15:421–438. doi: 10.1016/j.isci.2019.05.004. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib10] 10.Song N., Scholtemeijer M., Shah K. Mesenchymal stem cell immunomodulation: mechanisms and therapeutic potential. Trends Pharmacol. Sci. 2020;41:653–664. doi: 10.1016/j.tips.2020.06.009. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib11] 11.Greener J.G., Kandathil S.M., Moffat L., Jones D.T. A guide to machine learning for biologists. Nat. Rev. Mol. Cell Biol. 2022;23:40–55. doi: 10.1038/s41580-021-00407-0. [DOI] [PubMed] [Google Scholar]

PERMALINK

Protocol to assess fatal embolism risks from human stem cells

Fei Ma

Jinlai Zhang

Xin Jin

Pengfei Han

Yuling Liu

Ting Zhang

Kaijing Yan

Y James Kang

Summary

Graphical abstract

Highlights

Before you begin

Software installation

Single cell RNA sequencing

Table 1.

Key resources table

Step-by-step method details

Data preprocessing

Figure 1.

Feature selection

Figure 2.

Training classifier

Test and determine the optimal classifier

Table 2.

Table 3.

Development of mathematical model for embolic risk of ADSC cell

Expected outcomes

Figure 3.

Limitations

Troubleshooting

Problem 1

Potential solution

Problem 2

Potential solution

Problem 3

Potential solution

Problem 4

Potential solution

Problem 5

Potential solution

Resource availability

Lead contact

Materials availability

Data and code availability

Acknowledgments

Author contributions

Declaration of interests

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases