Figure - PMC

Skip to main content

An official website of the United States government

Here's how you know

Here's how you know

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

View full-text article in PMC

. 2021 Nov 24;12(12):1865. doi: 10.3390/genes12121865

Search in PMC
Search in PubMed
View in NLM Catalog
Add to search

© 2021 by the authors.

Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

PMC Copyright notice

Overall design of PipeOne. (A) Three modules of PipeOne: data processing and various feature identification (one), feature prioritizing (two), and disease subtyping (three). (B) Details of module one. Raw sequencing reads were quality controlled by FASTP and then went through eight tools to extract information from RNA-seq data, including expression levels of mRNA, lncRNA, circRNA, and retrotransposons, alternative splicing events, alternative polyadenylation, RNA editing, gene fusions, and SNPs. These information was used to construct the feature matrices for machine learning (only the top 1000 most variable features were used for each type of information) in module two and three. (C) Details of module two. First, feature importance was calculated by using random forest on all features from module one. Then the top K (20, 20, 50, 100, 200, all) ranked by feature importance were used to test and validate the importance of those top features. (D) Details of module three. First, a robust NMF integration algorithm was applied to obtain latent features and associated weights for all samples. Then K-Means clustering evaluated by silhouette width was used to cluster samples based on the latent feature matrix. Differential survival analysis by log-rank test was used to assess the clinical relevance of those stable clusters as potential subtypes. Finally, similar to module two, random forest was used to select features contributing to the subtyping results.