Signalling changes between resting (R) and activated (A) human platelets. TiO2-enriched phospho peptides labeled with 10-plex TMT reagents. This is data from the ASLA-515 project.
Some of the resulting peptides were directly labeled with 10-plex TMT reagents {Thompson 2003}. Remaining peptides were enriched for phosphopepides, labeled with 10-plex TMT for a second set of labeled samples. After combining equal portions of labeled peptide samples (either whole protein or enriched phosphopeptides), they were separated by high pH reverse phase/low pH reverse phase liquid chromatography and analyzed on a Thermo Fusion Tribrid Orbitrap mass spectrometer {Senko 2013}. Fragment ions were generated using CID and the ion trap analyzer; the reporter ions were fragmented using high energy collision dissociation of (synchronous precursor selection) SPS selected fragment ions at high resolution in the Orbitrap analyzer.
Peptides and proteins were identified using Proteome Discoverer v1.4 and SEQUEST {Eng 1994}. A wider 1.25 Da parent ion mass tolerance was used, TMT labels and alkylated cysteine were specified as static modifications, oxidation of methionine (+15.9949 Da) was specified as a variable modification, trypsin enzyme specificity was used, and a canonical UniProt Swiss-Prot human protein database was used. Fragment ion tolerance was set at 1.0005 Da. The labeled phosphopeptides were analyzed with the addition of variable phosphorylation (+79.9799) on serine, threonine, or tyrosine residues. Proteome Discoverer was also configured with a phosphRS {Taus 2011} node for site localizations. Confident peptide identifications were obtained using Percolator and the target/decoy method. PSM information (peptide sequences, q-values, masses, and reporter ions) was exported to tab-delimited files for processing with PAW pipeline modules.
The PAW pipeline Python scripts take the files exported from Proteome Discoverer, does some additional accurate peptide mass and minimum reporter ion intensity filtering. The exported data from Proteome Discoverer has already undergone PSM confidence filtering (Percolator q-values) and parsimonious protein inference. Reporter ions are peak heights of the most confident (closest to expected m/z value) centroided peak within 20 ppm of the expected position. Normalizations and statistical testing were performed using the Bioconductor {Gentleman 2004} package edgeR {Robinson 2010} as detailed in the first half of the notebook below. A Jupyter notebook with an R kernel was used to execute R commands and visualize the results.
Phosphopeptide enrichment is very different compared to protein expression. The measurements are peptide-centric rather than protein centric. Protein inference cannot be used as a noise filter to reduce the negative impact of incorrect PSMs. The positive effect of aggregated data to the protein level is also greatly compromised. We did a couple of things to address these issues. We increased the q-value cutoff for PSM filtering from 5% to 1%. We also considered phospho group localization to be less reliable than determination of the base peptide sequence and total number of phospho groups present. To get some degree of data aggregation to reduce the multiple testing impact and improve the data quality, we combined all PSMs from the same base peptide sequence and the same number of phosphogroups. This aggregation step reduces the number of data points in the quantitative analysis by about a factor of 2. The aggregation is not uniform across the phosphopeptides. Many peptides only have a single PSM, others may have many combined PSMs. We carried forward localization information in a limited way. The localization of the most confident PSM (smallest q-value) is reported and annotated as consistent (all PSMs has the same site localization) or variable. Analysis of the phosphopeptide data is in the second half of this notebook.
Thompson, A., Schäfer, J., Kuhn, K., Kienle, S., Schwarz, J., Schmidt, G., Neumann, T. and Hamon, C., 2003. Tandem mass tags: a novel quantification strategy for comparative analysis of complex protein mixtures by MS/MS. Analytical chemistry, 75(8), pp.1895-1904.
Senko, M.W., Remes, P.M., Canterbury, J.D., Mathur, R., Song, Q., Eliuk, S.M., Mullen, C., Earley, L., Hardman, M., Blethrow, J.D. and Bui, H., 2013. Novel parallelized quadrupole/linear ion trap/Orbitrap tribrid mass spectrometer improving proteome coverage and peptide identification rates. Analytical chemistry, 85(24), pp.11710-11714.
Eng, J.K., McCormack, A.L. and Yates, J.R., 1994. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. Journal of the American Society for Mass Spectrometry, 5(11), pp.976-989.
McAlister, G.C., Nusinow, D.P., Jedrychowski, M.P., Wühr, M., Huttlin, E.L., Erickson, B.K., Rad, R., Haas, W. and Gygi, S.P., 2014. MultiNotch MS3 enables accurate, sensitive, and multiplexed detection of differential expression across cancer cell line proteomes. Analytical chemistry, 86(14), pp.7150-7158.
Wilmarth, P.A., Riviere, M.A. and David, L.L., 2009. Techniques for accurate protein identification in shotgun proteomic studies of human, mouse, bovine, and chicken lenses. Journal of ocular biology, diseases, and informatics, 2(4), pp.223-234.
Taus, T., Köcher, T., Pichler, P., Paschke, C., Schmidt, A., Henrich, C. and Mechtler, K., 2011. Universal and confident phosphorylation site localization using phosphoRS. Journal of proteome research, 10(12), pp.5354-5362.
Gentleman, R.C., Carey, V.J., Bates, D.M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J. and Hornik, K., 2004. Bioconductor: open software development for computational biology and bioinformatics. Genome biology, 5(10), p.R80.
Robinson, M.D., McCarthy, D.J. and Smyth, G.K., 2010. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), pp.139-140.
# load the libraries
library(tidyverse)
library(stringr)
library(edgeR)
library(limma)
library(psych)
library(scales)
# ================== TMM normalization from DGEList object =====================
apply_tmm_factors <- function(y, color = NULL, plot = TRUE) {
# computes the tmm normalized data from the DGEList object
# y - DGEList object
# returns a dataframe with normalized intensities
# compute and print "Sample loading" normalization factors
lib_facs <- mean(y$samples$lib.size) / y$samples$lib.size
cat("\nLibrary size factors:\n",
sprintf("%-5s -> %f\n", colnames(y$counts), lib_facs))
# compute and print TMM normalization factors
tmm_facs <- 1/y$samples$norm.factors
cat("\nTrimmed mean of M-values (TMM) factors:\n",
sprintf("%-5s -> %f\n", colnames(y$counts), tmm_facs))
# compute and print the final correction factors
norm_facs <- lib_facs * tmm_facs
cat("\nCombined (lib size and TMM) normalization factors:\n",
sprintf("%-5s -> %f\n", colnames(y$counts), norm_facs))
# compute the normalized data as a new data frame
tmt_tmm <- as.data.frame(sweep(y$counts, 2, norm_facs, FUN = "*"))
colnames(tmt_tmm) <- str_c(colnames(y$counts), "_tmm")
# visualize results and return data frame
if(plot == TRUE) {
boxplot(log10(tmt_tmm), col = color, notch = TRUE, main = "TMM Normalized data")
}
tmt_tmm
}
# ============== CV function ===================================================
CV <- function(df) {
# Computes CVs of data frame rows
# df - data frame,
# returns vector of CVs (%)
ave <- rowMeans(df) # compute averages
sd <- apply(df, 1, sd) # compute standard deviations
cv <- 100 * sd / ave # compute CVs in percent (last thing gets returned)
}
# ================= reformat edgeR test results ================================
collect_results <- function(df, tt, x, xlab, y, ylab) {
# Computes new columns and extracts some columns to make results frame
# df - data in data.frame
# tt - top tags table from edgeR test
# x - columns for first condition
# xlab - label for x
# y - columns for second condition
# ylab - label for y
# returns a new dataframe
# condition average vectors
ave_x <- rowMeans(df[x])
ave_y <- rowMeans(df[y])
# FC, direction, candidates
fc <- ifelse(ave_y > ave_x, (ave_y / ave_x), (-1 * ave_x / ave_y))
direction <- ifelse(ave_y > ave_x, "up", "down")
candidate <- cut(tt$FDR, breaks = c(-Inf, 0.01, 0.05, 0.10, 1.0),
labels = c("high", "med", "low", "no"))
# make data frame
temp <- cbind(df[c(x, y)], data.frame(logFC = tt$logFC, FC = fc,
PValue = tt$PValue, FDR = tt$FDR,
ave_x = ave_x, ave_y = ave_y,
direction = direction, candidate = candidate,
Acc = tt$genes))
# fix column headers for averages
names(temp)[names(temp) %in% c("ave_x", "ave_y")] <- str_c("ave_", c(xlab, ylab))
temp # return the data frame
}
# ========== Setup for MA and volcano plots ====================================
transform <- function(results, x, y) {
# Make data frame with some transformed columns
# results - results data frame
# x - columns for x condition
# y - columns for y condition
# return new data frame
df <- data.frame(log10((results[x] + results[y])/2),
log2(results[y] / results[x]),
results$candidate,
-log10(results$FDR))
colnames(df) <- c("A", "M", "candidate", "P")
df # return the data frame
}
# ========== MA plots using ggplot =============================================
MA_plots <- function(results, x, y, title) {
# makes MA-plot DE candidate ggplots
# results - data frame with edgeR results and some condition average columns
# x - string for x-axis column
# y - string for y-axis column
# title - title string to use in plots
# returns a list of plots
# uses transformed data
temp <- transform(results, x, y)
# 2-fold change lines
ma_lines <- list(geom_hline(yintercept = 0.0, color = "black"),
geom_hline(yintercept = 1.0, color = "black", linetype = "dotted"),
geom_hline(yintercept = -1.0, color = "black", linetype = "dotted"))
# make main MA plot
ma <- ggplot(temp, aes(x = A, y = M)) +
geom_point(aes(color = candidate, shape = candidate)) +
scale_y_continuous(paste0("logFC (", y, "/", x, ")")) +
scale_x_continuous("Ave_intensity") +
ggtitle(title) +
ma_lines
# make separate MA plots
ma_facet <- ggplot(temp, aes(x = A, y = M)) +
geom_point(aes(color = candidate, shape = candidate)) +
scale_y_continuous(paste0("log2 FC (", y, "/", x, ")")) +
scale_x_continuous("log10 Ave_intensity") +
ma_lines +
facet_wrap(~ candidate) +
ggtitle(str_c(title, " (separated)"))
# make the plots visible
print(ma)
print(ma_facet)
}
# ========== Scatter plots using ggplot ========================================
scatter_plots <- function(results, x, y, title) {
# makes scatter-plot DE candidate ggplots
# results - data frame with edgeR results and some condition average columns
# x - string for x-axis column
# y - string for y-axis column
# title - title string to use in plots
# returns a list of plots
# 2-fold change lines
scatter_lines <- list(geom_abline(intercept = 0.0, slope = 1.0, color = "black"),
geom_abline(intercept = 0.301, slope = 1.0, color = "black", linetype = "dotted"),
geom_abline(intercept = -0.301, slope = 1.0, color = "black", linetype = "dotted"),
scale_y_log10(),
scale_x_log10())
# make main scatter plot
scatter <- ggplot(results, aes_string(x, y)) +
geom_point(aes(color = candidate, shape = candidate)) +
ggtitle(title) +
scatter_lines
# make separate scatter plots
scatter_facet <- ggplot(results, aes_string(x, y)) +
geom_point(aes(color = candidate, shape = candidate)) +
scatter_lines +
facet_wrap(~ candidate) +
ggtitle(str_c(title, " (separated)"))
# make the plots visible
print(scatter)
print(scatter_facet)
}
# ========== Volcano plots using ggplot ========================================
volcano_plot <- function(results, x, y, title) {
# makes a volcano plot
# results - a data frame with edgeR results
# x - string for the x-axis column
# y - string for y-axis column
# title - plot title string
# uses transformed data
temp <- transform(results, x, y)
# build the plot
ggplot(temp, aes(x = M, y = P)) +
geom_point(aes(color = candidate, shape = candidate)) +
xlab("log2 FC") +
ylab("-log10 FDR") +
ggtitle(str_c(title, " Volcano Plot"))
}
# ============== individual protein expression plots ===========================
# function to extract the identifier part of the accesssion
get_identifier <- function(accession) {
identifier <- str_split(accession, ";", simplify = TRUE)
identifier[,1]
}
set_plot_dimensions <- function(width_choice, height_choice) {
options(repr.plot.width=width_choice, repr.plot.height=height_choice)
}
plot_top_tags <- function(results, nleft, nright, top_tags) {
# results should have data first, then test results (two condition summary table)
# nleft, nright are number of data points in each condition
# top_tags is number of up and number of down top DE candidates to plot
# get top ipregulated
up <- results %>%
filter(logFC >= 0) %>%
arrange(FDR)
up <- up[1:top_tags, ]
# get top down regulated
down <- results %>%
filter(logFC < 0) %>%
arrange(FDR)
down <- down[1:top_tags, ]
# pack them
proteins <- rbind(up, down)
color = c(rep("red", nleft), rep("blue", nright))
for (row_num in 1:nrow(proteins)) {
row <- proteins[row_num, ]
vec <- as.vector(unlist(row[1:(nleft + nright)]))
names(vec) <- colnames(row[1:(nleft + nright)])
title <- str_c(get_identifier(row$Acc), ", int: ", scientific(mean(vec), 2),
", FDR: ", scientific(row$FDR, digits = 3),
", FC: ", round(row$FC, digits = 1))
barplot(vec, col = color, main = title,
cex.main = 1.0, cex.names = 0.7, cex.lab = 0.7)
}
}
In most protein expression studies, the assumption of the majority of the proteins not having any changes in expression levels is central to normalization strategies. This concept is much less tested in phosphopeptide enrichment studies. We need to pay close attention to the sizes of the normalization factors and the alignment of the boxplots to verify that the normalization methods used for the whole protein analysis above are still valid.
# read in the data export file
data_all_pep <- read_tsv("Cond-2_TiO2.txt")
# save accessions and remove from data
accessions <- data_all_pep$Accession
counts_pep <- data_all_pep %>% select(-Accession)
# get the data arranged for the testing (resting first, then activated)
resting <- select(counts_pep, contains("R"))
resting <- resting[, order(colnames(resting))]
activated <- select(counts_pep, contains("A"))
activated <- activated[, order(colnames(activated))]
counts_pep <- cbind(resting, activated)
# define some column indexes
R <- 1:5 # resting
A <- 6:10 # activated
# set color vector to match sample groups
color = c(rep("red", length(R)), rep("blue", length(A)))
head(counts_pep)
length(accessions)
boxplot(log10(counts_pep), col = color,
xlab = 'TMT samples', ylab = 'log10 Intensity',
main = 'Condition 2 TiO2 Starting data', notch = TRUE)
# load into edgeR DGEList object and normalize the data
group = c(rep("R", 5), rep("A", 5))
ye <- DGEList(counts = counts_pep, group = group, gene = accessions)
ye <- calcNormFactors(ye)
counts_pep_tmm <- apply_tmm_factors(ye, color)
# check sample-to-sample similarity
pairs.panels(log10(counts_pep_tmm[R]), lm = TRUE, main = "Condition 2 Phosphopeptides: resting")
pairs.panels(log10(counts_pep_tmm[A]), lm = TRUE, main = "Condition 2 Phosphopeptides: activated")
We do not have as much aggregation in peptide-centric phospho studies as we do in protein expression studies. We do not have as much averaging to reduce sample-to-sample scatter to the same degree. We also have a bit more missing data that we imputed (the points along the axes).
# check the clustering (set colors by condition)
plotMDS(ye, col = color, main = "Condition 2: Resting (red) and Activated (blue)")
Resting are along the left and the activated are on the right. We do still see the samples pairing top-to-bottom. A paired study design might be statistically more powerful.
An exact test will establish a robust (safe) baseline for differential expression. We will then do a paired testing.
We need the trended variance.
# compute dispersions and plot BCV
ye <- estimateDisp(ye)
plotBCV(ye, main = "Phosphopeptide Variance Trend")
The cluster plot indicates that we have some clear separation between conditions. The scatter plots show that sample-to-sample reproducibility is very good.
Perform an exact test with edgeR and look at the significant peptides.
# the exact test object has columns like fold-change, CPM, and p-values
# we have already loaded the data into "y" above, normalized, and est. dispersion
ete <- exactTest(ye, pair = c("R", "A"))
# this counts up, down, and unchanged genes (here it is proteins)
summary(decideTestsDGE(ete, p.value = 0.10))
topTags(ete)$table
# the topTags function adds the BH FDR values to an exactTest data frame
# make sure we do not change the row order!
tte <- topTags(ete, n = Inf, sort.by = "none")$table
results_e <- collect_results(counts_pep_tmm, tte, R, "R", A, "A")
# make an MD plot
plotMD(ete, p.value = 0.10 , main = "Phosphopeptide Exact Test: Resting vs Activated")
abline(h = c(-1, 1), col = "black")
# check the p-value distribution
ggplot(tte, aes(PValue)) +
geom_histogram(bins = 100, fill = "white", color = "black") +
geom_hline(yintercept = mean(hist(ete$table$PValue, breaks = 100,
plot = FALSE)$counts[26:100])) +
ggtitle("Phosphopeptide Exact Test: Resting vs Activated")
We can look at distributions of log2 fold change, fancier MA plots, scatter plots, and the popular volcano plot.
# see how many candidates by category
results_e %>% count(candidate)
# plot log2 fold-changes by category
ggplot(results_e, aes(x = logFC, fill = candidate)) +
geom_histogram(binwidth=0.1, color = "black") +
facet_wrap(~candidate) +
coord_cartesian(xlim = c(-5, 5)) +
ggtitle("Log2 R by category")
### Dotted lines are 2-fold and solid line is 1-to-1
# make the DE plots
MA_plots(results_e, "ave_R", "ave_A", "Resting vs Activated - Exact Test")
scatter_plots(results_e, "ave_R", "ave_A", "Resting vs Activated - Exact Test")
volcano_plot(results_e, "ave_R", "ave_A", "Resting vs Activated - Exact Test")
# plot the top N up down proteins
top_N <- 20
set_plot_dimensions(6, 3.5)
plot_top_tags(results_e, 5, 5, top_N)
set_plot_dimensions(7, 7)
# load into edgeR DGEList object and normalize the data
group = c(rep("R", 5), rep("A", 5))
yp <- DGEList(counts = counts_pep, group = group, gene = accessions)
yp <- calcNormFactors(yp)
# create the experimental design matrix
subject <- factor(rep(c("F1", "F3", "M1", "M2", "M4"), 2))
state <- factor(c(rep("R", 5), rep("A", 5)))
# Example 4.1 in edgeR user's guide
design <- model.matrix(~subject+state)
rownames(design) <- colnames(yp)
design
# extimate the dispersion parameters and check
yp <- estimateDisp(yp, design, robust = TRUE)
yp$common.dispersion
# fit statistical models (design matrix already in y$design)
fit <- glmQLFit(yp, design, robust = TRUE)
plotQLDisp(fit, main = "Condition 2 - Paired testing")
We now run the linear modeling and check the results.
# if we do not specify a contrast, the default is the last column
# of the design matrix - a Resting versus Activated comparison
paired <- glmQLFTest(fit) # default comparison
# check test results
topTags(paired)$table
ttp <- topTags(paired, n = Inf, sort.by = "none")$table
summary(decideTests(paired, p.value = 0.10))
# make basic MA plot
plotMD(paired, p.value = 0.10, main = "Condition 2 TiO2 - Paired test")
print(abline(h = c(-1, 1), col = "black")) # 2-fold lines
# check the p-value distribution
ggplot(ttp, aes(PValue)) +
geom_histogram(bins = 100, fill = "white", color = "black") +
geom_hline(yintercept = mean(hist(ttp$PValue, breaks = 100,
plot = FALSE)$counts[30:100])) +
ggtitle("p-value distribution: paired test")
# collect and reformat the test results
results_p <- collect_results(counts_pep_tmm, ttp, R, "R", A, "A")
# correct the changed reference order for the glm testing
results_p$logFC <- -1 * results_p$logFC
# see how many candidates by category
results_p %>% count(candidate)
# plot log2 fold-changes by category
ggplot(results_p, aes(x = logFC, fill = candidate)) +
geom_histogram(binwidth=0.1, color = "black") +
facet_wrap(~candidate) +
coord_cartesian(xlim = c(-5, 5)) +
ggtitle("Log2 R by paired test category")
# make the DE plots
MA_plots(results_p, "ave_R", "ave_A", "ASLA-515 TiO2 paired: Resting vs Activated")
scatter_plots(results_p, "ave_R", "ave_A", "ASLA-515 TiO2 paired: Resting vs Activated")
volcano_plot(results_p, "ave_R", "ave_A", "ASLA-515 TiO2 paired: Resting vs Activated")
# plot the top N up down proteins
top_N <- 20
set_plot_dimensions(6, 3.5)
plot_top_tags(results_p, 5, 5, top_N)
set_plot_dimensions(7, 7)
We can see the paired patterns in the data for these candidates compared to the exact test (which was more about the differences in means). The paired testing looks like the right thing here.
We have dramatic changes in signalling levels. We have similar numbers of up and down regulated peptides, although activation causes larger increased intensity changes for the over expressed candidates.
# phosphopeptides
cv_pep <- data.frame(R_pep = CV(counts_pep_tmm[R]), A_pep = CV(counts_pep_tmm[A]))
medians <- apply(cv_pep, 2, FUN = median)
print("Phosphopeptide median CVs (%)")
round(medians, 1)
# make the boxplots
boxplot(cv_pep, notch = TRUE, main = "Phosphopeptide CV distributions",
ylim = c(0, 150), ylab = "CV (%)")
The median CVs are reasonable for a peptide-centric experiment. We have a bit larger CVs and more variable data after activation compared to the resting state.
# save the results data frame
results <- cbind(results_e, results_p)
write.table(results, "Cond-2_TiO2_results.txt", sep = "\t", row.names = FALSE)
# log the session
sessionInfo()