Condition 2 TiO2 TMT Phosphopeptide Analysis¶

Joe Aslan Lab, OHSU¶

Notebook prepared by Phil Wilmarth, PSR Core, OHSU¶

January 9, 2020¶

Signalling changes between resting (R) and activated (A) human platelets. TiO2-enriched phospho peptides labeled with 10-plex TMT reagents. This is data from the ASLA-515 project.

Some of the resulting peptides were directly labeled with 10-plex TMT reagents {Thompson 2003}. Remaining peptides were enriched for phosphopepides, labeled with 10-plex TMT for a second set of labeled samples. After combining equal portions of labeled peptide samples (either whole protein or enriched phosphopeptides), they were separated by high pH reverse phase/low pH reverse phase liquid chromatography and analyzed on a Thermo Fusion Tribrid Orbitrap mass spectrometer {Senko 2013}. Fragment ions were generated using CID and the ion trap analyzer; the reporter ions were fragmented using high energy collision dissociation of (synchronous precursor selection) SPS selected fragment ions at high resolution in the Orbitrap analyzer.

Peptides and proteins were identified using Proteome Discoverer v1.4 and SEQUEST {Eng 1994}. A wider 1.25 Da parent ion mass tolerance was used, TMT labels and alkylated cysteine were specified as static modifications, oxidation of methionine (+15.9949 Da) was specified as a variable modification, trypsin enzyme specificity was used, and a canonical UniProt Swiss-Prot human protein database was used. Fragment ion tolerance was set at 1.0005 Da. The labeled phosphopeptides were analyzed with the addition of variable phosphorylation (+79.9799) on serine, threonine, or tyrosine residues. Proteome Discoverer was also configured with a phosphRS {Taus 2011} node for site localizations. Confident peptide identifications were obtained using Percolator and the target/decoy method. PSM information (peptide sequences, q-values, masses, and reporter ions) was exported to tab-delimited files for processing with PAW pipeline modules.

The PAW pipeline Python scripts take the files exported from Proteome Discoverer, does some additional accurate peptide mass and minimum reporter ion intensity filtering. The exported data from Proteome Discoverer has already undergone PSM confidence filtering (Percolator q-values) and parsimonious protein inference. Reporter ions are peak heights of the most confident (closest to expected m/z value) centroided peak within 20 ppm of the expected position. Normalizations and statistical testing were performed using the Bioconductor {Gentleman 2004} package edgeR {Robinson 2010} as detailed in the first half of the notebook below. A Jupyter notebook with an R kernel was used to execute R commands and visualize the results.

Phosphopeptide enrichment is very different compared to protein expression. The measurements are peptide-centric rather than protein centric. Protein inference cannot be used as a noise filter to reduce the negative impact of incorrect PSMs. The positive effect of aggregated data to the protein level is also greatly compromised. We did a couple of things to address these issues. We increased the q-value cutoff for PSM filtering from 5% to 1%. We also considered phospho group localization to be less reliable than determination of the base peptide sequence and total number of phospho groups present. To get some degree of data aggregation to reduce the multiple testing impact and improve the data quality, we combined all PSMs from the same base peptide sequence and the same number of phosphogroups. This aggregation step reduces the number of data points in the quantitative analysis by about a factor of 2. The aggregation is not uniform across the phosphopeptides. Many peptides only have a single PSM, others may have many combined PSMs. We carried forward localization information in a limited way. The localization of the most confident PSM (smallest q-value) is reported and annotated as consistent (all PSMs has the same site localization) or variable. Analysis of the phosphopeptide data is in the second half of this notebook.

Thompson, A., Schäfer, J., Kuhn, K., Kienle, S., Schwarz, J., Schmidt, G., Neumann, T. and Hamon, C., 2003. Tandem mass tags: a novel quantification strategy for comparative analysis of complex protein mixtures by MS/MS. Analytical chemistry, 75(8), pp.1895-1904.

Senko, M.W., Remes, P.M., Canterbury, J.D., Mathur, R., Song, Q., Eliuk, S.M., Mullen, C., Earley, L., Hardman, M., Blethrow, J.D. and Bui, H., 2013. Novel parallelized quadrupole/linear ion trap/Orbitrap tribrid mass spectrometer improving proteome coverage and peptide identification rates. Analytical chemistry, 85(24), pp.11710-11714.

Eng, J.K., McCormack, A.L. and Yates, J.R., 1994. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. Journal of the American Society for Mass Spectrometry, 5(11), pp.976-989.

McAlister, G.C., Nusinow, D.P., Jedrychowski, M.P., Wühr, M., Huttlin, E.L., Erickson, B.K., Rad, R., Haas, W. and Gygi, S.P., 2014. MultiNotch MS3 enables accurate, sensitive, and multiplexed detection of differential expression across cancer cell line proteomes. Analytical chemistry, 86(14), pp.7150-7158.

Wilmarth, P.A., Riviere, M.A. and David, L.L., 2009. Techniques for accurate protein identification in shotgun proteomic studies of human, mouse, bovine, and chicken lenses. Journal of ocular biology, diseases, and informatics, 2(4), pp.223-234.

Taus, T., Köcher, T., Pichler, P., Paschke, C., Schmidt, A., Henrich, C. and Mechtler, K., 2011. Universal and confident phosphorylation site localization using phosphoRS. Journal of proteome research, 10(12), pp.5354-5362.

Gentleman, R.C., Carey, V.J., Bates, D.M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J. and Hornik, K., 2004. Bioconductor: open software development for computational biology and bioinformatics. Genome biology, 5(10), p.R80.

Robinson, M.D., McCarthy, D.J. and Smyth, G.K., 2010. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), pp.139-140.

# load the libraries
library(tidyverse)
library(stringr)
library(edgeR)
library(limma)
library(psych)
library(scales)

── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.2.1     ✔ purrr   0.3.2
✔ tibble  2.1.3     ✔ dplyr   0.8.3
✔ tidyr   0.8.3     ✔ stringr 1.4.0
✔ readr   1.3.1     ✔ forcats 0.4.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Loading required package: limma

Attaching package: ‘psych’

The following objects are masked from ‘package:ggplot2’:

    %+%, alpha


Attaching package: ‘scales’

The following objects are masked from ‘package:psych’:

    alpha, rescale

The following object is masked from ‘package:purrr’:

    discard

The following object is masked from ‘package:readr’:

    col_factor

Define notebook functions¶

# ================== TMM normalization from DGEList object =====================
apply_tmm_factors <- function(y, color = NULL, plot = TRUE) {
    # computes the tmm normalized data from the DGEList object
        # y - DGEList object
        # returns a dataframe with normalized intensities
    
    # compute and print "Sample loading" normalization factors
    lib_facs <- mean(y$samples$lib.size) / y$samples$lib.size
    cat("\nLibrary size factors:\n", 
        sprintf("%-5s -> %f\n", colnames(y$counts), lib_facs))
    
    # compute and print TMM normalization factors
    tmm_facs <- 1/y$samples$norm.factors
    cat("\nTrimmed mean of M-values (TMM) factors:\n", 
        sprintf("%-5s -> %f\n", colnames(y$counts), tmm_facs))
    
    # compute and print the final correction factors
    norm_facs <- lib_facs * tmm_facs
    cat("\nCombined (lib size and TMM) normalization factors:\n", 
        sprintf("%-5s -> %f\n", colnames(y$counts), norm_facs))

    # compute the normalized data as a new data frame
    tmt_tmm <- as.data.frame(sweep(y$counts, 2, norm_facs, FUN = "*"))
    colnames(tmt_tmm) <- str_c(colnames(y$counts), "_tmm")
    
    # visualize results and return data frame
    if(plot == TRUE) {
        boxplot(log10(tmt_tmm), col = color, notch = TRUE, main = "TMM Normalized data")
    }
    tmt_tmm
}

# ============== CV function ===================================================
CV <- function(df) {
    # Computes CVs of data frame rows
        # df - data frame, 
        # returns vector of CVs (%)
    ave <- rowMeans(df)    # compute averages
    sd <- apply(df, 1, sd) # compute standard deviations
    cv <- 100 * sd / ave   # compute CVs in percent (last thing gets returned)
}

# ================= reformat edgeR test results ================================
collect_results <- function(df, tt, x, xlab, y, ylab) {
    # Computes new columns and extracts some columns to make results frame
        # df - data in data.frame
        # tt - top tags table from edgeR test
        # x - columns for first condition
        # xlab - label for x
        # y - columns for second condition
        # ylab - label for y
        # returns a new dataframe
    
    # condition average vectors
    ave_x <- rowMeans(df[x])
    ave_y <- rowMeans(df[y])
    
    # FC, direction, candidates
    fc <- ifelse(ave_y > ave_x, (ave_y / ave_x), (-1 * ave_x / ave_y))
    direction <- ifelse(ave_y > ave_x, "up", "down")
    candidate <- cut(tt$FDR, breaks = c(-Inf, 0.01, 0.05, 0.10, 1.0), 
                     labels = c("high", "med", "low", "no"))
    
    # make data frame
    temp <- cbind(df[c(x, y)], data.frame(logFC = tt$logFC, FC = fc, 
                                          PValue = tt$PValue, FDR = tt$FDR, 
                                          ave_x = ave_x, ave_y = ave_y, 
                                          direction = direction, candidate = candidate, 
                                          Acc = tt$genes)) 
    
    # fix column headers for averages
    names(temp)[names(temp) %in% c("ave_x", "ave_y")]  <- str_c("ave_", c(xlab, ylab))    
    
    temp # return the data frame
}

# ========== Setup for MA and volcano plots ====================================
transform <- function(results, x, y) {
    # Make data frame with some transformed columns
        # results - results data frame
        # x - columns for x condition
        # y - columns for y condition
        # return new data frame
    df <- data.frame(log10((results[x] + results[y])/2), 
                     log2(results[y] / results[x]), 
                     results$candidate,
                     -log10(results$FDR))
    colnames(df) <- c("A", "M", "candidate", "P")
    
    df # return the data frame
}

# ========== MA plots using ggplot =============================================
MA_plots <- function(results, x, y, title) {
    # makes MA-plot DE candidate ggplots
        # results - data frame with edgeR results and some condition average columns
        # x - string for x-axis column
        # y - string for y-axis column
        # title - title string to use in plots
        # returns a list of plots 
    
    # uses transformed data
    temp <- transform(results, x, y)
    
    # 2-fold change lines
    ma_lines <- list(geom_hline(yintercept = 0.0, color = "black"),
                     geom_hline(yintercept = 1.0, color = "black", linetype = "dotted"),
                     geom_hline(yintercept = -1.0, color = "black", linetype = "dotted"))

    # make main MA plot
    ma <- ggplot(temp, aes(x = A, y = M)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        scale_y_continuous(paste0("logFC (", y, "/", x, ")")) +
        scale_x_continuous("Ave_intensity") +
        ggtitle(title) + 
        ma_lines
    
    # make separate MA plots
    ma_facet <- ggplot(temp, aes(x = A, y = M)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        scale_y_continuous(paste0("log2 FC (", y, "/", x, ")")) +
        scale_x_continuous("log10 Ave_intensity") +
        ma_lines +
        facet_wrap(~ candidate) +
        ggtitle(str_c(title, " (separated)"))

    # make the plots visible
    print(ma)
    print(ma_facet)
}    

# ========== Scatter plots using ggplot ========================================
scatter_plots <- function(results, x, y, title) {
    # makes scatter-plot DE candidate ggplots
        # results - data frame with edgeR results and some condition average columns
        # x - string for x-axis column
        # y - string for y-axis column
        # title - title string to use in plots
        # returns a list of plots
    
    # 2-fold change lines
    scatter_lines <- list(geom_abline(intercept = 0.0, slope = 1.0, color = "black"),
                          geom_abline(intercept = 0.301, slope = 1.0, color = "black", linetype = "dotted"),
                          geom_abline(intercept = -0.301, slope = 1.0, color = "black", linetype = "dotted"),
                          scale_y_log10(),
                          scale_x_log10())

    # make main scatter plot
    scatter <- ggplot(results, aes_string(x, y)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        ggtitle(title) + 
        scatter_lines

    # make separate scatter plots
    scatter_facet <- ggplot(results, aes_string(x, y)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        scatter_lines +
        facet_wrap(~ candidate) +
        ggtitle(str_c(title, " (separated)")) 

    # make the plots visible
    print(scatter)
    print(scatter_facet)
}

# ========== Volcano plots using ggplot ========================================
volcano_plot <- function(results, x, y, title) {
    # makes a volcano plot
        # results - a data frame with edgeR results
        # x - string for the x-axis column
        # y - string for y-axis column
        # title - plot title string
    
    # uses transformed data
    temp <- transform(results, x, y)
    
    # build the plot
    ggplot(temp, aes(x = M, y = P)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        xlab("log2 FC") +
        ylab("-log10 FDR") +
        ggtitle(str_c(title, " Volcano Plot"))
}

# ============== individual protein expression plots ===========================
# function to extract the identifier part of the accesssion
get_identifier <- function(accession) {
    identifier <- str_split(accession, ";", simplify = TRUE)
    identifier[,1]
}

set_plot_dimensions <- function(width_choice, height_choice) {
    options(repr.plot.width=width_choice, repr.plot.height=height_choice)
}

plot_top_tags <- function(results, nleft, nright, top_tags) {
    # results should have data first, then test results (two condition summary table)
    # nleft, nright are number of data points in each condition
    # top_tags is number of up and number of down top DE candidates to plot
    # get top ipregulated
    up <- results %>% 
        filter(logFC >= 0) %>%
        arrange(FDR)
    up <- up[1:top_tags, ]
    
    # get top down regulated
    down <- results %>% 
        filter(logFC < 0) %>%
        arrange(FDR)
    down <- down[1:top_tags, ]
    
    # pack them
    proteins <- rbind(up, down)
        
    color = c(rep("red", nleft), rep("blue", nright))
    for (row_num in 1:nrow(proteins)) {
        row <- proteins[row_num, ]
        vec <- as.vector(unlist(row[1:(nleft + nright)]))
        names(vec) <- colnames(row[1:(nleft + nright)])
        title <- str_c(get_identifier(row$Acc), ", int: ", scientific(mean(vec), 2), 
                       ", FDR: ", scientific(row$FDR, digits = 3), 
                       ", FC: ", round(row$FC, digits = 1))
        barplot(vec, col = color, main = title,
                cex.main = 1.0, cex.names = 0.7, cex.lab = 0.7)
    }    
}

Phosphopeptide data¶

Read in the data exported from Excel¶

Load data into edgeR and normalize¶

In most protein expression studies, the assumption of the majority of the proteins not having any changes in expression levels is central to normalization strategies. This concept is much less tested in phosphopeptide enrichment studies. We need to pay close attention to the sizes of the normalization factors and the alignment of the boxplots to verify that the normalization methods used for the whole protein analysis above are still valid.

# read in the data export file
data_all_pep <- read_tsv("Cond-2_TiO2.txt")

Parsed with column specification:
cols(
  Accession = col_character(),
  RM2 = col_double(),
  AM2 = col_double(),
  RM4 = col_double(),
  AM4 = col_double(),
  RF1 = col_double(),
  AF1 = col_double(),
  RF3 = col_double(),
  AF3 = col_double(),
  RM1 = col_double(),
  AM1 = col_double()
)

# save accessions and remove from data
accessions <- data_all_pep$Accession
counts_pep <- data_all_pep %>% select(-Accession)


# get the data arranged for the testing (resting first, then activated)
resting <- select(counts_pep, contains("R"))
resting <- resting[, order(colnames(resting))]
activated <- select(counts_pep, contains("A"))
activated <- activated[, order(colnames(activated))]
counts_pep <- cbind(resting, activated)

# define some column indexes
R  <- 1:5 # resting
A <- 6:10 # activated

# set color vector to match sample groups
color = c(rep("red", length(R)), rep("blue", length(A)))

head(counts_pep)
length(accessions)

boxplot(log10(counts_pep), col = color,
        xlab = 'TMT samples', ylab = 'log10 Intensity', 
        main = 'Condition 2 TiO2 Starting data', notch = TRUE)

# load into edgeR DGEList object and normalize the data
group = c(rep("R", 5), rep("A", 5))
ye <- DGEList(counts = counts_pep, group = group, gene = accessions)
ye <- calcNormFactors(ye)

Save the TMM normalized data¶

counts_pep_tmm <- apply_tmm_factors(ye, color)

Library size factors:
 RF1   -> 0.965751
 RF3   -> 1.002853
 RM1   -> 1.001322
 RM2   -> 0.968792
 RM4   -> 0.988646
 AF1   -> 0.978702
 AF3   -> 1.057008
 AM1   -> 1.054618
 AM2   -> 0.976143
 AM4   -> 1.015718

Trimmed mean of M-values (TMM) factors:
 RF1   -> 0.935481
 RF3   -> 0.887340
 RM1   -> 0.968828
 RM2   -> 0.972893
 RM4   -> 0.982201
 AF1   -> 1.046131
 AF3   -> 1.047152
 AM1   -> 1.085472
 AM2   -> 1.040020
 AM4   -> 1.052221

Combined (lib size and TMM) normalization factors:
 RF1   -> 0.903442
 RF3   -> 0.889872
 RM1   -> 0.970108
 RM2   -> 0.942531
 RM4   -> 0.971049
 AF1   -> 1.023850
 AF3   -> 1.106848
 AM1   -> 1.144758
 AM2   -> 1.015208
 AM4   -> 1.068759

Normalization factors are close to 1 and boxplots look okay¶

The normalization checks seem fine, so we should be in good shape for statistical testing.

Checks sample-to-sample similarity¶

We can also do the multi-panel scatter plots for the phosphopeptides to compare samples within each condition.

# check sample-to-sample similarity
pairs.panels(log10(counts_pep_tmm[R]), lm = TRUE, main = "Condition 2 Phosphopeptides: resting")
pairs.panels(log10(counts_pep_tmm[A]), lm = TRUE, main = "Condition 2 Phosphopeptides: activated")

Peptide-centric data has a little more scatter¶

We do not have as much aggregation in peptide-centric phospho studies as we do in protein expression studies. We do not have as much averaging to reduce sample-to-sample scatter to the same degree. We also have a bit more missing data that we imputed (the points along the axes).

Check that samples cluster by condition and check variance¶

# check the clustering (set colors by condition)
plotMDS(ye, col = color, main = "Condition 2: Resting (red) and Activated (blue)")

Resting and activated are well separated¶

Resting are along the left and the activated are on the right. We do still see the samples pairing top-to-bottom. A paired study design might be statistically more powerful.

Do an exact test first and then do a paired study design.¶

An exact test will establish a robust (safe) baseline for differential expression. We will then do a paired testing.

We need the trended variance.

# compute dispersions and plot BCV
ye <- estimateDisp(ye)
plotBCV(ye, main = "Phosphopeptide Variance Trend")

Design matrix not provided. Switch to the classic mode.

Data are ready for statistical testing¶

The cluster plot indicates that we have some clear separation between conditions. The scatter plots show that sample-to-sample reproducibility is very good.

Perform an exact test with edgeR and look at the significant peptides.

# the exact test object has columns like fold-change, CPM, and p-values
# we have already loaded the data into "y" above, normalized, and est. dispersion
ete <- exactTest(ye, pair = c("R", "A"))

# this counts up, down, and unchanged genes (here it is proteins)
summary(decideTestsDGE(ete, p.value = 0.10))
topTags(ete)$table

# the topTags function adds the BH FDR values to an exactTest data frame 
# make sure we do not change the row order!
tte <- topTags(ete, n = Inf, sort.by = "none")$table

results_e <- collect_results(counts_pep_tmm, tte, R, "R", A, "A")

        A-R
Down    664
NotSig 1551
Up      885

Check MA plot and p-value distribution¶

# make an MD plot
plotMD(ete, p.value = 0.10 , main = "Phosphopeptide Exact Test: Resting vs Activated")
abline(h = c(-1, 1), col = "black")

# check the p-value distribution
ggplot(tte, aes(PValue)) + 
  geom_histogram(bins = 100, fill = "white", color = "black") + 
  geom_hline(yintercept = mean(hist(ete$table$PValue, breaks = 100, 
                                    plot = FALSE)$counts[26:100])) +
  ggtitle("Phosphopeptide Exact Test: Resting vs Activated")

Can also do more candidate visualizations¶

We can look at distributions of log2 fold change, fancier MA plots, scatter plots, and the popular volcano plot.

# see how many candidates by category
results_e %>% count(candidate)

# plot log2 fold-changes by category
ggplot(results_e, aes(x = logFC, fill = candidate)) +
  geom_histogram(binwidth=0.1, color = "black") +
  facet_wrap(~candidate) +
  coord_cartesian(xlim = c(-5, 5)) +
  ggtitle("Log2 R by category")

### Dotted lines are 2-fold and solid line is 1-to-1

# make the DE plots
MA_plots(results_e, "ave_R", "ave_A", "Resting vs Activated - Exact Test")
scatter_plots(results_e, "ave_R", "ave_A", "Resting vs Activated - Exact Test")
volcano_plot(results_e, "ave_R", "ave_A", "Resting vs Activated - Exact Test")

Look at some of the top tag proteins¶

# plot the top N up down proteins
top_N <- 20

set_plot_dimensions(6, 3.5)
plot_top_tags(results_e, 5, 5, top_N)
set_plot_dimensions(7, 7)

All of the candidate plots look convincing¶

Paired samples testing¶

We can use the general linear modeling (glm) extensions in edgeR to do a paired study design.

# load into edgeR DGEList object and normalize the data
group = c(rep("R", 5), rep("A", 5))
yp <- DGEList(counts = counts_pep, group = group, gene = accessions)
yp <- calcNormFactors(yp)

# create the experimental design matrix
subject <- factor(rep(c("F1", "F3", "M1", "M2", "M4"), 2))
state <- factor(c(rep("R", 5), rep("A", 5)))

# Example 4.1 in edgeR user's guide
design <- model.matrix(~subject+state)
rownames(design) <- colnames(yp)
design

# extimate the dispersion parameters and check
yp <- estimateDisp(yp, design, robust = TRUE)
yp$common.dispersion

# fit statistical models (design matrix already in y$design)
fit <- glmQLFit(yp, design, robust = TRUE)
plotQLDisp(fit, main = "Condition 2 - Paired testing")

Data has been loaded, normalized, and trended dispersion modeled¶

We now run the linear modeling and check the results.

# if we do not specify a contrast, the default is the last column
# of the design matrix - a Resting versus Activated comparison
paired <- glmQLFTest(fit) # default comparison

# check test results
topTags(paired)$table
ttp <- topTags(paired, n = Inf, sort.by = "none")$table
summary(decideTests(paired, p.value = 0.10))

# make basic MA plot
plotMD(paired, p.value = 0.10, main = "Condition 2 TiO2 - Paired test")
print(abline(h = c(-1, 1), col = "black")) # 2-fold lines
    
# check the p-value distribution
ggplot(ttp, aes(PValue)) + 
    geom_histogram(bins = 100, fill = "white", color = "black") + 
    geom_hline(yintercept = mean(hist(ttp$PValue, breaks = 100, 
                                 plot = FALSE)$counts[30:100])) +
    ggtitle("p-value distribution: paired test")

# collect and reformat the test results
results_p <- collect_results(counts_pep_tmm, ttp, R, "R", A, "A")

# correct the changed reference order for the glm testing
results_p$logFC <- -1 * results_p$logFC

       stateR
Down     1038
NotSig    972
Up       1090

NULL

We have about 600 more DE candidates with the paired testing¶

Note: the glm testing does not have the same sign for the logFC as the exact test did. The above MA plot will be inverted compared to the one from the exact testing farther above.

Check the log2 fold change distributions¶

# see how many candidates by category
results_p %>% count(candidate)

# plot log2 fold-changes by category
ggplot(results_p, aes(x = logFC, fill = candidate)) +
  geom_histogram(binwidth=0.1, color = "black") +
  facet_wrap(~candidate) +
  coord_cartesian(xlim = c(-5, 5)) +
  ggtitle("Log2 R by paired test category")

Check the other DE visualizations¶

# make the DE plots
MA_plots(results_p, "ave_R", "ave_A", "ASLA-515 TiO2 paired: Resting vs Activated")
scatter_plots(results_p, "ave_R", "ave_A", "ASLA-515 TiO2 paired: Resting vs Activated")
volcano_plot(results_p, "ave_R", "ave_A", "ASLA-515 TiO2 paired: Resting vs Activated")

Candidates look solid¶

Paired data may not have as large a differences in means between states. It is really the differences between specific samples that is driving the statistical testing.

Check some of the most statistically significant proteins (20 up and 20 down)¶

# plot the top N up down proteins
top_N <- 20

set_plot_dimensions(6, 3.5)
plot_top_tags(results_p, 5, 5, top_N)
set_plot_dimensions(7, 7)

There is more correlation in red and blue expression patterns¶

We can see the paired patterns in the data for these candidates compared to the exact test (which was more about the differences in means). The paired testing looks like the right thing here.

Plaetelet activation causes dramatic changes in phosphorylation¶

We have dramatic changes in signalling levels. We have similar numbers of up and down regulated peptides, although activation causes larger increased intensity changes for the over expressed candidates.

Check the peptide CVs for each state¶

# phosphopeptides
cv_pep <- data.frame(R_pep = CV(counts_pep_tmm[R]), A_pep = CV(counts_pep_tmm[A]))
medians <- apply(cv_pep, 2, FUN = median)
print("Phosphopeptide median CVs (%)")
round(medians, 1)

# make the boxplots
boxplot(cv_pep, notch = TRUE, main = "Phosphopeptide CV distributions", 
        ylim = c(0, 150), ylab = "CV (%)")

[1] "Phosphopeptide median CVs (%)"

CV distributions are similar for both states¶

The median CVs are reasonable for a peptide-centric experiment. We have a bit larger CVs and more variable data after activation compared to the resting state.

Export the phophopeptide results and log session¶

# save the results data frame
results <- cbind(results_e, results_p)
write.table(results, "Cond-2_TiO2_results.txt", sep = "\t", row.names = FALSE)

# log the session
sessionInfo()

R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS  10.15.2

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] scales_1.0.0    psych_1.8.12    edgeR_3.24.3    limma_3.38.3   
 [5] forcats_0.4.0   stringr_1.4.0   dplyr_0.8.3     purrr_0.3.2    
 [9] readr_1.3.1     tidyr_0.8.3     tibble_2.1.3    ggplot2_3.2.1  
[13] tidyverse_1.2.1

loaded via a namespace (and not attached):
 [1] pbdZMQ_0.3-3     statmod_1.4.32   locfit_1.5-9.1   tidyselect_0.2.5
 [5] repr_1.0.1       splines_3.5.3    haven_2.1.1      lattice_0.20-38 
 [9] colorspace_1.4-1 generics_0.0.2   vctrs_0.2.0      htmltools_0.3.6 
[13] base64enc_0.1-3  rlang_0.4.0      pillar_1.4.2     foreign_0.8-72  
[17] glue_1.3.1       withr_2.1.2      modelr_0.1.5     readxl_1.3.1    
[21] uuid_0.1-2       munsell_0.5.0    gtable_0.3.0     cellranger_1.1.0
[25] rvest_0.3.4      evaluate_0.14    labeling_0.3     parallel_3.5.3  
[29] broom_0.5.2      IRdisplay_0.7.0  Rcpp_1.0.2       backports_1.1.4 
[33] IRkernel_1.0.2   jsonlite_1.6     mnormt_1.5-5     hms_0.5.1       
[37] digest_0.6.20    stringi_1.4.3    grid_3.5.3       cli_1.1.0       
[41] tools_3.5.3      magrittr_1.5     lazyeval_0.2.2   crayon_1.3.4    
[45] pkgconfig_2.0.2  zeallot_0.1.0    xml2_1.2.2       lubridate_1.7.4 
[49] assertthat_0.2.1 httr_1.4.1       rstudioapi_0.10  R6_2.4.0        
[53] nlme_3.1-141     compiler_3.5.3

RF1	RF3	RM1	RM2	RM4	AF1	AF3	AM1	AM2	AM4
<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
34683947	37873437	32918798	37116452	37614953	22800311	21503758	20756313	21787001	20939295
5497806	6672084	11063516	4939466	13680290	22635667	25881939	29379954	24554781	27052080
9722591	15754640	26461757	18865144	12844696	6388820	5769413	11699631	8265890	5136809
12259290	8350920	7405620	5697690	8015420	13785590	13666420	11447390	13344220	12668360
12657879	15500815	14695680	17417237	16178997	4778883	3903832	4625139	6451124	5174802
10749110	9171110	11103250	9290960	10405960	6806840	5785780	7348210	6914970	8117220

	genes	logFC	logCPM	PValue	FDR
	<chr>	<dbl>	<dbl>	<dbl>	<dbl>
302	P30273	4.119800	9.022406	2.920604e-54	9.053874e-51
2058	Q05655	3.028749	5.259739	4.705835e-46	7.294044e-43
133	Q9UN19	3.363625	9.998154	5.466284e-43	5.648493e-40
2104	Q68D51	-2.521924	5.019133	2.849385e-41	2.208274e-38
1284	P43405	3.851150	6.710581	5.691309e-41	3.528611e-38
2247	P16885	3.209415	4.836029	1.472331e-40	7.607041e-38
2084	Q27J81	2.820821	5.207082	2.471888e-38	1.094693e-35
338	Q13136	-2.085062	8.785572	6.871815e-38	2.662828e-35
29	Q08495	-2.628154	11.952402	4.278155e-36	1.473587e-33
1558	Q5M775	2.936819	6.255757	4.019865e-35	1.246158e-32

	genes	logFC	logCPM	F	PValue	FDR
	<chr>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
302	P30273	-4.142076	9.022405	1372.6995	6.833855e-09	2.118495e-05
338	Q13136	2.086658	8.785573	874.0283	2.997357e-08	3.380010e-05
262	Q9NQT8	2.630571	9.059221	850.9914	3.270978e-08	3.380010e-05
169	Q8IUD2	2.132547	9.566796	777.3500	4.397291e-08	3.407901e-05
1284	P43405	-3.871582	6.710573	622.8652	9.065157e-08	4.389685e-05
409	Q96A00	1.984248	8.574624	610.2058	9.693119e-08	4.389685e-05
1529	O43314	-2.452699	6.315523	595.1667	1.051493e-07	4.389685e-05
648	Q9Y613	-2.529036	7.966159	576.4985	1.166621e-07	4.389685e-05
1241	Q86YS7	1.688747	6.664160	549.8863	1.360881e-07	4.389685e-05
1516	O75122	-2.358794	6.333958	520.0259	1.632300e-07	4.389685e-05

	(Intercept)	subjectF3	subjectM1	subjectM2	subjectM4	stateR
RF1	1	0	0	0	0	1
RF3	1	1	0	0	0	1
RM1	1	0	1	0	0	1
RM2	1	0	0	1	0	1
RM4	1	0	0	0	1	1
AF1	1	0	0	0	0	0
AF3	1	1	0	0	0	0
AM1	1	0	1	0	0	0
AM2	1	0	0	1	0	0
AM4	1	0	0	0	1	0

A tibble: 4 × 2
candidate	n
<fct>	<int>
high	964
med	320
low	265
no	1551

A tibble: 4 × 2
candidate	n
<fct>	<int>
high	1358
med	524
low	246
no	972

	(Intercept)	subjectF3	subjectM1	subjectM2	subjectM4	stateR
RF1	1	0	0	0	0	1
RF3	1	1	0	0	0	1
RM1	1	0	1	0	0	1
RM2	1	0	0	1	0	1
RM4	1	0	0	0	1	1
AF1	1	0	0	0	0	0
AF3	1	1	0	0	0	0
AM1	1	0	1	0	0	0
AM2	1	0	0	1	0	0
AM4	1	0	0	0	1	0

	(Intercept)	subjectF3	subjectM1	subjectM2	subjectM4	stateR
RF1	1	0	0	0	0	1
RF3	1	1	0	0	0	1
RM1	1	0	1	0	0	1
RM2	1	0	0	1	0	1
RM4	1	0	0	0	1	1
AF1	1	0	0	0	0	0
AF3	1	1	0	0	0	0
AM1	1	0	1	0	0	0
AM2	1	0	0	1	0	0
AM4	1	0	0	0	1	0