Condition 1 TiO2 TMT Phosphopeptide Analysis¶

Joe Aslan Lab, OHSU¶

Notebook prepared by Phil Wilmarth, PSR Core, OHSU¶

January 9, 2020¶

Signalling changes between resting (R) and activated (A) human platelets. TiO2-enriched phospho peptides labeled with 10-plex TMT reagents. This is data from the ASLA-359 project.

Some of the resulting peptides were directly labeled with 10-plex TMT reagents {Thompson 2003}. Remaining peptides were enriched for phosphopepides, labeled with 10-plex TMT for a second set of labeled samples. After combining equal portions of labeled peptide samples (either whole protein or enriched phosphopeptides), they were separated by high pH reverse phase/low pH reverse phase liquid chromatography and analyzed on a Thermo Fusion Tribrid Orbitrap mass spectrometer {Senko 2013}. Fragment ions were generated using CID and the ion trap analyzer; the reporter ions were fragmented using high energy collision dissociation of (synchronous precursor selection) SPS selected fragment ions at high resolution in the Orbitrap analyzer.

Peptides and proteins were identified using Proteome Discoverer v1.4 and SEQUEST {Eng 1994}. A wider 1.25 Da parent ion mass tolerance was used, TMT labels and alkylated cysteine were specified as static modifications, oxidation of methionine (+15.9949 Da) was specified as a variable modification, trypsin enzyme specificity was used, and a canonical UniProt Swiss-Prot human protein database was used. Fragment ion tolerance was set at 1.0005 Da. The labeled phosphopeptides were analyzed with the addition of variable phosphorylation (+79.9799) on serine, threonine, or tyrosine residues. Proteome Discoverer was also configured with a phosphRS {Taus 2011} node for site localizations. Confident peptide identifications were obtained using Percolator and the target/decoy method. PSM information (peptide sequences, q-values, masses, and reporter ions) was exported to tab-delimited files for processing with PAW pipeline modules.

The PAW pipeline Python scripts take the files exported from Proteome Discoverer, does some additional accurate peptide mass and minimum reporter ion intensity filtering. The exported data from Proteome Discoverer has already undergone PSM confidence filtering (Percolator q-values) and parsimonious protein inference. Reporter ions are peak heights of the most confident (closest to expected m/z value) centroided peak within 20 ppm of the expected position. Normalizations and statistical testing were performed using the Bioconductor {Gentleman 2004} package edgeR {Robinson 2010} as detailed in the first half of the notebook below. A Jupyter notebook with an R kernel was used to execute R commands and visualize the results.

Phosphopeptide enrichment is very different compared to protein expression. The measurements are peptide-centric rather than protein centric. Protein inference cannot be used as a noise filter to reduce the negative impact of incorrect PSMs. The positive effect of aggregated data to the protein level is also greatly compromised. We did a couple of things to address these issues. We increased the q-value cutoff for PSM filtering from 5% to 1%. We also considered phospho group localization to be less reliable than determination of the base peptide sequence and total number of phospho groups present. To get some degree of data aggregation to reduce the multiple testing impact and improve the data quality, we combined all PSMs from the same base peptide sequence and the same number of phosphogroups. This aggregation step reduces the number of data points in the quantitative analysis by about a factor of 2. The aggregation is not uniform across the phosphopeptides. Many peptides only have a single PSM, others may have many combined PSMs. We carried forward localization information in a limited way. The localization of the most confident PSM (smallest q-value) is reported and annotated as consistent (all PSMs has the same site localization) or variable. Analysis of the phosphopeptide data is in the second half of this notebook.

Thompson, A., Schäfer, J., Kuhn, K., Kienle, S., Schwarz, J., Schmidt, G., Neumann, T. and Hamon, C., 2003. Tandem mass tags: a novel quantification strategy for comparative analysis of complex protein mixtures by MS/MS. Analytical chemistry, 75(8), pp.1895-1904.

Senko, M.W., Remes, P.M., Canterbury, J.D., Mathur, R., Song, Q., Eliuk, S.M., Mullen, C., Earley, L., Hardman, M., Blethrow, J.D. and Bui, H., 2013. Novel parallelized quadrupole/linear ion trap/Orbitrap tribrid mass spectrometer improving proteome coverage and peptide identification rates. Analytical chemistry, 85(24), pp.11710-11714.

Eng, J.K., McCormack, A.L. and Yates, J.R., 1994. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. Journal of the American Society for Mass Spectrometry, 5(11), pp.976-989.

McAlister, G.C., Nusinow, D.P., Jedrychowski, M.P., Wühr, M., Huttlin, E.L., Erickson, B.K., Rad, R., Haas, W. and Gygi, S.P., 2014. MultiNotch MS3 enables accurate, sensitive, and multiplexed detection of differential expression across cancer cell line proteomes. Analytical chemistry, 86(14), pp.7150-7158.

Wilmarth, P.A., Riviere, M.A. and David, L.L., 2009. Techniques for accurate protein identification in shotgun proteomic studies of human, mouse, bovine, and chicken lenses. Journal of ocular biology, diseases, and informatics, 2(4), pp.223-234.

Taus, T., Köcher, T., Pichler, P., Paschke, C., Schmidt, A., Henrich, C. and Mechtler, K., 2011. Universal and confident phosphorylation site localization using phosphoRS. Journal of proteome research, 10(12), pp.5354-5362.

Gentleman, R.C., Carey, V.J., Bates, D.M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J. and Hornik, K., 2004. Bioconductor: open software development for computational biology and bioinformatics. Genome biology, 5(10), p.R80.

Robinson, M.D., McCarthy, D.J. and Smyth, G.K., 2010. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), pp.139-140.

# load the libraries
library(tidyverse)
library(stringr)
library(edgeR)
library(limma)
library(psych)
library(scales)

── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.2.1     ✔ purrr   0.3.2
✔ tibble  2.1.3     ✔ dplyr   0.8.3
✔ tidyr   0.8.3     ✔ stringr 1.4.0
✔ readr   1.3.1     ✔ forcats 0.4.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Loading required package: limma

Attaching package: ‘psych’

The following objects are masked from ‘package:ggplot2’:

    %+%, alpha


Attaching package: ‘scales’

The following objects are masked from ‘package:psych’:

    alpha, rescale

The following object is masked from ‘package:purrr’:

    discard

The following object is masked from ‘package:readr’:

    col_factor

Define notebook functions¶

# ================== TMM normalization from DGEList object =====================
apply_tmm_factors <- function(y, color = NULL, plot = TRUE) {
    # computes the tmm normalized data from the DGEList object
        # y - DGEList object
        # returns a dataframe with normalized intensities
    
    # compute and print "Sample loading" normalization factors
    lib_facs <- mean(y$samples$lib.size) / y$samples$lib.size
    cat("\nLibrary size factors:\n", 
        sprintf("%-5s -> %f\n", colnames(y$counts), lib_facs))
    
    # compute and print TMM normalization factors
    tmm_facs <- 1/y$samples$norm.factors
    cat("\nTrimmed mean of M-values (TMM) factors:\n", 
        sprintf("%-5s -> %f\n", colnames(y$counts), tmm_facs))
    
    # compute and print the final correction factors
    norm_facs <- lib_facs * tmm_facs
    cat("\nCombined (lib size and TMM) normalization factors:\n", 
        sprintf("%-5s -> %f\n", colnames(y$counts), norm_facs))

    # compute the normalized data as a new data frame
    tmt_tmm <- as.data.frame(sweep(y$counts, 2, norm_facs, FUN = "*"))
    colnames(tmt_tmm) <- str_c(colnames(y$counts), "_tmm")
    
    # visualize results and return data frame
    if(plot == TRUE) {
        boxplot(log10(tmt_tmm), col = color, notch = TRUE, main = "TMM Normalized data")
    }
    tmt_tmm
}

# ============== CV function ===================================================
CV <- function(df) {
    # Computes CVs of data frame rows
        # df - data frame, 
        # returns vector of CVs (%)
    ave <- rowMeans(df)    # compute averages
    sd <- apply(df, 1, sd) # compute standard deviations
    cv <- 100 * sd / ave   # compute CVs in percent (last thing gets returned)
}

# ================= reformat edgeR test results ================================
collect_results <- function(df, tt, x, xlab, y, ylab) {
    # Computes new columns and extracts some columns to make results frame
        # df - data in data.frame
        # tt - top tags table from edgeR test
        # x - columns for first condition
        # xlab - label for x
        # y - columns for second condition
        # ylab - label for y
        # returns a new dataframe
    
    # condition average vectors
    ave_x <- rowMeans(df[x])
    ave_y <- rowMeans(df[y])
    
    # FC, direction, candidates
    fc <- ifelse(ave_y > ave_x, (ave_y / ave_x), (-1 * ave_x / ave_y))
    direction <- ifelse(ave_y > ave_x, "up", "down")
    candidate <- cut(tt$FDR, breaks = c(-Inf, 0.01, 0.05, 0.10, 1.0), 
                     labels = c("high", "med", "low", "no"))
    
    # make data frame
    temp <- cbind(df[c(x, y)], data.frame(logFC = tt$logFC, FC = fc, 
                                          PValue = tt$PValue, FDR = tt$FDR, 
                                          ave_x = ave_x, ave_y = ave_y, 
                                          direction = direction, candidate = candidate, 
                                          Acc = tt$genes)) 
    
    # fix column headers for averages
    names(temp)[names(temp) %in% c("ave_x", "ave_y")]  <- str_c("ave_", c(xlab, ylab))    
    
    temp # return the data frame
}

# ========== Setup for MA and volcano plots ====================================
transform <- function(results, x, y) {
    # Make data frame with some transformed columns
        # results - results data frame
        # x - columns for x condition
        # y - columns for y condition
        # return new data frame
    df <- data.frame(log10((results[x] + results[y])/2), 
                     log2(results[y] / results[x]), 
                     results$candidate,
                     -log10(results$FDR))
    colnames(df) <- c("A", "M", "candidate", "P")
    
    df # return the data frame
}

# ========== MA plots using ggplot =============================================
MA_plots <- function(results, x, y, title) {
    # makes MA-plot DE candidate ggplots
        # results - data frame with edgeR results and some condition average columns
        # x - string for x-axis column
        # y - string for y-axis column
        # title - title string to use in plots
        # returns a list of plots 
    
    # uses transformed data
    temp <- transform(results, x, y)
    
    # 2-fold change lines
    ma_lines <- list(geom_hline(yintercept = 0.0, color = "black"),
                     geom_hline(yintercept = 1.0, color = "black", linetype = "dotted"),
                     geom_hline(yintercept = -1.0, color = "black", linetype = "dotted"))

    # make main MA plot
    ma <- ggplot(temp, aes(x = A, y = M)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        scale_y_continuous(paste0("logFC (", y, "/", x, ")")) +
        scale_x_continuous("Ave_intensity") +
        ggtitle(title) + 
        ma_lines
    
    # make separate MA plots
    ma_facet <- ggplot(temp, aes(x = A, y = M)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        scale_y_continuous(paste0("log2 FC (", y, "/", x, ")")) +
        scale_x_continuous("log10 Ave_intensity") +
        ma_lines +
        facet_wrap(~ candidate) +
        ggtitle(str_c(title, " (separated)"))

    # make the plots visible
    print(ma)
    print(ma_facet)
}    

# ========== Scatter plots using ggplot ========================================
scatter_plots <- function(results, x, y, title) {
    # makes scatter-plot DE candidate ggplots
        # results - data frame with edgeR results and some condition average columns
        # x - string for x-axis column
        # y - string for y-axis column
        # title - title string to use in plots
        # returns a list of plots
    
    # 2-fold change lines
    scatter_lines <- list(geom_abline(intercept = 0.0, slope = 1.0, color = "black"),
                          geom_abline(intercept = 0.301, slope = 1.0, color = "black", linetype = "dotted"),
                          geom_abline(intercept = -0.301, slope = 1.0, color = "black", linetype = "dotted"),
                          scale_y_log10(),
                          scale_x_log10())

    # make main scatter plot
    scatter <- ggplot(results, aes_string(x, y)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        ggtitle(title) + 
        scatter_lines

    # make separate scatter plots
    scatter_facet <- ggplot(results, aes_string(x, y)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        scatter_lines +
        facet_wrap(~ candidate) +
        ggtitle(str_c(title, " (separated)")) 

    # make the plots visible
    print(scatter)
    print(scatter_facet)
}

# ========== Volcano plots using ggplot ========================================
volcano_plot <- function(results, x, y, title) {
    # makes a volcano plot
        # results - a data frame with edgeR results
        # x - string for the x-axis column
        # y - string for y-axis column
        # title - plot title string
    
    # uses transformed data
    temp <- transform(results, x, y)
    
    # build the plot
    ggplot(temp, aes(x = M, y = P)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        xlab("log2 FC") +
        ylab("-log10 FDR") +
        ggtitle(str_c(title, " Volcano Plot"))
}

# ============== individual protein expression plots ===========================
# function to extract the identifier part of the accesssion
get_identifier <- function(accession) {
    identifier <- str_split(accession, ";", simplify = TRUE)
    identifier[,1]
}

set_plot_dimensions <- function(width_choice, height_choice) {
    options(repr.plot.width=width_choice, repr.plot.height=height_choice)
}

plot_top_tags <- function(results, nleft, nright, top_tags) {
    # results should have data first, then test results (two condition summary table)
    # nleft, nright are number of data points in each condition
    # top_tags is number of up and number of down top DE candidates to plot
    # get top ipregulated
    up <- results %>% 
        filter(logFC >= 0) %>%
        arrange(FDR)
    up <- up[1:top_tags, ]
    
    # get top down regulated
    down <- results %>% 
        filter(logFC < 0) %>%
        arrange(FDR)
    down <- down[1:top_tags, ]
    
    # pack them
    proteins <- rbind(up, down)
        
    color = c(rep("red", nleft), rep("blue", nright))
    for (row_num in 1:nrow(proteins)) {
        row <- proteins[row_num, ]
        vec <- as.vector(unlist(row[1:(nleft + nright)]))
        names(vec) <- colnames(row[1:(nleft + nright)])
        title <- str_c(get_identifier(row$Acc), ", int: ", scientific(mean(vec), 2), 
                       ", FDR: ", scientific(row$FDR, digits = 3), 
                       ", FC: ", round(row$FC, digits = 1))
        barplot(vec, col = color, main = title,
                cex.main = 1.0, cex.names = 0.7, cex.lab = 0.7)
    }    
}

Phosphopeptide data¶

Read in the data exported from Excel¶

Load data into edgeR and normalize¶

In most protein expression studies, the assumption of the majority of the proteins not having any changes in expression levels is central to normalization strategies. This concept is much less tested in phosphopeptide enrichment studies. We need to pay close attention to the sizes of the normalization factors and the alignment of the boxplots to verify that the normalization methods used for the whole protein analysis above are still valid.

# read in the data export file
data_all_pep <- read_tsv("Cond-1_TiO2.txt")

Parsed with column specification:
cols(
  Accession = col_character(),
  `L1-CRP` = col_double(),
  `L1+CRP` = col_double(),
  `L2-CRP` = col_double(),
  `L2+CRP` = col_double(),
  `B1-CRP` = col_double(),
  `B1+CRP` = col_double(),
  `B2-CRP` = col_double(),
  `B2+CRP` = col_double(),
  `B3-CRP` = col_double(),
  `B3+CRP` = col_double()
)

# save accessions and remove from data
accessions <- data_all_pep$Accession
counts_pep <- data_all_pep %>% select(-Accession)

# get the data arranged for the testing (resting first, then activated)
minus <- select(counts_pep, contains("-"))
minus <- minus[, order(colnames(minus))]
plus <- select(counts_pep, contains("+"))
plus <- plus[, order(colnames(plus))]
counts_pep <- cbind(minus, plus)

# define some column indexes
R  <- 1:5 # resting
A <- 6:10 # activated

# set color vector to match sample groups
color = c(rep("red", length(R)), rep("blue", length(A)))

head(counts_pep)
length(accessions)

boxplot(log10(counts_pep), col = color,
        xlab = 'TMT samples', ylab = 'log10 Intensity', 
        main = 'Condition 1 TiO2 Starting data', notch = TRUE)

# load into edgeR DGEList object and normalize the data
group = c(rep("R", 5), rep("A", 5))
ye <- DGEList(counts = counts_pep, group = group, gene = accessions)
ye <- calcNormFactors(ye)

Save the TMM normalized data¶

counts_pep_tmm <- apply_tmm_factors(ye, color)

Library size factors:
 B1-CRP -> 1.156073
 B2-CRP -> 1.036589
 B3-CRP -> 1.094500
 L1-CRP -> 1.004538
 L2-CRP -> 0.910289
 B1+CRP -> 1.117783
 B2+CRP -> 0.962666
 B3+CRP -> 0.998731
 L1+CRP -> 0.946977
 L2+CRP -> 0.853290

Trimmed mean of M-values (TMM) factors:
 B1-CRP -> 0.914401
 B2-CRP -> 0.951629
 B3-CRP -> 1.002488
 L1-CRP -> 0.969543
 L2-CRP -> 1.033312
 B1+CRP -> 0.950687
 B2+CRP -> 1.011083
 B3+CRP -> 1.046773
 L1+CRP -> 1.026112
 L2+CRP -> 1.108274

Combined (lib size and TMM) normalization factors:
 B1-CRP -> 1.057113
 B2-CRP -> 0.986448
 B3-CRP -> 1.097223
 L1-CRP -> 0.973943
 L2-CRP -> 0.940613
 B1+CRP -> 1.062661
 B2+CRP -> 0.973335
 B3+CRP -> 1.045444
 L1+CRP -> 0.971705
 L2+CRP -> 0.945679

Normalization factors are close to 1 and boxplots look okay¶

The normalization checks seem fine, so we should be in good shape for statistical testing.

Checks sample-to-sample similarity¶

We can also do the multi-panel scatter plots for the phosphopeptides to compare samples within each condition.

# check sample-to-sample similarity
pairs.panels(log10(counts_pep_tmm[R]), lm = TRUE, main = "Condition 1 Phosphopeptides: resting")
pairs.panels(log10(counts_pep_tmm[A]), lm = TRUE, main = "Condition 1 Phosphopeptides: activated")

Peptide-centric data has a little more scatter¶

We do not have as much aggregation in peptide-centric phospho studies as we do in protein expression studies. We do not have as much averaging to reduce sample-to-sample scatter to the same degree. We also have a bit more missing data that we imputed (the points along the axes).

Check that samples cluster by condition and check variance¶

# check the clustering (set colors by condition)
plotMDS(ye, col = color, main = "Condition 1: Resting (red) and Activated (blue)")

Resting and activated are well separated¶

Resting are along the left and the activated are on the right. We do still see the samples pairing top-to-bottom. A paired study design might be statistically more powerful.

Do an exact test first and then do a paired study design.¶

An exact test will establish a robust (safe) baseline for differential expression. We will then do a paired testing.

We need the trended variance.

# compute dispersions and plot BCV
ye <- estimateDisp(ye)
plotBCV(ye, main = "Phosphopeptide Variance Trend")

Design matrix not provided. Switch to the classic mode.

Data are ready for statistical testing¶

The cluster plot indicates that we have some clear separation between conditions. The scatter plots show that sample-to-sample reproducibility is very good.

Perform an exact test with edgeR and look at the significant peptides.

# the exact test object has columns like fold-change, CPM, and p-values
# we have already loaded the data into "y" above, normalized, and est. dispersion
ete <- exactTest(ye, pair = c("R", "A"))

# this counts up, down, and unchanged genes (here it is proteins)
summary(decideTestsDGE(ete, p.value = 0.10))
topTags(ete)$table

# the topTags function adds the BH FDR values to an exactTest data frame 
# make sure we do not change the row order!
tte <- topTags(ete, n = Inf, sort.by = "none")$table

results_e <- collect_results(counts_pep_tmm, tte, R, "R", A, "A")

        A-R
Down    381
NotSig 1727
Up      647

Check MA plot and p-value distribution¶

# make an MD plot
plotMD(ete, p.value = 0.10 , main = "Phosphopeptide Exact Test: Resting vs Activated")
abline(h = c(-1, 1), col = "black")

# check the p-value distribution
ggplot(tte, aes(PValue)) + 
  geom_histogram(bins = 100, fill = "white", color = "black") + 
  geom_hline(yintercept = mean(hist(ete$table$PValue, breaks = 100, 
                                    plot = FALSE)$counts[26:100])) +
  ggtitle("Phosphopeptide Exact Test: Resting vs Activated")

Can also do more candidate visualizations¶

We can look at distributions of log2 fold change, fancier MA plots, scatter plots, and the popular volcano plot.

# see how many candidates by category
results_e %>% count(candidate)

# plot log2 fold-changes by category
ggplot(results_e, aes(x = logFC, fill = candidate)) +
  geom_histogram(binwidth=0.1, color = "black") +
  facet_wrap(~candidate) +
  coord_cartesian(xlim = c(-5, 5)) +
  ggtitle("Log2 R by exact test category")

### Dotted lines are 2-fold and solid line is 1-to-1

# make the DE plots
MA_plots(results_e, "ave_R", "ave_A", "ASLA-359 Resting vs Activated - Exact Test")
scatter_plots(results_e, "ave_R", "ave_A", "ASLA-359 Resting vs Activated - Exact Test")
volcano_plot(results_e, "ave_R", "ave_A", "ASLA-359 Resting vs Activated - Exact Test")

Look at some of the top tag proteins¶

# plot the top N up down proteins
top_N <- 20

set_plot_dimensions(6, 3.5)
plot_top_tags(results_e, 5, 5, top_N)
set_plot_dimensions(7, 7)

All of the candidate plots look convincing¶

Paired samples testing¶

We can use the general linear modeling (glm) extensions in edgeR to do a paired study design.

# load into edgeR DGEList object and normalize the data
group = c(rep("R", 5), rep("A", 5))
yp <- DGEList(counts = counts_pep, group = group, gene = accessions)
yp <- calcNormFactors(yp)

# create the experimental design matrix
subject <- factor(rep(c("B1", "B2", "B3", "L1", "L2"), 2))
state <- factor(c(rep("R", 5), rep("A", 5)))

# Example 4.1 in edgeR user's guide
design <- model.matrix(~subject+state)
rownames(design) <- colnames(yp)
design

# extimate the dispersion parameters and check
yp <- estimateDisp(yp, design, robust = TRUE)
yp$common.dispersion

# fit statistical models (design matrix already in y$design)
fit <- glmQLFit(yp, design, robust = TRUE)
plotQLDisp(fit, main = "Condition 1 - Paired testing")

Data has been loaded, normalized, and trended dispersion modeled¶

We now run the linear modeling and check the results.

# if we do not specify a contrast, the default is the last column
# of the design matrix - a Resting versus Activated comparison
paired <- glmQLFTest(fit) # default comparison

# check test results
topTags(paired)$table
ttp <- topTags(paired, n = Inf, sort.by = "none")$table
summary(decideTests(paired, p.value = 0.10))

# make basic MA plot
plotMD(paired, p.value = 0.10, main = "Condition 1 TiO2 - Paired test")
print(abline(h = c(-1, 1), col = "black")) # 2-fold lines
    
# check the p-value distribution
ggplot(ttp, aes(PValue)) + 
    geom_histogram(bins = 100, fill = "white", color = "black") + 
    geom_hline(yintercept = mean(hist(ttp$PValue, breaks = 100, 
                                 plot = FALSE)$counts[30:100])) +
    ggtitle("p-value distribution: paired test")

# collect and reformat the test results
results_p <- collect_results(counts_pep_tmm, ttp, R, "R", A, "A")

# correct the changed reference order for the glm testing
results_p$logFC <- -1 * results_p$logFC

       stateR
Down      803
NotSig   1041
Up        911

NULL

We have about 700 more DE candidates with the paired testing¶

Note: the glm testing does not have the same sign for the logFC as the exact test did. The above MA plot will be inverted compared to the one from the exact testing farther above.

Check the log2 fold change distributions¶

# see how many candidates by category
results_p %>% count(candidate)

# plot log2 fold-changes by category
ggplot(results_p, aes(x = logFC, fill = candidate)) +
  geom_histogram(binwidth=0.1, color = "black") +
  facet_wrap(~candidate) +
  coord_cartesian(xlim = c(-5, 5)) +
  ggtitle("Log2 R by paired test category")

Check the other DE visualizations¶

# make the DE plots
MA_plots(results_p, "ave_R", "ave_A", "ASLA-359 TiO2 paired: Resting vs Activated")
scatter_plots(results_p, "ave_R", "ave_A", "ASLA-359 TiO2 paired: Resting vs Activated")
volcano_plot(results_p, "ave_R", "ave_A", "ASLA-359 TiO2 paired: Resting vs Activated")

Candidates look solid¶

Paired data may not have as large a differences in means between states. It is really the differences between specific samples that is driving the statistical testing.

Check some of the most statistically significant proteins (20 up and 20 down)¶

# plot the top N up down proteins
top_N <- 20

set_plot_dimensions(6, 3.5)
plot_top_tags(results_p, 5, 5, top_N)
set_plot_dimensions(7, 7)

There is more correlation in red and blue expression patterns¶

We can see the paired patterns in the data for these candidates compared to the exact test (which was more about the differences in means). The paired testing looks like the right thing here.

Plaetelet activation causes dramatic changes in phosphorylation¶

We have dramatic changes in signalling levels. We have similar numbers of up and down regulated peptides, although activation causes larger increased intensity changes for the over expressed candidates.

Check the peptide CVs for each state¶

# phosphopeptides
cv_pep <- data.frame(R_pep = CV(counts_pep_tmm[R]), A_pep = CV(counts_pep_tmm[A]))
medians <- apply(cv_pep, 2, FUN = median)
print("Phosphopeptide median CVs (%)")
round(medians, 1)

# make the boxplots
boxplot(cv_pep, notch = TRUE, main = "Phosphopeptide CV distributions", 
        ylim = c(0, 150), ylab = "CV (%)")

[1] "Phosphopeptide median CVs (%)"

CV distributions are similar for both states¶

The median CVs are reasonable for a peptide-centric experiment. We have a bit larger CVs and more variable data after activation compared to the resting state.

Export the phophopeptide results and log session¶

# save the results data frame
results <- cbind(results_e, results_p)
write.table(results, "Cond-1_TiO2_results.txt", sep = "\t", row.names = FALSE)

# log the session
sessionInfo()

R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS  10.15.2

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] scales_1.0.0    psych_1.8.12    edgeR_3.24.3    limma_3.38.3   
 [5] forcats_0.4.0   stringr_1.4.0   dplyr_0.8.3     purrr_0.3.2    
 [9] readr_1.3.1     tidyr_0.8.3     tibble_2.1.3    ggplot2_3.2.1  
[13] tidyverse_1.2.1

loaded via a namespace (and not attached):
 [1] pbdZMQ_0.3-3     statmod_1.4.32   locfit_1.5-9.1   tidyselect_0.2.5
 [5] repr_1.0.1       splines_3.5.3    haven_2.1.1      lattice_0.20-38 
 [9] colorspace_1.4-1 generics_0.0.2   vctrs_0.2.0      htmltools_0.3.6 
[13] base64enc_0.1-3  rlang_0.4.0      pillar_1.4.2     foreign_0.8-72  
[17] glue_1.3.1       withr_2.1.2      modelr_0.1.5     readxl_1.3.1    
[21] uuid_0.1-2       munsell_0.5.0    gtable_0.3.0     cellranger_1.1.0
[25] rvest_0.3.4      evaluate_0.14    labeling_0.3     parallel_3.5.3  
[29] broom_0.5.2      IRdisplay_0.7.0  Rcpp_1.0.2       backports_1.1.4 
[33] IRkernel_1.0.2   jsonlite_1.6     mnormt_1.5-5     hms_0.5.1       
[37] digest_0.6.20    stringi_1.4.3    grid_3.5.3       cli_1.1.0       
[41] tools_3.5.3      magrittr_1.5     lazyeval_0.2.2   crayon_1.3.4    
[45] pkgconfig_2.0.2  zeallot_0.1.0    xml2_1.2.2       lubridate_1.7.4 
[49] assertthat_0.2.1 httr_1.4.1       rstudioapi_0.10  R6_2.4.0        
[53] nlme_3.1-141     compiler_3.5.3

B1-CRP	B2-CRP	B3-CRP	L1-CRP	L2-CRP	B1+CRP	B2+CRP	B3+CRP	L1+CRP	L2+CRP
<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
44699890	55212664	59159248	63175915	71323887	40915230	64929740	63861431	68620369	76828383
29538733	31449192	31232703	35274394	36966379	18791698	19961754	18230789	21057395	19786619
15445490	22148300	19726660	19958250	26794380	13477450	26006550	22767600	27371970	32648500
18344352	18896712	17846973	21265201	32230854	15578337	20185362	19703876	22695618	32892582
8648280	11096920	10568750	10304870	20364050	9375540	8824140	10491280	11367700	20330260
4370668	5319673	4352876	5292394	4994891	9850623	10982599	11111301	12367885	12193458

	genes	logFC	logCPM	PValue	FDR
	<chr>	<dbl>	<dbl>	<dbl>	<dbl>
439	Q92956	4.001242	8.553359	7.774902e-90	2.141986e-86
440	Q0ZGT2	3.147127	8.551902	4.619902e-74	6.363914e-71
556	P16949	3.522024	8.144731	1.408039e-65	1.293049e-62
388	Q8IVT5	-3.162106	8.735127	2.202818e-61	1.517191e-58
17	Q0ZGT2	2.730462	12.628000	8.183778e-59	4.509262e-56
111	Q96SB3	2.194340	10.364713	3.217704e-56	1.477462e-53
370	Q8IY63	2.628336	8.766501	1.349076e-53	5.309579e-51
707	Q04759	3.008434	7.695881	2.169222e-53	7.470260e-51
863	Q9UQL6	2.543998	7.295454	5.860132e-53	1.793851e-50
411	Q68D51	2.485389	8.630386	4.278355e-45	1.178687e-42

	genes	logFC	logCPM	F	PValue	FDR
	<chr>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
17	Q0ZGT2	-2.723305	12.628000	1459.0333	5.317393e-10	1.464942e-06
111	Q96SB3	-2.197204	10.364713	1055.6473	1.817976e-09	1.893071e-06
12	Q15942	1.797312	13.243466	1021.2594	2.061420e-09	1.893071e-06
85	Q8TF42	-2.307422	10.871584	858.2046	3.986861e-09	2.745950e-06
370	Q8IY63	-2.626067	8.766501	761.5332	6.269270e-09	3.398042e-06
863	Q9UQL6	-2.545984	7.295454	704.7447	8.405961e-09	3.398042e-06
44	Q14644	-1.465140	11.563990	699.7786	8.633863e-09	3.398042e-06
439	Q92956	-4.007752	8.553359	609.9762	1.451041e-08	4.553010e-06
1197	Q5VZK9	-2.730660	6.552094	605.9958	1.487372e-08	4.553010e-06
440	Q0ZGT2	-3.148196	8.551902	568.4728	1.893358e-08	4.805120e-06

A tibble: 4 × 2
candidate	n
<fct>	<int>
high	715
med	162
low	151
no	1727

A tibble: 4 × 2
candidate	n
<fct>	<int>
high	979
med	472
low	263
no	1041