Condition 1 TiO2 TMT Phosphopeptide Analysis

Joe Aslan Lab, OHSU

Notebook prepared by Phil Wilmarth, PSR Core, OHSU

January 9, 2020

Signalling changes between resting (R) and activated (A) human platelets. TiO2-enriched phospho peptides labeled with 10-plex TMT reagents. This is data from the ASLA-359 project.

Some of the resulting peptides were directly labeled with 10-plex TMT reagents {Thompson 2003}. Remaining peptides were enriched for phosphopepides, labeled with 10-plex TMT for a second set of labeled samples. After combining equal portions of labeled peptide samples (either whole protein or enriched phosphopeptides), they were separated by high pH reverse phase/low pH reverse phase liquid chromatography and analyzed on a Thermo Fusion Tribrid Orbitrap mass spectrometer {Senko 2013}. Fragment ions were generated using CID and the ion trap analyzer; the reporter ions were fragmented using high energy collision dissociation of (synchronous precursor selection) SPS selected fragment ions at high resolution in the Orbitrap analyzer.

Peptides and proteins were identified using Proteome Discoverer v1.4 and SEQUEST {Eng 1994}. A wider 1.25 Da parent ion mass tolerance was used, TMT labels and alkylated cysteine were specified as static modifications, oxidation of methionine (+15.9949 Da) was specified as a variable modification, trypsin enzyme specificity was used, and a canonical UniProt Swiss-Prot human protein database was used. Fragment ion tolerance was set at 1.0005 Da. The labeled phosphopeptides were analyzed with the addition of variable phosphorylation (+79.9799) on serine, threonine, or tyrosine residues. Proteome Discoverer was also configured with a phosphRS {Taus 2011} node for site localizations. Confident peptide identifications were obtained using Percolator and the target/decoy method. PSM information (peptide sequences, q-values, masses, and reporter ions) was exported to tab-delimited files for processing with PAW pipeline modules.

The PAW pipeline Python scripts take the files exported from Proteome Discoverer, does some additional accurate peptide mass and minimum reporter ion intensity filtering. The exported data from Proteome Discoverer has already undergone PSM confidence filtering (Percolator q-values) and parsimonious protein inference. Reporter ions are peak heights of the most confident (closest to expected m/z value) centroided peak within 20 ppm of the expected position. Normalizations and statistical testing were performed using the Bioconductor {Gentleman 2004} package edgeR {Robinson 2010} as detailed in the first half of the notebook below. A Jupyter notebook with an R kernel was used to execute R commands and visualize the results.

Phosphopeptide enrichment is very different compared to protein expression. The measurements are peptide-centric rather than protein centric. Protein inference cannot be used as a noise filter to reduce the negative impact of incorrect PSMs. The positive effect of aggregated data to the protein level is also greatly compromised. We did a couple of things to address these issues. We increased the q-value cutoff for PSM filtering from 5% to 1%. We also considered phospho group localization to be less reliable than determination of the base peptide sequence and total number of phospho groups present. To get some degree of data aggregation to reduce the multiple testing impact and improve the data quality, we combined all PSMs from the same base peptide sequence and the same number of phosphogroups. This aggregation step reduces the number of data points in the quantitative analysis by about a factor of 2. The aggregation is not uniform across the phosphopeptides. Many peptides only have a single PSM, others may have many combined PSMs. We carried forward localization information in a limited way. The localization of the most confident PSM (smallest q-value) is reported and annotated as consistent (all PSMs has the same site localization) or variable. Analysis of the phosphopeptide data is in the second half of this notebook.

Thompson, A., Schäfer, J., Kuhn, K., Kienle, S., Schwarz, J., Schmidt, G., Neumann, T. and Hamon, C., 2003. Tandem mass tags: a novel quantification strategy for comparative analysis of complex protein mixtures by MS/MS. Analytical chemistry, 75(8), pp.1895-1904.

Senko, M.W., Remes, P.M., Canterbury, J.D., Mathur, R., Song, Q., Eliuk, S.M., Mullen, C., Earley, L., Hardman, M., Blethrow, J.D. and Bui, H., 2013. Novel parallelized quadrupole/linear ion trap/Orbitrap tribrid mass spectrometer improving proteome coverage and peptide identification rates. Analytical chemistry, 85(24), pp.11710-11714.

Eng, J.K., McCormack, A.L. and Yates, J.R., 1994. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. Journal of the American Society for Mass Spectrometry, 5(11), pp.976-989.

McAlister, G.C., Nusinow, D.P., Jedrychowski, M.P., Wühr, M., Huttlin, E.L., Erickson, B.K., Rad, R., Haas, W. and Gygi, S.P., 2014. MultiNotch MS3 enables accurate, sensitive, and multiplexed detection of differential expression across cancer cell line proteomes. Analytical chemistry, 86(14), pp.7150-7158.

Wilmarth, P.A., Riviere, M.A. and David, L.L., 2009. Techniques for accurate protein identification in shotgun proteomic studies of human, mouse, bovine, and chicken lenses. Journal of ocular biology, diseases, and informatics, 2(4), pp.223-234.

Taus, T., Köcher, T., Pichler, P., Paschke, C., Schmidt, A., Henrich, C. and Mechtler, K., 2011. Universal and confident phosphorylation site localization using phosphoRS. Journal of proteome research, 10(12), pp.5354-5362.

Gentleman, R.C., Carey, V.J., Bates, D.M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J. and Hornik, K., 2004. Bioconductor: open software development for computational biology and bioinformatics. Genome biology, 5(10), p.R80.

Robinson, M.D., McCarthy, D.J. and Smyth, G.K., 2010. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), pp.139-140.

In [1]:
# load the libraries
library(tidyverse)
library(stringr)
library(edgeR)
library(limma)
library(psych)
library(scales)
── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.2.1     ✔ purrr   0.3.2
✔ tibble  2.1.3     ✔ dplyr   0.8.3
✔ tidyr   0.8.3     ✔ stringr 1.4.0
✔ readr   1.3.1     ✔ forcats 0.4.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Loading required package: limma

Attaching package: ‘psych’

The following objects are masked from ‘package:ggplot2’:

    %+%, alpha


Attaching package: ‘scales’

The following objects are masked from ‘package:psych’:

    alpha, rescale

The following object is masked from ‘package:purrr’:

    discard

The following object is masked from ‘package:readr’:

    col_factor

Define notebook functions

In [2]:
# ================== TMM normalization from DGEList object =====================
apply_tmm_factors <- function(y, color = NULL, plot = TRUE) {
    # computes the tmm normalized data from the DGEList object
        # y - DGEList object
        # returns a dataframe with normalized intensities
    
    # compute and print "Sample loading" normalization factors
    lib_facs <- mean(y$samples$lib.size) / y$samples$lib.size
    cat("\nLibrary size factors:\n", 
        sprintf("%-5s -> %f\n", colnames(y$counts), lib_facs))
    
    # compute and print TMM normalization factors
    tmm_facs <- 1/y$samples$norm.factors
    cat("\nTrimmed mean of M-values (TMM) factors:\n", 
        sprintf("%-5s -> %f\n", colnames(y$counts), tmm_facs))
    
    # compute and print the final correction factors
    norm_facs <- lib_facs * tmm_facs
    cat("\nCombined (lib size and TMM) normalization factors:\n", 
        sprintf("%-5s -> %f\n", colnames(y$counts), norm_facs))

    # compute the normalized data as a new data frame
    tmt_tmm <- as.data.frame(sweep(y$counts, 2, norm_facs, FUN = "*"))
    colnames(tmt_tmm) <- str_c(colnames(y$counts), "_tmm")
    
    # visualize results and return data frame
    if(plot == TRUE) {
        boxplot(log10(tmt_tmm), col = color, notch = TRUE, main = "TMM Normalized data")
    }
    tmt_tmm
}

# ============== CV function ===================================================
CV <- function(df) {
    # Computes CVs of data frame rows
        # df - data frame, 
        # returns vector of CVs (%)
    ave <- rowMeans(df)    # compute averages
    sd <- apply(df, 1, sd) # compute standard deviations
    cv <- 100 * sd / ave   # compute CVs in percent (last thing gets returned)
}

# ================= reformat edgeR test results ================================
collect_results <- function(df, tt, x, xlab, y, ylab) {
    # Computes new columns and extracts some columns to make results frame
        # df - data in data.frame
        # tt - top tags table from edgeR test
        # x - columns for first condition
        # xlab - label for x
        # y - columns for second condition
        # ylab - label for y
        # returns a new dataframe
    
    # condition average vectors
    ave_x <- rowMeans(df[x])
    ave_y <- rowMeans(df[y])
    
    # FC, direction, candidates
    fc <- ifelse(ave_y > ave_x, (ave_y / ave_x), (-1 * ave_x / ave_y))
    direction <- ifelse(ave_y > ave_x, "up", "down")
    candidate <- cut(tt$FDR, breaks = c(-Inf, 0.01, 0.05, 0.10, 1.0), 
                     labels = c("high", "med", "low", "no"))
    
    # make data frame
    temp <- cbind(df[c(x, y)], data.frame(logFC = tt$logFC, FC = fc, 
                                          PValue = tt$PValue, FDR = tt$FDR, 
                                          ave_x = ave_x, ave_y = ave_y, 
                                          direction = direction, candidate = candidate, 
                                          Acc = tt$genes)) 
    
    # fix column headers for averages
    names(temp)[names(temp) %in% c("ave_x", "ave_y")]  <- str_c("ave_", c(xlab, ylab))    
    
    temp # return the data frame
}

# ========== Setup for MA and volcano plots ====================================
transform <- function(results, x, y) {
    # Make data frame with some transformed columns
        # results - results data frame
        # x - columns for x condition
        # y - columns for y condition
        # return new data frame
    df <- data.frame(log10((results[x] + results[y])/2), 
                     log2(results[y] / results[x]), 
                     results$candidate,
                     -log10(results$FDR))
    colnames(df) <- c("A", "M", "candidate", "P")
    
    df # return the data frame
}

# ========== MA plots using ggplot =============================================
MA_plots <- function(results, x, y, title) {
    # makes MA-plot DE candidate ggplots
        # results - data frame with edgeR results and some condition average columns
        # x - string for x-axis column
        # y - string for y-axis column
        # title - title string to use in plots
        # returns a list of plots 
    
    # uses transformed data
    temp <- transform(results, x, y)
    
    # 2-fold change lines
    ma_lines <- list(geom_hline(yintercept = 0.0, color = "black"),
                     geom_hline(yintercept = 1.0, color = "black", linetype = "dotted"),
                     geom_hline(yintercept = -1.0, color = "black", linetype = "dotted"))

    # make main MA plot
    ma <- ggplot(temp, aes(x = A, y = M)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        scale_y_continuous(paste0("logFC (", y, "/", x, ")")) +
        scale_x_continuous("Ave_intensity") +
        ggtitle(title) + 
        ma_lines
    
    # make separate MA plots
    ma_facet <- ggplot(temp, aes(x = A, y = M)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        scale_y_continuous(paste0("log2 FC (", y, "/", x, ")")) +
        scale_x_continuous("log10 Ave_intensity") +
        ma_lines +
        facet_wrap(~ candidate) +
        ggtitle(str_c(title, " (separated)"))

    # make the plots visible
    print(ma)
    print(ma_facet)
}    

# ========== Scatter plots using ggplot ========================================
scatter_plots <- function(results, x, y, title) {
    # makes scatter-plot DE candidate ggplots
        # results - data frame with edgeR results and some condition average columns
        # x - string for x-axis column
        # y - string for y-axis column
        # title - title string to use in plots
        # returns a list of plots
    
    # 2-fold change lines
    scatter_lines <- list(geom_abline(intercept = 0.0, slope = 1.0, color = "black"),
                          geom_abline(intercept = 0.301, slope = 1.0, color = "black", linetype = "dotted"),
                          geom_abline(intercept = -0.301, slope = 1.0, color = "black", linetype = "dotted"),
                          scale_y_log10(),
                          scale_x_log10())

    # make main scatter plot
    scatter <- ggplot(results, aes_string(x, y)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        ggtitle(title) + 
        scatter_lines

    # make separate scatter plots
    scatter_facet <- ggplot(results, aes_string(x, y)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        scatter_lines +
        facet_wrap(~ candidate) +
        ggtitle(str_c(title, " (separated)")) 

    # make the plots visible
    print(scatter)
    print(scatter_facet)
}

# ========== Volcano plots using ggplot ========================================
volcano_plot <- function(results, x, y, title) {
    # makes a volcano plot
        # results - a data frame with edgeR results
        # x - string for the x-axis column
        # y - string for y-axis column
        # title - plot title string
    
    # uses transformed data
    temp <- transform(results, x, y)
    
    # build the plot
    ggplot(temp, aes(x = M, y = P)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        xlab("log2 FC") +
        ylab("-log10 FDR") +
        ggtitle(str_c(title, " Volcano Plot"))
}

# ============== individual protein expression plots ===========================
# function to extract the identifier part of the accesssion
get_identifier <- function(accession) {
    identifier <- str_split(accession, ";", simplify = TRUE)
    identifier[,1]
}

set_plot_dimensions <- function(width_choice, height_choice) {
    options(repr.plot.width=width_choice, repr.plot.height=height_choice)
}

plot_top_tags <- function(results, nleft, nright, top_tags) {
    # results should have data first, then test results (two condition summary table)
    # nleft, nright are number of data points in each condition
    # top_tags is number of up and number of down top DE candidates to plot
    # get top ipregulated
    up <- results %>% 
        filter(logFC >= 0) %>%
        arrange(FDR)
    up <- up[1:top_tags, ]
    
    # get top down regulated
    down <- results %>% 
        filter(logFC < 0) %>%
        arrange(FDR)
    down <- down[1:top_tags, ]
    
    # pack them
    proteins <- rbind(up, down)
        
    color = c(rep("red", nleft), rep("blue", nright))
    for (row_num in 1:nrow(proteins)) {
        row <- proteins[row_num, ]
        vec <- as.vector(unlist(row[1:(nleft + nright)]))
        names(vec) <- colnames(row[1:(nleft + nright)])
        title <- str_c(get_identifier(row$Acc), ", int: ", scientific(mean(vec), 2), 
                       ", FDR: ", scientific(row$FDR, digits = 3), 
                       ", FC: ", round(row$FC, digits = 1))
        barplot(vec, col = color, main = title,
                cex.main = 1.0, cex.names = 0.7, cex.lab = 0.7)
    }    
}

Phosphopeptide data

Read in the data exported from Excel

Load data into edgeR and normalize

In most protein expression studies, the assumption of the majority of the proteins not having any changes in expression levels is central to normalization strategies. This concept is much less tested in phosphopeptide enrichment studies. We need to pay close attention to the sizes of the normalization factors and the alignment of the boxplots to verify that the normalization methods used for the whole protein analysis above are still valid.

In [3]:
# read in the data export file
data_all_pep <- read_tsv("Cond-1_TiO2.txt")
Parsed with column specification:
cols(
  Accession = col_character(),
  `L1-CRP` = col_double(),
  `L1+CRP` = col_double(),
  `L2-CRP` = col_double(),
  `L2+CRP` = col_double(),
  `B1-CRP` = col_double(),
  `B1+CRP` = col_double(),
  `B2-CRP` = col_double(),
  `B2+CRP` = col_double(),
  `B3-CRP` = col_double(),
  `B3+CRP` = col_double()
)
In [4]:
# save accessions and remove from data
accessions <- data_all_pep$Accession
counts_pep <- data_all_pep %>% select(-Accession)

# get the data arranged for the testing (resting first, then activated)
minus <- select(counts_pep, contains("-"))
minus <- minus[, order(colnames(minus))]
plus <- select(counts_pep, contains("+"))
plus <- plus[, order(colnames(plus))]
counts_pep <- cbind(minus, plus)
In [5]:
# define some column indexes
R  <- 1:5 # resting
A <- 6:10 # activated

# set color vector to match sample groups
color = c(rep("red", length(R)), rep("blue", length(A)))

head(counts_pep)
length(accessions)
A data.frame: 6 × 10
B1-CRPB2-CRPB3-CRPL1-CRPL2-CRPB1+CRPB2+CRPB3+CRPL1+CRPL2+CRP
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
44699890552126645915924863175915713238874091523064929740638614316862036976828383
29538733314491923123270335274394369663791879169819961754182307892105739519786619
15445490221483001972666019958250267943801347745026006550227676002737197032648500
18344352188967121784697321265201322308541557833720185362197038762269561832892582
864828011096920105687501030487020364050 9375540 8824140104912801136770020330260
4370668 5319673 4352876 5292394 4994891 985062310982599111113011236788512193458
2755
In [6]:
boxplot(log10(counts_pep), col = color,
        xlab = 'TMT samples', ylab = 'log10 Intensity', 
        main = 'Condition 1 TiO2 Starting data', notch = TRUE)
In [7]:
# load into edgeR DGEList object and normalize the data
group = c(rep("R", 5), rep("A", 5))
ye <- DGEList(counts = counts_pep, group = group, gene = accessions)
ye <- calcNormFactors(ye)

Save the TMM normalized data

In [8]:
counts_pep_tmm <- apply_tmm_factors(ye, color)
Library size factors:
 B1-CRP -> 1.156073
 B2-CRP -> 1.036589
 B3-CRP -> 1.094500
 L1-CRP -> 1.004538
 L2-CRP -> 0.910289
 B1+CRP -> 1.117783
 B2+CRP -> 0.962666
 B3+CRP -> 0.998731
 L1+CRP -> 0.946977
 L2+CRP -> 0.853290

Trimmed mean of M-values (TMM) factors:
 B1-CRP -> 0.914401
 B2-CRP -> 0.951629
 B3-CRP -> 1.002488
 L1-CRP -> 0.969543
 L2-CRP -> 1.033312
 B1+CRP -> 0.950687
 B2+CRP -> 1.011083
 B3+CRP -> 1.046773
 L1+CRP -> 1.026112
 L2+CRP -> 1.108274

Combined (lib size and TMM) normalization factors:
 B1-CRP -> 1.057113
 B2-CRP -> 0.986448
 B3-CRP -> 1.097223
 L1-CRP -> 0.973943
 L2-CRP -> 0.940613
 B1+CRP -> 1.062661
 B2+CRP -> 0.973335
 B3+CRP -> 1.045444
 L1+CRP -> 0.971705
 L2+CRP -> 0.945679

Normalization factors are close to 1 and boxplots look okay

The normalization checks seem fine, so we should be in good shape for statistical testing.

Checks sample-to-sample similarity

We can also do the multi-panel scatter plots for the phosphopeptides to compare samples within each condition.

In [9]:
# check sample-to-sample similarity
pairs.panels(log10(counts_pep_tmm[R]), lm = TRUE, main = "Condition 1 Phosphopeptides: resting")
pairs.panels(log10(counts_pep_tmm[A]), lm = TRUE, main = "Condition 1 Phosphopeptides: activated")

Peptide-centric data has a little more scatter

We do not have as much aggregation in peptide-centric phospho studies as we do in protein expression studies. We do not have as much averaging to reduce sample-to-sample scatter to the same degree. We also have a bit more missing data that we imputed (the points along the axes).

Check that samples cluster by condition and check variance

In [10]:
# check the clustering (set colors by condition)
plotMDS(ye, col = color, main = "Condition 1: Resting (red) and Activated (blue)")

Resting and activated are well separated

Resting are along the left and the activated are on the right. We do still see the samples pairing top-to-bottom. A paired study design might be statistically more powerful.

Do an exact test first and then do a paired study design.

An exact test will establish a robust (safe) baseline for differential expression. We will then do a paired testing.

We need the trended variance.

In [11]:
# compute dispersions and plot BCV
ye <- estimateDisp(ye)
plotBCV(ye, main = "Phosphopeptide Variance Trend")
Design matrix not provided. Switch to the classic mode.

Data are ready for statistical testing

The cluster plot indicates that we have some clear separation between conditions. The scatter plots show that sample-to-sample reproducibility is very good.

Perform an exact test with edgeR and look at the significant peptides.

In [12]:
# the exact test object has columns like fold-change, CPM, and p-values
# we have already loaded the data into "y" above, normalized, and est. dispersion
ete <- exactTest(ye, pair = c("R", "A"))

# this counts up, down, and unchanged genes (here it is proteins)
summary(decideTestsDGE(ete, p.value = 0.10))
topTags(ete)$table

# the topTags function adds the BH FDR values to an exactTest data frame 
# make sure we do not change the row order!
tte <- topTags(ete, n = Inf, sort.by = "none")$table

results_e <- collect_results(counts_pep_tmm, tte, R, "R", A, "A")
        A-R
Down    381
NotSig 1727
Up      647
A data.frame: 10 × 5
geneslogFClogCPMPValueFDR
<chr><dbl><dbl><dbl><dbl>
439Q92956 4.001242 8.5533597.774902e-902.141986e-86
440Q0ZGT2 3.147127 8.5519024.619902e-746.363914e-71
556P16949 3.522024 8.1447311.408039e-651.293049e-62
388Q8IVT5-3.162106 8.7351272.202818e-611.517191e-58
17Q0ZGT2 2.73046212.6280008.183778e-594.509262e-56
111Q96SB3 2.19434010.3647133.217704e-561.477462e-53
370Q8IY63 2.628336 8.7665011.349076e-535.309579e-51
707Q04759 3.008434 7.6958812.169222e-537.470260e-51
863Q9UQL6 2.543998 7.2954545.860132e-531.793851e-50
411Q68D51 2.485389 8.6303864.278355e-451.178687e-42

Check MA plot and p-value distribution

In [13]:
# make an MD plot
plotMD(ete, p.value = 0.10 , main = "Phosphopeptide Exact Test: Resting vs Activated")
abline(h = c(-1, 1), col = "black")

# check the p-value distribution
ggplot(tte, aes(PValue)) + 
  geom_histogram(bins = 100, fill = "white", color = "black") + 
  geom_hline(yintercept = mean(hist(ete$table$PValue, breaks = 100, 
                                    plot = FALSE)$counts[26:100])) +
  ggtitle("Phosphopeptide Exact Test: Resting vs Activated")

Can also do more candidate visualizations

We can look at distributions of log2 fold change, fancier MA plots, scatter plots, and the popular volcano plot.

In [14]:
# see how many candidates by category
results_e %>% count(candidate)

# plot log2 fold-changes by category
ggplot(results_e, aes(x = logFC, fill = candidate)) +
  geom_histogram(binwidth=0.1, color = "black") +
  facet_wrap(~candidate) +
  coord_cartesian(xlim = c(-5, 5)) +
  ggtitle("Log2 R by exact test category")
A tibble: 4 × 2
candidaten
<fct><int>
high 715
med 162
low 151
no 1727
In [15]:
### Dotted lines are 2-fold and solid line is 1-to-1
In [16]:
# make the DE plots
MA_plots(results_e, "ave_R", "ave_A", "ASLA-359 Resting vs Activated - Exact Test")
scatter_plots(results_e, "ave_R", "ave_A", "ASLA-359 Resting vs Activated - Exact Test")
volcano_plot(results_e, "ave_R", "ave_A", "ASLA-359 Resting vs Activated - Exact Test")

Look at some of the top tag proteins

In [17]:
# plot the top N up down proteins
top_N <- 20

set_plot_dimensions(6, 3.5)
plot_top_tags(results_e, 5, 5, top_N)
set_plot_dimensions(7, 7)

All of the candidate plots look convincing

Paired samples testing

We can use the general linear modeling (glm) extensions in edgeR to do a paired study design.

In [18]:
# load into edgeR DGEList object and normalize the data
group = c(rep("R", 5), rep("A", 5))
yp <- DGEList(counts = counts_pep, group = group, gene = accessions)
yp <- calcNormFactors(yp)
In [19]:
# create the experimental design matrix
subject <- factor(rep(c("B1", "B2", "B3", "L1", "L2"), 2))
state <- factor(c(rep("R", 5), rep("A", 5)))

# Example 4.1 in edgeR user's guide
design <- model.matrix(~subject+state)
rownames(design) <- colnames(yp)
design
A matrix: 10 × 6 of type dbl
(Intercept)subjectB2subjectB3subjectL1subjectL2stateR
B1-CRP100001
B2-CRP110001
B3-CRP101001
L1-CRP100101
L2-CRP100011
B1+CRP100000
B2+CRP110000
B3+CRP101000
L1+CRP100100
L2+CRP100010
In [20]:
# extimate the dispersion parameters and check
yp <- estimateDisp(yp, design, robust = TRUE)
yp$common.dispersion
0.0489842575982801
In [21]:
# fit statistical models (design matrix already in y$design)
fit <- glmQLFit(yp, design, robust = TRUE)
plotQLDisp(fit, main = "Condition 1 - Paired testing")

Data has been loaded, normalized, and trended dispersion modeled

We now run the linear modeling and check the results.

In [22]:
# if we do not specify a contrast, the default is the last column
# of the design matrix - a Resting versus Activated comparison
paired <- glmQLFTest(fit) # default comparison

# check test results
topTags(paired)$table
ttp <- topTags(paired, n = Inf, sort.by = "none")$table
summary(decideTests(paired, p.value = 0.10))

# make basic MA plot
plotMD(paired, p.value = 0.10, main = "Condition 1 TiO2 - Paired test")
print(abline(h = c(-1, 1), col = "black")) # 2-fold lines
    
# check the p-value distribution
ggplot(ttp, aes(PValue)) + 
    geom_histogram(bins = 100, fill = "white", color = "black") + 
    geom_hline(yintercept = mean(hist(ttp$PValue, breaks = 100, 
                                 plot = FALSE)$counts[30:100])) +
    ggtitle("p-value distribution: paired test")

# collect and reformat the test results
results_p <- collect_results(counts_pep_tmm, ttp, R, "R", A, "A")

# correct the changed reference order for the glm testing
results_p$logFC <- -1 * results_p$logFC
A data.frame: 10 × 6
geneslogFClogCPMFPValueFDR
<chr><dbl><dbl><dbl><dbl><dbl>
17Q0ZGT2-2.72330512.6280001459.03335.317393e-101.464942e-06
111Q96SB3-2.19720410.3647131055.64731.817976e-091.893071e-06
12Q15942 1.79731213.2434661021.25942.061420e-091.893071e-06
85Q8TF42-2.30742210.871584 858.20463.986861e-092.745950e-06
370Q8IY63-2.626067 8.766501 761.53326.269270e-093.398042e-06
863Q9UQL6-2.545984 7.295454 704.74478.405961e-093.398042e-06
44Q14644-1.46514011.563990 699.77868.633863e-093.398042e-06
439Q92956-4.007752 8.553359 609.97621.451041e-084.553010e-06
1197Q5VZK9-2.730660 6.552094 605.99581.487372e-084.553010e-06
440Q0ZGT2-3.148196 8.551902 568.47281.893358e-084.805120e-06
       stateR
Down      803
NotSig   1041
Up        911
NULL

We have about 700 more DE candidates with the paired testing

Note: the glm testing does not have the same sign for the logFC as the exact test did. The above MA plot will be inverted compared to the one from the exact testing farther above.

Check the log2 fold change distributions

In [23]:
# see how many candidates by category
results_p %>% count(candidate)

# plot log2 fold-changes by category
ggplot(results_p, aes(x = logFC, fill = candidate)) +
  geom_histogram(binwidth=0.1, color = "black") +
  facet_wrap(~candidate) +
  coord_cartesian(xlim = c(-5, 5)) +
  ggtitle("Log2 R by paired test category")
A tibble: 4 × 2
candidaten
<fct><int>
high 979
med 472
low 263
no 1041

Check the other DE visualizations

In [24]:
# make the DE plots
MA_plots(results_p, "ave_R", "ave_A", "ASLA-359 TiO2 paired: Resting vs Activated")
scatter_plots(results_p, "ave_R", "ave_A", "ASLA-359 TiO2 paired: Resting vs Activated")
volcano_plot(results_p, "ave_R", "ave_A", "ASLA-359 TiO2 paired: Resting vs Activated")

Candidates look solid

Paired data may not have as large a differences in means between states. It is really the differences between specific samples that is driving the statistical testing.

Check some of the most statistically significant proteins (20 up and 20 down)

In [25]:
# plot the top N up down proteins
top_N <- 20

set_plot_dimensions(6, 3.5)
plot_top_tags(results_p, 5, 5, top_N)
set_plot_dimensions(7, 7)

There is more correlation in red and blue expression patterns

We can see the paired patterns in the data for these candidates compared to the exact test (which was more about the differences in means). The paired testing looks like the right thing here.

Plaetelet activation causes dramatic changes in phosphorylation

We have dramatic changes in signalling levels. We have similar numbers of up and down regulated peptides, although activation causes larger increased intensity changes for the over expressed candidates.

Check the peptide CVs for each state

In [26]:
# phosphopeptides
cv_pep <- data.frame(R_pep = CV(counts_pep_tmm[R]), A_pep = CV(counts_pep_tmm[A]))
medians <- apply(cv_pep, 2, FUN = median)
print("Phosphopeptide median CVs (%)")
round(medians, 1)

# make the boxplots
boxplot(cv_pep, notch = TRUE, main = "Phosphopeptide CV distributions", 
        ylim = c(0, 150), ylab = "CV (%)")
[1] "Phosphopeptide median CVs (%)"
R_pep
19.1
A_pep
19.6

CV distributions are similar for both states

The median CVs are reasonable for a peptide-centric experiment. We have a bit larger CVs and more variable data after activation compared to the resting state.

Export the phophopeptide results and log session

In [27]:
# save the results data frame
results <- cbind(results_e, results_p)
write.table(results, "Cond-1_TiO2_results.txt", sep = "\t", row.names = FALSE)
In [28]:
# log the session
sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS  10.15.2

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] scales_1.0.0    psych_1.8.12    edgeR_3.24.3    limma_3.38.3   
 [5] forcats_0.4.0   stringr_1.4.0   dplyr_0.8.3     purrr_0.3.2    
 [9] readr_1.3.1     tidyr_0.8.3     tibble_2.1.3    ggplot2_3.2.1  
[13] tidyverse_1.2.1

loaded via a namespace (and not attached):
 [1] pbdZMQ_0.3-3     statmod_1.4.32   locfit_1.5-9.1   tidyselect_0.2.5
 [5] repr_1.0.1       splines_3.5.3    haven_2.1.1      lattice_0.20-38 
 [9] colorspace_1.4-1 generics_0.0.2   vctrs_0.2.0      htmltools_0.3.6 
[13] base64enc_0.1-3  rlang_0.4.0      pillar_1.4.2     foreign_0.8-72  
[17] glue_1.3.1       withr_2.1.2      modelr_0.1.5     readxl_1.3.1    
[21] uuid_0.1-2       munsell_0.5.0    gtable_0.3.0     cellranger_1.1.0
[25] rvest_0.3.4      evaluate_0.14    labeling_0.3     parallel_3.5.3  
[29] broom_0.5.2      IRdisplay_0.7.0  Rcpp_1.0.2       backports_1.1.4 
[33] IRkernel_1.0.2   jsonlite_1.6     mnormt_1.5-5     hms_0.5.1       
[37] digest_0.6.20    stringi_1.4.3    grid_3.5.3       cli_1.1.0       
[41] tools_3.5.3      magrittr_1.5     lazyeval_0.2.2   crayon_1.3.4    
[45] pkgconfig_2.0.2  zeallot_0.1.0    xml2_1.2.2       lubridate_1.7.4 
[49] assertthat_0.2.1 httr_1.4.1       rstudioapi_0.10  R6_2.4.0        
[53] nlme_3.1-141     compiler_3.5.3  
In [ ]: