This tutorial demonstrates how to analyse data from SILAC MS/MS experiment in Proteus. We strongly suggest the user should read the main tutorial first and familiarize themselves with the package.
vignette("proteus")
Here, we are only going to point out the differences between label-free and SILAC analysis. For this, we use a small subset of data from Ly et al. (2018). These data are distributed in a separate package proteusSILAC, which needs loading first:
library(proteusSILAC)
data(proteusSILAC)
The default measure.cols
object is designed for label-free data. For SILAC data we need to specify isotope ratios. Our data contain M/L ratios:
measCols <- list(
ML = "ratio_ml"
)
The column naming convention in our evidence file is different from the default, included in the evidenceColumns
object. Hence, we need to create a new named list, as follows:
eviCols <- list(
sequence = 'pep_sequence',
modified_sequence = 'modified_sequence',
modifications = 'modifications',
protein_group = 'proteins',
protein = 'leading_razor_protein',
experiment = 'experiment',
charge = 'charge',
reverse = 'reverse',
contaminant = 'potential_contaminant'
)
Due to its large size we do not attach the SILAC evidence file to this package. The command to read the file would be:
evi <- readEvidenceFile(evidenceFile, measure.cols=measCols, data.cols=eviCols, zeroes.are.missing=FALSE)
Please note the argument zeroes.are.missing=FALSE
. By default, readEvidenceFile
assumes that zeroes represent missing data (which usually is true for label-free and TMT experiments). In this case we read ratios which, in theory, can be zeroes. In most cases its doesn’t make any difference.
We do attach an example of processed evidence data with this package, it is called evi
. It contains all reported intensity columns:
head(evi)
## sequence modified_sequence modifications protein_group
## 5 LTLNSPIFDK _LTLNSPIFDK_ Unmodified Q00536;Q00536-3;Q00536-2
## 11 LTLNSPIFDK _LTLNSPIFDK_ Unmodified Q00536;Q00536-3;Q00536-2
## 12 LTLNSPIFDK _LTLNSPIFDK_ Unmodified Q00536;Q00536-3;Q00536-2
## 13 LTLNSPIFDK _LTLNSPIFDK_ Unmodified Q00536;Q00536-3;Q00536-2
## 18 LTLPEEHPCLK _LTLPEEHPCLK_ Unmodified Q9HCU4
## 19 LTLPEEHPCLK _LTLPEEHPCLK_ Unmodified Q9HCU4
## protein experiment charge reverse contaminant ML
## 5 Q00536 LM_VEHvsOHT_T01_R2 2 0.93639
## 11 Q00536 LM_VEHvsOHT_T48_R2 2 0.42157
## 12 Q00536 LM_VEHvsOHT_T01_R2 2 0.71518
## 13 Q00536 LM_VEHvsOHT_T01_R3 2 0.67556
## 18 Q9HCU4 LM_VEHvsOHT_T01_R3 3 0.71851
## 19 Q9HCU4 LM_VEHvsOHT_T01_R3 3 0.71851
Metadata for this experiment needs to specify all the reporter columns, experiments and corresponding condition and replicates:
metadataFile <- system.file("extdata", "metadata.txt", package="proteusSILAC")
meta <- read.delim(metadataFile, header=TRUE, sep="\t")
meta
## experiment measure sample condition replicate
## 1 LM_VEHvsOHT_T01_R1 ratio_ml T01_R1 T01 1
## 2 LM_VEHvsOHT_T01_R2 ratio_ml T01_R2 T01 2
## 3 LM_VEHvsOHT_T01_R3 ratio_ml T01_R3 T01 3
## 4 LM_VEHvsOHT_T48_R1 ratio_ml T48_R1 T48 1
## 5 LM_VEHvsOHT_T48_R2 ratio_ml T48_R2 T48 2
## 6 LM_VEHvsOHT_T48_R3 ratio_ml T48_R3 T48 3
We chose the median to aggregate peptide intensities from multiple entries in the evidence data:
pepdat <- makePeptideTable(evi, meta, measure.cols=measCols, aggregate.fun=aggregateMedian, experiment.type="SILAC")
We can use generic summary
function to see more information about xppepdat
.
summary(pepdat)
##
## *** Basic statistics ***
##
## content = peptide
## experiment type = SILAC
## number of samples = 6
## number of conditions = 2
## number of peptides = 44826
## samples = T01_R1, T01_R2, T01_R3, T48_R1, T48_R2, T48_R3
## conditions = T01, T48
##
## *** Data processing ***
##
## evidence columns used = ratio_ml
## sequence = 'Sequence'
## protein = 'Leading razor protein'
## normalization = identity
This is the number of non-zero peptide intensities per sample.
plotCount(pepdat)
Now we create protein data. The aggregate.fun=aggregateMedian
indicates that protein ratios are medians of constituent peptide ratios. This approach is appropriate for SILAC data, unlike the default hi-flyer method which only works for label-free (or TMT) data.
prodat <- makeProteinTable(pepdat, aggregate.fun=aggregateMedian)
Again, we can use a generic summary
function to see its properties.
summary(prodat)
##
## *** Basic statistics ***
##
## content = protein
## experiment type = SILAC
## number of samples = 6
## number of conditions = 2
## number of proteins = 8295
## samples = T01_R1, T01_R2, T01_R3, T48_R1, T48_R2, T48_R3
## conditions = T01, T48
##
## *** Data processing ***
##
## evidence columns used = ratio_ml
## sequence = 'Sequence'
## protein = 'Leading razor protein'
## normalization = identity
##
## *** Protein data ***
##
## quantification method =
## minimum number of peptides per protein = 1
We normalize protein ratios to the median, that is, after the normalization each sample is going to have the same median. Median is the default normalization, so we do not need to specify the norm.fun
parameter:
prodat.norm <- normalizeData(prodat)
These two figures show reporter intensity distributions before and after normalization.
plotSampleDistributions(prodat, fill="replicate") + labs(title="Before")
plotSampleDistributions(prodat.norm, fill="replicate") + labs(title="After")
For SILAC data we sometimes need to do differential expression on an isotope ratio, e.g. \(M/L\). In our case, \(M\) and \(L\) labelled two different biological conditions, hence the \(M/L\) ratio compares them. The null hypothesis is that \(M/L = 1\) or \(\log H/L = 0\). Because SILAC ratios tend to be symmetric in log space (see figure above), we chose the latter approach. Let’s do the differential expression for the time point T48
.
res <- limmaRatioDE(prodat.norm, condition="T48")
res <- res[order(res$P.Value),]
head(res)
## protein logFC AveExpr t P.Value adj.P.Val
## 1114 P03372-3 1.872584 1.872584 18.80576 4.875611e-05 0.1197013
## 1995 P38571-2 -1.932010 -1.932010 -18.55556 5.140080e-05 0.1197013
## 6897 Q9H9E1 -5.399857 -5.399857 -34.60810 5.573808e-05 0.1197013
## 1792 P29508 2.997129 2.997129 16.66563 7.847507e-05 0.1197013
## 4065 Q68CZ2 -1.382897 -1.382897 -16.63461 7.905253e-05 0.1197013
## 6342 Q9BT25-2 -1.277429 -1.277429 -14.28965 1.435558e-04 0.1401269
## B significant ngood
## 1114 2.1160676 FALSE 3
## 1995 2.0924707 FALSE 3
## 6897 0.5012036 FALSE 2
## 1792 1.8887352 FALSE 3
## 4065 1.8849703 FALSE 3
## 6342 1.5510637 FALSE 3
There are no statistically significant departures from the null hypothesis. Here is the volcano plot:
plotVolcano(res, binhex=FALSE)
## Warning: Removed 724 rows containing missing values (geom_point).
Alternatively, we can compare the two time points with a two-condition differential expression:
res2 <- limmaDE(prodat.norm)
## Warning: Partial NA coefficients for 2509 probe(s)
res2 <- res2[order(res2$P.Value),]
head(res2)
## protein logFC AveExpr t P.Value adj.P.Val
## 1114 P03372-3 2.019154 0.8630066 12.844410 1.159249e-05 0.06707415
## 1935 P35556 1.850429 0.2611482 9.660038 6.182255e-05 0.11272898
## 1841 P30740 1.560127 0.6027596 8.324268 1.454501e-04 0.11272898
## 1627 P21980 1.735755 0.8878502 9.611066 1.790863e-04 0.11272898
## 6673 Q9H0W9-2 -2.308303 -0.7049039 -12.273070 2.094581e-04 0.11272898
## 393 O15553 1.752696 0.3980035 9.283081 2.122366e-04 0.11272898
## B significant mean_T01 mean_T48 ngood_T01 ngood_T48
## 1114 3.6502718 FALSE -0.1465705 1.872584 3 3
## 1935 2.3732229 FALSE -0.6640666 1.186363 3 3
## 1841 1.6381649 FALSE -0.1773040 1.382823 3 3
## 1627 1.3689398 FALSE -0.1536026 1.582152 2 3
## 6673 0.6345338 FALSE -0.1278281 -2.436131 3 1
## 393 1.2394338 FALSE -0.3030749 1.449621 3 2
Though not statistically significant at the assumes 0.05 level, protein P03372-3 stands out. Here is a SILAC ratio plot for this protein:
plotIntensities(prodat.norm, id="P03372-3")