Using proteus R package: SILAC data

Marek Gierlinski

This tutorial demonstrates how to analyse data from SILAC MS/MS experiment in Proteus. We strongly suggest the user should read the main tutorial first and familiarize themselves with the package.

vignette("proteus")

Here, we are only going to point out the differences between label-free and SILAC analysis. For this, we use a small subset of data from Ly et al. (2018). These data are distributed in a separate package proteusSILAC, which needs loading first:

library(proteusSILAC)
data(proteusSILAC)

1 Input data

1.1 Columns in evidence file

The default measure.cols object is designed for label-free data. For SILAC data we need to specify isotope ratios. Our data contain M/L ratios:

measCols <- list(
  ML = "ratio_ml"
)

The column naming convention in our evidence file is different from the default, included in the evidenceColumns object. Hence, we need to create a new named list, as follows:

eviCols <- list(
  sequence = 'pep_sequence',
  modified_sequence = 'modified_sequence',
  modifications = 'modifications',
  protein_group = 'proteins',
  protein = 'leading_razor_protein',
  experiment = 'experiment',
  charge = 'charge',
  reverse = 'reverse',
  contaminant = 'potential_contaminant'
)

1.2 Read evidence file

Due to its large size we do not attach the SILAC evidence file to this package. The command to read the file would be:

evi <- readEvidenceFile(evidenceFile, measure.cols=measCols, data.cols=eviCols, zeroes.are.missing=FALSE)

Please note the argument zeroes.are.missing=FALSE. By default, readEvidenceFile assumes that zeroes represent missing data (which usually is true for label-free and TMT experiments). In this case we read ratios which, in theory, can be zeroes. In most cases its doesn’t make any difference.

We do attach an example of processed evidence data with this package, it is called evi. It contains all reported intensity columns:

head(evi)
##       sequence modified_sequence modifications            protein_group
## 5   LTLNSPIFDK      _LTLNSPIFDK_    Unmodified Q00536;Q00536-3;Q00536-2
## 11  LTLNSPIFDK      _LTLNSPIFDK_    Unmodified Q00536;Q00536-3;Q00536-2
## 12  LTLNSPIFDK      _LTLNSPIFDK_    Unmodified Q00536;Q00536-3;Q00536-2
## 13  LTLNSPIFDK      _LTLNSPIFDK_    Unmodified Q00536;Q00536-3;Q00536-2
## 18 LTLPEEHPCLK     _LTLPEEHPCLK_    Unmodified                   Q9HCU4
## 19 LTLPEEHPCLK     _LTLPEEHPCLK_    Unmodified                   Q9HCU4
##    protein         experiment charge reverse contaminant      ML
## 5   Q00536 LM_VEHvsOHT_T01_R2      2                     0.93639
## 11  Q00536 LM_VEHvsOHT_T48_R2      2                     0.42157
## 12  Q00536 LM_VEHvsOHT_T01_R2      2                     0.71518
## 13  Q00536 LM_VEHvsOHT_T01_R3      2                     0.67556
## 18  Q9HCU4 LM_VEHvsOHT_T01_R3      3                     0.71851
## 19  Q9HCU4 LM_VEHvsOHT_T01_R3      3                     0.71851

1.3 Metadata

Metadata for this experiment needs to specify all the reporter columns, experiments and corresponding condition and replicates:

metadataFile <- system.file("extdata", "metadata.txt", package="proteusSILAC")
meta <- read.delim(metadataFile, header=TRUE, sep="\t")
meta
##           experiment  measure sample condition replicate
## 1 LM_VEHvsOHT_T01_R1 ratio_ml T01_R1       T01         1
## 2 LM_VEHvsOHT_T01_R2 ratio_ml T01_R2       T01         2
## 3 LM_VEHvsOHT_T01_R3 ratio_ml T01_R3       T01         3
## 4 LM_VEHvsOHT_T48_R1 ratio_ml T48_R1       T48         1
## 5 LM_VEHvsOHT_T48_R2 ratio_ml T48_R2       T48         2
## 6 LM_VEHvsOHT_T48_R3 ratio_ml T48_R3       T48         3

2 Peptide data

2.1 Create a peptide dataset

We chose the median to aggregate peptide intensities from multiple entries in the evidence data:

pepdat <- makePeptideTable(evi, meta, measure.cols=measCols, aggregate.fun=aggregateMedian, experiment.type="SILAC")

We can use generic summary function to see more information about xppepdat.

summary(pepdat)
## 
## *** Basic statistics ***
## 
##   content = peptide
##   experiment type = SILAC
##   number of samples = 6
##   number of conditions = 2
##   number of peptides = 44826
##   samples = T01_R1, T01_R2, T01_R3, T48_R1, T48_R2, T48_R3
##   conditions = T01, T48
## 
## *** Data processing ***
## 
##   evidence columns used = ratio_ml
##   sequence = 'Sequence'
##   protein = 'Leading razor protein'
##   normalization = identity

2.2 Number of peptides

This is the number of non-zero peptide intensities per sample.

plotCount(pepdat)

3 Protein data

3.1 Create protein dataset

Now we create protein data. The aggregate.fun=aggregateMedian indicates that protein ratios are medians of constituent peptide ratios. This approach is appropriate for SILAC data, unlike the default hi-flyer method which only works for label-free (or TMT) data.

prodat <- makeProteinTable(pepdat, aggregate.fun=aggregateMedian)

Again, we can use a generic summary function to see its properties.

summary(prodat)
## 
## *** Basic statistics ***
## 
##   content = protein
##   experiment type = SILAC
##   number of samples = 6
##   number of conditions = 2
##   number of proteins = 8295
##   samples = T01_R1, T01_R2, T01_R3, T48_R1, T48_R2, T48_R3
##   conditions = T01, T48
## 
## *** Data processing ***
## 
##   evidence columns used = ratio_ml
##   sequence = 'Sequence'
##   protein = 'Leading razor protein'
##   normalization = identity
## 
## *** Protein data ***
## 
##   quantification method = 
##   minimum number of peptides per protein = 1

3.2 Normalization

We normalize protein ratios to the median, that is, after the normalization each sample is going to have the same median. Median is the default normalization, so we do not need to specify the norm.fun parameter:

prodat.norm <- normalizeData(prodat)

These two figures show reporter intensity distributions before and after normalization.

plotSampleDistributions(prodat, fill="replicate") + labs(title="Before")

plotSampleDistributions(prodat.norm, fill="replicate") + labs(title="After")

3.3 Differential expression

For SILAC data we sometimes need to do differential expression on an isotope ratio, e.g. \(M/L\). In our case, \(M\) and \(L\) labelled two different biological conditions, hence the \(M/L\) ratio compares them. The null hypothesis is that \(M/L = 1\) or \(\log H/L = 0\). Because SILAC ratios tend to be symmetric in log space (see figure above), we chose the latter approach. Let’s do the differential expression for the time point T48.

res <- limmaRatioDE(prodat.norm, condition="T48")
res <- res[order(res$P.Value),]
head(res)
##       protein     logFC   AveExpr         t      P.Value adj.P.Val
## 1114 P03372-3  1.872584  1.872584  18.80576 4.875611e-05 0.1197013
## 1995 P38571-2 -1.932010 -1.932010 -18.55556 5.140080e-05 0.1197013
## 6897   Q9H9E1 -5.399857 -5.399857 -34.60810 5.573808e-05 0.1197013
## 1792   P29508  2.997129  2.997129  16.66563 7.847507e-05 0.1197013
## 4065   Q68CZ2 -1.382897 -1.382897 -16.63461 7.905253e-05 0.1197013
## 6342 Q9BT25-2 -1.277429 -1.277429 -14.28965 1.435558e-04 0.1401269
##              B significant ngood
## 1114 2.1160676       FALSE     3
## 1995 2.0924707       FALSE     3
## 6897 0.5012036       FALSE     2
## 1792 1.8887352       FALSE     3
## 4065 1.8849703       FALSE     3
## 6342 1.5510637       FALSE     3

There are no statistically significant departures from the null hypothesis. Here is the volcano plot:

plotVolcano(res, binhex=FALSE)
## Warning: Removed 724 rows containing missing values (geom_point).

Alternatively, we can compare the two time points with a two-condition differential expression:

res2 <- limmaDE(prodat.norm)
## Warning: Partial NA coefficients for 2509 probe(s)
res2 <- res2[order(res2$P.Value),]
head(res2)
##       protein     logFC    AveExpr          t      P.Value  adj.P.Val
## 1114 P03372-3  2.019154  0.8630066  12.844410 1.159249e-05 0.06707415
## 1935   P35556  1.850429  0.2611482   9.660038 6.182255e-05 0.11272898
## 1841   P30740  1.560127  0.6027596   8.324268 1.454501e-04 0.11272898
## 1627   P21980  1.735755  0.8878502   9.611066 1.790863e-04 0.11272898
## 6673 Q9H0W9-2 -2.308303 -0.7049039 -12.273070 2.094581e-04 0.11272898
## 393    O15553  1.752696  0.3980035   9.283081 2.122366e-04 0.11272898
##              B significant   mean_T01  mean_T48 ngood_T01 ngood_T48
## 1114 3.6502718       FALSE -0.1465705  1.872584         3         3
## 1935 2.3732229       FALSE -0.6640666  1.186363         3         3
## 1841 1.6381649       FALSE -0.1773040  1.382823         3         3
## 1627 1.3689398       FALSE -0.1536026  1.582152         2         3
## 6673 0.6345338       FALSE -0.1278281 -2.436131         3         1
## 393  1.2394338       FALSE -0.3030749  1.449621         3         2

Though not statistically significant at the assumes 0.05 level, protein P03372-3 stands out. Here is a SILAC ratio plot for this protein:

plotIntensities(prodat.norm, id="P03372-3")