# Using proteus R package: label-free data

This tutorial demonstrates how to analyse data from label free MS/MS experiment in Proteus. We use an example data set from an unpublished experiment by Katharina Trunk, Sarah Coulthurst, Julien Peltier and Matthias Trost. The data are distributed in a package proteusLabelFree, which needs installing and loading first:

devtools::install_github("bartongroup/proteusLabelFree")
library(proteusLabelFree)
data(proteusLabelFree)

# 1 Quick start

Here is a minimal example of data analysis in Proteus. You will need the evidence file from MaxQuant and a very simple text file with metadata (see below how to create it). Then, you can get from the evidence file to differential expression in just a few steps.

library(proteus)
evidenceFile <- system.file("extdata", "evidence.txt.gz", package="proteusLabelFree")

pepdat <- makePeptideTable(evi, meta)
prodat <- makeProteinTable(pepdat)
prodat.med <- normalizeData(prodat)
plotVolcano_live(prodat.med, res)

# 2 Input data

## 2.1 Files

Proteus needs one input file: the evidence file from MaxQuant.

In this tutorial we are also going to use a metadata file. It should be created manually as a tab-delimited text file with a header containing several columns of metadata. The mandatory columns are “experiment”, “measure”, “sample” and “condition” (see section Metadata).

Alternatively, a metadata data frame (with the same columns) can be created within the R session.

Function readEvidenceFile() reads MaxQuant evidence file, filters out contaminants and reverse sequences. In this example we use the evidence file from the package proteusLabelFree (you need to install it if you want to follow this tutorial):

evidenceFile <- system.file("extdata", "evidence.txt.gz", package="proteusLabelFree")
evi <- readEvidenceFile(evidenceFile)

evi is a data frame with selected columns from the evidence file:

head(evi)
##       sequence modified_sequence modifications        protein_group
## 1 AAAAEKNVPLYK    _AAAAEKNVPLYK_    Unmodified sp|P00924|ENO1_YEAST
## 2 AAAAEKNVPLYK    _AAAAEKNVPLYK_    Unmodified sp|P00924|ENO1_YEAST
## 3 AAAAEKNVPLYK    _AAAAEKNVPLYK_    Unmodified sp|P00924|ENO1_YEAST
## 4 AAAAEKNVPLYK    _AAAAEKNVPLYK_    Unmodified sp|P00924|ENO1_YEAST
## 5 AAAAEKNVPLYK    _AAAAEKNVPLYK_    Unmodified sp|P00924|ENO1_YEAST
## 6 AAAAEKNVPLYK    _AAAAEKNVPLYK_    Unmodified sp|P00924|ENO1_YEAST
##                protein experiment charge reverse contaminant intensity
## 1 sp|P00924|ENO1_YEAST        A-1      2                      44276000
## 2 sp|P00924|ENO1_YEAST        A-1      3                      62815000
## 3 sp|P00924|ENO1_YEAST        A-2      2                      24414000
## 4 sp|P00924|ENO1_YEAST        A-2      3                      36633000
## 5 sp|P00924|ENO1_YEAST        A-3      2                      17926000
## 6 sp|P00924|ENO1_YEAST        A-3      3                      27610000

It can be quite large:

dim(evi)
## [1] 470318     10
format(object.size(evi), units="Mb")
## [1] "41.6 Mb"

## 2.3 Columns in evidence file

MaxQuant evidence files come with column names that can differ, depending on the MaxQuant version and the parameters used in its run. Many of these column names contain spaces and other characters (e.g. “/”) that make R data processing awkward. Therefore, upon reading the evidence file, Proteus renames its columns to simpler names, conforming with R variable name restrictions. Also, many of the evidence columns are not needed in Proteus processing, so these are not kept to save memory.

The selection and naming of columns is controlled by two parameters in function readEvidenceFile(): measure.cols and data.cols. Proteus comes with two predefined lists called measureColumns and evidenceColumns, which are used as defaults. They can be modified, as necessary.

### 2.3.1 Measurements

The default named list measureColumns describes evidence file columns that contain measurements. In case of label-free experiment there is only one such column, called Intensity.

str(measureColumns)
## List of 1
##  $intensity: chr "Intensity" Different measure columns are needed for TMT and SILAC experiments. See the appropriate vignettes (TMT and SILAC) for more details. ### 2.3.2 Other required columns The default named list evidenceColumns describes the minimal set of columns needed for further processing (in addition to measure columns). str(evidenceColumns) ## List of 9 ##$ sequence         : chr "Sequence"
##  $modified_sequence: chr "Modified sequence" ##$ modifications    : chr "Modifications"
##  $protein_group : chr "Proteins" ##$ protein          : chr "Leading razor protein"
##  $experiment : chr "Experiment" ##$ charge           : chr "Charge"
##  $reverse : chr "Reverse" ##$ contaminant      : chr "Potential contaminant"

The names (sequence, modified_sequence, and so on) are used internally in the package and should not be changed. The values (Sequence, Modified sequence) are the actual column names in the evidence file and can be adjusted if different naming convention is used.

### 2.3.3 Updating column names

We suggest not to change the default list, but create a copy, which can be modified. Let’s say we want to read column m/z as well:

myColumns <- c(evidenceColumns, mz="m/z")

If, for example, your evidence file contains a column named Leading Razor Protein (each word starting with a capital), we will have to change this value in the list:

myColumns$protein <- "Leading Razor Protein" The new column list can be then used with readEvidenceFile: evi_mz <- readEvidenceFile(evidenceFile, data.cols=myColumns) To quickly check what the column names are in a given evidence file, we provide a simple function to do this: evidenceFile <- system.file("extdata", "evidence.txt.gz", package="proteusLabelFree") evidence.columns <- readColumnNames(evidenceFile) evidence.columns ## [1] "Sequence" "Length" ## [3] "Modifications" "Modified sequence" ## [5] "Deamidation (NQ) Probabilities" "Oxidation (M) Probabilities" ## [7] "Deamidation (NQ) Score Diffs" "Oxidation (M) Score Diffs" ## [9] "Acetyl (Protein N-term)" "Deamidation (NQ)" ## [11] "Glu->pyro-Glu" "Oxidation (M)" ## [13] "Missed cleavages" "Proteins" ## [15] "Leading proteins" "Leading razor protein" ## [17] "Type" "Raw file" ## [19] "Experiment" "MS/MS m/z" ## [21] "Charge" "m/z" ## [23] "Mass" "Resolution" ## [25] "Uncalibrated - Calibrated m/z [ppm]" "Uncalibrated - Calibrated m/z [Da]" ## [27] "Mass error [ppm]" "Mass error [Da]" ## [29] "Uncalibrated mass error [ppm]" "Uncalibrated mass error [Da]" ## [31] "Max intensity m/z 0" "Retention time" ## [33] "Retention length" "Calibrated retention time" ## [35] "Calibrated retention time start" "Calibrated retention time finish" ## [37] "Retention time calibration" "Match time difference" ## [39] "Match m/z difference" "Match q-value" ## [41] "Match score" "Number of data points" ## [43] "Number of scans" "Number of isotopic peaks" ## [45] "PIF" "Fraction of total spectrum" ## [47] "Base peak fraction" "PEP" ## [49] "MS/MS count" "MS/MS scan number" ## [51] "Score" "Delta score" ## [53] "Combinatorics" "Intensity" ## [55] "Reverse" "Potential contaminant" ## [57] "id" "Protein group IDs" ## [59] "Peptide ID" "Mod. peptide ID" ## [61] "MS/MS IDs" "Best MS/MS" ## [63] "AIF MS/MS IDs" "Deamidation (NQ) site IDs" ## [65] "Oxidation (M) site IDs" ## 2.4 Metadata We also need metadata. In our example it is stored in the file in the proteusLabelFree package: metadataFile <- system.file("extdata", "metadata.txt", package="proteusLabelFree") meta <- read.delim(metadataFile, header=TRUE, sep="\t") It contains the design of our experiment: meta ## experiment measure sample condition replicate ## 1 A-1 Intensity A-1 A 1 ## 2 A-2 Intensity A-2 A 2 ## 3 A-3 Intensity A-3 A 3 ## 4 A-4 Intensity A-4 A 4 ## 5 A-5 Intensity A-5 A 5 ## 6 A-6 Intensity A-6 A 6 ## 7 A-7 Intensity A-7 A 7 ## 8 B-1 Intensity B-1 B 1 ## 9 B-2 Intensity B-2 B 2 ## 10 B-3 Intensity B-3 B 3 ## 11 B-4 Intensity B-4 B 4 ## 12 B-5 Intensity B-5 B 5 ## 13 B-6 Intensity B-6 B 6 ## 14 B-7 Intensity B-7 B 7 This metadata information will be attached to every peptide and protein object used by Proteus. ### 2.4.1 Metadata columns Metadata object should be a data frame with at least these columns: experiment (not always necessary), measure, sample and condition. • Column experiment should contain the same values as in the Experiment column in the evidence file. In case of multiplexed experiments it is possible to have an evidence file without the experiment column. In such case it should not be included in metadata (see TMT vignette for more details). Here, there are 14 unique experiments. We can see all unique experiment names in the evidence data using unique function: unique(evi$experiment)
##  [1] "A-1" "A-2" "A-3" "A-4" "A-5" "A-6" "A-7" "B-1" "B-2" "B-3" "B-4" "B-5"
## [13] "B-6" "B-7"
• Column measure refers to columns with measurements in the evidence file. In case of the label-free experiment there is only one measure column, “Intensity”. In multiplexed data (e.g. TMT) there can be several measurements columns.

• Column sample should contain (short) names corresponding to a given experiment and measure. For clarity, we recommend a short name consisting of a condition and replicate (in a simple design where such decomposition is possible). Sample identifies a given experiment/measure uniquely, so sample names must be unique.

• Column condition contains condition names. This information will be used in differential expression. In our example there are two conditions, hence two values in this column.

Other columns can be added and used for the downstream analysis. Here, we added a replicate column, but other information, in particular describing batch effects, can be very useful.

# 3 Peptide data

## 3.1 Create a peptide dataset

We can now create a peptide data object from the evidence and metadata. The function makePeptideTable() aggregates peptide data form the evidence into a single table, where rows correspond to peptide sequences and columns correspond to samples (as provided in metadata). Where there are multiple peptides corresponding to the same sequence/sample (e.g., with different charges) their intensities are added in label-free data (we note that this might not be the best approach due to possible high variance in peptide data - more to come soon).

pepdat <- makePeptideTable(evi, meta)

pepdat is an object of class proteusData. It consists of the intensity table (pepdat$tab) and additional information. The first fife rows and columns from the intensity table are pepdat$tab[1:5, 1:5]
##                           A-1      A-2      A-3      A-4       A-5
## AAAAEKNVPLYK        107091000 61047000 45536000 80252000 108521000
## AAAAEKNVPLYKHLADLSK  37000000       NA       NA       NA  20090000
## AAAAEKNVPLYQHLADLSK  26341000 10490000  7551800 16365000  22191000
## AAAAGAGGAGDSGDAVTK    6856400  2794300  2964400  4027100   4193800
## AAAALAGGKK            2561400       NA  7219200  6562417   5005877

We can use generic summary function to see more information about pepdat.

summary(pepdat)
##
## *** Basic statistics ***
##
##   content = peptide
##   experiment type = label-free
##   number of samples = 14
##   number of conditions = 2
##   number of peptides = 34733
##   samples = A-1, A-2, A-3, A-4, A-5, A-6, A-7, B-1, B-2, B-3, B-4, B-5, B-6, B-7
##   conditions = A, B
##
## *** Data processing ***
##
##   evidence columns used = Intensity
##   sequence = 'Sequence'
##   protein = 'Leading razor protein'
##   normalization = identity

## 3.2 Number of peptides

Function plotCount() plots, as the name suggests, peptide count in each sample. This is the number of non-zero peptide intensities per sample.

plotCount(pepdat)

## 3.3 Jaccard similarity

Function plotDetectionSimilarity() calculates Jaccard similarity between each pair of samples and plots its distribution. The similarity is based on detection. For a pair of samples it compares the number of detected peptides in both samples (intersection) divided by the total number of peptides detected in both samples (union). This measure of similarity is a number between 0 and 1.

plotDetectionSimilarity(pepdat, bin.size = 0.02)

## 3.4 Distance matrix

Function plotDistanceMatrix() calculates the Pearson’s correlation coefficient for each pair of samples and plots a heatmap:

plotDistanceMatrix(pepdat)

## 3.5 Clustering

We can use plotClustering() to see a dendrogram of the peptide data set:

plotClustering(pepdat)

## 3.6 PCA

Function plotPCA() creates a principal-component analysis plot.

plotPCA(pepdat)

The last sample (B-7) in the example data looks odd. Lets say we call it “bad” and want to remove from analysis. The easiest way to do this is to modify the metadata.

Warning: this is only an example. You should exercise caution when removing any of your data.

meta.clean <- meta[which(meta$sample != 'B-7'),] Now, we create a new “clean” peptide data set from evidence data: pepdat.clean <- makePeptideTable(evi, meta.clean) The new data pepdat.clean will contain only samples included in the metadata, that is as.character(meta.clean$sample)
##  [1] "A-1" "A-2" "A-3" "A-4" "A-5" "A-6" "A-7" "B-1" "B-2" "B-3" "B-4" "B-5"
## [13] "B-6"

Clustering confirms that the offending sample is gone.

plotClustering(pepdat.clean)

# 4 Protein data

There are many approaches to aggregating peptides into proteins. For simplicity, we assign peptides to proteins based on the Leading Razor Protein. For label-free data, we quantify protein abundances using a simple, but robust method of high-flyers (Silva et al. 2006).

## 4.1 Create protein dataset

makeProteinTable() creates a protein data set from the peptide data. The result is a proteusData object containing protein intensity table and other information.

prodat <- makeProteinTable(pepdat.clean)

Again, we can use a generic summary function to see its properties.

summary(prodat)
##
## *** Basic statistics ***
##
##   content = protein
##   experiment type = label-free
##   number of samples = 13
##   number of conditions = 2
##   number of proteins = 3797
##   samples = A-1, A-2, A-3, A-4, A-5, A-6, A-7, B-1, B-2, B-3, B-4, B-5, B-6
##   conditions = A, B
##
## *** Data processing ***
##
##   evidence columns used = Intensity
##   sequence = 'Sequence'
##   protein = 'Leading razor protein'
##   normalization = identity
##
## *** Protein data ***
##
##   quantification method =
##   minimum number of peptides per protein = 2

## 4.2 Normalization

Finally, we need to normalize data to account for variation of intensity between samples. The function normalizeData() can normalize peptide or protein data. Up to this point, we haven’t applied any normalization. The default normalization is to the median. After this step, median sample intensities will be equal.

prodat.med <- normalizeData(prodat)

The second parameter to normalizeData() is norm.fun and it points to a normalizing function. By default, this is normalizeMedian, but other normalizations can be used. For example, it works with normalizeQuantiles from limma package:

prodat.quant <- normalizeData(prodat, norm.fun=limma::normalizeQuantiles)

The function plotSampleDistributions() can be used to compare intensity distributions for each normalization.

plotSampleDistributions(prodat, title="Not normalized", fill="condition", method="violin")

plotSampleDistributions(prodat.med, title="Median normalization", fill="condition", method="violin")

plotSampleDistributions(prodat.quant, title="Quantile normalization", fill="condition", method="violin")

## 4.3 Mean-variance relationship

Some statistics (like mean and variance across replicate in each condition) are stored directly in the prodat object at the moment of creation. We can apply plotMV() function to plot the mean-variance relationship.

plotMV(prodat.med, with.loess=TRUE)

## 4.4 Protein clustering

We can use the same function plotClustering() to see the dendrogram for the proteins.

plotClustering(prodat.med)

## 4.5 Protein annotations

So far, we have analysed data using protein identifiers as provided in the evidence file. We need to annotate proteins and add more information, for example, UniProt ID, gene name, protein description. To do this, we need to prepare a data frame containing a column called protein with the original protein identifiers, as in the evidence file. Additional columns in this data frame will contain all necessary annotations. The function annotateProteins can merge such data frame into a proteusData object and the annotations can be used later to identify proteins and perform further downstream analysis (e.g. functional enrichment).

In the case of our example data protein identifiers look like sp|P00546|CDK1_YEAST. Hence, they contain a UniProt ID. We can use this information to grab annotations from UniProt servers.

First, we use strsplit to extract UniProt IDs:

luni <- lapply(as.character(prodat$proteins), function(prot) { if(grepl("sp\\|", prot)) { uniprot <- unlist(strsplit(prot, "|", fixed=TRUE))[2] c(prot, uniprot) } }) ids <- as.data.frame(do.call(rbind, luni)) names(ids) <- c("protein", "uniprot") This creates a data frame with two columns: head(ids) ## protein uniprot ## 1 sp|A0A023PYI5|YI036_YEAST A0A023PYI5 ## 2 sp|A5Z2X5|YP010_YEAST A5Z2X5 ## 3 sp|A6ZKT1|VPS10_YEAS7 A6ZKT1 ## 4 sp|A6ZN74|RTC1_YEAS7 A6ZN74 ## 5 sp|A6ZP47|DED1_YEAS7 A6ZP47 ## 6 sp|A6ZQR2|ENOPH_YEAS7 A6ZQR2 So, for each protein we have found its UniProt ID. A helper function fetchFromUniProt gets basic annotations from UniProt servers. Here we use the column uniprot containing 3796 UniProt IDs to fetch basic annotations (gene names and protein names) from UniProt. This might take a while! annotations <- fetchFromUniProt(ids$uniprot, verbose=TRUE)

This is what the content of this data frame looks like:

head(annotations)
##       id                                  genes
## 1 Q04087                  LRS4 YDR439W D9461.25
## 2 P08536                     MET3 YJR010W J1436
## 3 P47069 MPS3 NEP98 YJL019W J1310 J1315 YJL018W
## 4 P11746           MCM1 FUN80 YMR043W YM9532.08
## 5 P12945                NAT1 AAA1 YDL040C D2720
## 6 Q08229                           NBA1 YOL070C
##                                                                                                                        protein names
## 1                                                                  Monopolin complex subunit LRS4 (Loss of rDNA silencing protein 4)
## 2  Sulfate adenylyltransferase (EC 2.7.7.4) (ATP-sulfurylase) (Methionine-requiring protein 3) (Sulfate adenylate transferase) (SAT)
## 3                          Spindle pole body assembly component MPS3 (98 kDa nuclear envelope protein) (Monopolar spindle protein 3)
## 4                                                                         Pheromone receptor transcription factor (GRM/PRTF protein)
## 5 N-terminal acetyltransferase A complex subunit NAT1 (NatA complex subunit NAT1) (Amino-terminal, alpha-amino, acetyltransferase 1)
## 6                                                                              Protein NBA1 (NAP1 and bud neck-associated protein 1)

In order to feed these data into a proteusData object we need a column with the original protein identifiers. We can get this by merging the two data frames:

annotations.id <- merge(ids, annotations, by.x="uniprot", by.y="id")
annotations.id <- unique(annotations.id)

The second command is to remove duplicated entries. This is the data frame we need. The function annotateProteins will add it to the prodat.med object.

prodat.med <- annotateProteins(prodat.med, annotations.id)
## Annotated 3796 proteins.

The annotations are merged into prodat based on the protein column in the annotation data frame. They are stored in prodat$annotation and rows of this data frame correspond to the order of proteins in the intensity table. Obviously, not every data set follows the identifier convention in this example. For any data set the user will need to create a data frame with the appropriate annotations, using available resources. The only requirement is that the annotation data frame contains a column called protein and this column contains protein identifiers as provided in the evidence file and available from the proteusData object in the field proteins. # 5 Alternative processing ## 5.1 Modified sequence Peptide intensities are aggregated based on the Sequence column in the evidence file. That is, the intensity for a peptide (for a given sample) is the sum (or other user-specified function) of the intensities for all entries for this sequence and experiment in the evidence file. Instead, we can use the Modified sequence column and focus on protein modifications. Then, the aggregation will be done for a given modified sequence. Let us recall, that the evidence object in Proteus package is a data frame with column names as defined by the data.cols parameter in readEvidenceFile function. These column names are different from the original names in the evidence file, so they conform with R object naming standards. Here are the column names we have in evi: names(evi) ## [1] "sequence" "modified_sequence" "modifications" ## [4] "protein_group" "protein" "experiment" ## [7] "charge" "reverse" "contaminant" ## [10] "intensity" Therefore, if we want to base peptide aggregation on the modified sequence, we need to specify the appropriate column name when creating peptide table, by the parameter sequence.col. The usual protein aggregation will follow: pepdat.mod <- makePeptideTable(evi, meta, sequence.col = "modified_sequence") prodat.mod <- makeProteinTable(pepdat.mod) The default value of sequence.col is “sequence”, so if no parameter is given, peptides will be aggregated based on their (unmodified) sequence. ## 5.2 Protein groups By default protein intensities are aggregated from peptide intensities based on the Leading razor protein column from the evidence file (this column is renamed to protein by readEvidenceFile for simplicity). This means that for a given protein makeProteinTable would aggregate data from all peptides having this protein ID in the Leading razor protein column. We can change this default approach and, instead of razor proteins, we can use protein groups, as listed in Proteins column of the evidence file (this column is renamed to protein_group by readEvidenceFile to avoid confusion). The choice of column to aggregate protein data is controlled by parameter protein.col in the makePeptideTable function. It might look surprising that protein/group selection is done at the peptide level, but this is where peptide-to-protein conversion table is build and stored in the output object (have a look at pepdat$pep2prot - it’s a data frame with two columns, peptide and protein). We want to have it in the peptide object, in case peptide-level analysis needs to find out which proteins or protein groups these peptides belong to. The peptide-to-protein table is then used by makeProteinTable to aggregate proteins. Hence, if we want to aggregate proteins by protein group rather than razor proteins (which is the default), we need to do the following:

pepdat.group <- makePeptideTable(evi, meta, protein.col="protein_group")
prodat.group <- makeProteinTable(pepdat.group)

As a result, the object prodat.group will contain not individual proteins, but protein groups.

## 5.3 Evidence-to-peptide aggregation

Evidence file often contains multiple entries for the same peptide sequence and experiment. By default (and this is also done by MaxQuant when creating peptides table) these entries are summed. In proteus the aggregation method is controlled by the parameter aggregate.fun and can be easily modified. aggregate.fun points to a function that performs aggregation across all entries. The default function is aggregateSum, but it can be replaced with any user-provided function, if necessary.

### 5.3.1 Aggregator function

The aggregator function should have the following form:

function(wp, ...)

The input for this function is a matrix wp with peptide intensities. The ellipsis (…) indicates additional parameters passed to the function. Columns of the matrix wp correspond to samples and rows correspond to different entries from the evidence file. For example, for a peptide with sequence AASESIKVGDPFDESTFQGAQTSQMQLNK we have the following matrix:

evitab.example
##            A-1      A-2      A-3      A-4       A-5      A-6      A-7       B-1
## [1,]   6420600  1665300  2272300  4329600   3254700  2137400 43234000   1701200
## [2,] 153180000 71662000 56779000 94314000 167480000 53352000  6415100 106160000
## [3,]   2512900  2411700  1811300  3140700   2029800   816490  2208500   2254000
## [4,]   7571400   786130  2148800  1044000    748310   831810   949660   2016300
## [5,]   4021200  1267100  1579000  2821700        NA  1125100       NA   1456900
## [6,]        NA       NA       NA       NA        NA       NA       NA        NA
##            B-2      B-3       B-4      B-5      B-6      B-7
## [1,]   5723300  3439000   3212500  1447400  4291800  2321800
## [2,] 137790000 93275000   2783300 85152000 69391000 51629000
## [3,]   2144700   968470 115940000   899930   983180  1166900
## [4,]    954220   775800   3207100   881190  2136300   742550
## [5,]   2230500       NA   1587600  3393700       NA       NA
## [6,]   1307500       NA        NA       NA       NA       NA

There are up to 6 evidence entries per sample for this sequence. The output of the aggregator function should be a vector of aggregated intensities. In case of the default aggregator, the output is the sum of columns of the above matrix:

aggregateSum(evitab.example)
##  [1] 173706100  77792230  64590400 105650000 173512810  58262800  52807260
##  [8] 113588400 150150220  98458270 126730500  91774220  76802280  55860250

The elements of this vector correspond to the samples - columns of the input matrix. The peptide creator function, makePeptideTable uses the aggregator function to combine data for each peptide.

The other peptide aggregator function provided in the package is aggregateMedian. It calculates median of each column. We recommend using this function with SILAC data. You can do this by specifying aggregate.fun=aggregateMedian in the makePeptideTable call. See SILAC vignette for details.

### 5.3.2 User-defined aggregation function

Let us assume that instead of the default sums we want to find the maximum intensity across all entries for each peptide. We are not saying that this is a better approach, we just want to show an example of how to create a user-define peptide aggregator. The function we need takes a two-dimensional matrix as an input and returns the maximum of each column. It can be encoded as

aggregateMax <- function(wp) {
s <- apply(wp, 2, function(x) max(x, na.rm=TRUE))
return(as.vector(s))
}

Please note that na.rm=TRUE is necessary to ignore data gaps. Otherwise the returned vector would consist mostly of NAs. When applied to our example matrix the function returns

aggregateMax(evitab.example)
##  [1] 153180000  71662000  56779000  94314000 167480000  53352000  43234000
##  [8] 106160000 137790000  93275000 115940000  85152000  69391000  51629000

Now, we can use it to create an alternative peptide table:

pepdat.max <- makePeptideTable(evi, meta, aggregate.fun=aggregateMax)

The result will be somehow different from the default summing approach.

## 5.4 Peptide-to-protein aggregation

By default makeProteinTable aggregates peptides to proteins using the high-flyer method of (Silva et al. 2006). The protein intensity is the mean intensity of 3 top-intensity peptides associated with the given protein (or group of proteins). The peptide-to-protein aggregation method is controlled by the parameter aggregate.fun in makeProteinTable. It points to an aggregator function.

The aggregator function has exactly the same format as in evidence-to-peptide aggregator (see above). That is, it takes a two-dimensional matrix (columns are samples, rows are peptides) and calculates an aggregated vector. There are three aggregators in the package: the default aggregateHifly plus the same functions we used for evidence-to-peptide aggregation, aggregateSum and aggregateMedian.

A user-defined function can be used to aggregate proteins. Please see section on evidence-to-peptide aggregation for details.

## 5.5 Reading MaxQuant’s protein groups file

MaxQuant output contains a file with aggregated protein data, usually called proteinGroups.txt. Protein intensities in this file are summed over protein groups. It is possible to read these data directly into Proteus and skip peptide and protein aggregation steps.

proteinGroupsFile <- system.file("extdata", "proteinGroups.txt.gz", package="proteusLabelFree")
prot.MQ <- readProteinGroups(proteinGroupsFile, meta)

This creates a proteusData object with all necessary information: that is, intensity table, metadata, protein names. This object can be visualized and analysed using Proteus functions.

Just like with evidence data, we might have to specify which columns need to be read from the file. There are two arguments in readProteinGroups that control this: measure.cols and data.cols.

measure.cols specifies columns with measurements. The default value for this argument is a named list or vector:

setNames(paste("Intensity", meta$sample), meta$sample)
##             A-1             A-2             A-3             A-4             A-5
## "Intensity A-1" "Intensity A-2" "Intensity A-3" "Intensity A-4" "Intensity A-5"
##             A-6             A-7             B-1             B-2             B-3
## "Intensity A-6" "Intensity A-7" "Intensity B-1" "Intensity B-2" "Intensity B-3"
##             B-4             B-5             B-6             B-7
## "Intensity B-4" "Intensity B-5" "Intensity B-6" "Intensity B-7"

Values of this list are actual column names in the file (Intensity A-1, Intensity A-2, and so on), names of the list are sample names. If columns with measurements in the protein groups file are named following a different convention, this argument will have to be changed. Make sure this is a named list with names corresponding to sample names.

data.cols specifies additional columns to be read. It defaults to a global list proteinColumn. This is a another named list, containing the following elements:

str(proteinColumns)
## List of 3
##  $protein : chr "Majority protein IDs" ##$ reverse    : chr "Reverse"
##  $contaminant: chr "Potential contaminant" Again, if the majority protein ID column is named differently, this will have to be changed in data.cols argument. The three names of this list (protein, reverse and contaminant) are used internally and should not be changed or removed. # 6 Differential expression We suggest using limma package to do differential expression. Package Proteus contains a simple wrapper to limma, that takes a proteusData object as input. res <- limmaDE(prodat.med, sig.level=0.05) This function creates a data frame with DE results. head(res) ## protein logFC AveExpr t P.Value adj.P.Val ## 1 mut-yEGFP NA 25.46489 NA NA NA ## 2 sp|A0A023PYI5|YI036_YEAST NA NaN NA NA NA ## 3 sp|A5Z2X5|YP010_YEAST -0.7515835 28.11827 -2.3827226 0.03246686 0.3054346 ## 4 sp|A6ZKT1|VPS10_YEAS7 0.3279931 23.29869 2.0675716 0.05836982 0.3826613 ## 5 sp|A6ZN74|RTC1_YEAS7 -0.2551201 21.23215 -0.8738230 0.39746431 0.7529776 ## 6 sp|A6ZP47|DED1_YEAS7 0.1231840 27.69843 0.3502452 0.73154887 0.9052370 ## B significant mean_A mean_B ngood_A ngood_B uniprot ## 1 NA NA 25.46489 NA 7 0 <NA> ## 2 NA NA NA NA 0 0 A0A023PYI5 ## 3 -3.817759 FALSE 28.46515 27.71357 7 6 A5Z2X5 ## 4 -4.341622 FALSE 23.14731 23.47530 7 6 A6ZKT1 ## 5 -5.869494 FALSE 21.34990 21.09478 7 6 A6ZN74 ## 6 -6.190307 FALSE 27.64158 27.76476 7 6 A6ZP47 ## genes ## 1 <NA> ## 2 YIR036W-A ## 3 YPR010C-A ## 4 PEP1 VPS10 VPT1 SCY_0201 ## 5 RTC1 SCY_4942 ## 6 DED1 SCY_5262 ## protein names ## 1 <NA> ## 2 Putative uncharacterized membrane protein YIR036W-A ## 3 UPF0495 protein YPR010C-A ## 4 Vacuolar protein sorting/targeting protein PEP1 (Carboxypeptidase Y receptor) (CPY receptor) (Carboxypeptidase Y-deficient protein 1) (Sortilin VPS10) (Vacuolar carboxypeptidase sorting receptor VPS10) (Vacuolar protein sorting-associated protein 10) (Vacuolar protein-targeting protein 1) ## 5 Restriction of telomere capping protein 1 ## 6 ATP-dependent RNA helicase DED1 (EC 3.6.4.13) Before limma is called, intensity data are transformed using the transform.fun function (a parameter of limmaDE). The default value for this transformation is log10. Therefore, by default, the column logFC in the output data frame contains $$\log_{10}(M_1/M_2)$$, where $$M_k$$ represent the mean of condition $$k$$. If you need log2-based fold change, you can use transform.fun=log2. The significant column indicates significantly differentially expressed proteins, based on the Benjamini-Hochberg corrected p-values (column adj.P.Val) and the significance level defined when calling limmaDE (sig.level=0.05). It allows for simple filtering of the significant results (below we show only the most important columns, for simplicity): res[which(res$significant), c("protein", "logFC", "adj.P.Val")]
##                    protein      logFC   adj.P.Val
## 116   sp|P04037|COX4_YEAST  0.9310020 0.015116978
## 147   sp|P05374|CHO2_YEAST -1.1273068 0.022945434
## 275  sp|P09435|HSP73_YEAST  1.3424530 0.003565436
## 295   sp|P0CI39|STE2_YEASX  1.7701655 0.003780144
## 528   sp|P19145|GAP1_YEAST -2.3367877 0.003206448
## 621  sp|P22202|HSP74_YEAST  0.9655891 0.026144388
## 802   sp|P26263|PDC6_YEAST  3.3750749 0.003512104
## 863   sp|P29029|CHIT_YEAST -0.8842365 0.017687627
## 1702  sp|P38928|AXL2_YEAST -0.9794888 0.049029522
## 1757  sp|P39106|MNN1_YEAST -0.9376527 0.026144388
## 1976  sp|P40472|SIM1_YEAST -1.5889220 0.003206448
## 2041  sp|P40893|REE1_YEAST -0.6895676 0.026144388
## 2203  sp|P46955|NCA3_YEAST  1.8021963 0.040675275
## 2247  sp|P47047|MTR4_YEAST  1.1137049 0.040675275
## 2256  sp|P47069|MPS3_YEAST -1.3084852 0.049029522
## 2283  sp|P47123|MOG1_YEAST  0.8376954 0.040675275
## 2310  sp|P47181|YJ9S_YEAST  0.7425047 0.040675275
## 2356  sp|P49573|CTR1_YEAST -0.9417154 0.049029522
## 2706 sp|P54115|ALDH6_YEAST  0.6402032 0.009110354
## 2811  sp|Q02516|HAP5_YEAST  1.4730973 0.006206540
## 2877  sp|Q03178|PIR1_YEAST  1.5577601 0.003780144
## 2923  sp|Q03532|HAS1_YEAST  0.5633799 0.040675275
## 3251  sp|Q06648|GIC2_YEAST -0.9000611 0.006206540
## 3444  sp|Q08969|GRE1_YEAST  2.9123684 0.040675275
## 3597 sp|Q12298|YD061_YEAST -3.5585924 0.026144388

Note: limmaDE requires exactly two conditions to do a differential expression on. When data contain more conditions, a parameter conditions is required to select a pair:

res <- limmaDE(prodat.med, conditions=c("A", "B"))

For more complicated designs we recommend using limma functions directly.

## 6.1 Proteins present in only one condition

Sometimes data for a given protein is missing entirely from a condition (that is all replicates are NA in the intensity table). In such cases, differential expression with limma returns NA log-fold-change and p-value. However, a protein detected in one condition and not detected in the other condition might be interesting if a non-detection is due to very low abundance. We can easily find such proteins using the look-up table detect in an proteusData object. It contains logical columns for each condition, with TRUE indicating that the protein was detected (in at least one replicate) and FALSE when it was not detected (in any replicate).

head(prodat$detect) ## A B ## mut-yEGFP TRUE FALSE ## sp|A0A023PYI5|YI036_YEAST FALSE FALSE ## sp|A5Z2X5|YP010_YEAST TRUE TRUE ## sp|A6ZKT1|VPS10_YEAS7 TRUE TRUE ## sp|A6ZN74|RTC1_YEAS7 TRUE TRUE ## sp|A6ZP47|DED1_YEAS7 TRUE TRUE The first protein, mut-yEGFP is not detected in B condition. The proteins detected in only one condition can be found using a logical expression: only.A <- which(prodat$detect$A & !prodat$detect$B) only.B <- which(!prodat$detect$A & prodat$detect$B) We can list their identifiers from the proteins field in the protein object, as in this example for condition A: as.character(prodat$proteins[only.A])
## [1] "mut-yEGFP"

## 6.2 Visualization

Proteus provides with several functions to visualize protein data and the results of the differential expression.

Fold-change-intensity plot:

plotFID(prodat.med)

Volcano plot:

plotVolcano(res)

P-value distribution plot:

plotPdist(res)

## 6.3 Individual proteins

We can also look at individual protein. The function plotIntensities plots intensities of individual samples per condition, or other selected quantity (e.g. batch). Here is one of the up-regulated proteins.

plotIntensities(prodat.med, id='sp|P26263|PDC6_YEAST', log=TRUE)

A better understanding of the protein’s behaviour might be gained via function plotProtPeptides, which shows intensities of individual peptides and replicates for the given protein.

plotProtPeptides(pepdat.clean, 'sp|P26263|PDC6_YEAST', prodat.med)

# 7 Interactive plots

Proteus offers interactive versions of volcano and fold-change/intensity plots. They use Shiny framework to build an interactive local web page in your browser. They are called (from command line):

plotVolcano_live(prodat.med, res)

and

plotFID_live(prodat.med, res)

We strongly recommend to build protein annotations before running live functions.