This tutorial demonstrates how to analyse data from label free MS/MS experiment in Proteus. We use an example data set from an unpublished experiment by Katharina Trunk, Sarah Coulthurst, Julien Peltier and Matthias Trost. The data are distributed in a package proteusLabelFree, which needs installing and loading first:
devtools::install_github("bartongroup/proteusLabelFree")
library(proteusLabelFree)
data(proteusLabelFree)
Here is a minimal example of data analysis in Proteus. You will need the evidence file from MaxQuant and a very simple text file with metadata (see below how to create it). Then, you can get from the evidence file to differential expression in just a few steps.
library(proteus)
evidenceFile <- system.file("extdata", "evidence.txt.gz", package="proteusLabelFree")
metadataFile <- system.file("extdata", "metadata.txt", package="proteusLabelFree")
evi <- readEvidenceFile(evidenceFile)
meta <- read.delim(metadataFile, header=TRUE, sep="\t")
pepdat <- makePeptideTable(evi, meta)
prodat <- makeProteinTable(pepdat)
prodat.med <- normalizeData(prodat)
res <- limmaDE(prodat.med)
plotVolcano_live(prodat.med, res)
Proteus needs one input file: the evidence file from MaxQuant.
In this tutorial we are also going to use a metadata file. It should be created manually as a tab-delimited text file with a header containing several columns of metadata. The mandatory columns are “experiment”, “measure”, “sample” and “condition” (see section Metadata).
Alternatively, a metadata data frame (with the same columns) can be created within the R session.
Function readEvidenceFile()
reads MaxQuant evidence file, filters out contaminants and reverse sequences. In this example we use the evidence file from the package proteusLabelFree
(you need to install it if you want to follow this tutorial):
evidenceFile <- system.file("extdata", "evidence.txt.gz", package="proteusLabelFree")
evi <- readEvidenceFile(evidenceFile)
evi
is a data frame with selected columns from the evidence file:
## sequence modified_sequence modifications protein_group
## 1 AAAAEKNVPLYK _AAAAEKNVPLYK_ Unmodified sp|P00924|ENO1_YEAST
## 2 AAAAEKNVPLYK _AAAAEKNVPLYK_ Unmodified sp|P00924|ENO1_YEAST
## 3 AAAAEKNVPLYK _AAAAEKNVPLYK_ Unmodified sp|P00924|ENO1_YEAST
## 4 AAAAEKNVPLYK _AAAAEKNVPLYK_ Unmodified sp|P00924|ENO1_YEAST
## 5 AAAAEKNVPLYK _AAAAEKNVPLYK_ Unmodified sp|P00924|ENO1_YEAST
## 6 AAAAEKNVPLYK _AAAAEKNVPLYK_ Unmodified sp|P00924|ENO1_YEAST
## protein experiment charge reverse contaminant intensity
## 1 sp|P00924|ENO1_YEAST A-1 2 44276000
## 2 sp|P00924|ENO1_YEAST A-1 3 62815000
## 3 sp|P00924|ENO1_YEAST A-2 2 24414000
## 4 sp|P00924|ENO1_YEAST A-2 3 36633000
## 5 sp|P00924|ENO1_YEAST A-3 2 17926000
## 6 sp|P00924|ENO1_YEAST A-3 3 27610000
It can be quite large:
## [1] 470318 10
## [1] "41.6 Mb"
MaxQuant evidence files come with column names that can differ, depending on the MaxQuant version and the parameters used in its run. Many of these column names contain spaces and other characters (e.g. “/”) that make R data processing awkward. Therefore, upon reading the evidence file, Proteus renames its columns to simpler names, conforming with R variable name restrictions. Also, many of the evidence columns are not needed in Proteus processing, so these are not kept to save memory.
The selection and naming of columns is controlled by two parameters in function readEvidenceFile()
: measure.cols
and data.cols
. Proteus comes with two predefined lists called measureColumns
and evidenceColumns
, which are used as defaults. They can be modified, as necessary.
The default named list measureColumns
describes evidence file columns that contain measurements. In case of label-free experiment there is only one such column, called Intensity
.
## List of 1
## $ intensity: chr "Intensity"
Different measure columns are needed for TMT and SILAC experiments. See the appropriate vignettes (TMT
and SILAC
) for more details.
The default named list evidenceColumns
describes the minimal set of columns needed for further processing (in addition to measure columns).
## List of 9
## $ sequence : chr "Sequence"
## $ modified_sequence: chr "Modified sequence"
## $ modifications : chr "Modifications"
## $ protein_group : chr "Proteins"
## $ protein : chr "Leading razor protein"
## $ experiment : chr "Experiment"
## $ charge : chr "Charge"
## $ reverse : chr "Reverse"
## $ contaminant : chr "Potential contaminant"
The names (sequence
, modified_sequence
, and so on) are used internally in the package and should not be changed. The values (Sequence
, Modified sequence
) are the actual column names in the evidence file and can be adjusted if different naming convention is used.
We suggest not to change the default list, but create a copy, which can be modified. Let’s say we want to read column m/z
as well:
If, for example, your evidence file contains a column named Leading Razor Protein
(each word starting with a capital), we will have to change this value in the list:
The new column list can be then used with readEvidenceFile
:
To quickly check what the column names are in a given evidence file, we provide a simple function to do this:
evidenceFile <- system.file("extdata", "evidence.txt.gz", package="proteusLabelFree")
evidence.columns <- readColumnNames(evidenceFile)
evidence.columns
## [1] "Sequence" "Length"
## [3] "Modifications" "Modified sequence"
## [5] "Deamidation (NQ) Probabilities" "Oxidation (M) Probabilities"
## [7] "Deamidation (NQ) Score Diffs" "Oxidation (M) Score Diffs"
## [9] "Acetyl (Protein N-term)" "Deamidation (NQ)"
## [11] "Glu->pyro-Glu" "Oxidation (M)"
## [13] "Missed cleavages" "Proteins"
## [15] "Leading proteins" "Leading razor protein"
## [17] "Type" "Raw file"
## [19] "Experiment" "MS/MS m/z"
## [21] "Charge" "m/z"
## [23] "Mass" "Resolution"
## [25] "Uncalibrated - Calibrated m/z [ppm]" "Uncalibrated - Calibrated m/z [Da]"
## [27] "Mass error [ppm]" "Mass error [Da]"
## [29] "Uncalibrated mass error [ppm]" "Uncalibrated mass error [Da]"
## [31] "Max intensity m/z 0" "Retention time"
## [33] "Retention length" "Calibrated retention time"
## [35] "Calibrated retention time start" "Calibrated retention time finish"
## [37] "Retention time calibration" "Match time difference"
## [39] "Match m/z difference" "Match q-value"
## [41] "Match score" "Number of data points"
## [43] "Number of scans" "Number of isotopic peaks"
## [45] "PIF" "Fraction of total spectrum"
## [47] "Base peak fraction" "PEP"
## [49] "MS/MS count" "MS/MS scan number"
## [51] "Score" "Delta score"
## [53] "Combinatorics" "Intensity"
## [55] "Reverse" "Potential contaminant"
## [57] "id" "Protein group IDs"
## [59] "Peptide ID" "Mod. peptide ID"
## [61] "MS/MS IDs" "Best MS/MS"
## [63] "AIF MS/MS IDs" "Deamidation (NQ) site IDs"
## [65] "Oxidation (M) site IDs"
We also need metadata. In our example it is stored in the file in the proteusLabelFree
package:
metadataFile <- system.file("extdata", "metadata.txt", package="proteusLabelFree")
meta <- read.delim(metadataFile, header=TRUE, sep="\t")
It contains the design of our experiment:
## experiment measure sample condition replicate
## 1 A-1 Intensity A-1 A 1
## 2 A-2 Intensity A-2 A 2
## 3 A-3 Intensity A-3 A 3
## 4 A-4 Intensity A-4 A 4
## 5 A-5 Intensity A-5 A 5
## 6 A-6 Intensity A-6 A 6
## 7 A-7 Intensity A-7 A 7
## 8 B-1 Intensity B-1 B 1
## 9 B-2 Intensity B-2 B 2
## 10 B-3 Intensity B-3 B 3
## 11 B-4 Intensity B-4 B 4
## 12 B-5 Intensity B-5 B 5
## 13 B-6 Intensity B-6 B 6
## 14 B-7 Intensity B-7 B 7
This metadata information will be attached to every peptide and protein object used by Proteus.
Metadata object should be a data frame with at least these columns: experiment
(not always necessary), measure
, sample
and condition
.
experiment
should contain the same values as in the Experiment
column in the evidence file. In case of multiplexed experiments it is possible to have an evidence file without the experiment column. In such case it should not be included in metadata (see TMT
vignette for more details). Here, there are 14 unique experiments. We can see all unique experiment names in the evidence data using unique
function:## [1] "A-1" "A-2" "A-3" "A-4" "A-5" "A-6" "A-7" "B-1" "B-2" "B-3" "B-4" "B-5"
## [13] "B-6" "B-7"
Column measure
refers to columns with measurements in the evidence file. In case of the label-free experiment there is only one measure column, “Intensity”. In multiplexed data (e.g. TMT) there can be several measurements columns.
Column sample
should contain (short) names corresponding to a given experiment and measure. For clarity, we recommend a short name consisting of a condition and replicate (in a simple design where such decomposition is possible). Sample identifies a given experiment/measure uniquely, so sample names must be unique.
Column condition
contains condition names. This information will be used in differential expression. In our example there are two conditions, hence two values in this column.
Other columns can be added and used for the downstream analysis. Here, we added a replicate
column, but other information, in particular describing batch effects, can be very useful.
We can now create a peptide data object from the evidence and metadata. The function makePeptideTable()
aggregates peptide data form the evidence into a single table, where rows correspond to peptide sequences and columns correspond to samples (as provided in metadata). Where there are multiple peptides corresponding to the same sequence/sample (e.g., with different charges) their intensities are added in label-free data (we note that this might not be the best approach due to possible high variance in peptide data - more to come soon).
pepdat
is an object of class proteusData
. It consists of the intensity table (pepdat$tab
) and additional information. The first fife rows and columns from the intensity table are
## A-1 A-2 A-3 A-4 A-5
## AAAAEKNVPLYK 107091000 61047000 45536000 80252000 108521000
## AAAAEKNVPLYKHLADLSK 37000000 NA NA NA 20090000
## AAAAEKNVPLYQHLADLSK 26341000 10490000 7551800 16365000 22191000
## AAAAGAGGAGDSGDAVTK 6856400 2794300 2964400 4027100 4193800
## AAAALAGGKK 2561400 NA 7219200 6562417 5005877
We can use generic summary
function to see more information about pepdat
.
##
## *** Basic statistics ***
##
## content = peptide
## experiment type = label-free
## number of samples = 14
## number of conditions = 2
## number of peptides = 34733
## samples = A-1, A-2, A-3, A-4, A-5, A-6, A-7, B-1, B-2, B-3, B-4, B-5, B-6, B-7
## conditions = A, B
##
## *** Data processing ***
##
## evidence columns used = Intensity
## sequence = 'Sequence'
## protein = 'Leading razor protein'
## normalization = identity
Function plotCount()
plots, as the name suggests, peptide count in each sample. This is the number of non-zero peptide intensities per sample.
Function plotDetectionSimilarity()
calculates Jaccard similarity between each pair of samples and plots its distribution. The similarity is based on detection. For a pair of samples it compares the number of detected peptides in both samples (intersection) divided by the total number of peptides detected in both samples (union). This measure of similarity is a number between 0 and 1.
Function plotDistanceMatrix()
calculates the Pearson’s correlation coefficient for each pair of samples and plots a heatmap:
We can use plotClustering()
to see a dendrogram of the peptide data set:
The last sample (B-7) in the example data looks odd. Lets say we call it “bad” and want to remove from analysis. The easiest way to do this is to modify the metadata.
Warning: this is only an example. You should exercise caution when removing any of your data.
Now, we create a new “clean” peptide data set from evidence data:
The new data pepdat.clean
will contain only samples included in the metadata, that is
## [1] "A-1" "A-2" "A-3" "A-4" "A-5" "A-6" "A-7" "B-1" "B-2" "B-3" "B-4" "B-5"
## [13] "B-6"
Clustering confirms that the offending sample is gone.
There are many approaches to aggregating peptides into proteins. For simplicity, we assign peptides to proteins based on the Leading Razor Protein. For label-free data, we quantify protein abundances using a simple, but robust method of high-flyers (Silva et al. 2006).
makeProteinTable()
creates a protein data set from the peptide data. The result is a proteusData
object containing protein intensity table and other information.
Again, we can use a generic summary
function to see its properties.
##
## *** Basic statistics ***
##
## content = protein
## experiment type = label-free
## number of samples = 13
## number of conditions = 2
## number of proteins = 3797
## samples = A-1, A-2, A-3, A-4, A-5, A-6, A-7, B-1, B-2, B-3, B-4, B-5, B-6
## conditions = A, B
##
## *** Data processing ***
##
## evidence columns used = Intensity
## sequence = 'Sequence'
## protein = 'Leading razor protein'
## normalization = identity
##
## *** Protein data ***
##
## quantification method =
## minimum number of peptides per protein = 2
Finally, we need to normalize data to account for variation of intensity between samples. The function normalizeData()
can normalize peptide or protein data. Up to this point, we haven’t applied any normalization. The default normalization is to the median. After this step, median sample intensities will be equal.
The second parameter to normalizeData()
is norm.fun
and it points to a normalizing function. By default, this is normalizeMedian
, but other normalizations can be used. For example, it works with normalizeQuantiles
from limma
package:
The function plotSampleDistributions()
can be used to compare intensity distributions for each normalization.
plotSampleDistributions(prodat.med, title="Median normalization", fill="condition", method="violin")
Some statistics (like mean and variance across replicate in each condition) are stored directly in the prodat
object at the moment of creation. We can apply plotMV()
function to plot the mean-variance relationship.
We can use the same function plotClustering()
to see the dendrogram for the proteins.
So far, we have analysed data using protein identifiers as provided in the evidence file. We need to annotate proteins and add more information, for example, UniProt ID, gene name, protein description. To do this, we need to prepare a data frame containing a column called protein
with the original protein identifiers, as in the evidence file. Additional columns in this data frame will contain all necessary annotations. The function annotateProteins
can merge such data frame into a proteusData
object and the annotations can be used later to identify proteins and perform further downstream analysis (e.g. functional enrichment).
In the case of our example data protein identifiers look like sp|P00546|CDK1_YEAST
. Hence, they contain a UniProt ID. We can use this information to grab annotations from UniProt servers.
First, we use strsplit
to extract UniProt IDs:
luni <- lapply(as.character(prodat$proteins), function(prot) {
if(grepl("sp\\|", prot)) {
uniprot <- unlist(strsplit(prot, "|", fixed=TRUE))[2]
c(prot, uniprot)
}
})
ids <- as.data.frame(do.call(rbind, luni))
names(ids) <- c("protein", "uniprot")
This creates a data frame with two columns:
## protein uniprot
## 1 sp|A0A023PYI5|YI036_YEAST A0A023PYI5
## 2 sp|A5Z2X5|YP010_YEAST A5Z2X5
## 3 sp|A6ZKT1|VPS10_YEAS7 A6ZKT1
## 4 sp|A6ZN74|RTC1_YEAS7 A6ZN74
## 5 sp|A6ZP47|DED1_YEAS7 A6ZP47
## 6 sp|A6ZQR2|ENOPH_YEAS7 A6ZQR2
So, for each protein we have found its UniProt ID. A helper function fetchFromUniProt
gets basic annotations from UniProt servers. Here we use the column uniprot
containing 3796 UniProt IDs to fetch basic annotations (gene names and protein names) from UniProt. This might take a while!
This is what the content of this data frame looks like:
## id genes
## 1 Q04087 LRS4 YDR439W D9461.25
## 2 P08536 MET3 YJR010W J1436
## 3 P47069 MPS3 NEP98 YJL019W J1310 J1315 YJL018W
## 4 P11746 MCM1 FUN80 YMR043W YM9532.08
## 5 P12945 NAT1 AAA1 YDL040C D2720
## 6 Q08229 NBA1 YOL070C
## protein names
## 1 Monopolin complex subunit LRS4 (Loss of rDNA silencing protein 4)
## 2 Sulfate adenylyltransferase (EC 2.7.7.4) (ATP-sulfurylase) (Methionine-requiring protein 3) (Sulfate adenylate transferase) (SAT)
## 3 Spindle pole body assembly component MPS3 (98 kDa nuclear envelope protein) (Monopolar spindle protein 3)
## 4 Pheromone receptor transcription factor (GRM/PRTF protein)
## 5 N-terminal acetyltransferase A complex subunit NAT1 (NatA complex subunit NAT1) (Amino-terminal, alpha-amino, acetyltransferase 1)
## 6 Protein NBA1 (NAP1 and bud neck-associated protein 1)
In order to feed these data into a proteusData
object we need a column with the original protein identifiers. We can get this by merging the two data frames:
annotations.id <- merge(ids, annotations, by.x="uniprot", by.y="id")
annotations.id <- unique(annotations.id)
The second command is to remove duplicated entries. This is the data frame we need. The function annotateProteins
will add it to the prodat.med
object.
## Annotated 3796 proteins.
The annotations are merged into prodat
based on the protein
column in the annotation data frame. They are stored in prodat$annotation
and rows of this data frame correspond to the order of proteins in the intensity table.
Obviously, not every data set follows the identifier convention in this example. For any data set the user will need to create a data frame with the appropriate annotations, using available resources. The only requirement is that the annotation data frame contains a column called protein
and this column contains protein identifiers as provided in the evidence file and available from the proteusData
object in the field proteins
.
Peptide intensities are aggregated based on the Sequence
column in the evidence file. That is, the intensity for a peptide (for a given sample) is the sum (or other user-specified function) of the intensities for all entries for this sequence and experiment in the evidence file. Instead, we can use the Modified sequence
column and focus on protein modifications. Then, the aggregation will be done for a given modified sequence.
Let us recall, that the evidence object in Proteus package is a data frame with column names as defined by the data.cols
parameter in readEvidenceFile
function. These column names are different from the original names in the evidence file, so they conform with R object naming standards. Here are the column names we have in evi
:
## [1] "sequence" "modified_sequence" "modifications"
## [4] "protein_group" "protein" "experiment"
## [7] "charge" "reverse" "contaminant"
## [10] "intensity"
Therefore, if we want to base peptide aggregation on the modified sequence, we need to specify the appropriate column name when creating peptide table, by the parameter sequence.col
. The usual protein aggregation will follow:
pepdat.mod <- makePeptideTable(evi, meta, sequence.col = "modified_sequence")
prodat.mod <- makeProteinTable(pepdat.mod)
The default value of sequence.col
is “sequence”, so if no parameter is given, peptides will be aggregated based on their (unmodified) sequence.
By default protein intensities are aggregated from peptide intensities based on the Leading razor protein
column from the evidence file (this column is renamed to protein
by readEvidenceFile
for simplicity). This means that for a given protein makeProteinTable
would aggregate data from all peptides having this protein ID in the Leading razor protein
column. We can change this default approach and, instead of razor proteins, we can use protein groups, as listed in Proteins
column of the evidence file (this column is renamed to protein_group
by readEvidenceFile
to avoid confusion). The choice of column to aggregate protein data is controlled by parameter protein.col
in the makePeptideTable
function.
It might look surprising that protein/group selection is done at the peptide level, but this is where peptide-to-protein conversion table is build and stored in the output object (have a look at pepdat$pep2prot
- it’s a data frame with two columns, peptide
and protein
). We want to have it in the peptide object, in case peptide-level analysis needs to find out which proteins or protein groups these peptides belong to. The peptide-to-protein table is then used by makeProteinTable
to aggregate proteins. Hence, if we want to aggregate proteins by protein group rather than razor proteins (which is the default), we need to do the following:
pepdat.group <- makePeptideTable(evi, meta, protein.col="protein_group")
prodat.group <- makeProteinTable(pepdat.group)
As a result, the object prodat.group
will contain not individual proteins, but protein groups.
Evidence file often contains multiple entries for the same peptide sequence and experiment. By default (and this is also done by MaxQuant when creating peptides table) these entries are summed. In proteus the aggregation method is controlled by the parameter aggregate.fun
and can be easily modified. aggregate.fun
points to a function that performs aggregation across all entries. The default function is aggregateSum
, but it can be replaced with any user-provided function, if necessary.
The aggregator function should have the following form:
function(wp, ...)
The input for this function is a matrix wp
with peptide intensities. The ellipsis (…) indicates additional parameters passed to the function. Columns of the matrix wp
correspond to samples and rows correspond to different entries from the evidence file. For example, for a peptide with sequence AASESIKVGDPFDESTFQGAQTSQMQLNK
we have the following matrix:
## A-1 A-2 A-3 A-4 A-5 A-6 A-7 B-1
## [1,] 6420600 1665300 2272300 4329600 3254700 2137400 43234000 1701200
## [2,] 153180000 71662000 56779000 94314000 167480000 53352000 6415100 106160000
## [3,] 2512900 2411700 1811300 3140700 2029800 816490 2208500 2254000
## [4,] 7571400 786130 2148800 1044000 748310 831810 949660 2016300
## [5,] 4021200 1267100 1579000 2821700 NA 1125100 NA 1456900
## [6,] NA NA NA NA NA NA NA NA
## B-2 B-3 B-4 B-5 B-6 B-7
## [1,] 5723300 3439000 3212500 1447400 4291800 2321800
## [2,] 137790000 93275000 2783300 85152000 69391000 51629000
## [3,] 2144700 968470 115940000 899930 983180 1166900
## [4,] 954220 775800 3207100 881190 2136300 742550
## [5,] 2230500 NA 1587600 3393700 NA NA
## [6,] 1307500 NA NA NA NA NA
There are up to 6 evidence entries per sample for this sequence. The output of the aggregator function should be a vector of aggregated intensities. In case of the default aggregator, the output is the sum of columns of the above matrix:
## [1] 173706100 77792230 64590400 105650000 173512810 58262800 52807260
## [8] 113588400 150150220 98458270 126730500 91774220 76802280 55860250
The elements of this vector correspond to the samples - columns of the input matrix. The peptide creator function, makePeptideTable
uses the aggregator function to combine data for each peptide.
The other peptide aggregator function provided in the package is aggregateMedian
. It calculates median of each column. We recommend using this function with SILAC data. You can do this by specifying aggregate.fun=aggregateMedian
in the makePeptideTable
call. See SILAC vignette for details.
Let us assume that instead of the default sums we want to find the maximum intensity across all entries for each peptide. We are not saying that this is a better approach, we just want to show an example of how to create a user-define peptide aggregator. The function we need takes a two-dimensional matrix as an input and returns the maximum of each column. It can be encoded as
aggregateMax <- function(wp) {
s <- apply(wp, 2, function(x) max(x, na.rm=TRUE))
return(as.vector(s))
}
Please note that na.rm=TRUE
is necessary to ignore data gaps. Otherwise the returned vector would consist mostly of NAs. When applied to our example matrix the function returns
## [1] 153180000 71662000 56779000 94314000 167480000 53352000 43234000
## [8] 106160000 137790000 93275000 115940000 85152000 69391000 51629000
Now, we can use it to create an alternative peptide table:
The result will be somehow different from the default summing approach.
By default makeProteinTable
aggregates peptides to proteins using the high-flyer method of (Silva et al. 2006). The protein intensity is the mean intensity of 3 top-intensity peptides associated with the given protein (or group of proteins). The peptide-to-protein aggregation method is controlled by the parameter aggregate.fun
in makeProteinTable
. It points to an aggregator function.
The aggregator function has exactly the same format as in evidence-to-peptide aggregator (see above). That is, it takes a two-dimensional matrix (columns are samples, rows are peptides) and calculates an aggregated vector. There are three aggregators in the package: the default aggregateHifly
plus the same functions we used for evidence-to-peptide aggregation, aggregateSum
and aggregateMedian
.
A user-defined function can be used to aggregate proteins. Please see section on evidence-to-peptide aggregation for details.
MaxQuant output contains a file with aggregated protein data, usually called proteinGroups.txt
. Protein intensities in this file are summed over protein groups. It is possible to read these data directly into Proteus and skip peptide and protein aggregation steps.
proteinGroupsFile <- system.file("extdata", "proteinGroups.txt.gz", package="proteusLabelFree")
prot.MQ <- readProteinGroups(proteinGroupsFile, meta)
This creates a proteusData
object with all necessary information: that is, intensity table, metadata, protein names. This object can be visualized and analysed using Proteus functions.
Just like with evidence data, we might have to specify which columns need to be read from the file. There are two arguments in readProteinGroups
that control this: measure.cols
and data.cols
.
measure.cols
specifies columns with measurements. The default value for this argument is a named list or vector:
## A-1 A-2 A-3 A-4 A-5
## "Intensity A-1" "Intensity A-2" "Intensity A-3" "Intensity A-4" "Intensity A-5"
## A-6 A-7 B-1 B-2 B-3
## "Intensity A-6" "Intensity A-7" "Intensity B-1" "Intensity B-2" "Intensity B-3"
## B-4 B-5 B-6 B-7
## "Intensity B-4" "Intensity B-5" "Intensity B-6" "Intensity B-7"
Values of this list are actual column names in the file (Intensity A-1
, Intensity A-2
, and so on), names of the list are sample names. If columns with measurements in the protein groups file are named following a different convention, this argument will have to be changed. Make sure this is a named list with names corresponding to sample names.
data.cols
specifies additional columns to be read. It defaults to a global list proteinColumn
. This is a another named list, containing the following elements:
## List of 3
## $ protein : chr "Majority protein IDs"
## $ reverse : chr "Reverse"
## $ contaminant: chr "Potential contaminant"
Again, if the majority protein ID column is named differently, this will have to be changed in data.cols
argument. The three names of this list (protein
, reverse
and contaminant
) are used internally and should not be changed or removed.
We suggest using limma package to do differential expression. Package Proteus contains a simple wrapper to limma, that takes a proteusData
object as input.
This function creates a data frame with DE results.
## protein logFC AveExpr t P.Value adj.P.Val
## 1 mut-yEGFP NA 25.46489 NA NA NA
## 2 sp|A0A023PYI5|YI036_YEAST NA NaN NA NA NA
## 3 sp|A5Z2X5|YP010_YEAST -0.7515835 28.11827 -2.3827226 0.03246686 0.3054346
## 4 sp|A6ZKT1|VPS10_YEAS7 0.3279931 23.29869 2.0675716 0.05836982 0.3826613
## 5 sp|A6ZN74|RTC1_YEAS7 -0.2551201 21.23215 -0.8738230 0.39746431 0.7529776
## 6 sp|A6ZP47|DED1_YEAS7 0.1231840 27.69843 0.3502452 0.73154887 0.9052370
## B significant mean_A mean_B ngood_A ngood_B uniprot
## 1 NA NA 25.46489 NA 7 0 <NA>
## 2 NA NA NA NA 0 0 A0A023PYI5
## 3 -3.817759 FALSE 28.46515 27.71357 7 6 A5Z2X5
## 4 -4.341622 FALSE 23.14731 23.47530 7 6 A6ZKT1
## 5 -5.869494 FALSE 21.34990 21.09478 7 6 A6ZN74
## 6 -6.190307 FALSE 27.64158 27.76476 7 6 A6ZP47
## genes
## 1 <NA>
## 2 YIR036W-A
## 3 YPR010C-A
## 4 PEP1 VPS10 VPT1 SCY_0201
## 5 RTC1 SCY_4942
## 6 DED1 SCY_5262
## protein names
## 1 <NA>
## 2 Putative uncharacterized membrane protein YIR036W-A
## 3 UPF0495 protein YPR010C-A
## 4 Vacuolar protein sorting/targeting protein PEP1 (Carboxypeptidase Y receptor) (CPY receptor) (Carboxypeptidase Y-deficient protein 1) (Sortilin VPS10) (Vacuolar carboxypeptidase sorting receptor VPS10) (Vacuolar protein sorting-associated protein 10) (Vacuolar protein-targeting protein 1)
## 5 Restriction of telomere capping protein 1
## 6 ATP-dependent RNA helicase DED1 (EC 3.6.4.13)
Before limma
is called, intensity data are transformed using the transform.fun
function (a parameter of limmaDE
). The default value for this transformation is log10
. Therefore, by default, the column logFC
in the output data frame contains \(\log_{10}(M_1/M_2)\), where \(M_k\) represent the mean of condition \(k\). If you need log2-based fold change, you can use transform.fun=log2
.
The significant
column indicates significantly differentially expressed proteins, based on the Benjamini-Hochberg corrected p-values (column adj.P.Val
) and the significance level defined when calling limmaDE
(sig.level=0.05
). It allows for simple filtering of the significant results (below we show only the most important columns, for simplicity):
## protein logFC adj.P.Val
## 116 sp|P04037|COX4_YEAST 0.9310020 0.015116978
## 147 sp|P05374|CHO2_YEAST -1.1273068 0.022945434
## 275 sp|P09435|HSP73_YEAST 1.3424530 0.003565436
## 295 sp|P0CI39|STE2_YEASX 1.7701655 0.003780144
## 528 sp|P19145|GAP1_YEAST -2.3367877 0.003206448
## 621 sp|P22202|HSP74_YEAST 0.9655891 0.026144388
## 802 sp|P26263|PDC6_YEAST 3.3750749 0.003512104
## 863 sp|P29029|CHIT_YEAST -0.8842365 0.017687627
## 1702 sp|P38928|AXL2_YEAST -0.9794888 0.049029522
## 1757 sp|P39106|MNN1_YEAST -0.9376527 0.026144388
## 1976 sp|P40472|SIM1_YEAST -1.5889220 0.003206448
## 2041 sp|P40893|REE1_YEAST -0.6895676 0.026144388
## 2203 sp|P46955|NCA3_YEAST 1.8021963 0.040675275
## 2247 sp|P47047|MTR4_YEAST 1.1137049 0.040675275
## 2256 sp|P47069|MPS3_YEAST -1.3084852 0.049029522
## 2283 sp|P47123|MOG1_YEAST 0.8376954 0.040675275
## 2310 sp|P47181|YJ9S_YEAST 0.7425047 0.040675275
## 2356 sp|P49573|CTR1_YEAST -0.9417154 0.049029522
## 2706 sp|P54115|ALDH6_YEAST 0.6402032 0.009110354
## 2811 sp|Q02516|HAP5_YEAST 1.4730973 0.006206540
## 2877 sp|Q03178|PIR1_YEAST 1.5577601 0.003780144
## 2923 sp|Q03532|HAS1_YEAST 0.5633799 0.040675275
## 3251 sp|Q06648|GIC2_YEAST -0.9000611 0.006206540
## 3444 sp|Q08969|GRE1_YEAST 2.9123684 0.040675275
## 3597 sp|Q12298|YD061_YEAST -3.5585924 0.026144388
Note: limmaDE
requires exactly two conditions to do a differential expression on. When data contain more conditions, a parameter conditions
is required to select a pair:
For more complicated designs we recommend using limma functions directly.
Sometimes data for a given protein is missing entirely from a condition (that is all replicates are NA
in the intensity table). In such cases, differential expression with limma
returns NA
log-fold-change and p-value. However, a protein detected in one condition and not detected in the other condition might be interesting if a non-detection is due to very low abundance. We can easily find such proteins using the look-up table detect
in an proteusData
object. It contains logical columns for each condition, with TRUE
indicating that the protein was detected (in at least one replicate) and FALSE
when it was not detected (in any replicate).
## A B
## mut-yEGFP TRUE FALSE
## sp|A0A023PYI5|YI036_YEAST FALSE FALSE
## sp|A5Z2X5|YP010_YEAST TRUE TRUE
## sp|A6ZKT1|VPS10_YEAS7 TRUE TRUE
## sp|A6ZN74|RTC1_YEAS7 TRUE TRUE
## sp|A6ZP47|DED1_YEAS7 TRUE TRUE
The first protein, mut-yEGFP
is not detected in B
condition. The proteins detected in only one condition can be found using a logical expression:
only.A <- which(prodat$detect$`A` & !prodat$detect$B)
only.B <- which(!prodat$detect$`A` & prodat$detect$B)
We can list their identifiers from the proteins
field in the protein object, as in this example for condition A
:
## [1] "mut-yEGFP"
Proteus provides with several functions to visualize protein data and the results of the differential expression.
Fold-change-intensity plot:
Volcano plot:
P-value distribution plot:
We can also look at individual protein. The function plotIntensities
plots intensities of individual samples per condition, or other selected quantity (e.g. batch). Here is one of the up-regulated proteins.
A better understanding of the protein’s behaviour might be gained via function plotProtPeptides
, which shows intensities of individual peptides and replicates for the given protein.
Proteus offers interactive versions of volcano and fold-change/intensity plots. They use Shiny framework to build an interactive local web page in your browser. They are called (from command line):
and
We strongly recommend to build protein annotations before running live functions.