---
title: 'RNA-seq Analysis 3:'
author: "Pietà Schofield"
subtitle: Differential Expression
output:
  html_document:
    code_folding: show
    fig_caption: yes
    highlight: textmate
    theme: spacelab
    toc: yes
    toc_depth: 3
    toc_float:
      collapsed: yes
      smooth_scroll: no
  pdf_document:
    toc: yes
    toc_depth: '3'
---


```{r setup,echo=FALSE,include=FALSE}
require(RefManageR)
BibOptions(style="html",longnamesfirst=F)
if(!file.exists(".pieta.bib")){
 download.file("http://www.compbio.dundee.ac.uk/user/pschofield/biblio.bib",".pieta.bib")
}
bib <- ReadBib(".pieta.bib",check=FALSE)
require(knitr)
opts_chunk$set( message=F, warning=F, comment=NA)
#
# This little bit of trickery is rather complex knitr-foo thanks to Ramnath Vaidyanathan
#
# http://ramnathv.github.io/posts/verbatim-chunks-knitr/index.html
#
# It enables code chunks to be included with the surrounding brackets and options
# for illustrative purposes
#
knitr::opts_chunk$set(tidy = F)
knit_hooks$set(source = function(x, options){
  if (!is.null(options$verbatim) && options$verbatim){
    opts = gsub(",\\s*verbatim\\s*=\\s*TRUE\\s*", "", options$params.src)
    bef = sprintf('\n\n    ```{r %s}\n', opts, "\n")
    stringr::str_c(
      bef, 
      knitr:::indent_block(paste(x, collapse = '\n'), "    "), 
      "\n    ```\n"
    )
  } else {
    stringr::str_c("\n\n```", tolower(options$engine), "\n", 
      paste(x, collapse = '\n'), "\n```\n\n"
    )
  }
})
#
# a little bit of R from Yihui Xieto put in verbatim inline R
#
# http://stackoverflow.com/questions/20409172/how-to-display-verbatim-inline-r-code-with-backticks-using-rmarkdown
#
rinline <- function(code) {
  sprintf('``` `r %s` ```', code)
}
```

- [Return to the BS32010 main page](http://www.compbio.dundee.ac.uk/user/pschofield/Projects/teaching_BS32010)
- [Full Rmarkdown script for this workshop](http://www.compbio.dundee.ac.uk/user/pschofield/Projects/teaching_BS32010/code/BS32010Workshop3.Rmd)

# Data Exploration

```{r setupdata, echo=FALSE}
if(file.exists("/usr/local/share/BS32010/pschofield/Rdata/SCfcFull.RData")){
  load("/usr/local/share/BS32010/pschofield/Rdata/SCfcFull.RData")
}else{
  load("/Users/pschofield/Projects/teaching_BS32010/SCfcFull.RData")
}
countMatrix <- fc$counts[,4:9]
colnames(countMatrix)<-gsub("[.]*","",colnames(countMatrix))
colnames(countMatrix)<-gsub("sbam$","",colnames(countMatrix))
colnames(countMatrix)<-gsub("Snf2","Mut",colnames(countMatrix))
```

## Sanity Checks

Once we have our feature counts object we can extract the matrix of counts and there are a few sanity checks we can perform to see if it looks like the experiment has work. 

### PCA and MDS

[Principal Component Analysis (PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis) and [Multi-Dimensional Scaling (MDS)](https://en.wikipedia.org/wiki/Multidimensional_scaling) are complexity reduction techniques and data exploration tools for multi-variate data, for example gene expression profiles where each gene represents a variable with an expression value. PCA attempts to find a transformation of the data that produces combinations of variable with maximum variation between samples. MDS is a broader term that encompasses PCA but include other techniques for reducing the dimension of multi-variate data.

Several of the Bioconductor packages for differential gene expression analysis include functions with names like `plotPCA()` and `plotMDS()` for visualising the relationships between samples. For example the package [EDASeq](http://bioconductor.org/packages/release/bioc/html/EDASeq.html) `r Cite(bib,"risso2011")` provides a `plotPCA()` function

```{r plotPCA}
require(EDASeq)
plotPCA(countMatrix)
```

### Exercise 3.1

Usine the EDASeq `plotPCA()` function to explore the counts matrix in the featureCounts object created earlier.

### Clustering

Another common data exploration technique used is clustering analysis. In this a distance is calculated between samples based on some combination of variable and a distance matrix generated. A clustering algorithm is them run over the distance matrix that aggregates (groups) samples based on their distance. There are several clustering methods and several clustering packages available in R of particular use in this context are the heirarchical clustering functions `hclust()` available as a base R function and `pvclust()` available in the [pvclust](https://CRAN.R-project.org/package=pvclust) package `r Cite(bib, "suzuki2006")`
The advantage with the pvclust() function is it uses bootstrapping to provide a level of confidence that the observed clusters are robust to slight changes in the samples.

```{r pvclust}
require(pvclust)
pvc <- pvclust(countMatrix)
plot(pvc)
```

### Sample Distributions

The average is a measure of a central tendency in data distribution. There a several familiar averages, the mean, the mode and the median for example. Another way of exploring the samples is to plot box plots of the sample distributions. Again several packages provide convenience functions for box plots of sample distributions. The most obvious is the `boxplot()` function that is available in both the R base package `graphics` and also the bioconductor infrastructure package `BiocGenerics`.

```{r boxplot}
boxplot(log(countMatrix+0.5))
```

variations of this are to plot the distribution of the log of the count instead.

The EDASeq package also provides the function `plotRLE()` where RLE stands for Relative Log Expression. The RLE is calculated by first calculating the median expression value of the distribution of all counts and then for each count calculating the expression relative to this average.

### Exercise 3.2

Use the `pvclust()` function and the EDASeq function plotRLE() to explore the counts matrix in the featureCounts object created earlier. 

## Normalisation

There are always sources of unwanted variation in data that are not due to the biological question of interest. The perpose of what is often called normalisation is the removal of unwanted variation between samples so that the variation due to the biological question of interest is not masked by unwanted variation.

### Library Size

The most obvious source of variation in NGS data is the difference in the total number of sequences produced for a sample by the sequencer. This is generally called the library size and correcting for library size is an important step. 

One way of normalising for library size is to divide the counts for each gene by the total number of reads and multiply by a fixed large number, often choosen as 1M, or sometimes an average for example mean or median of the library sizes between samples. Values normalised for library size are sometimes referred to as Counts Per Million (CPM).

### Gene Length

Another source of variation between genes is the length of the gene. A longer gene transcript has a greater chance of having fragments detected by sequencing because there will be more fragments to detect. This also means that when looking for differential expression a gene that has more fragments is likely to have higher chance of a significant result. As with library size one method is to divide the counts per gene by the gene length in base pairs and multiply by a fixed large number often 1K

When counts have been normalised by library size and then gene length they are often referred to as Fragments (or Reads) Per Thousand Bases Per Million Fragments (or Reads) written as FPKM or RPKM. Another slight variation on this approach is to normalise by gene length first and then normalise by the new library size of the normalised values this produces a value called Transcripts Per Million (TPM). [There is an explanation of this you can watch over and over here](http://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/)

### GC content

Due to the way sequencing technology works it has been noted that genes with higher GC content are more likely to be significantly differentially expressed. The EDASeq packages has functions for GC content normalisation if you supply the sequence for the genes.

### RNA composition

Due to the limitation on sequence resources if a highly expressed gene in one condition is not expressed in another condition there will be more sequencing resources to devote to reads from other genes. This can give the impression that the other genes are more highly expressed in the second condition when it is purely due to just having more sequencing resouces available for them. The `edgeR` packages default normalisation routine Trimmed Mean of M-values (TMM) `r Cite(bib, "robinson2010")` performs an adjustment that adjusts for RNA composition based on the assumption that most genes are not changing. (See the section on MA plots below for an explanation of M value).

### Take Home

The various differential expression packages available in Bioconductor vary in the way they normalise of correct samples and genes for unwanted variation and the field is one of active research without a consensus at the moment to the best approach. The appropriate approach may also depend on the question being addressed. Packages often have several normalisation approaches available. One technique is to perform several different normalisations and look for changes that are robust to normalisation differences. But you should be aware this is a conservative approach and may lead to false negative result, however the alternative of cherry picking the normalisation technique that gives the most positive results should also be avoided.

Be aware also some normalisations may not be appropriate for a particular differential expression tool. Some tools such as [limma](https://bioconductor.org/packages/release/bioc/html/limma.html) `r Cite(bib, "ritchie2015")` are designed to apply the adjustment to the count data first and then model the differential expression and calculate significance while others such as [edgeR](https://bioconductor.org/packages/release/bioc/html/edgeR.html)  `r Cite(bib, c("robinson2010","mccarthy2012","robinson2007","robinson2008","zhou2014"))` and [DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html) `r Cite(bib, "love2014")` are designed to use the unadjusted counts and apply the adjustment to the model.

### MAplots

The MAplot is a form of [Bland–Altman plot](https://en.wikipedia.org/wiki/Bland–Altman_plot) where a measure of the difference between a variable in two conditions if plotted against mean value of the variable in the two conditions. In general 

$$
M = log(\frac{V_1}{V_2}) = log(V_1) - log(V_2) 
$$

and

$$
A = log(\frac{V_1 + V_2}{2})
$$

Because this is such a standard plot again many differential expression packages provide convenience fuctions for producing them.

# Diferential Expression

Once we have explored our data and hopefully removed as much unwanted variance as we can through normalisation we can then calculate the difference between the expression counts in the conditions and calculate a level of confidence about the result. This will normally be in the form of a p.value or p.value adjusted for multiple comparisons by some method. 

In this workshop we will use the package [edgeR](https://bioconductor.org/packages/release/bioc/html/edgeR.html) the process is generally very similar between packages, though they vary slightly in the data structures they uses as well as the normalisation steps and models they use.

```{r edgeR }
require(edgeR)

# convert the count matrix into a format edgeR wants
# In this case what is called a DGEList object

# create a factor that says what each sample is here because the first character of the sample name
# distinquishes the sample type lets use that
cond <- as.factor(substr(colnames(countMatrix),1,1))

# create the DGEList object
dge <- DGEList(countMatrix,group= cond)
dge
```

EdgeR provides several normalisation options through the `calcNormFactors()` the default option is to use one called Trimmed Mean of M-values (TMM) `r Cite(bib, "robinson2010")`

```{r edgeRNorm}
dge <- calcNormFactors(dge, method="TMM")
dge
```

Once the normalisation has been done you need to create a model that describes the experimental design. Here we can use the `model.matrix()` function and the group factor to define our model

```{r edgeRDGE}
design <- model.matrix( ~dge$sample$group )
design
```

Then we can call the edgeR commands to fit the model and find the top genes. The first task is to use the `estimateDisp()` function to estimate the dispersions that edgeR uses for its modelling. The common dispersion is the estimated dispersion over all the data, the trended dispersion is the dispersion for genes with similar expression and the tagwise dispersion is the individual gene dispersion. Thes can be visualised with the `plotBCV()` function

```{r edgeRDisp}
dge <- estimateDisp(dge, design)
plotBCV(dge, cex = 0.5)
```

We can then use the edgeR Generalised Linear Modelling fit `glmFit()` and calculate significance with the Log Ratio Test `glmLRT()` function. The `topTags()` function is then used to recover the results of the modelling.

```{r edgeRfit}
fit <- glmFit(dge,design)
lrt <- glmLRT(fit,coef=2)
topTags(lrt)
```

## Volcano plots

A standard plot for graphically examining the results of differential expression is the volcano plot that plots the significance against the fold change, often coloured by mean expression.

```{r volcano}
require(ggplot2)
tt <- topTags(lrt,n=10000)$table

ggplot(data=tt) + geom_point(aes(x=logFC,y=-log(FDR),color=logCPM)) +
  scale_colour_gradientn(colours=c("#000000" ,"#FF0000" ))
```

## Exercise 3.3

Using the data in the saved featureCounts object /usr/local/share/BS32010/pschofield/Rdata/SCfcFull.Rdata perform a differential expression analysis with edgeR. 

- Cluster the samples using pvclust.
- Try using a couple of alternative normalisations
    - Cluster the normalised gene counts, does this alter the clustering
    - Does this alter final list of significant genes
- What is the most highly differentially expressed gene
- What is the most highly up regulated gene

# Bibliography

```{r bibliography, echo=FALSE, results="asis"}
PrintBibliography(bib)
```

# Session Info

```{r session, echo=FALSE}
sessionInfo()
```