---
title: 'RNA-seq Analysis 4:'
author: "Pietà Schofield"
subtitle: Introduction to Gene Set and Pathway Analysis
output:
  html_document:
    code_folding: show
    fig_caption: yes
    highlight: textmate
    theme: spacelab
    toc: yes
    toc_depth: 3
    toc_float:
      collapsed: yes
      smooth_scroll: no
  pdf_document:
    toc: yes
    toc_depth: '3'
---

```{r setup,echo=FALSE,include=FALSE}
require(RefManageR)
BibOptions(style="html",longnamesfirst=F)
if(!file.exists(".pieta.bib")){
 download.file("http://www.compbio.dundee.ac.uk/user/pschofield/biblio.bib",".pieta.bib")
}
bib <- ReadBib(".pieta.bib",check=FALSE)
require(knitr)
opts_chunk$set( message=F, warning=F, comment=NA)
#
# This little bit of trickery is rather complex knitr-foo thanks to Ramnath Vaidyanathan
#
# http://ramnathv.github.io/posts/verbatim-chunks-knitr/index.html
#
# It enables code chunks to be included with the surrounding brackets and options
# for illustrative purposes
#
knitr::opts_chunk$set(tidy = F)
knit_hooks$set(source = function(x, options){
  if (!is.null(options$verbatim) && options$verbatim){
    opts = gsub(",\\s*verbatim\\s*=\\s*TRUE\\s*", "", options$params.src)
    bef = sprintf('\n\n    ```{r %s}\n', opts, "\n")
    stringr::str_c(
      bef, 
      knitr:::indent_block(paste(x, collapse = '\n'), "    "), 
      "\n    ```\n"
    )
  } else {
    stringr::str_c("\n\n```", tolower(options$engine), "\n", 
      paste(x, collapse = '\n'), "\n```\n\n"
    )
  }
})
#
# a little bit of R from Yihui Xieto put in verbatim inline R
#
# http://stackoverflow.com/questions/20409172/how-to-display-verbatim-inline-r-code-with-backticks-using-rmarkdown
#
rinline <- function(code) {
  sprintf('``` `r %s` ```', code)
}
```

- [Return to the BS32010 main page](http://www.compbio.dundee.ac.uk/user/pschofield/Projects/teaching_BS32010)
- [Full Rmarkdown script for this workshop](http://www.compbio.dundee.ac.uk/user/pschofield/Projects/teaching_BS32010/code/BS32010Workshop4.Rmd)

# Functional Annotation

As with most things there are multiple ways and multiple packages available in Bioconductor for doing [functional annotation analysis](http://bioconductor.org/packages/release/BiocViews.html#___FunctionalAnnotation) of set of genes identified by an NGS experiment. Many of these are listed under the various BiocViews for example:

- [Pathways BiocView](https://www.bioconductor.org/packages/release/BiocViews.html#___Pathways), and again the choice of which to choose is a mix of personal preference and specific application.
- [Gene Set Enrichment BiocView](https://www.bioconductor.org/packages/release/BiocViews.html#___GeneSetEnrichment)
- [Network Enrichment BiocView](https://www.bioconductor.org/packages/release/BiocViews.html#___NetworkEnrichment)
- [Systems Biology BiocView](https://www.bioconductor.org/packages/release/BiocViews.html#___SystemsBiology)


This final workshop introduces a couple of Bioconductor packages that can be used for visualisation and exploration of functional annotations of sets of enriched genes. These are

- [STRINGdb](https://www.bioconductor.org/packages/release/bioc/html/STRINGdb.html) package
`r Cite(bib,"franceschini2013")`
- [pathview](https://www.bioconductor.org/packages/release/bioc/html/pathview.html) package 
`r Cite(bib,"luo2013")`

There are also a growing number of packages that integrate functional annotation analysis  [FGNet](https://www.bioconductor.org/packages/release/bioc/html/FGNet.html) `r Cite(bib,"aibar2015")` ,  [clusterProfiler](https://www.bioconductor.org/packages/release/bioc/html/clusterProfiler.html) `r Cite(bib,"yu2012")`

## Protein Interactions

In particular these packages give access to the [STRING Known and predicted protein interaction database](http://string-db.org), [Gene Ontology Database](http://geneontology.org) and the [KEGG Kyoto Encyclopedia of Genes and Genomes Database](http://www.genome.jp/kegg/),


```{r stringDBex}
require(STRINGdb)

species <- get_STRING_species(version="10", species_name=NULL)
species[which(grepl("^Homo sapiens",species$official_name)),]

# create a new STRING_db object
string_db <- STRINGdb$new(version="10",species=9606)
```

I constant issue with bioinformatics is the issue of different databases and annotations having different identifiers for the same thing. This is no different with STRINGdb. So the first task is to map the gene identifiers in the data set to genes that STRINGdb knows about

```{r stringDBdata, echo=FALSE}
data(diff_exp_example1)
knitr::kable(head(diff_exp_example1))
```

The gene identifier column is called gene. The first task in most annotation processes is ensuring that the identifiers for the set of features you have discovered match the correct identifiers used by the annotation database. This mapping task is often far from trivial as there are many different identifiers used. 

![[XKCD](http://www.xkcd.com) cartoon that is very apt for various aspects of bioinformatics](http://imgs.xkcd.com/comics/standards.png)]

The STRINGdb packages provides the `string_db$map()` function to map the gene ids of several types to their string id.

```{r stringDBmap}
# map to STRING ids
example1_mapped = string_db$map( diff_exp_example1[1:1000,], "gene", removeUnmappedRows = TRUE )

# get the STRING_id for the top 50 genes
hits = example1_mapped$STRING_id[1:50]  

# plot the STRING network png 
string_db$plot_network( hits )
```

It is possible to add halos to the plot that signify the level of differential expression.

```{r stringPoshPlot}
# filter by p-value and add a color column 
# (i.e. green down-regulated gened and red for up-regulated genes)
example1_mapped_pval05 = string_db$add_diff_exp_color( subset(example1_mapped, pvalue<0.05),
                                                       logFcColStr="logFC" )    

# post payload information to the STRING server
payload_id = string_db$post_payload( example1_mapped_pval05$STRING_id,
                                     colors=example1_mapped_pval05$color )

# display a STRING network png with the "halo"
string_db$plot_network( hits, payload_id=payload_id )
```

## Geno Ontology 

STRINGdb also provides a simple interface to the [Gene Ontolgy database]() and can perform GO term enrichment analysis. The `string_db$get_enrichment()` function returns an ordered list of GO terms that are over represented in the gene hits list provided. It has parameters that provide for the selection of specific GO categories, Biological Process, Molecular Function and Celular Compartment. It also has options for specifying multiple test correction methods and the evidence type for the GO annotations.

```{r stringGO}
hits = example1_mapped$STRING_id 
######## compute enrichment in GO annotations for Biological Process ########
enrichmentGO = string_db$get_enrichment( hits, category = "Process", methodMT = "fdr", iea = TRUE )
head(enrichmentGO, n=7)
```

There are several packages available for GO term analysis reflecting the fact that there is currently no consensus on best way to perform GO term inference. The package [topGO](http://bioconductor.org/packages/release/bioc/html/topGO.html) provides a flexible framework for applying several methods.

## KEGG Pathways

The `string_db$get_enrichment()` can also be used to get [KEGG Pathway]() enrichment values

```{r stringKEGG}
enrichmentKEGG = string_db$get_enrichment( hits, category = "KEGG", methodMT = "fdr", iea = TRUE )
head(enrichmentKEGG, n=7)
```

### Visualising Pathways

The package [pathview](http://bioconductor.org/packages/release/bioc/html/pathview.html) can be used to visualise KEGG Pathways and will annotate the pathway diagram with fold change or significance colouring. So for example the enriched pathway for p53 in the above example can be visualise as follows.

```{r pathview, eval=TRUE}
# Load the pathview package
require(pathview)

# create a named vector of fold changes from the expression data using the gene Ids as names
gene.data <- as.numeric(diff_exp_example1$logFC)
names(gene.data) <- diff_exp_example1$gene

# lets look at one of the top path in the significant KEGG pathway list above for example
# the p53 pathway which has KEGG id 04115, homo sapiens has the KEGG id hsa
# the gene ids that have been used in this case are the gene symbols

pathID <- "04115"
pview <- pathview(gene.data=gene.data,
                  gene.idtype="SYMBOL",
                  pathway.id=pathID,
                  species="hsa",
                  out.suffix="kegg",
                  kegg.native=T,
                  same.layer=T)
```

The `pathview()` function creates a PNG graphics file in the local directory. 

![KEGG pathway diagram](hsa04115.kegg.png)

There are a large number of diverse packages for [pathway annotation annotation](http://bioconductor.org/packages/release/BiocViews.html#___Pathways) 

## Exercise 4

Using the expression data in the RData file /usr/local/share/BS32010/pschofield/Rdata/dge_scerevisiae.RData 

- use STRINGdb to visualise the relationshops of the top 25 differentially expressed genes.  
- find the top 6 KEGG  pathways 
- use pathview to visualise the differential expression of the genes in the glycine, serine and threonine metabolism pathway.

# Bibliography

```{r bibliography, echo=FALSE, results="asis"}
PrintBibliography(bib)
```

# Session Info

```{r session, echo=FALSE}
sessionInfo()
```