---
title: 'RNA-seq Analysis 2:'
author: "Pietà Schofield"
subtitle: Annotations and Alignments
output:
  html_document:
    code_folding: show
    fig_caption: yes
    highlight: textmate
    theme: spacelab
    toc: yes
    toc_depth: 3
    toc_float:
      collapsed: yes
      smooth_scroll: no
  pdf_document:
    toc: yes
    toc_depth: '3'
---


```{r setup,echo=FALSE,include=FALSE}
.libPaths("/cluster/gjb_lab/pschofield/R/x86_64-redhat-linux-gnu-library/3.2")
require(RefManageR)
BibOptions(style="html",longnamesfirst=F)
if(!file.exists(".pieta.bib")){
 download.file("http://www.compbio.dundee.ac.uk/user/pschofield/biblio.bib",".pieta.bib")
}
bib <- ReadBib(".pieta.bib",check=FALSE)
require(knitr)
opts_chunk$set( message=F, warning=F, comment=NA)
#
# This little bit of trickery is rather complex knitr-foo thanks to Ramnath Vaidyanathan
#
# http://ramnathv.github.io/posts/verbatim-chunks-knitr/index.html
#
# It enables code chunks to be included with the surrounding brackets and options
# for illustrative purposes
#
knitr::opts_chunk$set(tidy = F)
knit_hooks$set(source = function(x, options){
  if (!is.null(options$verbatim) && options$verbatim){
    opts = gsub(",\\s*verbatim\\s*=\\s*TRUE\\s*", "", options$params.src)
    bef = sprintf('\n\n    ```{r %s}\n', opts, "\n")
    stringr::str_c(
      bef, 
      knitr:::indent_block(paste(x, collapse = '\n'), "    "), 
      "\n    ```\n"
    )
  } else {
    stringr::str_c("\n\n```", tolower(options$engine), "\n", 
      paste(x, collapse = '\n'), "\n```\n\n"
    )
  }
})
#
# a little bit of R from Yihui Xieto put in verbatim inline R
#
# http://stackoverflow.com/questions/20409172/how-to-display-verbatim-inline-r-code-with-backticks-using-rmarkdown
#
rinline <- function(code) {
  sprintf('``` `r %s` ```', code)
}
```

- [Return to the BS32010 main page](http://www.compbio.dundee.ac.uk/user/pschofield/Projects/teaching_BS32010)
- [Full Rmarkdown script for this workshop](http://www.compbio.dundee.ac.uk/user/pschofield/Projects/teaching_BS32010/code/BS32010Workshop2.Rmd)

# Annotations

There are many packages in bioconductor that contain curated annotation data for numerous and diverse organisms. See the list on the [BiocViews AnnotationData](https://www.bioconductor.org/packages/release/BiocViews.html#___AnnotationData) page

## Annotation Packages

There are annotation datasets that are available as Bioconductor packages many of them use the `RSQLite` package to interface to `sqlite` database files that hold the information locally. Bioconductor provides some standard packages for a range of annotation databases. Examples of these are:

- [BSGenome](https://bioconductor.org/packages/release/BiocViews.html#___BSgenome) packages `r Cite(bib,"pages2016", .opts=list(max.names=3))` store the full genomic sequence in Biostrings format for various organisms. 
- `AnnotationDbi` packages `r Cite(bib,"pages2016a", .opts=list(max.names=3))` that store several type of information
    - [OrgDb](https://bioconductor.org/packages/release/BiocViews.html#___OrgDb) packages  store genome wide annotation data in a  format for many organisms. 
    - [ChipDb](https://bioconductor.org/packages/release/BiocViews.html#___ChipDb`) and `probe` annotations that hold information on probe-sets on various micro array platforms
    - [InparanoidDb](https://bioconductor.org/packages/release/BiocViews.html#___InparanoidDb) packages that hold homology information on proteins
- [Txdb](https://bioconductor.org/packages/release/BiocViews.html#___TxDb) uses the [GenomicFeatures](https://bioconductor.org/packages/release/bioc/html/GenomicFeatures.html) package `r Cite(bib,"lawrence2013", .opts=list(max.names=3))` to manage database that hold transcript model annotations.
- [MeSHDb](https://bioconductor.org/packages/release/BiocViews.html#___MeSHDb) `r Cite(bib,"tsuyuzaki2015")` provide links between genes and the Medical Subject Headings curated database for many organisms


## Online Packages

Bioconductor also provides some library packages that are designed to ease searching various on-line databases for annotation data. These can be used to programmatically retrieve data that is often also avaliable to be downloaded from web sites such as [ENSEMBL](http://www.ensembl.org/info/data/ftp/index.html) and [UCSC](http://hgdownload.soe.ucsc.edu/downloads.html)

- [BiomaRt](https://bioconductor.org/packages/release/bioc/html/biomaRt.html) 
Interface to BioMart databases  `r Cite(bib,c("durinck2009","durinck2005"))` (e.g. Ensembl, COSMIC ,Wormbase and Gramene)
- [AnnotationHub](https://bioconductor.org/packages/release/bioc/html/AnnotationHub.html) The home page for the AnnotationHub package states
"The AnnotationHub web resource provides a central location where genomic files (e.g., VCF, bed, wig) and other resources from standard locations (e.g., UCSC, Ensembl) can be discovered"
- [GEOQuery](http://bioconductor.org/packages/release/bioc/html/GEOquery.html) 
`r Cite(bib,"davis2007")` This package provides access to the NCBI Gene Expression Omnibus (GEO) is a public repository of micro array data
- [SRAdb](http://bioconductor.org/packages/release/bioc/html/SRAdb.html) 
`r Cite(bib, "zhu2013")` provides similar access to GEOQuery except to the NCBI Small Reads Archive, the largest public repository of sequencing data from the next generation of sequencing platforms.
- [STRINGdb](https://www.bioconductor.org/packages/release/bioc/html/STRINGdb.html)
`r Cite(bib, "franceschini2013")` provides an interface to the STRING protein-protein interactions database. It also gives access to KEGG and GO annotations.
- [pathview](http://bioconductor.org/packages/release/bioc/html/pathview.html) 
`r Cite(bib, "luo2013")` is a tool set for pathway based data integration and visualization and will download pathway information from KEGG using the [KEGGgraph](http://bioconductor.org/packages/release/bioc/html/KEGGgraph.html) 
`r Cite(bib, "zhang2015")` package

## Useful CRAN Packages

Not all R packages for handling biological data are in bioconductor, some are in the [Comprehensive R Archive Network (CRAN)](https://cran.r-project.org/web/packages) and can be installed using the `install.packages()` function.

```{r cran, eval=FALSE}
install.packages("packagename")
```

- [Analyses of Phylogenetics and Evolution (ape) ](https://www.bioconductor.org/packages/release/bioc/html/ape.html) `r Cite(bib,"paradis2004")` is a very useful package that has several functions for importing data formats and even accessing online information. For example 

    - the `read.GenBank()` function can be used to retrieve sequence information by GenBank accession number 
- [genoPlotR](https://cran.r-project.org/web/packages/genoPlotR/index.html) `r Cite(bib, "guy2010")` also has a functions for reading feature and sequence annotation data 
    - the `read_dna_seg_from_genbank()` function will read the features from a GenBank file

###AnnotationHub Example

We know there is some _Saccharomyces cerevisiae_ data coming our way so lets look to see what is available through AnnotationHub and lets look at the data provider ENSEMBL.

```{r ah, eval=F}
require(AnnotationHub)
ah <- AnnotationHub()
sc <- query(ah,c("ENSEMBL","Saccharomyces cerevisiae"))
sc
```

This displays a list of the data items available. A fuller list in the form of a data frame is available with the `mocls()` function, this displays more information including the URL of the on-line data source.

```{r ahmcols, eval=FALSE}
mcols(sc)
```

Item `AH49366` is the full genomic sequence for _Saccharomyces cerevisiae_ release 81 

```{r ahGenome, eval=FALSE}
scFaFile <- sc[["AH49366"]]
scFaFile
```

The sequences in the FaFile object can be retrieved with the `getSeq()` command

```{r getSeq, eval=FALSE}
getSeq(scFaFile)
```

The last item in the list of available _S. cerevisiae_ annotation is the GTF file containing the genetic features and we can retrieve this into a `GenomicRanges` object

```{r ahGTF, eval=FALSE}
genes <- sc[["AH50407"]]
head(genes)
```

It is possible to retrieve the sequences for a gene of interest using these two annotations

```{r getGeneSeq, eval=FALSE}
# get the GenomicRanges item that has the gene name of interest and is of type gene 
ser3 <- genes[which(genes$gene_name=="SER3" & genes$type=="gene"),]
# pass this as a parameter to the getSeq function
seqSer3 <- getSeq(scFaFile,ser3)
names(seqSer3) <- ser3$gene_id
seqSer3
```

## Exercise 2.1

- Which chromosome is the SER3 gene on.
- In the example above what other types genetic features are listed for the SER3 gene?
- How many different bio-types are there in the genetic features data and what are they?
- Find the name, length and sequence for the longest snoRNA.

# NGS Data Processing

While it is possible to complete all steps of an RNA-seq analysis with tools from bioconductor there are many other alternatives. Web interface tools such as Galaxy provide form driven interfaces to many of the command line tools including several bioconductor tools and might be an easier introduction to NGS data analysis. Commercial tools such as CLC Bio NGS Workbench also exist that if you have access to might also be an attractive alternative. However bioconductor offers great flexibility and a huge depth in downstream analysis of results that there are certainly benefits from learning how to do an RNA-seq analysis with bioconductor even if you choose to use other software subsequently.

## QC

I am not necessarily going to recommend doing QC in R since FastQC introduced earlier in the course is a fast and relatively painless why to summaries quality information on FASTQ sequence files. However it can be done and one way the task can be performed with in R using the [ShortRead](https://www.bioconductor.org/packages/release/bioc/html/ShortRead.html) package.

```{r shortReadQA, eval=FALSE}
require(ShortRead)
dirPath <- system.file(package="ShortRead", "extdata", "E-MTAB-1147")
qualAnalysis <- qa(dirPath,"fastq.gz", BPPARAM=SerialParam())
report(qualAnalysis,dest="myQual",type="html")
```

The package [EDASeq](http://bioconductor.org/packages/release/bioc/html/EDASeq.html) `r Cite(bib,"risso2011")` also provides tools for the exploration of unaligned (fastq) data and aligned (bam) data.

```{r EDASeq, eval=FALSE}
require(EDASeq)
# Make a list of the files of interest
files = list.files(dataDir,pattern=".*bam$",full=T)

# prettify the names
names(files) <- gsub("[.]bam$", "", basename(files))

# make a list of the meta data in this case the sample condition 
cond <- gsub("[0-9]$","",names(files))

# create a data frame with the meta data in
meta <- DataFrame(cond,row.names=names(files))

# create a BamFileList object and set the meta data
bfs <- BamFileList(files)
elementMetadata(bfs) <- meta

# Call the EDASeq function to plot the qualities
plotQuality(bfs)

# Plot the perbase quality for a bamfile 
# WARNING this function hangs my version of R
plotQuality(bfs[[1]])

# Plot the total read counts
barplot(bfs)

# Plot the nucleotide frequencies for a sample
plotNtFrequency(bfs[[1]])
```

## Exercise 2.2

Using the EDASeq function `plotQuality()` to examine the per cycle base quality for the file in the directory  '/usr/local/share/BS32010/pschofield/data/fastq`

## Alignment

While you might choose other alignment tools such as TopHat or STAR for speed and flexibility there are aligners available in Bioconductor, chiefly

- [gmapR](https://www.bioconductor.org/packages/release/bioc/html/gmapR.html) 
`r Cite(bib, c("wu2005-1","wu2009","barr2016"))` an interface to the GMAP/GSNAP/GSTRUCT suite
- [Rsubread](https://www.bioconductor.org/packages/release/bioc/html/Rsubread.html) 
`r Cite(bib, "liao2013")` which provides an interface onto the subread/subjunc/featureCounts tool set

For simplicities sake and to show the process all with-in the Bioconductor environment this workshop uses the `Rsubread` package.

###Build Index

As with all aligners there is a two step process to aligning reads with Rsubread and at least two files are necessary. First an index is needs to be built from a FASTA format genomic sequence reference, aligners tend to have their own index structure so indexes for one aligner will not work for an other in fact often indexes for one version release of an aligner will not work for another version. 

There are two options at this point.

- Use the genomic sequence, all the chromosomal DNA sequence to build the index, then at a subsequent step use an annotation file that hold information on the location of genetic features to count the reads that fall in the regions of DNA covered by each feature of interest.
- Use the transcriptome (cDNA) sequence only, so instead of a sequence per chromosome there is a sequence per feature. This way when the alignment is performed the assignment to features is performed in the same step.

##### Thought Exercise

What are some advantages and disadvantages of the two alignment refernce strategies?


```{r buildIndex, eval=FALSE}
require(Rsubread)
ref <- system.file("extdata","reference.fa",package="Rsubread")
# Build an index using the reference fasta file
buildindex(basename="./myorganism_index",reference=ref)
```

###Align

Once the reference is built the FASTQ files can be aligned. In two similar ways you can perform the alignment with either the basic `align()` function that calls the subread aligner, or with the `subjunc()` function that calls the more sophisticated `subjunc` aligner that performs alternative splicing junction detection using donor/receptor site sequence detection. For simple gene-wise differential expression `subread()` should be sufficient.

```{r align, eval=FALSE}
reads <- system.file("extdata","reads.txt.gz",package="Rsubread")
align(index="./myorganism_index",readfile1=reads,
      output_file="./Rsubread_alignment.BAM",phredOffset=64)
```

Both `align()` and `subjunc()` have a lot of parameters for fine tuning the alignment process. For example they can perform read trimming, and set limits on numbers of mismatches and fragmemt size. The output from both aligners are a BAM file, a file of insert-deletion locations

##Feature Assignment

The feature assignment function in the Rsubread package is the `featureCounts()` function. It takes a list of BAM files and a genomic features annotation file and produces a `featureCounts` object that contains a count matrix and some statistics on the numbers of no assigned reads.

##Exercise 2.3

Using the paired end fastq files available in the directory '/usr/local/share/BS32010/pschofield/data/fastq` obtain a featureCounts object containing the reads that align to the genes on chromosomes 5 (V) of the _Saccharomyces cerevisiae_ genome. Report the number of assigned and unassigned reads. 

You will need to

- Obtain the sequence for the chromosomes and build an index
- Align the reads to the reference
- Obtain the genomic features annotation for the features on those chromosomes
- Assign the aligned reads to the featureRegions


# Bibliography

```{r bibliography, echo=FALSE, results="asis"}
PrintBibliography(bib)
```

# Session Info

```{r session, echo=FALSE}
sessionInfo()
```