NGS Data Processing

While it is possible to complete all steps of an RNA-seq analysis with tools from bioconductor there are many other alternatives.

Web interface tools such as Galaxy provide form driven interfaces to many of the command line tools including several bioconductor tools and might be an easier introduction to NGS data analysis.
Commercial tools such as CLC Bio NGS Workbench also exist that if you have access to might also be an attractive alternative.
Standalone command line programs such as aligners like STAR and TopHat and qc software such as FastQC provide highly flexible and configurable tools for many NGS task

However bioconductor also offers great flexibility and a huge depth in downstream analysis of results that there are certainly benefits from learning how to do an RNA-seq analysis with bioconductor even if you choose to use other software subsequently.

QC

I am not necessarily going to recommend doing QC in R since FastQC introduced earlier in the course is a fast and relatively painless way to summaries quality information on FASTQ sequence files. However it can be done and one way the task can be performed with in R using the ShortRead package or EDASeq (Risso, Schwartz, Sherlock, et al., 2011).

Alignment

R/bioconductor provides several wrappers for command line aligners for both ChIP-seq (non-splice aware) reads and RNA-seq (splice aware) reads. Because of its simplicity we use the Rsubread package (Liao, Smyth, and Shi, 2013) which provides an interface onto the subread/subjunc/featureCounts tool set

Build Index

As with all aligners there is a two step process to aligning reads with Rsubread and at least two files are necessary. First an index is needs to be built from a FASTA format genomic sequence reference, aligners tend to have their own index structure so indexes for one aligner will not work for an other in fact often indexes for one version release of an aligner will not work for another version.

There are two options at this point.

genomic sequence reference, use the chromosomal DNA sequence to build the index, and align the reads to the full genome then at a subsequent step use a separate annotation file that holds information on the location of genetic features to count the reads that fall in the regions of DNA covered by each feature of interest.
transcriptome (cDNA) sequence reference , instead of a sequence per chromosome there is a sequence per feature. Generally the reference contains the sequence of the mature mRNA after splicing this way when the alignment is performed the assignment to features is performed in the same step. It is possible to use a non-splice aware aligner for this since the reads and the reference are spliced. Clearly this mean pre-mRNA unspliced reads will not align.

Genomic Reference

If you are performing an genomic reference alignment two reference files are needed the FASTA file of the genome sequence and a gene feature file (GFF/GTF) file with the locations of the features of interest.

While this can be automated through various packages in R such as annotationHub it is also possible and potentially simpler to manually download these file via web browers from sites such as ENSEMBL and NCBI genome. Generally it only makes sense to get genome sequence and genomic features files from the same source as different providers have different name for chromosomes and this can lead to all kinds of problems if using incompatible ids.

Seeing we know the location of the files we want we can just use download.file() in R to get the files

homeDir <- system("echo ${HOME}",intern=T)
## decide where these are going to be put in your file space
rnaLocal <- paste0(homeDir,"/RNAseq/")

## create the directory
dir.create(rnaLocal,recursive=T,showWarnings = F)

refLocal <- paste0(rnaLocal,"ref/")
dir.create(refLocal,recursive=T,showWarnings = F)

## setup the paths to the files on the ENSEMBL ftp server
faFTP <- "ftp://ftp.ensembl.org/pub/release-84/fasta/saccharomyces_cerevisiae/dna/"
gtfFTP <- "ftp://ftp.ensembl.org/pub/release-84/gtf/saccharomyces_cerevisiae/"

## the full yeast genome is in this file 
faFile <- "Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz"

## the gene annotations are in this file
gtfFile <- "Saccharomyces_cerevisiae.R64-1-1.84.gtf.gz"

## run the commands to get the file
download.file(paste0(faFTP,faFile),paste0(refLocal,faFile))
download.file(paste0(gtfFTP,gtfFile),paste0(refLocal,gtfFile))

Build the index

Now we have the references we can build the index

## load the subread package
require(Rsubread)

idxLocal <- paste0(rnaLocal,"index/")
dir.create(idxLocal,recursive=T,showWarnings = F)
# Build an index using the reference fasta file
# unfortunately Rsubread requires uncompressed fasta files
# unzip the file
system(paste0("gunzip ",paste0(refLocal,faFile)))

# get the name of the unzipped file
refFile <- list.files(refLocal,pattern="fa$",full=T)
buildindex(basename=paste0(idxLocal,"scR64"),reference=refFile)

Alignment

Once the reference is built the FASTQ files can be aligned. In two similar ways you can perform the alignment with either the basic align() function that calls the subread aligner, or with the subjunc() function that calls the more sophisticated subjunc aligner that performs alternative splicing junction detection using donor/receptor site sequence detection. For simple gene-wise differential expression subread() should be sufficient.

## the data we are going to align is here
dataHTTP <- "http://www.compbio.dundee.ac.uk/user/pschofield/data/scData.tar.gz"

## download the file to the local directory
download.file(dataHTTP,paste0(rnaLocal,"scData.tar.gz"))

## untar the data
system(paste0("tar -zxvf ",rnaLocal,"scData.tar.gz -C ",rnaLocal))

mcIndex <- paste0(idxLocal,"scR64")
pair1Files <- list.files(paste0(rnaLocal,"scData"),pattern=".*_R1.*",full.names = T)
bamLocal <- paste0(rnaLocal,"bam/")
dir.create(bamLocal,recursive=T,showWarnings = F)

##
lapply(pair1Files,function(fn1){
  fileStub <- gsub("_lib.*","",basename(fn1))
  align(index=mcIndex,
        readfile1=fn1,
        readfile2=gsub("_R1","_R2",fn1),
        output_file=paste0(bamLocal,fileStub,".bam"))
})
  
dir(bamLocal)

Both align() and subjunc() have a lot of parameters for fine tuning the alignment process. For example they can perform read trimming, and set limits on numbers of mismatches and fragmemt size. The output from both aligners are a BAM file, a file of insert-deletion locations

Cluster Jobs

This may not be the way you want to do your alignment even if you are using Rsubread, because you are not making use of the cluster or its parallel nature and you will be slowing down the RStudio Server.

It is possible to write a script in R and submit that script using qsub.

#!/usr/bin/env Rscript
#
require(BiocParallel)
require(Rsubread)

# this should be the number of cores requested in the qsub
bpparam <- MulticoreParam(8)

# set up the paths to your files
homeDir <- system("echo ${HOME}",intern=T)
rnaLocal <- paste0(homeDir,"/RNAseq/")
idxLocal <- paste0(rnaLocal,"index/")
mcIndex <- paste0(idxLocal,"scR64")
pair1Files <- list.files(paste0(rnaLocal,"scData"),pattern=".*_R1.*",full.names = T)

# create and output directory
bamLocal <- paste0(rnaLocal,"bam/")
dir.create(bamLocal,showWarnings = F)

myAlign <- function(fn1){
             fileStub <- gsub("_lib.*","",basename(fn1))
             align(index=mcIndex,
                   readfile1=fn1,
                   readfile2=gsub("_R1","_R2",fn1),
                   output_file=paste0(bamLocal,fileStub,".bam"))
           }
# 
bplapply(pair1Files,myAlign,BPPARAM = bpparam)

Still in RStudio server select the File > New File > R Script menu option and cut and paste the content of the above code chunk into the file. Save the file as alignSC.R, into your RNAseq/code directory, this script can then be submitted with qsub. Log onto the cluster via putty or Xming

module load r
qsub -V -pe smp 8 RNAseq/code/alignSC.R

Feature Counting

The next task is to assign the reads that have been aligned to the reference genome to the approriate genetic features. This can be accomplished using the featureCounts() function in Rsubread. featureCounts take

# Again we have to unzip the annotation file
system(paste0("gunzip ",paste0(refLocal,gtfFile)))
gtf <- list.files(refLocal,pattern=".*gtf$",full.names = T)

# call featureCounts on the list of bam files in the current directory
bamFiles <- list.files(bamLocal,pattern=".*bam$")
curdir <- getwd()
setwd(bamLocal)
# specifying the gtf file containing the annotations and the format of the file
# counting reads that match overlap exons and grouping exons by gene_id
fcLim <- featureCounts(files = bamFiles , GTF.featureType="exon",GTF.attrType="gene_id",
                       annot.ext=gtf, isGTFAnnotationFile=TRUE )

setwd(curdir)
# Print out the stats from the featureCount object
knitr::kable(fcLim$stat)

Lets save this object for later

dir.create(paste0(rnaLocal,"Rdata"),recursive = F,showWarnings = F)
save(fcLim,file=paste0(rnaLocal,"Rdata/limSC.RData"))

Bibliography

[1] Y. Liao, G. K. Smyth and W. Shi. “The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote”. In: Nucleic Acids Research 41 (2013), p. 108.

[2] D. Risso, K. Schwartz, G. Sherlock, et al. “GC-content normalization for RNA-Seq data.” In: BMC Bioinformatics 12 (2011), p. 480.

Session Info

R version 3.2.4 (2016-03-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.4 (El Capitan)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] knitr_1.12.3       RefManageR_0.10.13 pietalib_0.1      

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.4          XVector_0.10.0       magrittr_1.5        
 [4] GenomicRanges_1.22.4 BiocGenerics_0.16.1  zlibbioc_1.16.0     
 [7] IRanges_2.4.8        R6_2.1.2             bibtex_0.4.0        
[10] plyr_1.8.3           httr_1.1.0           stringr_1.0.0       
[13] GenomeInfoDb_1.6.3   tools_3.2.4          parallel_3.2.4      
[16] htmltools_0.3.5      yaml_2.1.13          digest_0.6.9        
[19] RJSONIO_1.3-0        formatR_1.3          S4Vectors_0.8.11    
[22] bitops_1.0-6         RCurl_1.95-4.8       evaluate_0.8.3      
[25] rmarkdown_0.9.5      stringi_1.0-1        XML_3.98-1.4        
[28] stats4_3.2.4         lubridate_1.5.6

RNA-seq Analysis 2:

Annotations and Alignments

Pietà Schofield

NGS Data Processing

QC

Alignment

Build Index

Genomic Reference

Build the index

Alignment

Cluster Jobs

Feature Counting

Bibliography

Session Info