While it is possible to complete all steps of an RNA-seq analysis with tools from bioconductor there are many other alternatives.
However bioconductor also offers great flexibility and a huge depth in downstream analysis of results that there are certainly benefits from learning how to do an RNA-seq analysis with bioconductor even if you choose to use other software subsequently.
I am not necessarily going to recommend doing QC in R since FastQC introduced earlier in the course is a fast and relatively painless way to summaries quality information on FASTQ sequence files. However it can be done and one way the task can be performed with in R using the ShortRead package or EDASeq (Risso, Schwartz, Sherlock, et al., 2011).
R/bioconductor provides several wrappers for command line aligners for both ChIP-seq (non-splice aware) reads and RNA-seq (splice aware) reads. Because of its simplicity we use the Rsubread package (Liao, Smyth, and Shi, 2013) which provides an interface onto the subread/subjunc/featureCounts tool set
As with all aligners there is a two step process to aligning reads with Rsubread and at least two files are necessary. First an index is needs to be built from a FASTA format genomic sequence reference, aligners tend to have their own index structure so indexes for one aligner will not work for an other in fact often indexes for one version release of an aligner will not work for another version.
There are two options at this point.
If you are performing an genomic reference alignment two reference files are needed the FASTA file of the genome sequence and a gene feature file (GFF/GTF) file with the locations of the features of interest.
While this can be automated through various packages in R such as annotationHub it is also possible and potentially simpler to manually download these file via web browers from sites such as ENSEMBL and NCBI genome. Generally it only makes sense to get genome sequence and genomic features files from the same source as different providers have different name for chromosomes and this can lead to all kinds of problems if using incompatible ids.
Seeing we know the location of the files we want we can just use download.file()
in R to get the files
homeDir <- system("echo ${HOME}",intern=T)
## decide where these are going to be put in your file space
rnaLocal <- paste0(homeDir,"/RNAseq/")
## create the directory
dir.create(rnaLocal,recursive=T,showWarnings = F)
refLocal <- paste0(rnaLocal,"ref/")
dir.create(refLocal,recursive=T,showWarnings = F)
## setup the paths to the files on the ENSEMBL ftp server
faFTP <- "ftp://ftp.ensembl.org/pub/release-84/fasta/saccharomyces_cerevisiae/dna/"
gtfFTP <- "ftp://ftp.ensembl.org/pub/release-84/gtf/saccharomyces_cerevisiae/"
## the full yeast genome is in this file
faFile <- "Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz"
## the gene annotations are in this file
gtfFile <- "Saccharomyces_cerevisiae.R64-1-1.84.gtf.gz"
## run the commands to get the file
download.file(paste0(faFTP,faFile),paste0(refLocal,faFile))
download.file(paste0(gtfFTP,gtfFile),paste0(refLocal,gtfFile))
Now we have the references we can build the index
## load the subread package
require(Rsubread)
idxLocal <- paste0(rnaLocal,"index/")
dir.create(idxLocal,recursive=T,showWarnings = F)
# Build an index using the reference fasta file
# unfortunately Rsubread requires uncompressed fasta files
# unzip the file
system(paste0("gunzip ",paste0(refLocal,faFile)))
# get the name of the unzipped file
refFile <- list.files(refLocal,pattern="fa$",full=T)
buildindex(basename=paste0(idxLocal,"scR64"),reference=refFile)
Once the reference is built the FASTQ files can be aligned. In two similar ways you can perform the alignment with either the basic align()
function that calls the subread aligner, or with the subjunc()
function that calls the more sophisticated subjunc
aligner that performs alternative splicing junction detection using donor/receptor site sequence detection. For simple gene-wise differential expression subread()
should be sufficient.
## the data we are going to align is here
dataHTTP <- "http://www.compbio.dundee.ac.uk/user/pschofield/data/scData.tar.gz"
## download the file to the local directory
download.file(dataHTTP,paste0(rnaLocal,"scData.tar.gz"))
## untar the data
system(paste0("tar -zxvf ",rnaLocal,"scData.tar.gz -C ",rnaLocal))
mcIndex <- paste0(idxLocal,"scR64")
pair1Files <- list.files(paste0(rnaLocal,"scData"),pattern=".*_R1.*",full.names = T)
bamLocal <- paste0(rnaLocal,"bam/")
dir.create(bamLocal,recursive=T,showWarnings = F)
##
lapply(pair1Files,function(fn1){
fileStub <- gsub("_lib.*","",basename(fn1))
align(index=mcIndex,
readfile1=fn1,
readfile2=gsub("_R1","_R2",fn1),
output_file=paste0(bamLocal,fileStub,".bam"))
})
dir(bamLocal)
Both align()
and subjunc()
have a lot of parameters for fine tuning the alignment process. For example they can perform read trimming, and set limits on numbers of mismatches and fragmemt size. The output from both aligners are a BAM file, a file of insert-deletion locations
This may not be the way you want to do your alignment even if you are using Rsubread, because you are not making use of the cluster or its parallel nature and you will be slowing down the RStudio Server.
It is possible to write a script in R and submit that script using qsub.
#!/usr/bin/env Rscript
#
require(BiocParallel)
require(Rsubread)
# this should be the number of cores requested in the qsub
bpparam <- MulticoreParam(8)
# set up the paths to your files
homeDir <- system("echo ${HOME}",intern=T)
rnaLocal <- paste0(homeDir,"/RNAseq/")
idxLocal <- paste0(rnaLocal,"index/")
mcIndex <- paste0(idxLocal,"scR64")
pair1Files <- list.files(paste0(rnaLocal,"scData"),pattern=".*_R1.*",full.names = T)
# create and output directory
bamLocal <- paste0(rnaLocal,"bam/")
dir.create(bamLocal,showWarnings = F)
myAlign <- function(fn1){
fileStub <- gsub("_lib.*","",basename(fn1))
align(index=mcIndex,
readfile1=fn1,
readfile2=gsub("_R1","_R2",fn1),
output_file=paste0(bamLocal,fileStub,".bam"))
}
#
bplapply(pair1Files,myAlign,BPPARAM = bpparam)
Still in RStudio server select the File > New File > R Script menu option and cut and paste the content of the above code chunk into the file. Save the file as alignSC.R, into your RNAseq/code directory, this script can then be submitted with qsub. Log onto the cluster via putty or Xming
module load r
qsub -V -pe smp 8 RNAseq/code/alignSC.R
The next task is to assign the reads that have been aligned to the reference genome to the approriate genetic features. This can be accomplished using the featureCounts()
function in Rsubread. featureCounts take
# Again we have to unzip the annotation file
system(paste0("gunzip ",paste0(refLocal,gtfFile)))
gtf <- list.files(refLocal,pattern=".*gtf$",full.names = T)
# call featureCounts on the list of bam files in the current directory
bamFiles <- list.files(bamLocal,pattern=".*bam$")
curdir <- getwd()
setwd(bamLocal)
# specifying the gtf file containing the annotations and the format of the file
# counting reads that match overlap exons and grouping exons by gene_id
fcLim <- featureCounts(files = bamFiles , GTF.featureType="exon",GTF.attrType="gene_id",
annot.ext=gtf, isGTFAnnotationFile=TRUE )
setwd(curdir)
# Print out the stats from the featureCount object
knitr::kable(fcLim$stat)
Lets save this object for later
dir.create(paste0(rnaLocal,"Rdata"),recursive = F,showWarnings = F)
save(fcLim,file=paste0(rnaLocal,"Rdata/limSC.RData"))
[1] Y. Liao, G. K. Smyth and W. Shi. “The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote”. In: Nucleic Acids Research 41 (2013), p. 108.
[2] D. Risso, K. Schwartz, G. Sherlock, et al. “GC-content normalization for RNA-Seq data.” In: BMC Bioinformatics 12 (2011), p. 480.
R version 3.2.4 (2016-03-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.4 (El Capitan)
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] knitr_1.12.3 RefManageR_0.10.13 pietalib_0.1
loaded via a namespace (and not attached):
[1] Rcpp_0.12.4 XVector_0.10.0 magrittr_1.5
[4] GenomicRanges_1.22.4 BiocGenerics_0.16.1 zlibbioc_1.16.0
[7] IRanges_2.4.8 R6_2.1.2 bibtex_0.4.0
[10] plyr_1.8.3 httr_1.1.0 stringr_1.0.0
[13] GenomeInfoDb_1.6.3 tools_3.2.4 parallel_3.2.4
[16] htmltools_0.3.5 yaml_2.1.13 digest_0.6.9
[19] RJSONIO_1.3-0 formatR_1.3 S4Vectors_0.8.11
[22] bitops_1.0-6 RCurl_1.95-4.8 evaluate_0.8.3
[25] rmarkdown_0.9.5 stringi_1.0-1 XML_3.98-1.4
[28] stats4_3.2.4 lubridate_1.5.6