--- title: 'RNA-seq Analysis 2:' author: "Pietà Schofield" subtitle: Annotations and Alignments output: html_document: code_folding: show fig_caption: yes highlight: textmate theme: spacelab toc: yes toc_depth: 3 toc_float: collapsed: no smooth_scroll: yes pdf_document: toc: yes toc_depth: '3' --- ```{r setup,echo=FALSE,include=FALSE} require(RefManageR) BibOptions(style="html",longnamesfirst=F) if(!file.exists(".pieta.bib")){ download.file("http://www.compbio.dundee.ac.uk/user/pschofield/biblio.bib",".pieta.bib") } bib <- ReadBib(".pieta.bib",check=FALSE) require(knitr) opts_chunk$set( message=F, warning=F, comment=NA) # # This little bit of trickery is rather complex knitr-foo thanks to Ramnath Vaidyanathan # # http://ramnathv.github.io/posts/verbatim-chunks-knitr/index.html # # It enables code chunks to be included with the surrounding brackets and options # for illustrative purposes # knitr::opts_chunk$set(tidy = F) knit_hooks$set(source = function(x, options){ if (!is.null(options$verbatim) && options$verbatim){ opts = gsub(",\\s*verbatim\\s*=\\s*TRUE\\s*", "", options$params.src) bef = sprintf('\n\n ```{r %s}\n', opts, "\n") stringr::str_c( bef, knitr:::indent_block(paste(x, collapse = '\n'), " "), "\n ```\n" ) } else { stringr::str_c("\n\n```", tolower(options$engine), "\n", paste(x, collapse = '\n'), "\n```\n\n" ) } }) # # a little bit of R from Yihui Xieto put in verbatim inline R # # http://stackoverflow.com/questions/20409172/how-to-display-verbatim-inline-r-code-with-backticks-using-rmarkdown # rinline <- function(code) { sprintf('``` `r %s` ```', code) } ``` - [Return to course overview](http://www.compbio.dundee.ac.uk/user/pschofield/Projects/teaching_pg) - [Full Rmarkdown script for this workshop](http://www.compbio.dundee.ac.uk/user/pschofield/Projects/teaching_pg/code/biocNGS.Rmd) # NGS Data Processing While it is possible to complete all steps of an RNA-seq analysis with tools from bioconductor there are many other alternatives. - Web interface tools such as Galaxy provide form driven interfaces to many of the command line tools including several bioconductor tools and might be an easier introduction to NGS data analysis. - Commercial tools such as CLC Bio NGS Workbench also exist that if you have access to might also be an attractive alternative. - Standalone command line programs such as aligners like STAR and TopHat and qc software such as FastQC provide highly flexible and configurable tools for many NGS task However bioconductor also offers great flexibility and a huge depth in downstream analysis of results that there are certainly benefits from learning how to do an RNA-seq analysis with bioconductor even if you choose to use other software subsequently. ## QC I am not necessarily going to recommend doing QC in R since FastQC introduced earlier in the course is a fast and relatively painless way to summaries quality information on FASTQ sequence files. However it can be done and one way the task can be performed with in R using the [ShortRead](https://www.bioconductor.org/packages/release/bioc/html/ShortRead.html) package or [EDASeq](http://bioconductor.org/packages/release/bioc/html/EDASeq.html) `r Cite(bib,"risso2011")`. ## Alignment R/bioconductor provides several wrappers for command line aligners for both ChIP-seq (non-splice aware) reads and RNA-seq (splice aware) reads. Because of its simplicity we use the [Rsubread](https://www.bioconductor.org/packages/release/bioc/html/Rsubread.html) package `r Cite(bib, "liao2013")` which provides an interface onto the subread/subjunc/featureCounts tool set ###Build Index As with all aligners there is a two step process to aligning reads with Rsubread and at least two files are necessary. First an index is needs to be built from a FASTA format genomic sequence reference, aligners tend to have their own index structure so indexes for one aligner will not work for an other in fact often indexes for one version release of an aligner will not work for another version. There are two options at this point. - **genomic sequence reference**, use the chromosomal DNA sequence to build the index, and align the reads to the full genome then at a subsequent step use a separate annotation file that holds information on the location of genetic features to count the reads that fall in the regions of DNA covered by each feature of interest. - **transcriptome (cDNA) sequence reference** , instead of a sequence per chromosome there is a sequence per feature. Generally the reference contains the sequence of the mature mRNA after splicing this way when the alignment is performed the assignment to features is performed in the same step. It is possible to use a non-splice aware aligner for this since the reads and the reference are spliced. Clearly this mean pre-mRNA unspliced reads will not align. #### Genomic Reference If you are performing an genomic reference alignment two reference files are needed the FASTA file of the genome sequence and a gene feature file (GFF/GTF) file with the locations of the features of interest. While this can be automated through various packages in R such as [annotationHub]() it is also possible and potentially simpler to manually download these file via web browers from sites such as [ENSEMBL](http://www.ensembl.org/info/data/ftp/index.html) and [NCBI genome](http://www.ncbi.nlm.nih.gov/genome). Generally it only makes sense to get genome sequence and genomic features files from the same source as different providers have different name for chromosomes and this can lead to all kinds of problems if using incompatible ids. Seeing we know the location of the files we want we can just use `download.file()` in R to get the files ```{r downloadAnn, eval=FALSE} homeDir <- system("echo ${HOME}",intern=T) ## decide where these are going to be put in your file space rnaLocal <- paste0(homeDir,"/RNAseq/") ## create the directory dir.create(rnaLocal,recursive=T,showWarnings = F) refLocal <- paste0(rnaLocal,"ref/") dir.create(refLocal,recursive=T,showWarnings = F) ## setup the paths to the files on the ENSEMBL ftp server faFTP <- "ftp://ftp.ensembl.org/pub/release-84/fasta/saccharomyces_cerevisiae/dna/" gtfFTP <- "ftp://ftp.ensembl.org/pub/release-84/gtf/saccharomyces_cerevisiae/" ## the full yeast genome is in this file faFile <- "Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz" ## the gene annotations are in this file gtfFile <- "Saccharomyces_cerevisiae.R64-1-1.84.gtf.gz" ## run the commands to get the file download.file(paste0(faFTP,faFile),paste0(refLocal,faFile)) download.file(paste0(gtfFTP,gtfFile),paste0(refLocal,gtfFile)) ``` ### Build the index Now we have the references we can build the index ```{r buildIndex, eval=FALSE} ## load the subread package require(Rsubread) idxLocal <- paste0(rnaLocal,"index/") dir.create(idxLocal,recursive=T,showWarnings = F) # Build an index using the reference fasta file # unfortunately Rsubread requires uncompressed fasta files # unzip the file system(paste0("gunzip ",paste0(refLocal,faFile))) # get the name of the unzipped file refFile <- list.files(refLocal,pattern="fa$",full=T) buildindex(basename=paste0(idxLocal,"scR64"),reference=refFile) ``` ###Alignment Once the reference is built the FASTQ files can be aligned. In two similar ways you can perform the alignment with either the basic `align()` function that calls the subread aligner, or with the `subjunc()` function that calls the more sophisticated `subjunc` aligner that performs alternative splicing junction detection using donor/receptor site sequence detection. For simple gene-wise differential expression `subread()` should be sufficient. ```{r align, eval=FALSE} ## the data we are going to align is here dataHTTP <- "http://www.compbio.dundee.ac.uk/user/pschofield/data/scData.tar.gz" ## download the file to the local directory download.file(dataHTTP,paste0(rnaLocal,"scData.tar.gz")) ## untar the data system(paste0("tar -zxvf ",rnaLocal,"scData.tar.gz -C ",rnaLocal)) mcIndex <- paste0(idxLocal,"scR64") pair1Files <- list.files(paste0(rnaLocal,"scData"),pattern=".*_R1.*",full.names = T) bamLocal <- paste0(rnaLocal,"bam/") dir.create(bamLocal,recursive=T,showWarnings = F) ## lapply(pair1Files,function(fn1){ fileStub <- gsub("_lib.*","",basename(fn1)) align(index=mcIndex, readfile1=fn1, readfile2=gsub("_R1","_R2",fn1), output_file=paste0(bamLocal,fileStub,".bam")) }) dir(bamLocal) ``` Both `align()` and `subjunc()` have a lot of parameters for fine tuning the alignment process. For example they can perform read trimming, and set limits on numbers of mismatches and fragmemt size. The output from both aligners are a BAM file, a file of insert-deletion locations ## Cluster Jobs This may not be the way you want to do your alignment even if you are using Rsubread, because you are not making use of the cluster or its parallel nature and you will be slowing down the RStudio Server. It is possible to write a script in R and submit that script using qsub. ```{r scriptExample, eval=FALSE} #!/usr/bin/env Rscript # require(BiocParallel) require(Rsubread) # this should be the number of cores requested in the qsub bpparam <- MulticoreParam(8) # set up the paths to your files homeDir <- system("echo ${HOME}",intern=T) rnaLocal <- paste0(homeDir,"/RNAseq/") idxLocal <- paste0(rnaLocal,"index/") mcIndex <- paste0(idxLocal,"scR64") pair1Files <- list.files(paste0(rnaLocal,"scData"),pattern=".*_R1.*",full.names = T) # create and output directory bamLocal <- paste0(rnaLocal,"bam/") dir.create(bamLocal,showWarnings = F) myAlign <- function(fn1){ fileStub <- gsub("_lib.*","",basename(fn1)) align(index=mcIndex, readfile1=fn1, readfile2=gsub("_R1","_R2",fn1), output_file=paste0(bamLocal,fileStub,".bam")) } # bplapply(pair1Files,myAlign,BPPARAM = bpparam) ``` Still in RStudio server select the File > New File > R Script menu option and cut and paste the content of the above code chunk into the file. Save the file as alignSC.R, into your RNAseq/code directory, this script can then be submitted with qsub. Log onto the cluster via putty or Xming ``` module load r qsub -V -pe smp 8 RNAseq/code/alignSC.R ``` ## Feature Counting The next task is to assign the reads that have been aligned to the reference genome to the approriate genetic features. This can be accomplished using the `featureCounts()` function in Rsubread. featureCounts take ```{r featureCounts, eval=FALSE} # Again we have to unzip the annotation file system(paste0("gunzip ",paste0(refLocal,gtfFile))) gtf <- list.files(refLocal,pattern=".*gtf$",full.names = T) # call featureCounts on the list of bam files in the current directory bamFiles <- list.files(bamLocal,pattern=".*bam$") curdir <- getwd() setwd(bamLocal) # specifying the gtf file containing the annotations and the format of the file # counting reads that match overlap exons and grouping exons by gene_id fcLim <- featureCounts(files = bamFiles , GTF.featureType="exon",GTF.attrType="gene_id", annot.ext=gtf, isGTFAnnotationFile=TRUE ) setwd(curdir) # Print out the stats from the featureCount object knitr::kable(fcLim$stat) ``` Lets save this object for later ```{r saveFC,eval=FALSE} dir.create(paste0(rnaLocal,"Rdata"),recursive = F,showWarnings = F) save(fcLim,file=paste0(rnaLocal,"Rdata/limSC.RData")) ``` # Bibliography ```{r bibliography, echo=FALSE, results="asis"} PrintBibliography(bib) ``` # Session Info ```{r session, echo=FALSE} sessionInfo() ```