Annotations

There are many packages in bioconductor that contain curated annotation data for numerous and diverse organisms. See the list on the BiocViews AnnotationData page

Annotation Packages

There are annotation datasets that are available as Bioconductor packages many of them use the RSQLite package to interface to sqlite database files that hold the information locally. Bioconductor provides some standard packages for a range of annotation databases. Examples of these are:

Online Packages

Bioconductor also provides some library packages that are designed to ease searching various on-line databases for annotation data. These can be used to programmatically retrieve data that is often also avaliable to be downloaded from web sites such as ENSEMBL and UCSC

Useful CRAN Packages

Not all R packages for handling biological data are in bioconductor, some are in the Comprehensive R Archive Network (CRAN) and can be installed using the install.packages() function.

install.packages("packagename")
  • Analyses of Phylogenetics and Evolution (ape) is a very useful package that has several functions for importing data formats and even accessing online information. For example

    • the read.GenBank() function can be used to retrieve sequence information by GenBank accession number
  • genoPlotR (Guy, Roat Kultima, and Andersson, 2010) also has a functions for reading feature and sequence annotation data
    • the read_dna_seg_from_genbank() function will read the features from a GenBank file

AnnotationHub Example

We know there is some Saccharomyces cerevisiae data coming our way so lets look to see what is available through AnnotationHub and lets look at the data provider ENSEMBL.

require(AnnotationHub)
ah <- AnnotationHub()
sc <- query(ah,c("ENSEMBL","Saccharomyces cerevisiae"))
sc
AnnotationHub with 106 records
# snapshotDate(): 2016-03-09 
# $dataprovider: Ensembl, UCSC, ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
# $species: Saccharomyces cerevisiae
# $rdataclass: FaFile, GRanges, TwoBitFile, OrgDb
# additional mcols(): taxonomyid, genome, description, tags,
#   sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH289"]]' 

            title                                                
  AH289   | Saccharomyces_cerevisiae.EF4.69.cdna.all.fa          
  AH290   | Saccharomyces_cerevisiae.EF4.69.dna.toplevel.fa      
  AH291   | Saccharomyces_cerevisiae.EF4.69.dna_rm.toplevel.fa   
  AH292   | Saccharomyces_cerevisiae.EF4.69.dna_sm.toplevel.fa   
  AH293   | Saccharomyces_cerevisiae.EF4.69.ncrna.fa             
  ...       ...                                                  
  AH50219 | Saccharomyces_cerevisiae.R64-1-1.dna_sm.toplevel.2bit
  AH50220 | Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.2bit   
  AH50221 | Saccharomyces_cerevisiae.R64-1-1.ncrna.2bit          
  AH50338 | Saccharomyces_cerevisiae.R64-1-1.82.gtf              
  AH50407 | Saccharomyces_cerevisiae.R64-1-1.83.gtf              

This displays a list of the data items available. A fuller list in the form of a data frame is available with the mocls() function, this displays more information including the URL of the on-line data source.

mcols(sc)

Item AH49366 is the full genomic sequence for Saccharomyces cerevisiae release 81

scFaFile <- sc[["AH49366"]]
scFaFile
class: FaFile 
path: /Users/pschofield/.AnnotationHub/56011
index: /Users/pschofield/.AnnotationHub/56012
isOpen: FALSE 
yieldSize: NA 

The sequences in the FaFile object can be retrieved with the getSeq() command

getSeq(scFaFile)
  A DNAStringSet instance of length 17
       width seq                                       names               
 [1]  230218 CCACACCACACCCACACAC...TGGGTGTGGTGTGTGTGGG I dna:chromosome ...
 [2]  813184 AAATAGCCCTCATGTACGT...GTGTGGTGTGTGGGTGTGT II dna:chromosome...
 [3]  316620 CCCACACACCACACCCACA...GGTGGGTGTGGTGTGTGTG III dna:chromosom...
 [4] 1531933 ACACCACACCCACACCACA...GGTAGTAAGTAGCTTTTGG IV dna:chromosome...
 [5]  439888 CACACACACCACACCCACA...GTGGGTGTGGTGTGTGTGT IX dna:chromosome...
 ...     ... ...
[13] 1078177 CACACACACACACCACCCA...ACGTACATGAGGGCTATTT XII dna:chromosom...
[14]  924431 CCACACACACACCACACCC...GTGTGGTGTGTGTGTGGGG XIII dna:chromoso...
[15]  784333 CCGGCTTTCTGACCGAAAT...GTGTGGGTGTGGTGTGGGT XIV dna:chromosom...
[16] 1091291 ACACCACACCCACACCACA...GTGTGTGGGTGTGGTGTGT XV dna:chromosome...
[17]  948067 AAATAGCCCTCATGTACGT...TTTAATTTCGGTCAGAAAN XVI dna:chromosom...

The last item in the list of available S. cerevisiae annotation is the GTF file containing the genetic features and we can retrieve this into a GenomicRanges object

genes <- sc[["AH50407"]]
head(genes)
GRanges object with 6 ranges and 19 metadata columns:
      seqnames       ranges strand |   source        type     score
         <Rle>    <IRanges>  <Rle> | <factor>    <factor> <numeric>
  [1]       IV [1802, 2953]      + |  ensembl        gene      <NA>
  [2]       IV [1802, 2953]      + |  ensembl  transcript      <NA>
  [3]       IV [1802, 2953]      + |  ensembl        exon      <NA>
  [4]       IV [1802, 2950]      + |  ensembl         CDS      <NA>
  [5]       IV [1802, 1804]      + |  ensembl start_codon      <NA>
  [6]       IV [2951, 2953]      + |  ensembl  stop_codon      <NA>
          phase     gene_id gene_version   gene_name gene_source
      <integer> <character>  <character> <character> <character>
  [1]      <NA>     YDL248W            1        COS7     ensembl
  [2]      <NA>     YDL248W            1        COS7     ensembl
  [3]      <NA>     YDL248W            1        COS7     ensembl
  [4]         0     YDL248W            1        COS7     ensembl
  [5]         0     YDL248W            1        COS7     ensembl
  [6]         0     YDL248W            1        COS7     ensembl
        gene_biotype transcript_id transcript_version transcript_name
         <character>   <character>        <character>     <character>
  [1] protein_coding          <NA>               <NA>            <NA>
  [2] protein_coding       YDL248W                  1            COS7
  [3] protein_coding       YDL248W                  1            COS7
  [4] protein_coding       YDL248W                  1            COS7
  [5] protein_coding       YDL248W                  1            COS7
  [6] protein_coding       YDL248W                  1            COS7
      transcript_source transcript_biotype exon_number     exon_id
            <character>        <character> <character> <character>
  [1]              <NA>               <NA>        <NA>        <NA>
  [2]           ensembl     protein_coding        <NA>        <NA>
  [3]           ensembl     protein_coding           1   YDL248W.1
  [4]           ensembl     protein_coding           1        <NA>
  [5]           ensembl     protein_coding           1        <NA>
  [6]           ensembl     protein_coding           1        <NA>
      exon_version  protein_id protein_version
       <character> <character>     <character>
  [1]         <NA>        <NA>            <NA>
  [2]         <NA>        <NA>            <NA>
  [3]            1        <NA>            <NA>
  [4]         <NA>     YDL248W               1
  [5]         <NA>        <NA>            <NA>
  [6]         <NA>        <NA>            <NA>
  -------
  seqinfo: 17 sequences (1 circular) from Saccharomyces_cerevisiae.R64-1-1.83.gtf genome; no seqlengths

It is possible to retrieve the sequences for a gene of interest using these two annotations

# get the GenomicRanges item that has the gene name of interest and is of type gene 
ser3 <- genes[which(genes$gene_name=="SER3" & genes$type=="gene"),]
# pass this as a parameter to the getSeq function
seqSer3 <- getSeq(scFaFile,ser3)
names(seqSer3) <- ser3$gene_id
seqSer3
  A DNAStringSet instance of length 1
    width seq                                          names               
[1]  1410 ATGACAAGCATTGACATTAAC...CAATTAGATTGCTATATTAA YER081W

Exercise 2.1

  • Which chromosome is the SER3 gene on.
  • In the example above what other types genetic features are listed for the SER3 gene?
  • How many different bio-types are there in the genetic features data and what are they?
  • Find the name, length and sequence for the longest snoRNA.

NGS Data Processing

While it is possible to complete all steps of an RNA-seq analysis with tools from bioconductor there are many other alternatives. Web interface tools such as Galaxy provide form driven interfaces to many of the command line tools including several bioconductor tools and might be an easier introduction to NGS data analysis. Commercial tools such as CLC Bio NGS Workbench also exist that if you have access to might also be an attractive alternative. However bioconductor offers great flexibility and a huge depth in downstream analysis of results that there are certainly benefits from learning how to do an RNA-seq analysis with bioconductor even if you choose to use other software subsequently.

QC

I am not necessarily going to recommend doing QC in R since FastQC introduced earlier in the course is a fast and relatively painless why to summaries quality information on FASTQ sequence files. However it can be done and one way the task can be performed with in R using the ShortRead package.

require(ShortRead)
dirPath <- system.file(package="ShortRead", "extdata", "E-MTAB-1147")
qualAnalysis <- qa(dirPath,"fastq.gz", BPPARAM=SerialParam())
report(qualAnalysis,dest="myQual",type="html")

The package EDASeq (Risso, Schwartz, Sherlock, et al., 2011) also provides tools for the exploration of unaligned (fastq) data and aligned (bam) data.

require(EDASeq)
# Make a list of the files of interest
files = list.files(dataDir,pattern=".*bam$",full=T)

# prettify the names
names(files) <- gsub("[.]bam$", "", basename(files))

# make a list of the meta data in this case the sample condition 
cond <- gsub("[0-9]$","",names(files))

# create a data frame with the meta data in
meta <- DataFrame(cond,row.names=names(files))

# create a BamFileList object and set the meta data
bfs <- BamFileList(files)
elementMetadata(bfs) <- meta

# Call the EDASeq function to plot the qualities
plotQuality(bfs)

# Plot the perbase quality for a bamfile 
# WARNING this function hangs my version of R
plotQuality(bfs[[1]])

# Plot the total read counts
barplot(bfs)

# Plot the nucleotide frequencies for a sample
plotNtFrequency(bfs[[1]])

Exercise 2.2

Using the EDASeq function plotQuality() to examine the per cycle base quality for the file in the directory ’/usr/local/share/BS32010/pschofield/data/fastq`

Alignment

While you might choose other alignment tools such as TopHat or STAR for speed and flexibility there are aligners available in Bioconductor, chiefly

For simplicities sake and to show the process all with-in the Bioconductor environment this workshop uses the Rsubread package.

Build Index

As with all aligners there is a two step process to aligning reads with Rsubread and at least two files are necessary. First an index is needs to be built from a FASTA format genomic sequence reference, aligners tend to have their own index structure so indexes for one aligner will not work for an other in fact often indexes for one version release of an aligner will not work for another version.

There are two options at this point.

  • Use the genomic sequence, all the chromosomal DNA sequence to build the index, then at a subsequent step use an annotation file that hold information on the location of genetic features to count the reads that fall in the regions of DNA covered by each feature of interest.
  • Use the transcriptome (cDNA) sequence only, so instead of a sequence per chromosome there is a sequence per feature. This way when the alignment is performed the assignment to features is performed in the same step.
Thought Exercise

What are some advantages and disadvantages of the two alignment refernce strategies?

require(Rsubread)
ref <- system.file("extdata","reference.fa",package="Rsubread")
# Build an index using the reference fasta file
buildindex(basename="./myorganism_index",reference=ref)

Align

Once the reference is built the FASTQ files can be aligned. In two similar ways you can perform the alignment with either the basic align() function that calls the subread aligner, or with the subjunc() function that calls the more sophisticated subjunc aligner that performs alternative splicing junction detection using donor/receptor site sequence detection. For simple gene-wise differential expression subread() should be sufficient.

reads <- system.file("extdata","reads.txt.gz",package="Rsubread")
align(index="./myorganism_index",readfile1=reads,
      output_file="./Rsubread_alignment.BAM",phredOffset=64)

Both align() and subjunc() have a lot of parameters for fine tuning the alignment process. For example they can perform read trimming, and set limits on numbers of mismatches and fragmemt size. The output from both aligners are a BAM file, a file of insert-deletion locations

Feature Assignment

The feature assignment function in the Rsubread package is the featureCounts() function. It takes a list of BAM files and a genomic features annotation file and produces a featureCounts object that contains a count matrix and some statistics on the numbers of no assigned reads.

Exercise 2.3

Using the paired end fastq files available in the directory ’/usr/local/share/BS32010/pschofield/data/fastq` obtain a featureCounts object containing the reads that align to the genes on chromosomes 5 (V) of the Saccharomyces cerevisiae genome. Report the number of assigned and unassigned reads.

You will need to

  • Obtain the sequence for the chromosomes and build an index
  • Align the reads to the reference
  • Obtain the genomic features annotation for the features on those chromosomes
  • Assign the aligned reads to the featureRegions

Bibliography

[1] C. Barr, T. Wu and M. Lawrence. An R interface to the GMAP/GSNAP/GSTRUCT suiteCory Barr and Thomas Wu and Michael Lawrence. 2016.

[2] S. Davis and P. Meltzer. “GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor”. In: Bioinformatics 14 (2007), pp. 1846–1847.

[3] S. Durinck, Y. Moreau, A. Kasprzyk, et al. “BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis”. In: Bioinformatics 21 (2005), pp. 3439–3440.

[4] S. Durinck, P. T. Spellman, E. Birney, et al. “Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt”. In: Nature Protocols 4 (2009), pp. 1184–1191.

[5] A. Franceschini, D. Szklarczyk, S. Frankild, et al. “STRING v9.1: protein-protein interaction networks, with increased coverage and integration.” In: Nucleic acids research 41 (Database issue Jan. 2013), pp. D808–815. ISSN: 1362-4962 0305-1048. DOI: 10.1093/nar/gks1094. pmid: pmid.

[6] L. Guy, J. Roat Kultima and S. G. E. Andersson. “genoPlotR: comparative gene and genome visualization in R”. In: Bioinformatics 26.18 (Sep. 15, 2010), pp. 2334–2335. ISSN: 1367-4803. DOI: 10.1093/bioinformatics/btq413. URL: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2935412/.

[7] M. Lawrence, W. Huber, H. Pagès, et al. “Software for Computing and Annotating Genomic Ranges”. In: PLoS Computational Biology 9 (2013).

[8] Y. Liao, G. K. Smyth and W. Shi. “The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote”. In: Nucleic Acids Research 41 (2013), p. 108.

[9] W. Luo and C. Brouwer. “Pathview: an R/Bioconductor package for pathway-based data integration and visualization”. In: Bioinformatics (Jun. 04, 2013). DOI: 10.1093/bioinformatics/btt285. URL: http://bioinformatics.oxfordjournals.org/content/early/2013/06/11/bioinformatics.btt285.abstract.

[10] H. Pages, P. Aboyoun, R. Gentleman, et al. Biostrings: String objects representing biological sequences, and matching. 2016.

[11] H. Pages, M. Carlson, S. Falcon, et al. AnnotationDbi: Annotation Database Interface. 2016.

[12] D. Risso, K. Schwartz, G. Sherlock, et al. “GC-content normalization for RNA-Seq data.” In: BMC Bioinformatics 12 (2011), p. 480.

[13] T. D. Wu and C. K. Watanabe. “GMAP: a genomic mapping and alignment program for mRNA and EST sequences”. In: Bioinformatics 21.9 (May. 01, 2005), pp. 1859–1875. DOI: 10.1093/bioinformatics/bti310. URL: http://bioinformatics.oxfordjournals.org/content/21/9/1859.abstract.

[14] Y. Zhu, R. M. Stephens, P. S. Meltzer, et al. “SRAdb: query and use public next-generation sequencing data from within R”. In: BMC Bioinformatics 14.1 (2013), pp. 1–4. ISSN: 1471-2105. DOI: 10.1186/1471-2105-14-19. URL: http://dx.doi.org/10.1186/1471-2105-14-19.

Session Info

R version 3.2.3 (2015-12-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.3 (El Capitan)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] Rsamtools_1.22.0     Biostrings_2.38.4    XVector_0.10.0      
 [4] GenomicRanges_1.22.4 GenomeInfoDb_1.6.3   IRanges_2.4.8       
 [7] S4Vectors_0.8.11     AnnotationHub_2.2.3  BiocGenerics_0.16.1 
[10] knitr_1.12.3         RefManageR_0.10.6    pietalib_0.1        

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.3                  futile.logger_1.4.1         
 [3] BiocInstaller_1.20.1         formatR_1.3                 
 [5] plyr_1.8.3                   futile.options_1.0.0        
 [7] bitops_1.0-6                 tools_3.2.3                 
 [9] zlibbioc_1.16.0              digest_0.6.9                
[11] lubridate_1.5.0              evaluate_0.8.3              
[13] RSQLite_1.0.0                bibtex_0.4.0                
[15] shiny_0.13.1                 DBI_0.3.1                   
[17] curl_0.9.6                   yaml_2.1.13                 
[19] stringr_1.0.0                httr_1.1.0                  
[21] Biobase_2.30.0               R6_2.1.2                    
[23] AnnotationDbi_1.32.3         BiocParallel_1.4.3          
[25] XML_3.98-1.4                 rmarkdown_0.9.5             
[27] RJSONIO_1.3-0                lambda.r_1.1.7              
[29] magrittr_1.5                 htmltools_0.3               
[31] mime_0.4                     interactiveDisplayBase_1.8.0
[33] xtable_1.8-2                 httpuv_1.3.3                
[35] stringi_1.0-1                RCurl_1.95-4.8