Bioconductor

Bioconductor (Huber, W., Carey, et al., 2015) is a curated repository of open source packages for the R statistical environment specifically aimed at biological information processing. There are three main strands of packages

  • Software packages that contain libraries of functions and data object classes for the processing and statistical analysis of biological data
  • Annotation packages that contain data files that hold curated biological information databases for example lists of genes in a species, or the DNA sequence of an organism.
  • Experimental Data packages that contain the results files in various states of processing and analysis form biological experiment.

The project has a website which is well worth exploring to discover the range of available packages.

Bioconductor Home Page

Bioconductor Home Page

Installation

If you are installing R RStudio and bioconductor on a computer the initial installation of bioconductor requires the installation of a script from the bioconductor website.

source("http://bioconductor.org/biocLite.R")

Once this script has been installed it can be run to install the basic bioconductor installation package

biocLite("BiocInstaller")

Subsequently you can install packages with the BiocInstaller packages without the sourcing of the script

# with either
BiocInstaller::biocLite("NameOfPackage")

# or load the installer package first

require(BiocInstaller)
biocLite("NameOfPackage")

However if you want a bioconductor package on the RStudio Server which is not installed it is probably better to request SLS Computing install the package centrally for you and everyone else can also then access it. The same is probably true of CRAN packages. This has the added advantage that you are less likely to get incompatibilities between packages which can arise if you install a package that is incompatible with a version of another package installed globally

Core Data Classes

One of the most useful things Bioconductor provides is a collection of common data classes that have storage structures and processing methods that as used extensively throughout the library.

Biostrings

The Biostrings package (Pages, Aboyoun, Gentleman, et al., 2016) provides several classes for the storage of sequences of single characters. It provides a general virtual class XString for storing big string sequences of any character. It provides derived classes [BString], [DNAString], [RNAString] and [AAString] for general strings, DNA sequences, RNA sequences and Amino Acid sequences such as peptides and proteins. I also provides a class for collections or sets of such objects derived from the [XStringSet] class. It also provides classes for holding multiple sequence alignments and masked and quality scaled sequences and sets of aligned sequences.

Also provided are several standard constants such as DNA_ALPHABET with the full set of IUPAC letters for DNA, and DNA_BASES that contains just the unambiguous base alphebet.

require(Biostrings)
DNA_ALPHABET
 [1] "A" "C" "G" "T" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+"
[18] "."
DNA_BASES
[1] "A" "C" "G" "T"

So for example we can generate a vector of 10 bases chosen at random from the DNA_BASES alphabet using the sample() function

# sample the alphabet
vcDNA <- sample(DNA_BASES,10,replace=T)
str(vcDNA)
 chr [1:10] "A" "T" "G" "G" "A" "T" "T" "A" "G" ...

The constructor for an object of the DNAString class wants a single string of chararters not a vector of single characters so we can use the paste() function to turn the vector of bases to a single string

stDNA <- paste(vcDNA,collapse="")
str(stDNA)
 chr "ATGGATTAGC"

We can then use the DNAString() constructor function to make a DNAString object.

DNA <- DNAString(stDNA)
str(DNA)
Formal class 'DNAString' [package "Biostrings"] with 5 slots
  ..@ shared         :Formal class 'SharedRaw' [package "XVector"] with 2 slots
  .. .. ..@ xp                    :<externalptr> 
  .. .. ..@ .link_to_cached_object:<environment: 0x7fee7ddf2e08> 
  ..@ offset         : int 0
  ..@ length         : int 10
  ..@ elementMetadata: NULL
  ..@ metadata       : list()

The str() function here reveals the structure of the DNA objects you can see it is of class DNAString and you can see the name of the package that defined this class. It also has 5 slots, there are the objcts that appear in the indented list starting with an @ sign.

You can retrieve the content of these slots with the syntax objectname @ slotname .

DNA@shared
SharedRaw of length 10 (data starting at address 0x7fee81615658)

In the case of XString objects you can also retrieve the sequence as a character string using the casting function as.character()

seq <- as.character(DNA)
seq
[1] "ATGGATTAGC"

As well as the str() function used above we can use the is() function to see what type of thing R knows an object to be.

is(DNA)
[1] "DNAString" "XString"   "XRaw"      "XVector"   "Vector"    "Annotated"

Biostrings provides methods for reading and writing common file formats containing sequences searching such as FASTA and FASTQ, for searching with pattern and position weight matrices, and for pairwise sequence alignments.

bigSeq <- DNAString("AGCTGGTCGATGACGTTAGGATCAGTACGATTGAGGCAGGGTCAGTAATTGGATTGAGGATTGACCCCGATG")
smallSeq <- DNAString("TCAGTA")
pos <- matchPattern(smallSeq,bigSeq)
pos
  Views on a 72-letter DNAString subject
subject: AGCTGGTCGATGACGTTAGGATCAGTACGATT...TCAGTAATTGGATTGAGGATTGACCCCGATG
views:
    start end width
[1]    22  27     6 [TCAGTA]
[2]    42  47     6 [TCAGTA]

Execise 1.2

  • What kind of object is the pos object above (the result of the patternMatch() call)
    • use the str() function to view its structure.
    • recover content of the ranges slot
    • use the is() function discover the full list of object types and classes the ranges slot belongs to

IRanges

The IRanges package (Lawrence, Huber, Pagès, et al., 2013) provides a way of storing, manipulating and calculating with interval ranges. Interval ranges represent regions of a sequence.

You can construct an IRange object with the IRanges() constructor, by specifying either of two lists,

  • start and end positions
  • start positions and widths
library(IRanges)   
ir <- IRanges(start = c(1, 8, 14, 15, 19, 34, 40),
              width = c(12, 6, 6, 15, 6, 2, 7))
plotRanges(ir)

The IRanges package provides functions for manipulating and analysing IRanges objects, for example there is a function overLapsAny() that will tell you which ranges in an IRanges object overlap

overLaps <- overlapsAny(ir)
overLaps
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE

and exactly which the overlap with

overLaps <- findOverlaps(ir)
overLaps
Hits object with 15 hits and 0 metadata columns:
       queryHits subjectHits
       <integer>   <integer>
   [1]         1           1
   [2]         1           2
   [3]         2           1
   [4]         2           2
   [5]         3           3
   ...       ...         ...
  [11]         5           3
  [12]         5           4
  [13]         5           5
  [14]         6           6
  [15]         7           7
  -------
  queryLength: 7
  subjectLength: 7

We can split the ranges in to groups (or bins) of non-overlapping (or disjoint) ranges

bins <- disjointBins(ir)
bins
[1] 1 2 1 2 3 1 1

GenomicRanges

The GenomicRanges package (Lawrence, Huber, Pagès, et al., 2013) makes use of a compression technique called Run Length Encoding(RLE) to store extra information with interval ranges to hold data such as genomic locations

Run Length Encoding

# generate a repetative sequence
mySeq <- unlist(sapply(rnorm(10,5,1),
                  function(n){
                    rep(sample(DNA_BASES,1),n)
                  }))
# what does it look like
mySeq
 [1] "C" "C" "C" "C" "T" "T" "T" "C" "C" "C" "C" "C" "G" "G" "G" "G" "A"
[18] "A" "A" "G" "G" "G" "G" "A" "A" "A" "A" "A" "A" "G" "G" "G" "G" "G"
[35] "G" "C" "C" "C" "C" "C" "C" "G" "G" "G" "G"
# how big is it
paste0("Size of mySeq ",format(object.size(mySeq),units = "Kb"))
[1] "Size of mySeq 0.6 Kb"
# turn it into an RLE object
myRLE <- Rle(mySeq)
# what does that look like
myRLE
character-Rle of length 45 with 10 runs
  Lengths:   4   3   5   4   3   4   6   6   6   4
  Values : "C" "T" "C" "G" "A" "G" "A" "G" "C" "G"
# how big is that
paste0("Size of myRLE ",format(object.size(myRLE),units = "Kb"))
[1] "Size of myRLE 1.4 Kb"

Hmmmm! not much advantage “I thought you said it was a compression technique”, well if you a looking at longer sequences with more repeats

# generate a much bigger repetative sequence
mySeq <- unlist(sapply(rnorm(100,10,2),
                  function(n){
                    rep(sample(DNA_BASES,1),n)
                  }))
paste0("Size of mySeq ",format(object.size(mySeq),units = "Kb"))
[1] "Size of mySeq 7.7 Kb"
# and now the RLE
myRLE <- Rle(mySeq)
paste0("Size of myRLE ",format(object.size(myRLE),units = "Kb"))
[1] "Size of myRLE 2.1 Kb"

GenomicRanges (also called GRanges) are excelent for genomic annotations. The minimal requirements for a GRanges object are an RLE object with sequences names and a set of ranges associated with the seqences in the form of an IRanges object. The GRanges object also contains a slot for strand (sense on the DNA), however this defaults to * which represents unspecified if not specifically given.

grObj <- GRanges(seqnames=Rle(c(rep("seq1",4),rep("seq2",2),"seq3")),
                 ranges = IRanges(start = c(110, 200 , 250, 350, 100, 300, 40),
                                  width = c(50, 40, 60, 150, 60, 200, 70)))
grObj
GRanges object with 7 ranges and 0 metadata columns:
      seqnames     ranges strand
         <Rle>  <IRanges>  <Rle>
  [1]     seq1 [110, 159]      *
  [2]     seq1 [200, 239]      *
  [3]     seq1 [250, 309]      *
  [4]     seq1 [350, 499]      *
  [5]     seq2 [100, 159]      *
  [6]     seq2 [300, 499]      *
  [7]     seq3 [ 40, 109]      *
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

GRanges objects can store other information along with this basic data, typically they are used for information in GTF and BED annotation files. The package rtracklayer has many functions for importing and exporting data from these file formats

require(rtracklayer)
track <- import(system.file("tests", "v1.gff", package = "rtracklayer"))
export(track, "my.gff3", "gff3")
track
GRanges object with 3 ranges and 5 metadata columns:
      seqnames             ranges strand |   source     type     score
         <Rle>          <IRanges>  <Rle> | <factor> <factor> <numeric>
  [1]    chr21 [1000000, 1011000]      + | TeleGene enhancer       500
  [2]    chr22 [1010000, 1010100]      + | TeleGene promoter       900
  [3]    chr22 [1020000, 1020000]      - | TeleGene promoter       800
          phase    group
      <integer> <factor>
  [1]      <NA>   touch1
  [2]      <NA>   touch1
  [3]      <NA>   touch2
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

R has a few packages that are designed for visualiation of genomic range data. Two of the more flexible are the package Gviz and the package ggbio.

For example we can plot use Gviz to plot some mocked up annotation

require(Gviz)
# if we mock up some data
grObj <- GRanges(seqnames=Rle(rep("7",4)),
                 ranges=IRanges(start=c(2000000, 2070000, 2100000, 2160000),
                                end=c(2050000, 2130000, 2150000, 2170000)),
                 strand=c("-", "+", "-", "-"),
                 group=c("Group1","Group2","Group1", "Group3"))
# make and annotation track
annTrack <- AnnotationTrack(grObj, genome="hg19", feature="test",
                            id=paste("annTrack item", 1:4),
                            name="annotation track",
                            stacking="squish")
# make and axis track
axis <- GenomeAxisTrack()
# and an ideogram (chromosome map track)
ideo <- IdeogramTrack(genome="hg19",chromosome=7)
## Now plot the tracks
res <- plotTracks(list(ideo, axis, annTrack))

GenomicAlignments

There are several packages in bioconductor that are specifically designed for the handling of short read data, in the form of FASTQ raw files and alignment files such as BAM and SAM files. There are differences in the underlying data structures they use and the functions the provide.

The ShortRead package (Morgan, Anders, Lawrence, et al., 2009) is particularly suited for processing reads in FASTQ files and provides functions for sampling and QC tasks, it imports the reads into Biostring objects. It also contains some functions for single ended alignments. However, the packages GenomicAlignments (Lawrence, Huber, Pagès, et al., 2013) provides much more extensive support for paired end and gapped alignments in BAM files. It uses the package Rsamtools to import and export alignment data files.

When alignments are read in the reads are held in GRanges objects with all the functions of the GenomicRanges and IRanges packages available for processing.

alignFile <- system.file(package="Gviz", "extdata", "gapped.bam")
gaData <- readGAlignmentPairs(alignFile)
gaData
GAlignmentPairs object with 1074 pairs, strandMode=1, and 0 metadata columns:
         seqnames strand   :             ranges  --             ranges
            <Rle>  <Rle>   :          <IRanges>  --          <IRanges>
     [1]    chr12      +   : [2966852, 2966901]  -- [2966993, 2967042]
     [2]    chr12      +   : [2966856, 2966905]  -- [2966963, 2967012]
     [3]    chr12      +   : [2966856, 2966905]  -- [2967024, 2967073]
     [4]    chr12      +   : [2966859, 2966908]  -- [2967003, 2967052]
     [5]    chr12      +   : [2966859, 2966908]  -- [2967010, 2967059]
     ...      ...    ... ...                ... ...                ...
  [1070]    chr12      +   : [3154300, 3154349]  -- [3154446, 3154495]
  [1071]    chr12      -   : [3154359, 3154408]  -- [3154244, 3154293]
  [1072]    chr12      -   : [3154599, 3154648]  -- [3154459, 3154508]
  [1073]    chr12      -   : [3154743, 3154792]  -- [3154597, 3154646]
  [1074]    chr12      -   : [3155131, 3155180]  -- [3154990, 3155039]
  -------
  seqinfo: 1 sequence from an unspecified genome

Gvis also has functions to make plotting of alignment data relatively painless.

atrack <- AlignmentsTrack(alignFile,isPaired=T)
plotTracks(atrack,from=2960000,to=3160000,chromosome="chr12")

Data Import Export

Bioinformatics is settling on a fairly steady set of file formats, FASTA and FASTQ for sequence data for example and SAM/BAM for NGS short read alignment data. While many of packages mentioned in this workshop possess functions for importing and exporting data that the package data structures are designed to handle they often don’t possess functions for other files. There are also some standard formats for processed data for example read depths, peak locations and genetic annotations. This type of data often appears as track in a genome browser.

rtracklayer

The package rtracklayer (Lawrence, Gentleman, and Carey, 2009) has two useful generic functions import() and export() that will handle several track formats including FASTA, GTF, BED, WIG.

Annotations

There are many packages in bioconductor that contain curated annotation data for numerous and diverse organisms. See the list on the BiocViews AnnotationData page

Annotation Packages

There are annotation datasets that are available as Bioconductor packages many of them use the RSQLite package to interface to sqlite database files that hold the information locally. Bioconductor provides some standard packages for a range of annotation databases. Examples of these are:

Online Packages

Bioconductor also provides some library packages that are designed to ease searching various on-line databases for annotation data. These can be used to programmatically retrieve data that is often also avaliable to be downloaded from web sites such as ENSEMBL and UCSC

Biomart Example

Provided you are not working in bacteria the [biomaRt] package provides a useful interface onto the ENSEMBL databases. Biomart provides a standardised data access protocols and ENSEMBL have implemented it on their genome servers for

as well as their popular genomes server, but not bacteria

AnnotationHub Example

We know there is some Saccharomyces cerevisiae data coming our way so lets look to see what is available through AnnotationHub and lets look at the data provider ENSEMBL.

require(AnnotationHub)
ah <- AnnotationHub()
sc <- query(ah,c("ENSEMBL","Saccharomyces cerevisiae"))
sc

This displays a list of the data items available. A fuller list in the form of a data frame is available with the mocls() function, this displays more information including the URL of the on-line data source.

mcols(sc)

Item AH49366 is the full genomic sequence for Saccharomyces cerevisiae release 81

scFaFile <- sc[["AH49366"]]
scFaFile

The sequences in the FaFile object can be retrieved with the getSeq() command

getSeq(scFaFile)

The last item in the list of available S. cerevisiae annotation is the GTF file containing the genetic features and we can retrieve this into a GenomicRanges object

genes <- sc[["AH50407"]]
head(genes)

It is possible to retrieve the sequences for a gene of interest using these two annotations

# get the GenomicRanges item that has the gene name of interest and is of type gene 
ser3 <- genes[which(genes$gene_name=="SER3" & genes$type=="gene"),]
# pass this as a parameter to the getSeq function
seqSer3 <- getSeq(scFaFile,ser3)
names(seqSer3) <- ser3$gene_id
seqSer3

Exercise 2.1

  • Which chromosome is the SER3 gene on.
  • In the example above what other types genetic features are listed for the SER3 gene?
  • How many different bio-types are there in the genetic features data and what are they?
  • Find the name, length and sequence for the longest snoRNA.

Learning Bioconductor

Bioconductor provides many resources for assisting in learning how to do various analyses. It has a very active support forum where many bioconductor uses and many of the package authors will answer questions regarding the packages.

Package Vignettes

All Bioconductor packages are supposed to come with a user manual which is a concise list of the functions and data structures available in the package listed alphabetically and with all the function parameters documented. They are also expected to have examples for all the functions and at least one vignette document that is meant to be a more recipe, How-to, document. You can list the names of the available vignettes with the command vignette()

vignette(package="Biostrings")

You can then view the vignette with the same command only this time passing the name of the vignette you want.

vignette("BiostringsQuickOverview")

Workflows

One of the ways this is done is via workflows

For example the RNA-seq Workflow introduces several alternative packages for performing RNA-seq. We have concentrated on the edgeR package whereas the workflow concentrates on the DESeq2

Tutorials

The support forum has a specific section of links to tutorial

CRAN

Not all R packages for handling biological data are in bioconductor, some are in the Comprehensive R Archive Network (CRAN) and can be installed using the install.packages() function. Several of the CRAN Task Views are of interest to biological data analysis most specifically the Phylogenetic Task View that list available packages on CRAN for phylogenetic studies.

These are installed through the base packages install.packages() function.

install.packages("packagename")
  • Analyses of Phylogenetics and Evolution (ape) is a very useful package that has several functions for importing data formats and even accessing online information. For example

    • the read.GenBank() function can be used to retrieve sequence information by GenBank accession number
  • genoPlotR (Guy, Roat Kultima, and Andersson, 2010) also has a functions for reading feature and sequence annotation data
    • the read_dna_seg_from_genbank() function will read the features from a GenBank file

There are also numerous packages available on GitHub and the devtools package has a useful funtion install.git_hub() for installing packages from GitHub

Bibliography

[1] S. Davis and P. Meltzer. “GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor”. In: Bioinformatics 14 (2007), pp. 1846–1847.

[2] S. Durinck, Y. Moreau, A. Kasprzyk, et al. “BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis”. In: Bioinformatics 21 (2005), pp. 3439–3440.

[3] S. Durinck, P. T. Spellman, E. Birney, et al. “Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt”. In: Nature Protocols 4 (2009), pp. 1184–1191.

[4] A. Franceschini, D. Szklarczyk, S. Frankild, et al. “STRING v9.1: protein-protein interaction networks, with increased coverage and integration.” In: Nucleic acids research 41 (Database issue Jan. 2013), pp. D808–815. ISSN: 1362-4962 0305-1048. DOI: 10.1093/nar/gks1094. pmid: pmid.

[5] L. Guy, J. Roat Kultima and S. G. E. Andersson. “genoPlotR: comparative gene and genome visualization in R”. In: Bioinformatics 26.18 (Sep. 15, 2010), pp. 2334–2335. ISSN: 1367-4803. DOI: 10.1093/bioinformatics/btq413. URL: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2935412/.

[6] Huber, W., Carey, et al. “Orchestrating high-throughput genomic analysis with Bioconductor”. In: Nature Methods 12.2 (2015), pp. 115–121.

[7] M. Lawrence, R. Gentleman and V. Carey. “rtracklayer: an R package for interfacing with genome browsers”. In: Bioinformatics 25 (2009), pp. 1841–1842.

[8] M. Lawrence, W. Huber, H. Pagès, et al. “Software for Computing and Annotating Genomic Ranges”. In: PLoS Computational Biology 9 (2013).

[9] W. Luo and C. Brouwer. “Pathview: an R/Bioconductor package for pathway-based data integration and visualization”. In: Bioinformatics (Jun. 04, 2013). DOI: 10.1093/bioinformatics/btt285. URL: http://bioinformatics.oxfordjournals.org/content/early/2013/06/11/bioinformatics.btt285.abstract.

[10] M. Morgan, S. Anders, M. Lawrence, et al. “ShortRead: a bioconductor package for input, quality assessment and exploration of high-throughput sequence data.” In: Bioinformatics (Oxford, England) 25.19 (Oct. 01, 2009), pp. 2607–2608. ISSN: 1367-4811 1367-4803. DOI: 10.1093/bioinformatics/btp450. pmid: pmid.

[11] H. Pages, P. Aboyoun, R. Gentleman, et al. Biostrings: String objects representing biological sequences, and matching. 2016.

[12] H. Pages, M. Carlson, S. Falcon, et al. AnnotationDbi: Annotation Database Interface. 2016.

[13] Y. Zhu, R. M. Stephens, P. S. Meltzer, et al. “SRAdb: query and use public next-generation sequencing data from within R”. In: BMC Bioinformatics 14.1 (2013), pp. 1–4. ISSN: 1471-2105. DOI: 10.1186/1471-2105-14-19. URL: http://dx.doi.org/10.1186/1471-2105-14-19.

Session Info

R version 3.2.4 (2016-03-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.4 (El Capitan)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
 [1] grid      stats4    parallel  stats     graphics  grDevices utils    
 [8] datasets  methods   base     

other attached packages:
 [1] Gviz_1.14.7                rtracklayer_1.30.4        
 [3] GenomicAlignments_1.6.3    Rsamtools_1.22.0          
 [5] SummarizedExperiment_1.0.2 Biobase_2.30.0            
 [7] GenomicRanges_1.22.4       GenomeInfoDb_1.6.3        
 [9] Biostrings_2.38.4          XVector_0.10.0            
[11] IRanges_2.4.8              S4Vectors_0.8.11          
[13] BiocGenerics_0.16.1        BiocInstaller_1.20.1      
[15] knitr_1.12.3               RefManageR_0.10.13        
[17] pietalib_0.1              

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.4              lubridate_1.5.6         
 [3] biovizBase_1.18.0        lattice_0.20-33         
 [5] digest_0.6.9             R6_2.1.2                
 [7] plyr_1.8.3               futile.options_1.0.0    
 [9] acepack_1.3-3.3          RSQLite_1.0.0           
[11] evaluate_0.8.3           httr_1.1.0              
[13] ggplot2_2.1.0            zlibbioc_1.16.0         
[15] GenomicFeatures_1.22.13  rpart_4.1-10            
[17] Matrix_1.2-5             rmarkdown_0.9.5         
[19] labeling_0.3             splines_3.2.4           
[21] BiocParallel_1.4.3       stringr_1.0.0           
[23] foreign_0.8-66           RCurl_1.95-4.8          
[25] biomaRt_2.26.1           munsell_0.4.3           
[27] htmltools_0.3.5          nnet_7.3-12             
[29] gridExtra_2.2.1          Hmisc_3.17-3            
[31] matrixStats_0.50.2       XML_3.98-1.4            
[33] bitops_1.0-6             gtable_0.2.0            
[35] DBI_0.3.1                magrittr_1.5            
[37] formatR_1.3              scales_0.4.0            
[39] bibtex_0.4.0             stringi_1.0-1           
[41] latticeExtra_0.6-28      futile.logger_1.4.1     
[43] Formula_1.2-1            lambda.r_1.1.7          
[45] RColorBrewer_1.1-2       tools_3.2.4             
[47] RJSONIO_1.3-0            dichromat_2.0-0         
[49] BSgenome_1.38.0          survival_2.39-2         
[51] yaml_2.1.13              AnnotationDbi_1.32.3    
[53] colorspace_1.2-6         cluster_2.0.4           
[55] VariantAnnotation_1.16.4