Functional Annotation Packages

There is an increasing number of packages for accessing and analysing functional annotation enrichments associated with set of significant genes. For example looking for over represented Gene Ontology (GO) terms or over represented pathways from KEGG. On of the simplest to use is the package STRINGdb

We need the output from the feature analysis for example the results from the differential gene expression analysis. This might exist already as a saved RData object but here we will quickly recreate it using edgeR from the featureCounts

# download the data if it is not already on the system
# 
if(!file.exists(paste0(rnaLocal,"Rdata/fullSC.RData"))){
  dir.create(paste0(rnaLocal,"Rdata/"),recursive=T,showWarnings = F)
  download.file("http://www.compbio.dundee.ac.uk/user/pschofield/data/fullSC.RData",
                paste0(rnaLocal,"Rdata/fullSC.RData"))
}
load( paste0(rnaLocal,"Rdata/fullSC.RData"))
countMatrix <- fc$counts
colnames(countMatrix)<-gsub("[.]*","",colnames(countMatrix))
colnames(countMatrix)<-gsub("sbam$","",colnames(countMatrix))
colnames(countMatrix)<-gsub("Snf2","Mut",colnames(countMatrix))

# if dont already have the results calculate them otherwise load them
if(!file.exists(paste0(rnaLocal,"Rdata/dgeSC.RData"))){
  require(edgeR)
  conditions <- as.factor(substr(colnames(countMatrix),1,1))
  conditions <- relevel(conditions,ref = "W")
  design <- model.matrix( ~ conditions -1)
  dge <- DGEList(countMatrix, group=conditions)
  dge <- calcNormFactors(dge,method="TMM")
  dge <- estimateDisp(dge,design)
  fit <- glmFit(dge,design)
  lrt <- glmLRT(fit,contrast = c(-1,1))
  tt <- topTags(lrt, n=10000)
  dge_scerevisiae <- tt$table[,c("FDR","logFC")]
  dge_scerevisiae$geneId <- rownames(dge_scerevisiae)
  save(dge_scerevisiae,file=paste0(rnaLocal,"Rdata/dgeSC.RData"))
}else{
  load(paste0(rnaLocal,"Rdata/dgeSC.RData"))
}

A reminder of what the results look like

FDR logFC geneId
YOR290C 0 8.758857 YOR290C
YNR034W-A 0 4.990543 YNR034W-A
YHR136C 0 4.958985 YHR136C
YOL052C-A 0 4.410756 YOL052C-A
YDR119W-A 0 4.349755 YDR119W-A
YPR160W 0 4.298277 YPR160W

One of the oldest challenges in bioinformatics is the problem caused by id proliferations and inconsistencies. For example the same gene will have different id in the ENSEMBL, NCBI and Entrezgene databases. It is possible to get some of these from the ENSEMBL database using biomaRt

require(biomaRt)

# get a list of the available species from ENSEMBL
ens <- useMart("ensembl")
marts <- listDatasets(ens)

knitr::kable(marts[which(grepl("^S",marts$description)),])
dataset description version
13 scerevisiae_gene_ensembl Saccharomyces cerevisiae genes (R64-1-1) R64-1-1
26 sscrofa_gene_ensembl Sus scrofa genes (Sscrofa10.2) Sscrofa10.2
64 saraneus_gene_ensembl Sorex araneus genes (sorAra1) sorAra1
66 sharrisii_gene_ensembl Sarcophilus harrisii genes (DEVIL7.0) DEVIL7.0

So there is a dataset called scerevisiae_gene_ensembl that is the one we want

scmart <- useDataset("scerevisiae_gene_ensembl", ens)

We can see a full list of the data available in the scerevisiae_gene_ensembl dataset.

attributePages(scmart)
[1] "feature_page" "structure"    "homologs"     "snp"         
[5] "snp_somatic"  "sequences"   
knitr::kable(head(listAttributes(scmart,"feature_page")))
name description
ensembl_gene_id Ensembl Gene ID
ensembl_transcript_id Ensembl Transcript ID
ensembl_peptide_id Ensembl Protein ID
ensembl_exon_id Ensembl Exon ID
description Description
chromosome_name Chromosome Name
knitr::kable(head(listAttributes(scmart,"sequences")))
name description
1167 transcript_exon_intron Unspliced (Transcript)
1168 gene_exon_intron Unspliced (Gene)
1169 transcript_flank Flank (Transcript)
1170 gene_flank Flank (Gene)
1171 coding_transcript_flank Flank-coding region (Transcript)
1172 coding_gene_flank Flank-coding region (Gene)

We can get some of this information for the signicantly differentially expressed genes

# again if scripting this it may be something we don't want to do everytime we run the script
# but being mindful of the fact we might change parameters in the upstream analysis that would
# mean the set of significant genes changes and this step would need refreshing
# 
if(!file.exists(paste0(homeDir,"Rdata/annSC.RData"))){
  annData <- getBM(
     attributes = c("ensembl_gene_id",
                    "external_gene_name",
                    "description",
                    "entrezgene",
                    "refseq_dna",
                    "refseq_mrna"),
    filter="ensembl_gene_id",
    values=rownames(dge_scerevisiae)[which(dge_scerevisiae$FDR<=0.05)],
    mart=scmart)
  save(annData,file=paste0(rnaLocal,"Rdata/annSC.RData"))
}else{
  load(paste0(rnaLocal,"Rdata/annSC.RData"))
}

head(annData)
  ensembl_gene_id external_gene_name
1         YPR161C               SGV1
2         YOL138C               RTC1
3         YGR129W               SYF2
4         YPR165W               RHO1
5         YPR098C                   
6         YPL015C               HST2
                                                                                                                                                                                                                                                                                                                                                                                   description
1                      Cyclin (Bur2p)-dependent protein kinase; part of the BUR kinase complex which functions in transcriptional regulation; phosphorylates the carboxy-terminal domain (CTD) of Rpo21p and the C-terminal repeat domain of Spt5p; recruits Spt6p to the CTD at the onset of transcription; regulated by Cak1p; similar to metazoan CDK9 proteins [Source:SGD;Acc:S000006365]
2                                                                                                                 Subunit of the SEA (Seh1-associated) complex; SEA is a coatomer-related complex that associates dynamically with the vacuole; null mutation suppresses cdc13-1 temperature sensitivity; has N-terminal WD-40 repeats and a C-terminal RING motif [Source:SGD;Acc:S000005498]
3                                                                               Member of the NineTeen Complex (NTC); NTC contains Prp19p and stabilizes U6 snRNA in catalytic forms of the spliceosome containing U2, U5, and U6 snRNAs; relocalizes to the cytosol in response to hypoxia; isy1 syf2 cells have defective spindles activiating cell cycle arrest [Source:SGD;Acc:S000003361]
4                                                                                                                                   GTP-binding protein of the rho subfamily of Ras-like proteins; involved in establishment of cell polarity; regulates protein kinase C (Pkc1p) and the cell wall synthesizing enzyme 1,3-beta-glucan synthase (Fks1p and Gsc2p) [Source:SGD;Acc:S000006369]
5                                                                                                                                                                                                                                                                                       Protein of unknown function; localized to the mitochondrial outer membrane [Source:SGD;Acc:S000006302]
6 Cytoplasmic NAD(+)-dependent protein deacetylase; member of the silencing information regulator 2 (Sir2) family of NAD(+)-dependent protein deacetylases; modulates nucleolar (rDNA) and telomeric silencing; possesses NAD(+)-dependent histone deacetylase activity in vitro; contains a nuclear export signal (NES); function regulated by its nuclear export [Source:SGD;Acc:S000005936]
  entrezgene   refseq_dna  refseq_mrna
1     856290 NM_001184258 NM_001184258
2     853982 NM_001183392 NM_001183392
3     853030 NM_001181258 NM_001181258
4     856294 NM_001184262 NM_001184262
5     856213 NM_001184195 NM_001184195
6     856092 NM_001183829 NM_001183829

We can combine this data with the gene expression data using merge

annRes <- merge(annData,dge_scerevisiae,by.x="ensembl_gene_id",by.y="geneId")

knitr::kable(head(annRes[order(abs(annRes$logFC),decreasing = T),]))
ensembl_gene_id external_gene_name description entrezgene refseq_dna refseq_mrna FDR logFC
4863 YOR290C SNF2 Catalytic subunit of the SWI/SNF chromatin remodeling complex; involved in transcriptional regulation; contains DNA-stimulated ATPase activity; functions interdependently in transcriptional activation with Snf5p and Snf6p [Source:SGD;Acc:S000005816] 854465 NM_001183709 NM_001183709 0.0e+00 8.758857
2098 YHR095W Dubious open reading frame; unlikely to encode a functional protein, based on available experimental and comparative sequence data [Source:SGD;Acc:S000001137] NA 0.0e+00 -5.485661
4500 YNR034W-A Putative protein of unknown function; expression is regulated by Msn2p/Msn4p; YNR034W-A has a paralog, YCR075W-A, that arose from the whole genome duplication [Source:SGD;Acc:S000007525] 855770 NM_001184446 NM_001184446 0.0e+00 4.990543
140 tN(GUU)P Asparagine tRNA (tRNA-Asn), predicted by tRNAscan-SE analysis [Source:SGD;Acc:S000006677] NA 0.0e+00 4.963624
2129 YHR136C SPL2 Protein with similarity to cyclin-dependent kinase inhibitors; downregulates low-affinity phosphate transport during phosphate limitation by targeting Pho87p to the vacuole; upstream region harbors putative hypoxia response element (HRE) cluster; overproduction suppresses a plc1 null mutation; promoter shows an increase in Snf2p occupancy after heat shock; GFP-fusion protein localizes to the cytoplasm [Source:SGD;Acc:S000001178] 856538 NM_001179266 NM_001179266 0.0e+00 4.958985
174 tY(GUA)J1 SUP7 Tyrosine tRNA (tRNA-Tyr), predicted by tRNAscan-SE analysis; can mutate to suppress ochre nonsense mutations [Source:SGD;Acc:S000006781] NA 3.4e-06 4.538889

And satisfyingly and fortunately the biggest foldchange between the WT and the SNF2 deletion mutant is the SNF2 gene itself.

STRINGdb

The first step after loading the packages is to create an object that links to STRING, this requires knowing the species id of the organism of interest of which there are over 2000.

require(STRINGdb)

# List the names and ids of the organisms
species <- get_STRING_species(version="10", species_name=NULL)

# find the organisms starting with "Sacc"
species[which(grepl("^Sacc",species$official_name)),]
     species_id                         official_name
26         4932              Saccharomyces cerevisiae
298      203122         Saccharophagus degradans 2-40
897      405948 Saccharopolyspora erythraea NRRL 2338
1083     471857   Saccharomonospora viridis DSM 43017
                    compact_name   kingdom type
26      Saccharomyces cerevisiae eukaryota core
298     Saccharophagus degradans  bacteria core
897  Saccharopolyspora erythraea  bacteria core
1083   Saccharomonospora viridis  bacteria core
# create a new STRING_db object for Saccharomyces cerevisiae
string_db <- STRINGdb$new(version="10",species=4932)

Once we have the STRINGdb we need to convert or map the ids we have from ENSEMBL for our genes to the gene ids in STRINGdb. We can use either the entrez gene id or the ensembl gene id but they have differening success rates so better to use the one that maximises the number of mapped genes

# map to STRING ids
example1_mapped = string_db$map(annRes, "entrezgene", removeUnmappedRows = TRUE )
Warning:  we couldn't map to STRING 8% of your identifiers
example1_mapped = string_db$map(annRes, "ensembl_gene_id", removeUnmappedRows = TRUE )
Warning:  we couldn't map to STRING 3% of your identifiers

STRINGdb then lets us draw a gene interaction network for a set of genes. Here we select the STRING_id for the top 25 differentially expressed genes

# get the STRING_id for the top 50 genes
hits = example1_mapped$STRING_id[1:25]  

# plot the STRING network png 
string_db$plot_network( hits )

It is also possible to add to this diagram the foldchange information in the form of coloured halos around the notes that represent the foldchange as an intensity on a Red to Green colour scale.

# filter by p-value and add a color column 
# (i.e. green down-regulated gened and red for up-regulated genes)
example1_mapped_pval05 = string_db$add_diff_exp_color( subset(example1_mapped, FDR<0.05),
                                                       logFcColStr="logFC" )    

# post payload information to the STRING server
payload_id = string_db$post_payload( example1_mapped_pval05$STRING_id, colors=example1_mapped_pval05$color )

# display a STRING network png with the "halo"
string_db$plot_network( hits, payload_id=payload_id )

GO and KEGG Annotations

STRINGdb has the function get_enrichment() that can calculate enriched annotations from GO, KEGG, InterPro and pfam for sets of genes

hits <- example1_mapped$STRING_id
######## compute enrichment in GO annotations for Biological Process ########
goProcess <- string_db$get_enrichment(hits,  category = "Process", 
                                      methodMT = "fdr", iea = TRUE )
write.csv(goProcess[which(goProcess$pvalue_fdr<=0.05),],
          file=paste0(rnaLocal,"RData/goProcessR.csv"))
goFunction <- string_db$get_enrichment(hits,  category = "Function" , 
                                      methodMT = "fdr", iea = TRUE )
write.csv(goFunction[which(goFunction$pvalue_fdr<=0.05),],
          file=paste0(rnaLocal,"RData/goFunctionR.csv"))
goComponent <- string_db$get_enrichment(hits,  category = "Component" , 
                                      methodMT = "fdr", iea = TRUE )
write.csv(goComponent[which(goComponent$pvalue_fdr<=0.05),],
          file=paste0(rnaLocal,"RData/goComponentR.csv"))
keggPath <- string_db$get_enrichment(hits,  category = "KEGG" , 
                                      methodMT = "fdr", iea = TRUE )
write.csv(keggPath[which(keggPath$pvalue_fdr<=0.05),],
          file=paste0(rnaLocal,"RData/keggPathR.csv"))
interPro <- string_db$get_enrichment(hits,  category = "InterPro" , 
                                      methodMT = "fdr", iea = TRUE )
write.csv(interPro[which(interPro$pvalue_fdr<=0.05),],
          file=paste0(rnaLocal,"RData/interProR.csv"))
pfamDom <- string_db$get_enrichment(hits,  category = "Pfam" , 
                                      methodMT = "fdr", iea = TRUE )
write.csv(pfamDom[which(pfamDom$pvalue_fdr<=0.05),],
          file=paste0(rnaLocal,"RData/pfamDomR.csv"))

Visualisation

There are also packages for visualising these kinds of results. One such is the pathview package that is used to visualise expression data on KEGG pathways

require(pathview)
gene.data <- as.numeric(annRes$logFC)
names(gene.data) <- annRes$geneId
pathID <- "00260"
pview <- pathview(gene.data=gene.data,
                  gene.idtype="kegg",
                  pathway.id=pathID,
                  species="sce",
                  out.suffix="kegg",
                  kegg.native=T,
                  same.layer=T)
KEGG pathway diagram

KEGG pathway diagram

sessionInfo()
R version 3.2.5 (2016-04-14)
Platform: x86_64-apple-darwin15.4.0 (64-bit)
Running under: OS X 10.11.4 (El Capitan)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
 [1] parallel  stats4    tcltk     stats     graphics  grDevices utils    
 [8] datasets  methods   base     

other attached packages:
 [1] pathview_1.10.1      org.Hs.eg.db_3.2.3   AnnotationDbi_1.32.3
 [4] IRanges_2.4.8        S4Vectors_0.8.11     Biobase_2.30.0      
 [7] BiocGenerics_0.16.1  STRINGdb_1.10.1      hash_2.2.6          
[10] gplots_3.0.1         RColorBrewer_1.1-2   plotrix_3.6-1       
[13] RCurl_1.95-4.8       bitops_1.0-6         igraph_1.0.1        
[16] plyr_1.8.3           sqldf_0.4-10         RSQLite_1.0.0       
[19] DBI_0.3.1            gsubfn_0.6-6         proto_0.3-10        
[22] png_0.1-7            biomaRt_2.26.1       knitr_1.12.3        
[25] RefManageR_0.10.13   pietalib_0.1        

loaded via a namespace (and not attached):
 [1] KEGGREST_1.10.1      gtools_3.5.0         htmltools_0.3.5     
 [4] yaml_2.1.13          chron_2.3-47         XML_3.98-1.4        
 [7] Rgraphviz_2.14.0     stringr_1.0.0        zlibbioc_1.16.0     
[10] Biostrings_2.38.4    caTools_1.17.1       evaluate_0.8.3      
[13] GenomeInfoDb_1.6.3   highr_0.5.1          Rcpp_0.12.4         
[16] KernSmooth_2.23-15   formatR_1.3          gdata_2.17.0        
[19] graph_1.48.0         XVector_0.10.0       digest_0.6.9        
[22] stringi_1.0-1        RJSONIO_1.3-0        GenomicRanges_1.22.4
[25] grid_3.2.5           bibtex_0.4.0         tools_3.2.5         
[28] magrittr_1.5         KEGGgraph_1.28.0     lubridate_1.5.6     
[31] rmarkdown_0.9.5      httr_1.1.0           R6_2.1.2