There is an increasing number of packages for accessing and analysing functional annotation enrichments associated with set of significant genes. For example looking for over represented Gene Ontology (GO) terms or over represented pathways from KEGG. On of the simplest to use is the package STRINGdb
We need the output from the feature analysis for example the results from the differential gene expression analysis. This might exist already as a saved RData object but here we will quickly recreate it using edgeR from the featureCounts
# download the data if it is not already on the system
#
if(!file.exists(paste0(rnaLocal,"Rdata/fullSC.RData"))){
dir.create(paste0(rnaLocal,"Rdata/"),recursive=T,showWarnings = F)
download.file("http://www.compbio.dundee.ac.uk/user/pschofield/data/fullSC.RData",
paste0(rnaLocal,"Rdata/fullSC.RData"))
}
load( paste0(rnaLocal,"Rdata/fullSC.RData"))
countMatrix <- fc$counts
colnames(countMatrix)<-gsub("[.]*","",colnames(countMatrix))
colnames(countMatrix)<-gsub("sbam$","",colnames(countMatrix))
colnames(countMatrix)<-gsub("Snf2","Mut",colnames(countMatrix))
# if dont already have the results calculate them otherwise load them
if(!file.exists(paste0(rnaLocal,"Rdata/dgeSC.RData"))){
require(edgeR)
conditions <- as.factor(substr(colnames(countMatrix),1,1))
conditions <- relevel(conditions,ref = "W")
design <- model.matrix( ~ conditions -1)
dge <- DGEList(countMatrix, group=conditions)
dge <- calcNormFactors(dge,method="TMM")
dge <- estimateDisp(dge,design)
fit <- glmFit(dge,design)
lrt <- glmLRT(fit,contrast = c(-1,1))
tt <- topTags(lrt, n=10000)
dge_scerevisiae <- tt$table[,c("FDR","logFC")]
dge_scerevisiae$geneId <- rownames(dge_scerevisiae)
save(dge_scerevisiae,file=paste0(rnaLocal,"Rdata/dgeSC.RData"))
}else{
load(paste0(rnaLocal,"Rdata/dgeSC.RData"))
}
A reminder of what the results look like
FDR | logFC | geneId | |
---|---|---|---|
YOR290C | 0 | 8.758857 | YOR290C |
YNR034W-A | 0 | 4.990543 | YNR034W-A |
YHR136C | 0 | 4.958985 | YHR136C |
YOL052C-A | 0 | 4.410756 | YOL052C-A |
YDR119W-A | 0 | 4.349755 | YDR119W-A |
YPR160W | 0 | 4.298277 | YPR160W |
One of the oldest challenges in bioinformatics is the problem caused by id proliferations and inconsistencies. For example the same gene will have different id in the ENSEMBL, NCBI and Entrezgene databases. It is possible to get some of these from the ENSEMBL database using biomaRt
require(biomaRt)
# get a list of the available species from ENSEMBL
ens <- useMart("ensembl")
marts <- listDatasets(ens)
knitr::kable(marts[which(grepl("^S",marts$description)),])
dataset | description | version | |
---|---|---|---|
13 | scerevisiae_gene_ensembl | Saccharomyces cerevisiae genes (R64-1-1) | R64-1-1 |
26 | sscrofa_gene_ensembl | Sus scrofa genes (Sscrofa10.2) | Sscrofa10.2 |
64 | saraneus_gene_ensembl | Sorex araneus genes (sorAra1) | sorAra1 |
66 | sharrisii_gene_ensembl | Sarcophilus harrisii genes (DEVIL7.0) | DEVIL7.0 |
So there is a dataset called scerevisiae_gene_ensembl that is the one we want
scmart <- useDataset("scerevisiae_gene_ensembl", ens)
We can see a full list of the data available in the scerevisiae_gene_ensembl dataset.
attributePages(scmart)
[1] "feature_page" "structure" "homologs" "snp"
[5] "snp_somatic" "sequences"
knitr::kable(head(listAttributes(scmart,"feature_page")))
name | description |
---|---|
ensembl_gene_id | Ensembl Gene ID |
ensembl_transcript_id | Ensembl Transcript ID |
ensembl_peptide_id | Ensembl Protein ID |
ensembl_exon_id | Ensembl Exon ID |
description | Description |
chromosome_name | Chromosome Name |
knitr::kable(head(listAttributes(scmart,"sequences")))
name | description | |
---|---|---|
1167 | transcript_exon_intron | Unspliced (Transcript) |
1168 | gene_exon_intron | Unspliced (Gene) |
1169 | transcript_flank | Flank (Transcript) |
1170 | gene_flank | Flank (Gene) |
1171 | coding_transcript_flank | Flank-coding region (Transcript) |
1172 | coding_gene_flank | Flank-coding region (Gene) |
We can get some of this information for the signicantly differentially expressed genes
# again if scripting this it may be something we don't want to do everytime we run the script
# but being mindful of the fact we might change parameters in the upstream analysis that would
# mean the set of significant genes changes and this step would need refreshing
#
if(!file.exists(paste0(homeDir,"Rdata/annSC.RData"))){
annData <- getBM(
attributes = c("ensembl_gene_id",
"external_gene_name",
"description",
"entrezgene",
"refseq_dna",
"refseq_mrna"),
filter="ensembl_gene_id",
values=rownames(dge_scerevisiae)[which(dge_scerevisiae$FDR<=0.05)],
mart=scmart)
save(annData,file=paste0(rnaLocal,"Rdata/annSC.RData"))
}else{
load(paste0(rnaLocal,"Rdata/annSC.RData"))
}
head(annData)
ensembl_gene_id external_gene_name
1 YPR161C SGV1
2 YOL138C RTC1
3 YGR129W SYF2
4 YPR165W RHO1
5 YPR098C
6 YPL015C HST2
description
1 Cyclin (Bur2p)-dependent protein kinase; part of the BUR kinase complex which functions in transcriptional regulation; phosphorylates the carboxy-terminal domain (CTD) of Rpo21p and the C-terminal repeat domain of Spt5p; recruits Spt6p to the CTD at the onset of transcription; regulated by Cak1p; similar to metazoan CDK9 proteins [Source:SGD;Acc:S000006365]
2 Subunit of the SEA (Seh1-associated) complex; SEA is a coatomer-related complex that associates dynamically with the vacuole; null mutation suppresses cdc13-1 temperature sensitivity; has N-terminal WD-40 repeats and a C-terminal RING motif [Source:SGD;Acc:S000005498]
3 Member of the NineTeen Complex (NTC); NTC contains Prp19p and stabilizes U6 snRNA in catalytic forms of the spliceosome containing U2, U5, and U6 snRNAs; relocalizes to the cytosol in response to hypoxia; isy1 syf2 cells have defective spindles activiating cell cycle arrest [Source:SGD;Acc:S000003361]
4 GTP-binding protein of the rho subfamily of Ras-like proteins; involved in establishment of cell polarity; regulates protein kinase C (Pkc1p) and the cell wall synthesizing enzyme 1,3-beta-glucan synthase (Fks1p and Gsc2p) [Source:SGD;Acc:S000006369]
5 Protein of unknown function; localized to the mitochondrial outer membrane [Source:SGD;Acc:S000006302]
6 Cytoplasmic NAD(+)-dependent protein deacetylase; member of the silencing information regulator 2 (Sir2) family of NAD(+)-dependent protein deacetylases; modulates nucleolar (rDNA) and telomeric silencing; possesses NAD(+)-dependent histone deacetylase activity in vitro; contains a nuclear export signal (NES); function regulated by its nuclear export [Source:SGD;Acc:S000005936]
entrezgene refseq_dna refseq_mrna
1 856290 NM_001184258 NM_001184258
2 853982 NM_001183392 NM_001183392
3 853030 NM_001181258 NM_001181258
4 856294 NM_001184262 NM_001184262
5 856213 NM_001184195 NM_001184195
6 856092 NM_001183829 NM_001183829
We can combine this data with the gene expression data using merge
annRes <- merge(annData,dge_scerevisiae,by.x="ensembl_gene_id",by.y="geneId")
knitr::kable(head(annRes[order(abs(annRes$logFC),decreasing = T),]))
ensembl_gene_id | external_gene_name | description | entrezgene | refseq_dna | refseq_mrna | FDR | logFC | |
---|---|---|---|---|---|---|---|---|
4863 | YOR290C | SNF2 | Catalytic subunit of the SWI/SNF chromatin remodeling complex; involved in transcriptional regulation; contains DNA-stimulated ATPase activity; functions interdependently in transcriptional activation with Snf5p and Snf6p [Source:SGD;Acc:S000005816] | 854465 | NM_001183709 | NM_001183709 | 0.0e+00 | 8.758857 |
2098 | YHR095W | Dubious open reading frame; unlikely to encode a functional protein, based on available experimental and comparative sequence data [Source:SGD;Acc:S000001137] | NA | 0.0e+00 | -5.485661 | |||
4500 | YNR034W-A | Putative protein of unknown function; expression is regulated by Msn2p/Msn4p; YNR034W-A has a paralog, YCR075W-A, that arose from the whole genome duplication [Source:SGD;Acc:S000007525] | 855770 | NM_001184446 | NM_001184446 | 0.0e+00 | 4.990543 | |
140 | tN(GUU)P | Asparagine tRNA (tRNA-Asn), predicted by tRNAscan-SE analysis [Source:SGD;Acc:S000006677] | NA | 0.0e+00 | 4.963624 | |||
2129 | YHR136C | SPL2 | Protein with similarity to cyclin-dependent kinase inhibitors; downregulates low-affinity phosphate transport during phosphate limitation by targeting Pho87p to the vacuole; upstream region harbors putative hypoxia response element (HRE) cluster; overproduction suppresses a plc1 null mutation; promoter shows an increase in Snf2p occupancy after heat shock; GFP-fusion protein localizes to the cytoplasm [Source:SGD;Acc:S000001178] | 856538 | NM_001179266 | NM_001179266 | 0.0e+00 | 4.958985 |
174 | tY(GUA)J1 | SUP7 | Tyrosine tRNA (tRNA-Tyr), predicted by tRNAscan-SE analysis; can mutate to suppress ochre nonsense mutations [Source:SGD;Acc:S000006781] | NA | 3.4e-06 | 4.538889 |
And satisfyingly and fortunately the biggest foldchange between the WT and the SNF2 deletion mutant is the SNF2 gene itself.
The first step after loading the packages is to create an object that links to STRING, this requires knowing the species id of the organism of interest of which there are over 2000.
require(STRINGdb)
# List the names and ids of the organisms
species <- get_STRING_species(version="10", species_name=NULL)
# find the organisms starting with "Sacc"
species[which(grepl("^Sacc",species$official_name)),]
species_id official_name
26 4932 Saccharomyces cerevisiae
298 203122 Saccharophagus degradans 2-40
897 405948 Saccharopolyspora erythraea NRRL 2338
1083 471857 Saccharomonospora viridis DSM 43017
compact_name kingdom type
26 Saccharomyces cerevisiae eukaryota core
298 Saccharophagus degradans bacteria core
897 Saccharopolyspora erythraea bacteria core
1083 Saccharomonospora viridis bacteria core
# create a new STRING_db object for Saccharomyces cerevisiae
string_db <- STRINGdb$new(version="10",species=4932)
Once we have the STRINGdb we need to convert or map the ids we have from ENSEMBL for our genes to the gene ids in STRINGdb. We can use either the entrez gene id or the ensembl gene id but they have differening success rates so better to use the one that maximises the number of mapped genes
# map to STRING ids
example1_mapped = string_db$map(annRes, "entrezgene", removeUnmappedRows = TRUE )
Warning: we couldn't map to STRING 8% of your identifiers
example1_mapped = string_db$map(annRes, "ensembl_gene_id", removeUnmappedRows = TRUE )
Warning: we couldn't map to STRING 3% of your identifiers
STRINGdb then lets us draw a gene interaction network for a set of genes. Here we select the STRING_id for the top 25 differentially expressed genes
# get the STRING_id for the top 50 genes
hits = example1_mapped$STRING_id[1:25]
# plot the STRING network png
string_db$plot_network( hits )
It is also possible to add to this diagram the foldchange information in the form of coloured halos around the notes that represent the foldchange as an intensity on a Red to Green colour scale.
# filter by p-value and add a color column
# (i.e. green down-regulated gened and red for up-regulated genes)
example1_mapped_pval05 = string_db$add_diff_exp_color( subset(example1_mapped, FDR<0.05),
logFcColStr="logFC" )
# post payload information to the STRING server
payload_id = string_db$post_payload( example1_mapped_pval05$STRING_id, colors=example1_mapped_pval05$color )
# display a STRING network png with the "halo"
string_db$plot_network( hits, payload_id=payload_id )
STRINGdb has the function get_enrichment()
that can calculate enriched annotations from GO, KEGG, InterPro and pfam for sets of genes
hits <- example1_mapped$STRING_id
######## compute enrichment in GO annotations for Biological Process ########
goProcess <- string_db$get_enrichment(hits, category = "Process",
methodMT = "fdr", iea = TRUE )
write.csv(goProcess[which(goProcess$pvalue_fdr<=0.05),],
file=paste0(rnaLocal,"RData/goProcessR.csv"))
goFunction <- string_db$get_enrichment(hits, category = "Function" ,
methodMT = "fdr", iea = TRUE )
write.csv(goFunction[which(goFunction$pvalue_fdr<=0.05),],
file=paste0(rnaLocal,"RData/goFunctionR.csv"))
goComponent <- string_db$get_enrichment(hits, category = "Component" ,
methodMT = "fdr", iea = TRUE )
write.csv(goComponent[which(goComponent$pvalue_fdr<=0.05),],
file=paste0(rnaLocal,"RData/goComponentR.csv"))
keggPath <- string_db$get_enrichment(hits, category = "KEGG" ,
methodMT = "fdr", iea = TRUE )
write.csv(keggPath[which(keggPath$pvalue_fdr<=0.05),],
file=paste0(rnaLocal,"RData/keggPathR.csv"))
interPro <- string_db$get_enrichment(hits, category = "InterPro" ,
methodMT = "fdr", iea = TRUE )
write.csv(interPro[which(interPro$pvalue_fdr<=0.05),],
file=paste0(rnaLocal,"RData/interProR.csv"))
pfamDom <- string_db$get_enrichment(hits, category = "Pfam" ,
methodMT = "fdr", iea = TRUE )
write.csv(pfamDom[which(pfamDom$pvalue_fdr<=0.05),],
file=paste0(rnaLocal,"RData/pfamDomR.csv"))
There are also packages for visualising these kinds of results. One such is the pathview package that is used to visualise expression data on KEGG pathways
require(pathview)
gene.data <- as.numeric(annRes$logFC)
names(gene.data) <- annRes$geneId
pathID <- "00260"
pview <- pathview(gene.data=gene.data,
gene.idtype="kegg",
pathway.id=pathID,
species="sce",
out.suffix="kegg",
kegg.native=T,
same.layer=T)
KEGG pathway diagram
sessionInfo()
R version 3.2.5 (2016-04-14)
Platform: x86_64-apple-darwin15.4.0 (64-bit)
Running under: OS X 10.11.4 (El Capitan)
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] parallel stats4 tcltk stats graphics grDevices utils
[8] datasets methods base
other attached packages:
[1] pathview_1.10.1 org.Hs.eg.db_3.2.3 AnnotationDbi_1.32.3
[4] IRanges_2.4.8 S4Vectors_0.8.11 Biobase_2.30.0
[7] BiocGenerics_0.16.1 STRINGdb_1.10.1 hash_2.2.6
[10] gplots_3.0.1 RColorBrewer_1.1-2 plotrix_3.6-1
[13] RCurl_1.95-4.8 bitops_1.0-6 igraph_1.0.1
[16] plyr_1.8.3 sqldf_0.4-10 RSQLite_1.0.0
[19] DBI_0.3.1 gsubfn_0.6-6 proto_0.3-10
[22] png_0.1-7 biomaRt_2.26.1 knitr_1.12.3
[25] RefManageR_0.10.13 pietalib_0.1
loaded via a namespace (and not attached):
[1] KEGGREST_1.10.1 gtools_3.5.0 htmltools_0.3.5
[4] yaml_2.1.13 chron_2.3-47 XML_3.98-1.4
[7] Rgraphviz_2.14.0 stringr_1.0.0 zlibbioc_1.16.0
[10] Biostrings_2.38.4 caTools_1.17.1 evaluate_0.8.3
[13] GenomeInfoDb_1.6.3 highr_0.5.1 Rcpp_0.12.4
[16] KernSmooth_2.23-15 formatR_1.3 gdata_2.17.0
[19] graph_1.48.0 XVector_0.10.0 digest_0.6.9
[22] stringi_1.0-1 RJSONIO_1.3-0 GenomicRanges_1.22.4
[25] grid_3.2.5 bibtex_0.4.0 tools_3.2.5
[28] magrittr_1.5 KEGGgraph_1.28.0 lubridate_1.5.6
[31] rmarkdown_0.9.5 httr_1.1.0 R6_2.1.2