9 March 2015

“non-reproducible single occurrences are of no significance to science”

Popper (1935) Logik der Forschung

"The distinction between replication and reproducibility is, from what I understand, that

'replicable' means 'other people get exactly the same results when doing exactly the same thing',

while

'reproducible' means 'something similar happens in other people's hands'.

The latter is far stronger, in general, because it indicates that your results are not merely some quirk of your setup and may actually be right."

Brown (2015)

"Statisticians and computer scientists - if there is no code, there is no paper

So I have a new policy when evaluating CV's of candidates for jobs, or when I'm reading a paper as a referee. If the paper is about a new statistical method or machine learning algorithm and there is no software available for that method - I simply mentally cross it off the CV. If I'm reading a data analysis and there isn't code that reproduces their analysis - I mentally cross it off."

Leek (2015)

Myth 3: We need new platforms for reproducible computational science.

Engineers like building stuff. It sure is easier (and hence more fun, at least in the short term) than doing science. But what we need right now is scientists actually using stuff that already exists, not engineers building new stuff that no one will ever use.

…to a first approximation, IPython Notebook and knitr have won.

Brown (2014)

Open Research

Transparent scientific analysis - distributing analysis/code and data

  • Publishing data
    • public databases, repositories
    • ? (the data I work on isn't mine to share)
  • Publishing analysis
    • scripted analysis is simpler to distrubute than mouse clicks sequences
      • literate scripts, ipython-notebooks or knitr / sweave
    • make an R package Wickham (2015); Leek (2014)
    • post it on Git-Hub
  • Publishing results

"One of these days I'm gonna get organizized."

Bickle (1976) Taxi Driver

  • Literate proramming Knuth (1984), embed the code within the natural language description of the logic behind the code (cweb, noweb, knitr, ipython-notebooks)

\[ versus \]

  • Documentation generation structured comments embedded in the code are extracted to produce documentation (perldoc javadoc, sphynx, doxygen)

"Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do."

Knuth (1984)

I R

Package Development (devtools)

devtools Wickham and Chang (2015)

  • what RStudio is doing in the background when you make a package
    • create a directory structure
    • creates and manages certain key config files
      • DESCRIPTION needs hand editing
      • NAMESPACE automatically managed
  • generate documentation from structured comments
    • roxygen
  • tools for automated testing
    • testthat
  • tools for generating vignettes (how-tos and tutorial documentation)

    packagename
    |
    |- README
    |- DESCRIPTION
    |- NAMSPACE
    |- R
    |  ` - This is where you commented R scripts live
    |
    |- man
    |  ` - This is where the autogenerated help goes
    |
    |- tests
    |  ` - This is where the teststructure and code goes
    |
    |- vignettes
    :  ` - This is where the how-tos and tutorials go
    :
    :..scr
       `.. This is potentially where cpp (Rcpp) code would live 

Package: SpikeNorm
Type: Package
Title: A package to normalise RNA-seq data using Spike-in information
Version: 1.0
Date: 2014-11-20
Authors@R: c(
    person("Pieta","Schofield",email="pschofield@dundee.ac.uk",role=c("aut","cre")),
    person("Nick","Schurch",email="nschurc@dundee.ac.uk",role="aut"))
Description: This package uses expression values for spike-ins of known
    concentration to normalise RNA-seq data
License: GPL2
Imports:
    MASS,
    matrixStats,
    robust,
    plyr,
    ggplot2,
    limma,
    edgeR
Suggests:
    testthat,
    BiocStyle,
    knitr
VignetteBuilder: knitr
LazyData: true

#' subScript will submit a script to the cluster
#'
#' calls subJob to submit a script to the cluster as a temporary file it relies on the temporary
#' directory where the temporary file will be written to being mounted to the local machine.
#' I could get round this by writing it locally and then copying it with scp but at the moment
#' this is not worth the effort.
#'
#' @param scriptstub the stub for the temporary file name
#' @param script the content of the script as a vector of strings
#' @param tmpdir location for the temporary file
#' @param scriptext extention for the temporary file
#' @param logdir location for the batch job logs
#' @param cores number of cores
#'
#' @export
subScript <- function(scriptstub="ssh",script=c("#!/bin/bash","hostname"),
                      tmpdir="/homes/pschofield/tmp/",scriptext=".sh",logdir="",cores=8)
{
  batchJob <- tempfile(pattern=scriptstub,tmpdir=tmpdir,fileext=scriptext)
  filecon <- file(batchJob)
  writeLines(script, filecon)
  close(filecon)
  subJob(scriptfile=batchJob,logdir=logdir,mcCores=cores)
}

edit your functions then rinse and repeat

# create the documentation from the roxygen comments in the R sources
devtools::document()
# load the package for testing
devtools::load_all()
# run the test scripts stored in the tests subdirectory
devtools::test()
# eventually install the packages so it can be used outwith the source directory
devtools::install(reload=T)

Literate Analysis (rmarkdown)

knitr, Xie (2013) embed the code for the analysis within a natural language description of the analysis.

evolution of sweave

  • combined writing natural language in latex
  • with embedded chunks of R code

knitr permits

  • latex, rmarkdown and html as the natural language format
  • output to Word, HTML, PDF
  • output as report, poster, presentation, interactive document

rmarkdown file starts with a header

---
title: "how i R(oll)"
author: "Pietà Schofield"
date: "9 March 2015"
output: 
  ioslides_presentation:
    fig_caption: true
    fig_width: 10
    fig_height: 7
    wide: true
    css: presentation.css
---

rmarkdown, write natural language descriptive text in a Markdown dialect interspesed with chunks of R code

for example to list the content of the SpikeNorm packages DESCRIPTION file

```{r , eval=FALSE, comment=NA}
#
# Specify the file to open
#
desFile <- "/Users/pschofield/git_hub/SpikeNorm/DESCRIPTION"
#
# read the file and write it to the stdio this will be sent to
# the chuck output
#
writeLines(readLines(desFile))
#
```

or to generate and display a graph

```{r , fig.caption="Some Random Stuff", eval=FALSE}
#
# Plot anything R can plot
#
plot(1:10+rnorm(10),1:10+rnorm(10),pch="x", 
     xlab="expected", ylab="measured",main="Demo Plot")
#
# One option is to just send it to the default device and knitr
# captures it and put it in a temporary place
#
abline(0,1,col="red")
# 
# alternatively write it to a file and link to the file in the markdown
#
```

Some Random Stuff

knitr and hence rmarkdown interface with a program called pandoc by MacFarlane

pandoc will convert the markdown or latex generated by knitr into

  • HTML ( html_document, ioslides_presentation, slidy_presentation )
  • PDF ( pdf_document, beamer_presentation )
  • Word docx ( word_document )

(NB: pandoc is written in haskell! which makes it sort of cool in itself)

RStudio

  • RStudio makes this stuff very simple
    • literate analysis knitr (rmarkdown)
    • package creation (devtools)
  • Do as I say not as I do
    • it is all just R functions under the bonnet of RStudio
    • so if you are passionately (pathologically) addicted to vim (emac), there is always the vim-R-plugin (ESS)
    • I believe eclipse has a plugin StatET too

Bibliography

It is possible to include references from a bibtex library with knitcitations

I prefer RefManageR McLean (2014)

```{r , eval=FALSE}
# load the packsge
require(RefManageR)
mybibfile <- "/Users/pschofield/git_tree/biblio/bioinf.bib"
# specify the bibliography options
BibOptions(check.entries = FALSE, style = "markdown", 
           cite.style = "authoryear", bib.style = "authoryear")
# load the file
bib <- ReadBib(mybibfile, check=FALSE)
```

Then you can include a citations with `r TextCite(bib,"refkey")` in your text as type Finally you add a code chuck

```{r , eval=FALSE}
PrintBibliography(bib)
```

Normally code chunks appear without the options and ticks

#
# load the RefManageR package so I can have a central bib 
# file rather than it have to be in the same directory as
# the markdown file
#
require(RefManageR)
mybibfile <- "/Users/pschofield/git_tree/biblio/bioinf.bib"
#
# specify the bibliography options
#
BibOptions(check.entries = FALSE, style = "markdown", 
           cite.style = "authoryear", bib.style = "authoryear")
#
bib <- ReadBib(mybibfile, check=FALSE)

but I have been showing the decorations for illustrative purposes

Normally they are also syntax highlighted

Brown, T. (2014). Some myths of reproducible computational research. URL: http://ivory.idyll.org/blog/2014-myths-of-computational-reproducibility.html (visited on 2014).

Brown, T. (2015). Our approach to replication in computational science. URL: http://ivory.idyll.org/blog/replication-i.html (visited on 2015).

Knuth, D. E. (1984). "Literate Programming". In: Comput. J. 27.2, pp. 97-111. ISSN: 0010-4620. DOI: 10.1093/comjnl/27.2.97. URL: http://dx.doi.org/10.1093/comjnl/27.2.97.

Leek, J. (2014). rpackages. URL: https://github.com/jtleek/rpackages (visited on 2014).

Leek, J. (2015). Statisticians and computer scientists - if there is no code, there is no paper. URL: http://simplystatistics.org/2013/01/23/statisticians-and-computer-scientists-if-there-is-no-code-there-is-no-paper (visited on 2015).

McLean, M. W. (2014). Straightforward Bibliography Management in R Using the RefManager Package. arXiv: 1403.2036 [cs.DL]. Submitted. URL: http://arxiv.org/abs/1403.2036.

Popper, K. (1935). Logik der Forschung. Verlag von Julius Springer, Vienna, Austria.

Wickham, H. (2015). R Packages. URL: http://r-pkgs.had.co.nz (visited on 2015).

Wickham, H. and W. Chang (2015). devtools: Tools to Make Developing R Packages Easier. R package version 1.7.0. URL: http://CRAN.R-project.org/package=devtools.

Xie, Y. (2013). Dynamic Documents with R and knitr. Boca Raton, Florida: Chapman and Hall/CRC. URL: http://yihui.name/knitr/.

This is work in progress I hope I am getting better at it

  • do all my analysis in rmarkdown scripts
    • even script in rmarkdown to submit cluster jobs remotely via ssh
  • generate HTML pages in my public_html directory on the cluster http://www.compbio.dundee.ac.uk/user/pschofield, currently not all my pages are public as I don't own the data, you can access just ask
  • all my codes/scripts are in git (some on ningal some on git_hub) again you can have access
  • put all my useful little function in an R package
  • write my teaching materials and presentations in rmarkdown
  • produce posters with latex/sweave knitr

The code for this presentation can be found here

Thank You.