TARO Developer Documentation / Cluster Submission Scripts

TARO cluster job management overview (incomplete)
Script	Script description
TO_masterpipe_onNewClus.pl	Checks runtime environment and if it cluster then, starts TARO indevel.pm
TarO_indevel.pm Accepted command line parameters: -i [Inputfile (must give full path name)] -o [output directory (must give full path name)] Optional input: -s [integer] specifies how many sequences per file when splitting (input) fasta file to run jobs on the cluster -t [local cluster node directory] specifies location of local temporary storage -p [directory] location of perlscripts to run jobs on the cluster -j [directory] location (top of the tree) for storage and reading of raw output from pipeline steps -d debug -h print this message -v verbose mode -z overwrite all cluster-generated results -q minimal upload (only upload data into new display tables - used for incorporating legacy queries (v0.1) into v1.0) -b [integer] specifies the maximum number of sequences to include in the multiple sequence alignment -w [local cluster node directory] location for local psiblast database -a [directory] location of databases and .dat files -m [stepname(s)] overwrite specific stepname(s) Available stepnames: disembl globplot targetdb pfam_rps cdd_rps smart_rps kog_rps cog_rps signalp tmhmm muscle ronn jpred netnglyc netnglyc netphos pdbpsi obscore parcrys	Below are the comments from the script Deal with legacy queries. Read input into hash Explicitly mark sequences as user input (to avoid id duplication problems) Initialize the script Do some file manipulation Check for results of previous runs to avoid rerunning analisys Do the database update if necessary BLAST sequences vs COG database (update status to Completed) Analyse the BLAST results to identify sequences with sufficient identity to be included into a COG (TODO - use Rost Thresholds) Read in all members of that COG Parse COG information groupings Read in the cog data including phylogenetic information & full organism name Read in protein sequences OUTPUT COG orthologues and User-input sequences to a single file as input for PSIBLAST SEARCHING ### Write out files for sequences with known Gram+/- and euk origin (archaea currently processed as eukarya by signalP) + files for unknown remainder PSIBLAST the whole COG (plus original input sequences) against UNIREF100 HERE! this will be transformed into a lookup once the allvsall has finished....? Or perhaps do a profile-profile search? (update status to Completed) Parse psiblast output (Finding UNIREF (PSIBLAST) homologues) Output homologues (currently just from psiblast of uniref100 (UNIREF homologs) Concatenate all homologues & orthologues plus original inputs into single file (Strategy to remove redundancy to be implemented.... here!!) Analyse the inputs/orthologs/homologs to remove redundancy (remove sequences (100%coverage + 100%identity criteria)) plus write out file with orthologues + X number of homologues to make multiple alignment from (X is possibly user defined variable describing the maximum number of sequences to be included in the alignment (caveat that all orthologues will always be included)) Draw together all sequences into a single object & assign UniProt equivalence to each sequence Assigning UNIREF length & organism info. Assign Phylogenetic information to UniRef100 + input sequences Write out files for sequences with known Gram+/- and euk origin (archaea currently processed as eukarya by signalP) + files for unknown remainder (Writing out signalP phyla files) Run Muscle to generate multiple alignments for top 200 seqs: (+ add-on POA later??) # Strategy allows the alignment of a maximum of 200 sequences #- including orthologues and top-hitting homologues # COG orthologues are given priority, # sequences are ordered by evalue match to the input sequence(s) # sequences more than 1000aa longer than the input sequence are excluded Order the cog orthologues according to their psiblast evalue match & exclude any seqs more than 125% of longest input seq (only take sequences up to 1/4th longer that the user input sequence) UP UNTIL THIS POINT ALL ANALYSIS ARE RUN UNCONDITIONALLY. ANALYSIS OF SEQUENCES PIPELINE (start if close) Protein disorder prediction for now just use Globplot and Disembl. Disembl use homologues plus orthologues now. No trimming of orthologues now conducted: just downweight in the ranking scheme Running RONN Running Gloplot Running BLAST TargetDB Running PSIBLAST PDB (only input sequence) Running BLAST PDB - FOR ORTHOLOGUES & HOMOLOGUES Search CDD (RPSBLAST - Pfam) Running RPSBLAST CDD Running RPSBLAST Cog Running RPSBLAST SMART Running RPSBLAST KOG Run ParCrys (& calculate OB-scores as part of this process) Running TMHMM2 Glycosylation site prediction NetNglyc NetOglyc Phosphorylation site prediction Netphos Signal peptide prediction (signalP) - eukaryotic seqs gram+ve seqs, gram-ve seqs JPRED - run only on input seq(s) (end of condition) Miscellaneous analyses (count methionines, cysteines etc., calculate PI, Mr, length, hydrophobicity Calculate extinction coefficient at 280nm (molar and 1mg per ml coefficients) Get Nucleotide Sequence.... Nucl. seq dependant annotation: GC content,Tev protease sites and codon usage relative to e.coli/dictostelium/insect cell/cell free expression. Reading targetDB blast and PDB psiblast info to Identify psiblast queries with possible similar structure to a db sequence (using combos of id, alignlen, & evalue) Read in Protein disorder prediction results (to output to the tab delimeted file for uploading into the database) Read in Parcrys info Read in SignalP results Read in TMHMM2 results Read in Jpred results Read in netNglyc results Read in netPhos results Make MSA annotated with Globplot/Disembl disorder predictions (for Jalview) Copy the groups file, MSA and jnet file to be accessible in the website Jalview directory Write out tab delimited files for the database SEQUENCE RANKING MECHANISMS Currently ParCrys Score (& blast evalue) additions to sequence table Steps to deal with legacy data - to add in the extra data found for the given query UPDATE STATUS TABLE email user to indicate job is ready

Taro code distribution
137285 (bytes)-TarO_indevel.pm (main script library) 31727 -Clust_int_taro.pm (methods to send jobs to the cluster) 22831 -Blast_pars.pm (BLAST parsers) 21180 -ParCrys.pm (methods to calculate ParCrys scores) 17795 -BioSession.pm (HTTP session management) 13422 -OB.pm (methods to calculate OB score)
63997 - TarO_indevel.pl (most of the modules above are predominantly called from this script) 3188 - ParCrys.pl <3000 -15 other scripts (most are the data parsers)

ParCrys

Single standalone binary for parCrys is located in /homes/www-ob/bin/ParCrys/ParCrys_generic.pl Parcrys depends on a.out C binary for calculation.

a.out binary expects 6 input files to be located in its current directory. It accepts multiple FASTA sequencing from a single input file and its produces separate output for each of the sequences. The source code for a.out can be found at /homes/www-ob/bin/fastParzen/

TarO configurable cluster submission

TARO cluster submisssion engine can be used to execute tasks on the cluster. TarO.conf file is used to configure a master execution script called Analyses_pipeline.pl.