TARO Production Environment
This documentation does not reflect a complete file structure of the TARO, it is intended to cover all aspects of the production system only.
Table of Content
- TARO files location in the production environment
- The content of /homes/www-refine directory
- www-refine user crontab
- TARO CVS Repository
TARO files location on the fiber channel disk (write accessible from the cluster only)
TARO output directory for each run including log and error files:/fc_gpfs/gjb_lab/www-refine/pipeline_newclus/TO<jobid>
Binaries executing from the fiber channel disk
Two binaries are run from and located in/fc_gpfs/gjb_lab/www-refine/bin
In particular netOglyc-3.1b and signalp-3.0
The content of /homes/www-refine directory - the root TARO directory
Many directories contains old/ subdirectory with legacy code, which IS NOT USED BY TARO- backup This directory contains: pipeline_newclus directory - a backup (from Sept 2008) of the pipeline_newclus directory that contains the working directories for each TarO query
- Benchmarking - scripts to measure aspects of TARO performance
- bin See also here binary executable files to run in TARO pipeline
- Brunak/(contains several programs actively used by TarO (all from the Brunak group))
- modules/ (contains perl modules used by TarO)
- RONN (RONN v3 programme, and directory for running it)
- SEG (NOT USED by TarO directory for running the SEG program)
- httpd This is directory is used for the TarO Apache server configuration associated with the httpd program and the top directory for the Front End (largely perl CGI)
- cgi-bin (directory for the earlier version of the TarO Front End without session-based authentication. This is required for the display of some legacy TarO queries.)
- cgi-taro (production cgi scripts)
- cgi-devel/ - directory for development of TarO Front End
- conf (apache config file)
- conf.d (may not be needed?)
- htdocs (static html pages)
- logs (TARO apache logs)
- NOBACK By convention NOBACK directories are not included in the backups
- pipeline_results (output of different tools with subdirectories for each query named as TO<queryid>)
- BlastASTRAL_out (DEPRECEATED)
- BlastPDB_out
- BlastTargetDB_out
- cdd_RPSBLAST_out
- COG_BLAST
- Cog_RPSBLAST_out
- Disembl_out
- Globplot_out
- Homologues
- Jpred_out
- Kog_RPSBLAST_out
- netNglyc_out
- netOglyc_out
- netphos_out
- Pfam_RPSBLAST_out
- Psiblast_uniref
- PsiBlastPDB_out
- raw_pipe_out
- RONN_out
- SignalP_out
- smart_RPSBLAST_out
- test (FOR TESTING)
- TMHMM2_out
- SSPF_T01
- i_pwd (directory for Apache Basic authentication of devel user for the website used in TarO front-end development and testing)
- TarO_newclus (the hub directory for the TarO pipeline main TARO scripts are located here)
- upload/ directory also contains scripts called as part of the pipeline (generally to input/update the database)
- cluster_scripts/ directory contains scripts called by the pipeline as part of the mechanism for running array jobs on the cluster
- databases/ directory contains databases for searching and other dat files (eg matrix for calculating the OB score)
- qsub_commands/ directory - IS NOT A PART OF a TARO PIPELINE, the scripts inside this directory are used to start the TARO pipeline manually
- TarO2pt0 (TARO CVS)
- raw_pipe_out_200509.tar.bz2 - zipped tar archive of the raw output for the various algorithms run in TarO queries 1 to 1234
bin/brunak (contains programs obtained from the Brunak group that are run in TarO)
TMHMM2/ - directory for TMHMM2 netphos-2.0/ - directory for netphos netphos-2.0.Linux.tar.Z - netphos distribution netphos-2.0.readme - netphos readme netNglyc-1.0a.Linux.tar.Z - netNglyc distribution NOTE netNglyc is run from a directory in the NOBACK directory netNglyc-1.0a.readme - netNglyc readme NOBACK/ - contains netNglyc directory (which can generates a lot of files so better in NOBACK) netOglyc-3.1b/ - (NOT ACTIVELY USED by TarO) directory for netOglyc (this is not actively used as netOglyc is currently run from /fc_gpfs/gjb_lab/www-refine/bin/netOglyc-3.1b) netOglyc-3.1b.Linux.tar.Z - netOglyc distribution netOglyc-3.1b.readme - netOglyc readme signalp-3.0/ - (NOT ACTIVELY USED by TarO) directory for signalp. (This is not actively used as signalp is currently run from /fc_gpfs/gjb_lab/www-refine/bin/signalp-3.0) signalp-3.0.Linux.tar.Z - signalp 3.0 distribution signalp-3.0.readme - signalp 3.0 readme secretomep-1.0/ - (NOT ACTIVELY USED by TarO) secretomep directory (could be nice to include this program into TarO) secretomep-1.0c.Linux.tar.Z - (NOT ACTIVELY USED by TarO) secretomep distribution secretomep-1.0c.readme - readme for secretomep targetp-1.1/ - (NOT ACTIVELY USED by TarO) directory for targetp (could be nice to include this program into TarO) targetp-1.1.Linux.tar.Z - targetp distribution targetp-1.1.readme - targetp readme
TARO v2 bin/Modules folder content (perl libraries used by TARO pipeline)
bioperl-1.4/ - contains the BioPerl code used to calculate pI and molecular weight BioSession.pm - for front end session based authentication Blast_pars.pm - for parsing/analysis of Blast results Clust_int_taro.pm - for sending jobs to the cluster OB.pm - runs the OB-Score ParCrys.pm - runs ParCrys - NOTE calls the ParCrys executable from www-ob space TarO_indevel.pm - general module for TarO operations including parsing output of various programs and some analyses
httpd/htdocs folder content (TARO static html pages)
This is the 'home' directory for the TarO Front End and contains files used by the CGI scripts in ../cgi-taro including the help documents, also contains standard directories (eg error/, icons/)- download (contains tab-delimited and html files to download for each query. required by scripts in ../cgi-taro)
- error (standard http error pages)
- html (this dir & subdirs ARE NOT USED BY TARO)
- icons (icons)
- images (contains the bbsrc logo used by the BioSession.pm (ie ~www-refine/bin/modules/BioSession.pm))
- Jalview (directory for Jalview applet and groups files for each query required by ../cgi-taro/display_query_seqs.pl)
- TarO_help_new_files/ (contains files (eg images) used by TarO_help.html)
- TarO_help_files/ (The help documentation accessible from all pages in the Front End)
TarO1_org.png - image to show organisation of TarO website Documentation.pdf - TarO documentation used by the Tutorial account (and at the EBI EMBO course) index.html - a copy of the TarO help html in case (for whatever reason) someone browses the index.html file sspf.gif - SSPF logo displayed on all core pages styles.css - a stylesheet largelf for display of TarO tables cog_organism.dat - (NOT ACTIVELY USED) gives the conversion of COG organism abbreviations to full organism names. This file is not read by the front end (the information is encoded into ../cgi-taro/targpipe_display.pl TarORegistration.htm - gives the details required when registering for a private TarO account TarO_help.html - The help documentation accessible from all pages in the Front End
TARO v2 httpd/cgi-taro folder content (production CGI scripts)
TOUtils.lib* - library required by the scripts example_fasta.pl* - used to popup an example fasta format sequence global.pl* - required for the session-based authentication and contains globally displayed code (eg page headers) targpipe_auto.pl - (NOT USED BY TARO) code that in DEVELOPMENT to facilitae automatic submission (intended for TarO-PIMS interface) targpipe_input.pl* - a form to accept user input sequences (communicates to targpipe_save.pl) targpipe_save.pl* - recieves input from targpipe_input.pl* and starts the TarO pipeline targpipe_popup_annotation.pl* - used to display more detail about annotations (eg O-linked glycosylation sites) targpipe_popup_astral.pl* - (DEPRECEATED) used to display more detail about astral blast matches targpipe_popup_pdb.pl* - used to display more detail about pdb blast matches targpipe_popup_seq.pl* - displays a fasta sequence targpipe_popup_targetdb.pl*- used to display more detail about targetdb blast matches targpipe_rps_blast.pl*- used to display more detail about cdd/pfam etc rpsblast matches targpipe_tutorial_answers.pl* - (NOT USED BY TARO) used by the TarO Tutorial account targpipe_tutorial_hints.pl* - (NOT USED BY TARO) used by the TarO Tutorial account targpipe_tutorial_questions.pl* - (NOT USED BY TARO) used by the TarO Tutorial account TarO_usage_stats.pl* - calculates usage statistics for TarO targpipe_display.pl* - displays input sequence and putative orthologues targpipe_display_query_seqs.pl* - displays input sequence including the link to the Multiple Sequence Alignment, and the Query Status Table targpipe_display_homologs.pl*- displays input sequence and putative homologues targpipe_home.pl - displays the user's home page
TARO v2 SSF_T01/TarO_newclus folder content (production perl scripts for job managing)
The TarO_indevel.pl script is called by the TO_masterpipe_onNewClus.pl script which in turn is started by the targpipe_save.pl cgi script. TO_masterpipe_onNewClus.pl is started by the front end (targpipe_save.pl cgi script), and as noted above the TO_masterpipe_onNewClus.pl starts TarO_indevel.pl. analyse_aacomp_parzen.pl OB.pl and ParCrys.pl are called by TarO_indevel.pl (as part of the pipeline analyses). CRON calls Run_legacy_queries.pl and TarO_resubmission_engine.pl to respectively run through legacy queries (legacy queries arose at TarO version change) and to resubmit jobs that may have failed for some reason drop_query.pl script is not used by the pipeline and will remove queries from the TarO system - treat with caution! However the periodic removal of guest queries is done by another script - upload/delete_guest_records.pl
TARO v2 SSPF_T01/TarO_newclus/upload folder content (This directory largely contains scripts for interacting with the TarO database (eg to upload data))
This directory largely contains scripts for interacting with the TarO database (eg to upload data) 1) Update Scripts: update_query_display.pl - allows display of the query on the user home page (via pipe_can_display_query table) update_status.pl - changes the status of a analysis steps for a given query (writes to pipe_query_step_status table) update_targetDB_status.pl - called to update the targetDB status in the database table pipe_targetdb_status update_uniref100_statsfile.pl - called to update the UniRef100 dat file in /homes/www-refine/SSPF_TO1/TarO_newclus/databases 2) Upload Scripts: upload_Sequence.pl - uploads data to pipe_sequence_homology table upload_Cog_cluster.pl - uploads data to pipe_cog_cluster table upload_Sequence_Stats.pl - uploads data to pipe_sequence_statistics table upload_Cog_tophit.pl - uploads data to pipe_cog_match table upload_display_main.pl - uploads data to pipe_display_main table upload_TargetDB_Blast_tophit.pl - uploads data to pipe_targetdb_blast_top_hit table upload_display_query_seqs.pl - uploads data to pipe_display_query_seqs table upload_PDB_PSIBlast_tophit.pl - uploads data to pipe_pdb_psiblast_top_hit table upload_Annotation.pl - uploads to pipe_annotation table upload_Annotation_taroNew.pl - uploads to pipe_annotation table for the legacy queries upload_RPS_Blast_tophit.pl - uploads data to pipe_rps_blast_top_hit table upload_Uniref_tophit.pl - uploads data to pipe_uniref_top_hit table 3) Not Actively used by the pipeline: delete_guest_records.pl - called by CRON to periodically delete guest records (once a month, deletes queries older than 7 days) delete_guest_records.log - logfile for delete_guest_records.pl upload_targetdb_status.pl - (NOT ACTIVELY USED by TarO) uploads the TargetDB status infomation to pipe_targetdb_status table when initialising the database. Not used on a per query basis. upload_Sequence_origin.pl - (NOT ACTIVELY USED by TarO) used to upload the pipe_Sequence_origin table which is stable (so this is not written to on a per query basis) upload_Annotation_type.pl* - (NOT ACTIVELY USED by TarO) used to upload the pipe_Annotation_type table which is stable (so this is not written to on a per query basis) master_upload.pl - (NOT ACTIVELY USED by TarO) can be used manually to run all upload scripts in case this doesnt happen for any give TarO query master_update.pl - (NOT ACTIVELY USED by TarO) can be used manually to update the status of a query (in the pipe_query_step_status table) to indicate that results are available for all analysis steps upload_Astral_Blast_tophit.pl - (NOT ACTIVELY USED by TarO) previously used to upload blast hits to ASTRAL. This step has since been removed from the pipeline (though would be nice to include this)
TARO v2 SSPF_T01/TarO_newclus/qsub_commands folder content (This directory contains qsub command used for manually starting the TarO pipeline)
qsub_bigmemQueue.dat - resubmits a job to the bigmem.q (memory can be an issue causing jobs to fail)
TARO v2 SSPF_T01/TarO_newclus/databases folder content (This directory contains databases, dat files and symlinks to databases for the TarO pipeline)
This directory contains databases, dat files and symlinks to databases for the TarO pipeline -- Release/update information DB.dat -> /db/blastdb/DB.dat - Symlink to last update information for the Blast databases uniref100.release_note -> /db/blastdb/uniref100.release_note - symlink to the uniref100 release information -- COG/KOG data COG_KOG.phr - BLAST database for COG/KOG COG_KOG.pin - BLAST database for COG/KOG COG_KOG.psq - BLAST database for COG/KOG Cog_andKog_prot_nonredund.fasta - COG/KOG sequences in fasta format cog_kog.txt - dat file for COG/KOG groupings Organism_CogKog_abbreviations.txt - COG/KOG Organism abbreviations and taxonomic information -- Hydrophobicy & OB-Score data GES_Hydrophobicity_scores.dat - GES Hydrophobicity score data GES_zmat2.dat - OB-score matrix based on GES hydrophobicity Hydrophobicity_scores.dat - Kyte-Doolittle Hydrophobicity score data --PDB data pdb.phr -> /db/blastdb/pdb.phr - Symlink to latest PDB database for BLAST searching pdb.pin -> /db/blastdb/pdb.pin - Symlink to latest PDB database for BLAST searching pdb.psd -> /db/blastdb/pdb.psd - Symlink to latest PDB database for BLAST searching pdb.psi -> /db/blastdb/pdb.psi - Symlink to latest PDB database for BLAST searching pdb.psq -> /db/blastdb/pdb.psq - Symlink to latest PDB database for BLAST searching pdb_uniref50.phr -> /db/blastdb/pdb_uniref50.phr - Symlink to latest PDB database for PSIBLAST searching pdb_uniref50.pin -> /db/blastdb/pdb_uniref50.pin - Symlink to latest PDB database for PSIBLAST searching pdb_uniref50.psd -> /db/blastdb/pdb_uniref50.psd - Symlink to latest PDB database for PSIBLAST searching pdb_uniref50.psi -> /db/blastdb/pdb_uniref50.psi - Symlink to latest PDB database for PSIBLAST searching pdb_uniref50.psq -> /db/blastdb/pdb_uniref50.psq - Symlink to latest PDB database for PSIBLAST searching -- RPS BLAST databases RPS_blast_taro -> /db/CDD/ - Symlink to latest RPSblast databases directory -- TargetDB data targetdb.fasta -> /db/targetdb/targetdb.fasta - symlink to latest targetDB sequences targetdb_copy.fasta - pqrovides a record of the last update to TargetDB information targetdb.txt -> /db/targetdb/targetdb.txt - symlink to targetdb information (eg status etc). input to update_targetDB_status.pl script targetdb_taro.phr -> /db/blastdb/targetdb.phr - Symlink to latest TargetDB database for BLAST searching targetdb_taro.pin -> /db/blastdb/targetdb.pin - Symlink to latest TargetDB database for BLAST searching targetdb_taro.psd -> /db/blastdb/targetdb.psd - Symlink to latest TargetDB database for BLAST searching targetdb_taro.psi -> /db/blastdb/targetdb.psi - Symlink to latest TargetDB database for BLAST searching targetdb_taro.psq -> /db/blastdb/targetdb.psq - Symlink to latest TargetDB database for BLAST searching -- UniProt/UniRef100 data U100_organisms_tophyla_120808.dat - mapping UniRef100 organisms to taxonomic information uniref100_stats_taro.dat - uniref100 taxonomic and sequence length information uniref100_taro.fasta -> /db/uniref/uniref100.fasta - Symlink to latest UniRef100 sequences - Symlinks for blast searching uniref100: uniref100.00.phr -> /db/blastdb/uniref100.00.phr uniref100.00.pin -> /db/blastdb/uniref100.00.pin uniref100.00.psd -> /db/blastdb/uniref100.00.psd uniref100.00.psi -> /db/blastdb/uniref100.00.psi uniref100.00.psq -> /db/blastdb/uniref100.00.psq uniref100.01.phr -> /db/blastdb/uniref100.01.phr uniref100.01.pin -> /db/blastdb/uniref100.01.pin uniref100.01.psd -> /db/blastdb/uniref100.01.psd uniref100.01.psi -> /db/blastdb/uniref100.01.psi uniref100.01.psq -> /db/blastdb/uniref100.01.psq uniref100.02.phr -> /db/blastdb/uniref100.02.phr uniref100.02.pin -> /db/blastdb/uniref100.02.pin uniref100.02.psd -> /db/blastdb/uniref100.02.psd uniref100.02.psi -> /db/blastdb/uniref100.02.psi uniref100.02.psq -> /db/blastdb/uniref100.02.psq uniref100.pal -> /db/blastdb/uniref100.pal - Symlinks for psiblast searching uniref100: uniref100.filt -> /db/blastdb/uniref100.filt - (NOT ACTIVELY USED by TarO) symlink to fasta file for filtered uniref100 database: kept for completeness uniref100.filt.00.phr -> /db/blastdb/uniref100.filt.00.phr uniref100.filt.00.pin -> /db/blastdb/uniref100.filt.00.pin uniref100.filt.00.psd -> /db/blastdb/uniref100.filt.00.psd uniref100.filt.00.psi -> /db/blastdb/uniref100.filt.00.psi uniref100.filt.00.psq -> /db/blastdb/uniref100.filt.00.psq uniref100.filt.01.phr -> /db/blastdb/uniref100.filt.01.phr uniref100.filt.01.pin -> /db/blastdb/uniref100.filt.01.pin uniref100.filt.01.psd -> /db/blastdb/uniref100.filt.01.psd uniref100.filt.01.psi -> /db/blastdb/uniref100.filt.01.psi uniref100.filt.01.psq -> /db/blastdb/uniref100.filt.01.psq uniref100.filt.02.phr -> /db/blastdb/uniref100.filt.02.phr uniref100.filt.02.pin -> /db/blastdb/uniref100.filt.02.pin uniref100.filt.02.psd -> /db/blastdb/uniref100.filt.02.psd uniref100.filt.02.psi -> /db/blastdb/uniref100.filt.02.psi uniref100.filt.02.psq -> /db/blastdb/uniref100.filt.02.psq uniref100.filt.pal -> /db/blastdb/uniref100.filt.pal -- Not in Active use Cog_andKog_prot_nonredund_b4clean.fasta - (NOT ACTIVELY USED by TarO) The COG/KOG sequence before cleaning astral_1_67.phr - (NOT ACTIVELY USED by TarO) The Astral database for BLAST searching astral_1_67.pin - (NOT ACTIVELY USED by TarO) The Astral database for BLAST searching astral_1_67.psq - (NOT ACTIVELY USED by TarO) The Astral database for BLAST searching protein2interpro.dat - (NOT ACTIVELY USED by TarO) mapping from uniprot to interpro (would be nice to include this information into TarO) CVS/ - (NOT ACTIVELY USED by TarO) directory for CVS repository
TARO v2 SSPF_T01/TarO_newclus/cluster_scripts folder content (scripts that are run on the cluster nodes)
This directory contains scripts that are run on the cluster nodes as array jobs to Sun Gridengine. These are used via the Clust_int_taro.pm module. Blast_newclus.pl - runs BLASTP PSIBLAST_newclus_localDB.pl - runs PSIBLAST and copies the database to a local cluster node directory (for better bandwidth) PSIBLAST_newclus.pl - (NOT ACTIVELY USED by TarO) runs PSIBLAST Generic_cluster_interface.pl - this is used to run many programs (eg Disembl, netNglyc) RPSBlast_newclus_localDB.pl - runs RPSBLAST and copies the database to the a directory local to the cluster node (for better bandwidth) RPSBlast_newclus.pl - runs RPSBLAST
www-refine crontab
0 0 8 * * perl /homes/www-refine/SSPF_TO1/TarO_newclus/upload/delete_guest_records.pl >> /homes/www-refine/SSPF_TO1/TarO_newclus/upload/delete_guest_records.log 0 0 * * * perl /homes/www-refine/SSPF_TO1/TarO_newclus/TarO_resubmission_engine.pl >> /homes/www-refine/SSPF_TO1/TarO_newclus/TarO_resubmission_engine.log 0 3 * * * perl /homes/www-refine/SSPF_TO1/TarO_newclus/Run_legacy_queries.pl 0 15 * * * perl /homes/www-refine/SSPF_TO1/TarO_newclus/Run_legacy_queries.pl 0 0 1 1 * rm -rf /fc_gpfs/gjb_lab/www-refine/bin/signalp-3.0/tmp 0 0 1 6 * rm -rf /fc_gpfs/gjb_lab/www-refine/bin/signalp-3.0/tmp
TARO CVS Repository
General principals:
Due to the nature of the TARO it did not look feasible to run its scripts outside of a cluster environment, thus CVS contains copies of the production code that has been developed and debugged on the production system. Every effort was made to ensure that current production version of the scripts are backed up into repository. The specific feature of this repository is that CVS HEAD among current versions of the scripts also contains the old versions of the scripts and other scripts that are not part of TARO pipeline and has been used manually for various purporses.
Please note that any other scripts in www-refine directory are obsolete. (Most of them has been removed to the Attic, but some persist)
CSV directory structure:
- DB (scripts to create SQL (targpipe.sql) & update TARO database)
- FrontEnd (scripts to do with presentation)
- httpd
- cgi-devel (cgi scripts in development)
- cgi-taro (production cgi scripts)
- conf (apache config file)
- htdocs (static html pages)
- script_repos (main TARO scripts)
contains the master TarO script- TarO_indevel.pl
- TO_masterpipe_onNewClus.pl (the script to call TarO_indevel.pl)
- the scripts for running various software in Grid engine array context (eg Generic_cluster_interface.pl, PSIBLAST_newclus_localDB.pl)
- upload (scripts to upload data into TARO DB)
- update_scripts (utility scripts to update inhouse database which TARO uses)