TARO Developer Documentation/ File and Directory Structure

TARO Production Environment

This documentation does not reflect a complete file structure of the TARO, it is intended to cover all aspects of the production system only.

Table of Content

TARO files location in the production environment
The content of /homes/www-refine directory
www-refine user crontab
TARO CVS Repository

TARO files location on the fiber channel disk (write accessible from the cluster only)

TARO output directory for each run including log and error files:
/fc_gpfs/gjb_lab/www-refine/pipeline_newclus/TO<jobid>

Binaries executing from the fiber channel disk

Two binaries are run from and located in
/fc_gpfs/gjb_lab/www-refine/bin
In particular netOglyc-3.1b and signalp-3.0

The content of /homes/www-refine directory - the root TARO directory

Many directories contains old/ subdirectory with legacy code, which IS NOT USED BY TARO

backup This directory contains: pipeline_newclus directory - a backup (from Sept 2008) of the pipeline_newclus directory that contains the working directories for each TarO query
Benchmarking - scripts to measure aspects of TARO performance
bin

here

Brunak/(contains several programs actively used by TarO (all from the Brunak group))
modules/ (contains perl modules used by TarO)
RONN (RONN v3 programme, and directory for running it)
SEG (NOT USED by TarO directory for running the SEG program)

httpd

cgi-bin (directory for the earlier version of the TarO Front End without session-based authentication. This is required for the display of some legacy TarO queries.)
cgi-taro (production cgi scripts)
cgi-devel/ - directory for development of TarO Front End
conf (apache config file)
conf.d (may not be needed?)
htdocs (static html pages)
logs (TARO apache logs)

NOBACK

pipeline_results (output of different tools with subdirectories for each query named as TO<queryid>)
- BlastASTRAL_out (DEPRECEATED)
- BlastPDB_out
- BlastTargetDB_out
- cdd_RPSBLAST_out
- COG_BLAST
- Cog_RPSBLAST_out
- Disembl_out
- Globplot_out
- Homologues
- Jpred_out
- Kog_RPSBLAST_out
- netNglyc_out
- netOglyc_out
- netphos_out
- Pfam_RPSBLAST_out
- Psiblast_uniref
- PsiBlastPDB_out
- raw_pipe_out
- RONN_out
- SignalP_out
- smart_RPSBLAST_out
- test (FOR TESTING)
- TMHMM2_out

SSPF_T01
- i_pwd (directory for Apache Basic authentication of devel user for the website used in TarO front-end development and testing)
- TarO_newclus (the hub directory for the TarO pipeline main TARO scripts are located here)
  - upload/ directory also contains scripts called as part of the pipeline (generally to input/update the database)
  - cluster_scripts/ directory contains scripts called by the pipeline as part of the mechanism for running array jobs on the cluster
  - databases/ directory contains databases for searching and other dat files (eg matrix for calculating the OB score)
  - qsub_commands/ directory - IS NOT A PART OF a TARO PIPELINE, the scripts inside this directory are used to start the TARO pipeline manually
- TarO2pt0 (TARO CVS)
- raw_pipe_out_200509.tar.bz2 - zipped tar archive of the raw output for the various algorithms run in TarO queries 1 to 1234

bin/brunak (contains programs obtained from the Brunak group that are run in TarO)

TMHMM2/ - directory for TMHMM2 

netphos-2.0/ - directory for netphos
netphos-2.0.Linux.tar.Z - netphos distribution
netphos-2.0.readme - netphos readme

netNglyc-1.0a.Linux.tar.Z - netNglyc distribution NOTE netNglyc is run from a directory in the NOBACK directory
netNglyc-1.0a.readme - netNglyc readme

NOBACK/ - contains netNglyc directory (which can generates a lot of files so better in NOBACK)  

netOglyc-3.1b/ - (NOT ACTIVELY USED  by TarO) directory for netOglyc (this is not actively used as netOglyc is currently run from /fc_gpfs/gjb_lab/www-refine/bin/netOglyc-3.1b)
netOglyc-3.1b.Linux.tar.Z - netOglyc distribution
netOglyc-3.1b.readme - netOglyc readme

signalp-3.0/ - (NOT ACTIVELY USED by TarO) directory for signalp. (This is not actively used as signalp is currently run from /fc_gpfs/gjb_lab/www-refine/bin/signalp-3.0)
signalp-3.0.Linux.tar.Z - signalp 3.0 distribution
signalp-3.0.readme - signalp 3.0 readme

secretomep-1.0/ - (NOT ACTIVELY USED by TarO) secretomep directory (could be nice to include this program into TarO)
secretomep-1.0c.Linux.tar.Z - (NOT ACTIVELY USED by TarO) secretomep distribution
secretomep-1.0c.readme - readme for secretomep

targetp-1.1/ - (NOT ACTIVELY USED by TarO) directory for targetp  (could be nice to include this program into TarO)
targetp-1.1.Linux.tar.Z - targetp distribution
targetp-1.1.readme - targetp readme

TARO v2 bin/Modules folder content (perl libraries used by TARO pipeline)


bioperl-1.4/  - contains the BioPerl code used to calculate pI and molecular weight 
BioSession.pm   -  for front end session based authentication
Blast_pars.pm -  for parsing/analysis of Blast results  
Clust_int_taro.pm   -  for sending jobs to the cluster 
OB.pm   - runs the OB-Score
ParCrys.pm   - runs ParCrys - NOTE calls the ParCrys executable from www-ob space
TarO_indevel.pm - general module for TarO operations including parsing output of 
		various programs and some analyses

httpd/htdocs folder content (TARO static html pages)

This is the 'home' directory for the TarO Front End and contains files used by the CGI scripts in ../cgi-taro including the help documents, also contains standard directories (eg error/, icons/)

download (contains tab-delimited and html files to download for each query. required by scripts in ../cgi-taro)
error (standard http error pages)
html (this dir & subdirs ARE NOT USED BY TARO)
icons (icons)
images (contains the bbsrc logo used by the BioSession.pm (ie ~www-refine/bin/modules/BioSession.pm))
Jalview (directory for Jalview applet and groups files for each query required by ../cgi-taro/display_query_seqs.pl)
TarO_help_new_files/ (contains files (eg images) used by TarO_help.html)
TarO_help_files/ (The help documentation accessible from all pages in the Front End)

TarO1_org.png - image to show organisation of TarO website

Documentation.pdf - TarO documentation used by the Tutorial account (and at the EBI EMBO course)

index.html - a copy of the TarO help html in case (for whatever reason) someone browses the index.html file

sspf.gif - SSPF logo displayed on all core pages

styles.css - a stylesheet largelf for display of TarO tables

cog_organism.dat - (NOT ACTIVELY USED) gives the conversion of COG organism abbreviations to full organism names. This file is not read by the front end (the information is encoded into ../cgi-taro/targpipe_display.pl 

TarORegistration.htm - gives the details required when registering for a private TarO account

TarO_help.html - The help documentation accessible from all pages in the Front End

TARO v2 httpd/cgi-taro folder content (production CGI scripts)

 
TOUtils.lib* - library required by the scripts

example_fasta.pl* - used to popup an example fasta format sequence

global.pl* - required for the session-based authentication and contains globally displayed code (eg page headers)

targpipe_auto.pl - (NOT USED BY TARO) code that in DEVELOPMENT to facilitae automatic submission (intended for TarO-PIMS interface)

targpipe_input.pl* - a form to accept user input sequences (communicates to targpipe_save.pl)

targpipe_save.pl* - recieves input from targpipe_input.pl* and starts the TarO pipeline

targpipe_popup_annotation.pl* - used to display more detail about annotations (eg  O-linked glycosylation sites)

targpipe_popup_astral.pl* - (DEPRECEATED) used to display more detail about  astral blast matches 

targpipe_popup_pdb.pl* - used to display more detail about pdb blast matches 

targpipe_popup_seq.pl* - displays a fasta sequence

targpipe_popup_targetdb.pl*- used to display more detail about targetdb blast matches

targpipe_rps_blast.pl*- used to display more detail about cdd/pfam etc rpsblast matches

targpipe_tutorial_answers.pl* - (NOT USED BY TARO) used by the TarO Tutorial account

targpipe_tutorial_hints.pl* - (NOT USED BY TARO) used by the TarO Tutorial account

targpipe_tutorial_questions.pl* - (NOT USED BY TARO) used by the TarO Tutorial account

TarO_usage_stats.pl* - calculates usage statistics for TarO

targpipe_display.pl* - displays input sequence and putative orthologues

targpipe_display_query_seqs.pl* - displays input sequence including the link to the Multiple Sequence
 Alignment, and the Query Status Table

targpipe_display_homologs.pl*- displays input sequence and putative homologues

targpipe_home.pl - displays the user's home page

TARO v2 SSF_T01/TarO_newclus folder content (production perl scripts for job managing)

The TarO_indevel.pl script is called by the TO_masterpipe_onNewClus.pl script which
in turn is started by the targpipe_save.pl cgi script.

TO_masterpipe_onNewClus.pl is started by the front end (targpipe_save.pl cgi script), and 
as noted above the TO_masterpipe_onNewClus.pl starts TarO_indevel.pl.

analyse_aacomp_parzen.pl OB.pl and ParCrys.pl are called by TarO_indevel.pl (as part of the pipeline analyses).

CRON calls Run_legacy_queries.pl and TarO_resubmission_engine.pl to respectively run through legacy queries
(legacy queries arose at TarO version change) and to resubmit jobs that may have failed for some reason

drop_query.pl script is not used by the pipeline and will remove queries from the TarO system - treat with caution!
However the periodic removal of guest queries is done by another script - upload/delete_guest_records.pl

TARO v2 SSPF_T01/TarO_newclus/upload folder content (This directory largely contains scripts for interacting with the TarO database (eg to upload data))

 
This directory largely contains scripts for interacting with the TarO database (eg to upload data)
                            
1) Update Scripts:
update_query_display.pl - allows display of the query on the user home page  (via pipe_can_display_query table)           

update_status.pl - changes the status of a analysis steps for a given query (writes to pipe_query_step_status table)
       
update_targetDB_status.pl - called to update the targetDB status in the database table pipe_targetdb_status
 
update_uniref100_statsfile.pl - called to update the UniRef100 dat file in /homes/www-refine/SSPF_TO1/TarO_newclus/databases      


2) Upload Scripts:         
upload_Sequence.pl - uploads data to pipe_sequence_homology table                        

upload_Cog_cluster.pl - uploads data to pipe_cog_cluster table  
              
upload_Sequence_Stats.pl - uploads data to pipe_sequence_statistics table
                             
upload_Cog_tophit.pl - uploads data to pipe_cog_match table
             
upload_display_main.pl - uploads data to pipe_display_main table

upload_TargetDB_Blast_tophit.pl - uploads data to pipe_targetdb_blast_top_hit table
                 
upload_display_query_seqs.pl - uploads data to pipe_display_query_seqs table                          

upload_PDB_PSIBlast_tophit.pl - uploads data to pipe_pdb_psiblast_top_hit table      
 
upload_Annotation.pl - uploads to pipe_annotation table               

upload_Annotation_taroNew.pl - uploads to pipe_annotation table for the legacy queries                  

upload_RPS_Blast_tophit.pl - uploads data to pipe_rps_blast_top_hit table

upload_Uniref_tophit.pl - uploads data to pipe_uniref_top_hit table

3) Not Actively used by the pipeline:
delete_guest_records.pl - called by CRON to periodically delete guest records (once a month, deletes queries older than 7 days)                       
delete_guest_records.log - logfile for  delete_guest_records.pl

upload_targetdb_status.pl - (NOT ACTIVELY USED by TarO) uploads the TargetDB status infomation to pipe_targetdb_status table 
				when initialising the database. Not used on a per query basis.

upload_Sequence_origin.pl - (NOT ACTIVELY USED by TarO) used to upload the pipe_Sequence_origin table 
				which is stable (so this is not written to on a per query basis)
				
upload_Annotation_type.pl* - (NOT ACTIVELY USED by TarO) used to upload the pipe_Annotation_type table 
				which is stable (so this is not written to on a per query basis)
				
master_upload.pl - (NOT ACTIVELY USED by TarO) can be used manually to run all upload scripts in case this doesnt happen
			for any give TarO query
			
master_update.pl - (NOT ACTIVELY USED by TarO) can be used manually to update the status of a query (in the pipe_query_step_status table)
				 to indicate that results are available for all analysis steps
			
upload_Astral_Blast_tophit.pl - (NOT ACTIVELY USED by TarO) previously used to upload blast hits to ASTRAL. 
				This step has since been removed from the pipeline (though would be nice to include this)

TARO v2 SSPF_T01/TarO_newclus/qsub_commands folder content (This directory contains qsub command used for manually starting the TarO pipeline)

 
qsub_bigmemQueue.dat - resubmits a job to the bigmem.q  (memory can be an issue causing jobs to fail)

TARO v2 SSPF_T01/TarO_newclus/databases folder content (This directory contains databases, dat files and symlinks to databases for the TarO pipeline)

 
This directory contains databases, dat files and symlinks to databases for the TarO pipeline

-- Release/update information
DB.dat -> /db/blastdb/DB.dat - Symlink to last update information for the Blast databases
uniref100.release_note -> /db/blastdb/uniref100.release_note - symlink to the uniref100 release information


-- COG/KOG data
COG_KOG.phr - BLAST database for COG/KOG
COG_KOG.pin - BLAST database for COG/KOG
COG_KOG.psq - BLAST database for COG/KOG

Cog_andKog_prot_nonredund.fasta - COG/KOG sequences in fasta format
cog_kog.txt - dat file for COG/KOG groupings
Organism_CogKog_abbreviations.txt - COG/KOG Organism abbreviations and taxonomic information


-- Hydrophobicy & OB-Score data
GES_Hydrophobicity_scores.dat - GES Hydrophobicity score data
GES_zmat2.dat - OB-score matrix based on GES hydrophobicity
Hydrophobicity_scores.dat - Kyte-Doolittle Hydrophobicity score data


--PDB data
pdb.phr -> /db/blastdb/pdb.phr - Symlink to latest PDB database for BLAST searching
pdb.pin -> /db/blastdb/pdb.pin - Symlink to latest PDB database for BLAST searching
pdb.psd -> /db/blastdb/pdb.psd - Symlink to latest PDB database for BLAST searching
pdb.psi -> /db/blastdb/pdb.psi - Symlink to latest PDB database for BLAST searching
pdb.psq -> /db/blastdb/pdb.psq - Symlink to latest PDB database for BLAST searching

pdb_uniref50.phr -> /db/blastdb/pdb_uniref50.phr - Symlink to latest PDB database for PSIBLAST searching
pdb_uniref50.pin -> /db/blastdb/pdb_uniref50.pin - Symlink to latest PDB database for PSIBLAST searching
pdb_uniref50.psd -> /db/blastdb/pdb_uniref50.psd - Symlink to latest PDB database for PSIBLAST searching
pdb_uniref50.psi -> /db/blastdb/pdb_uniref50.psi - Symlink to latest PDB database for PSIBLAST searching
pdb_uniref50.psq -> /db/blastdb/pdb_uniref50.psq - Symlink to latest PDB database for PSIBLAST searching


-- RPS BLAST databases
RPS_blast_taro -> /db/CDD/ - Symlink to latest RPSblast databases directory


-- TargetDB data
targetdb.fasta -> /db/targetdb/targetdb.fasta - symlink to latest targetDB sequences
targetdb_copy.fasta - pqrovides a record of the last update to TargetDB information
targetdb.txt -> /db/targetdb/targetdb.txt - symlink to targetdb information  (eg status etc). input to update_targetDB_status.pl script

targetdb_taro.phr -> /db/blastdb/targetdb.phr -  Symlink to latest TargetDB database for BLAST searching
targetdb_taro.pin -> /db/blastdb/targetdb.pin - Symlink to latest TargetDB database for BLAST searching
targetdb_taro.psd -> /db/blastdb/targetdb.psd - Symlink to latest TargetDB database for BLAST searching
targetdb_taro.psi -> /db/blastdb/targetdb.psi - Symlink to latest TargetDB database for BLAST searching
targetdb_taro.psq -> /db/blastdb/targetdb.psq - Symlink to latest TargetDB database for BLAST searching


-- UniProt/UniRef100 data
U100_organisms_tophyla_120808.dat - mapping UniRef100 organisms to taxonomic information
uniref100_stats_taro.dat - uniref100 taxonomic and sequence length information
uniref100_taro.fasta -> /db/uniref/uniref100.fasta - Symlink to latest UniRef100 sequences

- Symlinks for blast searching uniref100:
uniref100.00.phr -> /db/blastdb/uniref100.00.phr
uniref100.00.pin -> /db/blastdb/uniref100.00.pin
uniref100.00.psd -> /db/blastdb/uniref100.00.psd
uniref100.00.psi -> /db/blastdb/uniref100.00.psi
uniref100.00.psq -> /db/blastdb/uniref100.00.psq
uniref100.01.phr -> /db/blastdb/uniref100.01.phr
uniref100.01.pin -> /db/blastdb/uniref100.01.pin
uniref100.01.psd -> /db/blastdb/uniref100.01.psd
uniref100.01.psi -> /db/blastdb/uniref100.01.psi
uniref100.01.psq -> /db/blastdb/uniref100.01.psq
uniref100.02.phr -> /db/blastdb/uniref100.02.phr
uniref100.02.pin -> /db/blastdb/uniref100.02.pin
uniref100.02.psd -> /db/blastdb/uniref100.02.psd
uniref100.02.psi -> /db/blastdb/uniref100.02.psi
uniref100.02.psq -> /db/blastdb/uniref100.02.psq
uniref100.pal -> /db/blastdb/uniref100.pal


- Symlinks for psiblast searching uniref100:
uniref100.filt -> /db/blastdb/uniref100.filt - (NOT ACTIVELY USED by TarO) symlink to fasta file for filtered 
						uniref100 database: kept for completeness
uniref100.filt.00.phr -> /db/blastdb/uniref100.filt.00.phr
uniref100.filt.00.pin -> /db/blastdb/uniref100.filt.00.pin
uniref100.filt.00.psd -> /db/blastdb/uniref100.filt.00.psd
uniref100.filt.00.psi -> /db/blastdb/uniref100.filt.00.psi
uniref100.filt.00.psq -> /db/blastdb/uniref100.filt.00.psq
uniref100.filt.01.phr -> /db/blastdb/uniref100.filt.01.phr
uniref100.filt.01.pin -> /db/blastdb/uniref100.filt.01.pin
uniref100.filt.01.psd -> /db/blastdb/uniref100.filt.01.psd
uniref100.filt.01.psi -> /db/blastdb/uniref100.filt.01.psi
uniref100.filt.01.psq -> /db/blastdb/uniref100.filt.01.psq
uniref100.filt.02.phr -> /db/blastdb/uniref100.filt.02.phr
uniref100.filt.02.pin -> /db/blastdb/uniref100.filt.02.pin
uniref100.filt.02.psd -> /db/blastdb/uniref100.filt.02.psd
uniref100.filt.02.psi -> /db/blastdb/uniref100.filt.02.psi
uniref100.filt.02.psq -> /db/blastdb/uniref100.filt.02.psq
uniref100.filt.pal -> /db/blastdb/uniref100.filt.pal


-- Not in Active use
Cog_andKog_prot_nonredund_b4clean.fasta - (NOT ACTIVELY USED by TarO) The COG/KOG sequence before cleaning

astral_1_67.phr - (NOT ACTIVELY USED by TarO) The Astral database for BLAST searching
astral_1_67.pin - (NOT ACTIVELY USED by TarO) The Astral database for BLAST searching
astral_1_67.psq - (NOT ACTIVELY USED by TarO) The Astral database for BLAST searching

protein2interpro.dat - (NOT ACTIVELY USED by TarO) mapping from uniprot to interpro (would be nice to include this information into TarO)

CVS/ - (NOT ACTIVELY USED by TarO) directory for CVS repository

TARO v2 SSPF_T01/TarO_newclus/cluster_scripts folder content (scripts that are run on the cluster nodes)

 
This directory contains scripts that are run on the cluster nodes 
as array jobs to Sun Gridengine.
These are used via the Clust_int_taro.pm module.

Blast_newclus.pl - runs BLASTP         

PSIBLAST_newclus_localDB.pl  - runs PSIBLAST and copies the database to a local
				cluster node directory (for better bandwidth)    
PSIBLAST_newclus.pl  - (NOT ACTIVELY USED by TarO) runs PSIBLAST

Generic_cluster_interface.pl - this is used to run many programs  (eg Disembl, netNglyc)
                         
RPSBlast_newclus_localDB.pl - runs RPSBLAST and copies the database to the 
			      a directory local to the cluster node (for better bandwidth)    
RPSBlast_newclus.pl - runs RPSBLAST

www-refine crontab

0 0 8 * * perl /homes/www-refine/SSPF_TO1/TarO_newclus/upload/delete_guest_records.pl >> /homes/www-refine/SSPF_TO1/TarO_newclus/upload/delete_guest_records.log
0 0 * * * perl /homes/www-refine/SSPF_TO1/TarO_newclus/TarO_resubmission_engine.pl >> /homes/www-refine/SSPF_TO1/TarO_newclus/TarO_resubmission_engine.log
0 3 * * * perl  /homes/www-refine/SSPF_TO1/TarO_newclus/Run_legacy_queries.pl
0 15 * * * perl  /homes/www-refine/SSPF_TO1/TarO_newclus/Run_legacy_queries.pl
0 0 1 1 * rm -rf /fc_gpfs/gjb_lab/www-refine/bin/signalp-3.0/tmp
0 0 1 6 * rm -rf /fc_gpfs/gjb_lab/www-refine/bin/signalp-3.0/tmp

TARO CVS Repository

General principals:
Due to the nature of the TARO it did not look feasible to run its scripts outside of a cluster environment, thus CVS contains copies of the production code that has been developed and debugged on the production system. Every effort was made to ensure that current production version of the scripts are backed up into repository. The specific feature of this repository is that CVS HEAD among current versions of the scripts also contains the old versions of the scripts and other scripts that are not part of TARO pipeline and has been used manually for various purporses.

CVS location::extssh:<USERNAME>@cvs.compbio.dundee.ac.uk:/gpfs/gjb_lab/cvs/barton/www-refine/TarO2_CVS
Please note that any other scripts in www-refine directory are obsolete. (Most of them has been removed to the Attic, but some persist)
CSV directory structure:

DB (scripts to create SQL (targpipe.sql) & update TARO database)
FrontEnd (scripts to do with presentation)

httpd

cgi-devel (cgi scripts in development)
cgi-taro (production cgi scripts)
conf (apache config file)
htdocs (static html pages)

script_repos (main TARO scripts)
contains the master TarO script
- TarO_indevel.pl
- TO_masterpipe_onNewClus.pl (the script to call TarO_indevel.pl)
- the scripts for running various software in Grid engine array context (eg Generic_cluster_interface.pl, PSIBLAST_newclus_localDB.pl)
- upload (scripts to upload data into TARO DB)
update_scripts (utility scripts to update inhouse database which TARO uses)