TarO Information

TarO is hosted by the Barton Group, School of Life Sciences Research, University of Dundee
First released: July 7, 2005. Last updated: October 30th 2007


Introduction

 

TarO analyses a protein sequence by a large number of bioinformatics techniques. These include crystallisation propensity prediction, orthologue searching, and many other sequence-based calculations. Results are tabulated and available via an annotated multiple sequence alignment, that can be edited interactively using Jalview.

TarO is focused on Structural Genomics target selection/optimisation, but provides annotations that can be informative to a range of biological questions.

 

TarO connects to available DAS (distributed annotation system) information via Jalview and links to Dasty2, as well as providing routes to other gateways such as UniProt, COG and the Conserved Domains Database. The result of an example TarO query can be browsed here.

Guest Access

Guest access allows you to try TarO without registering. Although there are no restrictions on guest access, guest results are visible to everyone and will be deleted from the server after a minimum of 8 days.

 

See here for details on how to obtain a free private account for academic use.

 


Use Policy

We ask users to wait until the results of their submissions become available before submitting any further sequence queries.

 

Please DO NOT write scripts against this server. If you are thinking about conducting large-scale analyses contact TarO_admin (taro@compbio.dundee.ac.uk).

 


Authors & Citation

TarO was developed within the SSPF, primarily by Ian Overton and Geoff Barton with contributions from Jo van Niekerk, Lester Carter, Alice Dawson, David Martin, Scott Cameron, Stephen McMahon, Malcolm White, Bill Hunter and Jim Naismith. Funding was provided by BBSRC under the SPoRT initiative. If you use TarO please cite: Overton et al. (2008) "TarO: A Target Optimisation System for Structural Biology", Nucleic Acids Research (web server issue) doi:10.1093/nar/gkn141.

 

Home Page

 

The home page provides a table summarising queries that have been run through the pipeline. Only queries submitted by yourself and by users in your group will be visible, and queries are presented in groups according to user ID. Of course all guest queries are visible to everyone. There is also an acknowledgements section, detailing the references for software incorporated into TarO. Please cite these as appropriate. Click on the image to navigate to an example home page.

 

Input Sequences Page

 

This page presents results for the input sequences and the progress of pipeline queries. There is also a button to start the Jalview applet and so visualise annotations of the input sequences in a Multiple Sequence Alignment. Links from this page lead to the results for putative orthologues and homologues. The results pages generally provide numerous links, for example to allow DAS lookups via Dasty2, and to the UniProt and COG websites. Click on the image to navigate to an example input sequences page.

 

Query Status Table

 

On the Input Sequences page there is a table detailing the query progress. The various TarO pipeline stages are summarised in the left column, and the status of each stage is summarised in the right-hand column. The colour of each row reflects the status of each step, according to a ‘traffic lights’ scheme. Orange indicates the step has been started, red indicates the step failed, green indicates the step completes successfully. The table opposite shows a TarO query at an intermediate stage, with some of the steps completed.

 

Annotated Multiple Sequence Alignment

Jalview is used to visualise annotations that can be mapped to residues in the sequence (e.g. phosphorylation sites), other annotations (eg extinction coefficient) are available in the results tables. The Jalview applet provides the facility to start the full Jalview application (on menu click File > View in Full Application). The full Jalview application allows lookup of DAS features and the ability to save alignment files. The multiple sequence alignment is constructed using the MUSCLE algorithm.

 

Orthologues Page

 

This page presents tabulated results for putative orthologues of the input sequence(s), annotated from BLAST searches of the COG database. Results on this page are ordered by predicted crystallisation propensity (ParCrys) and then by BLASTP Expectation value (for the match to the user input sequence). Methodology is described in more detail below. There are links for each sequence for homologues obtained by a search of UniRef100. Click on the image to navigate to an example orthologues page. 

 

Homologues Page

 

This page presents results for putative homologues of the sequence that was clicked on (which could be an input sequence or a putative orthologue). Results on this page are ordered by estimated crystallisation propensity (ParCrys) and then by PSIBLAST Expectation Value. Homologues are gathered using a PSIBLAST search of UniRef100. Methodology is described in more detail below. Click on the image to navigate to an example homologues page.

 

 

Submit New Query Page

 

 This page is used to start a new TarO query. The query description box allows users to specify a name for the query that is displayed in the home page query summary table. Something that is meaningful to help identify the query to you is therefore recommended! The input is required to be in fasta format and protein sequence. There is the facility to upload an input file, or to paste a fasta-format sequence into the large box. This page also has a field to specify the maximum number of sequences to include in the Multiple Sequence Alignment – the default is 100. If too many sequences are included, the alignment may become rather “gappy”. Click on the image to navigate to an example new query page.

 

Summary of Methodology

 

1 User input sequence(s) searched against the COG database using BLASTP (thresholds coupling sequence identity with alignment length as defined in Rost (1999) Protein Eng. 12:85-94). The topscoring matched COG sequence is used to assign a COG cluster to the input sequence and COG sequences from that COG cluster (of putative orthologues) are thus associated with the user input sequence. Sequences within an assigned COG cluster are displayed if the BLASTP evalue is 1e-3 or better.

2 All user and associated COG sequences are searched against UniRef100 using PSIBLAST (3 iterations, thresholds: alignment length 30 residues and evalue better than 1E-3) The resultant matches are assigned to the relevant query (ie user or COG) sequence. The topscoring match from the first iteration is designated the "Uniref Top hit" for each sequence, thus this is the equivalent of a BLASTP search.

3 The User, COG and UniRef100 sequences are analysed in several steps, as follows:

 

                       

Text Box: a) PSIBLAST/BLAST PDB database (to identify matches to known molecular structure via Rost thresholds)
b) BLAST TargetDB databse (to identify matches to Structural Genomics targets)
c) RPSBLAST Search domain databases (these are CDD, Pfam, SMART, COG and KOG)
d) Calculate simple biochemical properties (eg pI, Mr, GRAVY, #His, #Met, #Cys, sequence length, extinction coefficient)
e) SignalP prediction of signal peptide (only first 70 a.a. are examined and results are filtered by the criteria HMM probability threshold >= 0.7)
f) Multiple alignment (MUSCLE). This includes up to 100 sequences by default, which are selected as follows: 
	Sequences are ordered in the following sections: 
	i) user sequence(s) (displayed at the top of the MSA). 
	ii) COG sequences 
	iii) UniRef100 Sequences					
	Within each of sections ii) and iii), sequences are displayed in order of 
similarity to the user sequence(s) (as estimated by (PSI)BLAST expectation values).
g)  Protein disorder/order prediction (using RONN, Globplot and Disembl)
h) Glycosylation site prediction (O and N-linked, using NetOglyc and NetNglyc)
i) Phosphorylation site prediction (NetPhos)
j) Transmembrane region prediction (TMHMM2)
k) Crystallisation propensity prediction (ParCrys and OB-Score)
l) Secondary structure prediction (Jpred)

Note that annotation results can be visualised within a multiple alignement, using the Jalview applet. The full Jalview application can be started from within the applet (click 'File'->'View in Full Application'), for additional functionality.
TarO is still evolving and user feedback is most welcome. Please direct any comments to taro@compbio.dundee.ac.uk

 

Site layout


 

Description of column headings

The following section elaborates upon headings displayed in the tables on the TarO website. This section is primarily intended to be accessed via links from the main pages for additional explanation of particular table headings. 

 

Reference Table

Introduction
Description of column headings
QUERY_ID
FUNCTIONAL_DESC
#Sequences
Links
Sequence statistics
Sequence_ID
Seqlen
Mr
GpIclus
pI
GRAVY
SigP
SPconf
#His
#Met
#Cys
COG top hit details
COGclus
Subject
eval
%id
Alen
Qst
Sst
Sen
Seqlen
ORIGIN
ASTRAL Tophit
PDB Tophit
TargetDB Tophit
More
99%qcov
99%qcov+99%id
RPSBLAST
Organism
#TMH
TMH_span
Uniref Tophit
OB score
ParCrys prediction
ParCrys-Sc
RONN
Jpred_H
Jpred_E
NetNglyc
NetOglyc
NetPhos
A280
A280_mg

 

QUERY_ID

The TarO identifier for the search triggered from the user input page

Query Description

The user-specified description for the given query

#Sequences

The total number of unique sequence identifiers associated with the Query_ID. Currently these may be from user input, COG or UniRef100.

Sequence_ID

The sequence identifier (may be supplied by the user or from external databases). This sequence is referred to as the query sequence in the context of database searching.

Organism

The organism associated with the sequence. UniRef100 sequences are associated with a name according to the information in the header of UniRef100.fasta file, or where this is not informative organisms are assigned to UniRef100 identifiers using the IPI database; however, there are still (a small proprtion of) Uniref100 sequences where meaningful organism information is not currently assigned. Also, for UniRef sequences the presence of "..." following the organism name indicates that there is additional organism information available, which appears on mouseover. The COG/KOG sequences are associated with an abbreviated organism name, as given in the COG database. The list of abbreviations with their corresponding full organism names is given below:

Aae Aquifex aeolicus
Afu Archaeoglobus fulgidus
Ape Aeropyrum pernix
Atu Agrobacterium tumefaciens strain C58 (Cereon)
Bbu Borrelia burgdorferi
Bha Bacillus halodurans
Bme Brucella melitensis
Bsu Bacillus subtilis
Buc Buchnera sp. APS
Cac Clostridium acetobutylicum
Ccr Caulobacter vibrioides
Cgl Corynebacterium glutamicum
Cje Campylobacter jejuni
Cpn Chlamydophila pneumoniae CWL029
Ctr Chlamydia trachomatis
Dra Deinococcus radiodurans
EcZ Escherichia coli O157:H7 EDL933
Eco Escherichia coli K12
Ecs Escherichia coli O157:H7
Ecu Encephalitozoon_cuniculi
Fnu Fusobacterium nucleatum
Hbs Halobacterium sp. NRC-1
Hin Haemophilus influenzae
Hpy Helicobacter pylori 26695
jHp Helicobacter pylori J99
Lin Listeria innocua
Lla Lactococcus lactis
Mac Methanosarcina acetivorans str.C2A
Mge Mycoplasma genitalium
Mja Methanococcus jannaschii
Mka Methanopyrus kandleri AV19
Mle Mycobacterium leprae
Mlo Mesorhizobium loti
Mpn Mycoplasma pneumoniae
Mpu Mycoplasma pulmonis
MtC Mycobacterium tuberculosis CDC1551
Mth Methanothermobacter thermautotrophicus
Mtu Mycobacterium tuberculosis H37Rv
NmA Neisseria meningitidis Z2491
Nme Neisseria meningitidis MC58
Nos Nostoc sp. PCC 7120
Pab Pyrococcus abyssi
Pae Pseudomonas aeruginosa
Pho Pyrococcus horikoshii
Pmu Pasteurella multocida
Pya Pyrobaculum aerophilum
Rco Rickettsia conorii
Rpr Rickettsia prowazekii
Rso Ralstonia solanacearum
Sau Staphylococcus aureus N315
Sce Saccharomyces cerevisiae
Sme Sinorhizobium meliloti
Spn Streptococcus pneumoniae TIGR4
Spo Schizosaccharomyces pombe
Spy Streptococcus pyogenes M1 GAS
Sso Sulfolobus solfataricus
Sty Salmonella typhimurium LT2
Syn Synechocystis
Tac Thermoplasma acidophilum
Tma Thermotoga maritima
Tpa Treponema pallidum
Tvo Thermoplasma volcanium
Uur Ureaplasma urealyticum
Vch Vibrio cholerae
Xfa Xylella fastidiosa 9a5c
Ype Yersinia pestis

Links

Links to display further details of results (One-letter codes as follows):

O: Links to the results for putative orthologues of the sequence in question (identified by BLASTP of the COG and KOG databases)

S: Displays the Sequence (fasta format)

H: Links to the results for putative homologues of the sequence in question (identified by PSIBLAST of UniRef100)

C: Links to the COG database site (gateway to further information)

U: Links to the UniProt site (allows finding nucleotide sequence via EMBL CoDingSequence link, plus other links provide gateway to lots of further information)

T: Popup more details of the TargetDB BLAST statistics

P: Popup more details of the PDB PSIBLAST/BLASTP search results

D: Links to EMBL DAS (Dasty2) client

Sequence statistics

Empirically calculated and predicted properties for the given sequence.

Seqlen

Sequence length

Mr

Molecular weight

GpIclus

Cluster assigned from the GRAVY/pI index (see PNAS 99:11664)

pI

Isoelectric point

GRAVY

GRand AVerage of hydrophobicitY (kyte-doolittle socres)

SigP

SignalP (JMB 340:783) predicted signal peptide. This column details the last residue of any predicted signal peptide. More information on SignalP is available here

SPconf

SignalP (JMB 340:783) HMM confidence score. More information on SignalP is available here

#TMH

The number of transmembrane helices predicted by the program TMHMM2 (JMB 305:567). More information on TMHMM2 is available here

TMH_span

The portion of the sequence (start-end) that includes transmembrane helices, taken from the predictions of TMHMM2 (JMB 305:567). All predicted transmembrane helices are included in these sequence co-ordinates. More information on TMHMM2 is available here

#His

Number of Histidines

#Met

Number of Methionines

#Cys

Number of Cysteines

COG top hit details

These statistics are compiled from a BLASTP search of the COG database
Clusters of Orthologous Groups of proteins (COGs) are created in this database by comparing protein sequences encoded in complete genomes,
representing major phylogenetic lineages. Each COG consists of individual proteins or groups of co-orthologues from at least 3 lineages
More details about the COG database can be found at http://www.ncbi.nlm.nih.gov/COG/

COGclus

Assigned COG cluster based on the BLASTP search of the COG database. Note that a good BLAST match to a COG sequence does not automatically allow the assignment of a COG cluster because sequences in COG are not neccessarily associated with a COG cluster.

Subject

Database sequence identifier

eval

BLAST expectation value

%id

Percentage identity

Alen

Alignment length

Qst

Alignment start position on query sequence

Qen

Alignment end position on query sequence

Sst

Alignment start position on subject (database) sequence

Sen

Alignment end position on subject (database) sequence

Seqlen

Sequence length

ORIGIN

The source from which the sequence data has been retrieved

Uniref Tophit

The topscoring UniRef100 sequence found with a BLASTP seach.

PDB Tophit

The topscoring PDB sequence found with a PSIBLAST search (3 iterations, 1E-03) or BLASTP search (for the orthologue/homologue sequences), and thresholds from Rost (1999) Protein Eng. 12:85-94.If there is no data in these columns this is because no hit was found above the thresholds.

TargetDB Tophit

The topscoring TargetDB sequence found with a BLASTP search (thresholds 1E-03, as well as matching above Rost thresholds (coupling alignment length and percentage identity)). If there is no data in these columns this is because no hit was found above the thresholds.

TargetDB_groupID

The group associated with the TargetDB identifier

TargetDB_status

The status of efforts towards obtaining the molecular structure of the given protein.

More

Display more information. One-letter codes are specified as follows:

C: Links to the relevant page of the COG database site (gateway to further information)

U: Links to the relevant page of the UniProt site (allows finding nucleotide sequence via EMBL CodingSequence link, plus other links provide gateway to lots of further information)

T: Popup more details of the TargetDB BLAST results

P: Popup more details of the PDB PSIBLAST or BLASTP search results

D: Links to EMBL DAS (Dasty2) client

 

99%qcov

 

Whether at least 99% of the query sequence is covered by the alignment with the subject (database) sequence. This information is specified by 1 (True) or a 0 (False)

 

99%qcov+99%id

 

Whether at least 99% of the query sequence is covered by the alignment with the subject (database) sequence with at least 99% identity. This information is specified by 1 (True) or 0 (False).

 

RPSBLAST

 

RPS-BLAST (Reverse PSI-BLAST) searches a query sequence against a database of profiles, producing BLAST-like output. More details..
In TarO, RPS-BLAST is done vs the NCBI CDD versions of COG, KOG, SMART, Pfam and CDD profile databases. The "RPSBlast" link in the results tables will allows visualisation of the alignment statistics for the specific query sequence. Links within the table displayed provide access to the domain profiles matched.

 

PSIBLAST Statistics

 

These statistics are compiled from a PSIBLAST search of the UniRef100 database

 

BLASTP Statistics

 

These statistics refer to the BLASTP alignment of the user input sequence to sequences from the COG database

 

ParCrys prediction

 

ParCrys is a Parzen Window approach to crystallisation propensity prediction (Overton et al. 2007). The prediction can be "Highly amenable", "Amenable" or "Recalcitrant" to crystallisation. These three predictions are based on an analysis of TargetDB data. The ParCrys score thresholds for defining these boundaries are 6637270 (Highly amenable/Amenable) and 3564600 (Amenable/Recalcitrant). More information on ParCrys is available from here

 

ParCrys-Sc

 

ParCrys is a Parzen Window approach to crystallisation propensity prediction, the ParCrys-Sc refers to the raw ParCrys Score. The higher the score, the more similar the input sequence to sequences associated with diffraction-quality crystals. More information on ParCrys is available from here

 

OB

 

The OB-Score is a z-score scale based on calculated hydrophobicity and isoelectric point values from PDB sequences against a background distribution generated from UniRef50. The OB-Score can be used to estimate crystallisation propensity. For more details, see Overton & Barton (2006). FEBS Lett. 580, 4005-4009. More information on the OB-Score is also available from here

 

RONN

 

The RONN algorithm is used to predict disordered regions. The column RONN gives the percentage of residues that are predicted to be disordered by RONN. More information on RONN is available from here

 

Jpred_H

 

The Jpred algorithm is used to predict secondary structure. The column Jpred_H gives the percentage of residues that are predicted in helical conformation. More information on Jpred is available from here

 

Jpred_E

 

The Jpred algorithm is used to predict secondary structure. The column Jpred_E gives the percentage of residues that are predicted in extended conformation. More information on Jpred is available from here

 

NetNglyc

 

NetNglyc was developed to predict N-linked glycosylation in human proteins. The NetNglyc predictions with score of at least 0.7 are displayed in the format: "ResidueNumber:Score_ResidueNumber:Score_etc.". More information on NetNglyc is available from here
NOTE: Ignore predicted glycosylation unless a signal peptide is also predicted. Additionally, the NetNglyc developers reccommend to ignore glycosylation predicted on non-extracellular domains.

 

NetOglyc

 

NetOglyc is used to predict mucin type GalNAc O-glycosylation sites in mammalian proteins. The NetOglyc predictions with score of at least 0.7 are displayed in the format: "ResidueNumber:Score_ResidueNumber:Score_etc.". More information on NetOglyc is available from here
NOTE: Ignore predicted glycosylation unless a signal peptide is also predicted. Additionally, the NetOglyc developers reccommend to ignore glycosylation predicted on non-extracellular domains.

 

NetPhos

 

NetPhos is used to predict serine, threonine and tyrosine phosphorylation sites in eukaryotic proteins. The NetPhos predictions with score of at least 0.7 are displayed in the format: "ResidueNumber:Score_ResidueNumber:Score_etc.". More information on NetPhos is available from here

 

A280

 

The predicted Molar extinction coefficient at 280nm.

 

A280_1mg

 

The predicted extinction coefficient for a 1mg per ml solution of the protein at 280nm.

 

Multiple Sequence Alignment (MSA) Information

 

Clicking on the button to "View Multiple Sequence Alignment Annotated with...." starts the Jalview applet, displaying a window with the MSA, and a window entitled "Feature Settings". The full Jalview application can be started from within the applet for additional functionality (click 'File'->'View in Full Application'). The MSA was constructed with the MUSCLE algorithm, including sequences that have a BLAST match to the query sequence of 1E-20 or better. Sequences are excluded if their sequence length is more than 125% of the query sequence length.
The "Feature settings" window can be used to select which annotations to display and the order of precedence for displaying the annotations. The tick-boxes toggle the display of the groups (in the area at near the top of the window) and features (in the 'scroll-able' part of the window). The features are presented in coloured bars that indicate the colour displayed on the MSA when representing annotation of the specified feature. Unselecting a group and then reselecting it will move it to the top of the display order (ie on top of any other selected groups).

 

There are currently 5 groups of annotation on the MSA, however simultaneous display of all groups can be confusing! Therefore we strongly suggest that you customise the display of groups using the "Feature Settings" window (described above). The groups are:
  i) PTMs_+_SignalP (PostTranslational Modifications and Signal Peptide)
    PTMs include:
      N & O glycosylation (predicted by NetNglyc and NetOglyc)
      Phosphorylation (predicted by NetPhos )
      Signal Peptide (predicted by SignalP ) is also included here.
      NOTE if a signal peptide is not predicted, any predicted glycosylation       sites are likely to be wrong!
  ii) Domains(Pfam+CDD) & Disorder (RONN)
      Domain annotations are from RPSBLAST searching Pfam and CDD profiles.
      Disorder is predicted by RONN
  iii) TM_regions
      TransMembrane regions are predicted by TMHMM2
  iv) Disorder (RONN+Disembl)
      This group combines protein disorder predicted by RONN and Disembl.
      The Disembl "HotLoops" and "REM465" predictions are displayed,       however the "COILS" predictions are not displayed.
  v) Disorder (Globplot+Disembl)
      This group combines protein disorder predicted by Globplot and Disembl.
      The Disembl "HotLoops" and "REM465" predictions are displayed,       however the "COILS" predictions are not displayed.

We suggest the following combinations of displayed groups (with i (and iii) displayed on top of the other groups):
    i, ii & iii (This is the default, though the group will not be displayed if     there is no annotation in that group. Also, the group display order may     need tweaking (eg to bring PTMs to the top)).
    i, iii & iv
    i, iii & v