\documentstyle[12pt,html]{article}  
% jmb.sty is identical to apalike.sty 1989 June 19
%
% apalike.sty style, used in conjunction with apalike.bst,
% will produce an apa-like bibliography style:
%
% 1) Bibliography entries formatted alphabetically, last name
%    first, each entry having a hanging indentation and no label.
% 2) References in the following formats:
%		(Author, 1986)
%		(Author and Author, 1986)
%		(Author et al., 1986).
% 3) Multiple references in the form (Author1, 1986; Author2, 1987)
%
% To be used as an optional argument to the \documentstyle command; for example
%	\documentstyle[11pt,apalike]{book}
%
% 16-Sep-86, original version by Susan King and Oren Patashnik.
% 13-Oct-87 changes:
%	Fixed bug in last line by adding the {} that disappeard when
%		the \hbox{} was removed from the pre-APALIKE definition;
%	added club and widow penalties;
%	patched the \newblock LaTeX bug from `-.07em' to simply `.07em';
%	and made this work for document styles that don't define `chapter'.
%
%
% Use parens instead of brackets for \cite, and no label in the bibliography
%
\def\@cite#1#2{(#1\if@tempswa , #2\fi)}
\def\@biblabel#1{}

% Set length of hanging indentation for bibliography entries
%
\newlength{\bibhang}
\setlength{\bibhang}{2em}

% \thebibliography environment depends on whether or not `chapter's can exist
%
\@ifundefined{chapter}{\def\thebibliography#1{\section*{References\@mkboth
  {REFERENCES}{REFERENCES}}\list
  {\relax}{\setlength{\labelsep}{0em}
	\setlength{\itemindent}{-\bibhang}
	\setlength{\leftmargin}{\bibhang}}
    \def\newblock{\hskip .11em plus .33em minus .07em}
    \sloppy\clubpenalty4000\widowpenalty4000
    \sfcode`\.=1000\relax}}%
{\def\thebibliography#1{\chapter*{Bibliography\@mkboth
%{\def\thebibliography#1{\chapter*{References\@mkboth
  {BIBLIOGRAPHY}{BIBLIOGRAPHY}}\list
  {\relax}{\setlength{\labelsep}{0em}
	\setlength{\itemindent}{-\bibhang}
	\setlength{\leftmargin}{\bibhang}}
    \def\newblock{\hskip .11em plus .33em minus .07em}
    \sloppy\clubpenalty4000\widowpenalty4000
    \sfcode`\.=1000\relax}}

% `; ' goes between cites, and there's no \hbox around individual cites
%
\def\@citex[#1]#2{\if@filesw\immediate\write\@auxout{\string\citation{#2}}\fi
  \def\@citea{}\@cite{\@for\@citeb:=#2\do
    {\@citea\def\@citea{; }\@ifundefined
       {b@\@citeb}{{\bf ?}\@warning
       {Citation `\@citeb' on page \thepage \space undefined}}%
{\csname b@\@citeb\endcsname}}}{#1}}
%
% Very cut-down version of this paper 
%  
% $Id: map_cut.tex,v 1.6 1995/09/29 15:54:07 gjb Exp gjb $
%
% $Log: map_cut.tex,v $
% Revision 1.6  1995/09/29 15:54:07  gjb
% hacked about some more
%
% Revision 1.5  1995/09/28 16:32:53  gjb
% Extensive discussion of results added.
%
% Revision 1.4  1995/09/27 16:37:40  gjb
% notes in intro - start of results section
%
% Revision 1.3  1995/09/25 17:17:55  gjb
% small changes
%
% Revision 1.2  1995/09/25 16:16:25  gjb
% Initial version from Rob
%
%
% Quite a few additions and changes by Rob. 18/10/95
%
%
% I am begining to think that our best bet is to send this
%  back to JMB.... though this may just reflect a downward mood swing.
% There is just so much to say, and I would only fancy its chances
%  in Science if it was a "report" rather than an article...
% 
% Feedback?
%
% More changes 2/11/95
%
% 12/12/1995
% Geoff's hack at the nearly final version - check tenses, splice out the
% tables, re-write bits here and there.  Rewrite Abstract.
%
% 8/2/96 Accepted at bloody long last, some revisions by Rob
%  in answer to the referees criticisms.  Annotated by comments
%  where done.  I'll put a little "REF_RBR" in the comment so that
%  others can search forward for the changes
%
% Note: 1PAZ just didn't seem to score highly in the MAP runs in answer to Ref II
%
% Note: and I don't know what the hell to say about Rost's... I mean referee I's
%  vague question: "What is fold recognition?"  Any ideas?
%
% Note: I don't think we should merge Tables 6 & 7.  There would be
%  a loss of clarity.  Just how would we do it?  Damn referee.
%
% Note: I also don't know what to do about Figure 1 and the clarity.
%  I don't think the referee made much sense. I think we
%  are dealing with a general problem of not being able to use colour
%  because we are poor.
%
% There are a few things that I think we should clarify or bring out
%
%  a) The fact that one will do better if one uses carefully constructued predictions
%     and that we have had to be systematic in the assessment to avoid biasing
%     our results.  I know we say it, but it doesn't seem to carry somehow.
%     Ignore this comment if you disagree.
%
%  b) Both referees seemed to miss the distinction between folds and maps.  I don't
%     really know where to change this; maybe you (Rich, Geoff) can think of something.
%
%  c) Both referees seem a bit confused about the Sec-Sec versus Res-Res assessment
%      of accuracy.  I have tried to fix this, but you may want to try and clarify things
%      more.  Bon chance.
%
%  d) After where we say "the simplicity of the technique" it might be worth saying
%     that it is surprising that such simple principles are able to out-perform 
%     THREADER, and argue about just what fold recognition methods a la THREADER and
%     PROFIT are actually doing when they work: Secondary structure prediction and
%     accessibility preferences and length.  Nothing more, nothing less.  Also
%   
%  e) I (personally) wouldn't mind suggesting that the problem of FOLD recognition would
%     now seem to be one of distinguishing between folds of the same folding class.  
%     Look at how the methods do with, say, patterns containing 8 strands or 4 helices:
%     they reliably find folds containing this sequential arrangment of secondary structures
%     but are often unable to distinguish between them.  This is not really surprising; 
%     For example, if one considers the different between a lipocalin (up-and-down 
%     eight stranded be barrel) and  an Ig fold (eight strands in a sandwich), the 
%     amphipathic character and even the length of the strands are comparable (I think), 
%     and yet the folds are really wildly different.   Similarly for the four helix bundle 
%     examples and the plastocyanin example (which finds a nice six stranded barrel as 
%     the top scoring using MAP(PHD)).  This is a bit of a rant, so I will let you stop
%     and think.
%   
%  f) Oh, and I think we should somehow separate the true successes from the homologues in
%     the globin and plastocyanin searches (i.e. shade them different, or something), since
%     they aren't really the same thing as the other "successes".
% 
%  g) Over to you two.
%
\newcommand{\al}{\mbox{$\alpha$~}}  
\newcommand{\be}{\mbox{$\beta$~}}  
\newcommand{\albe}{\mbox{$\alpha/\beta$~}}
  
\newcommand{\tten}{\mbox{${\rm 3_{10}}$~}}  
\newcommand{\ea}{\mbox{\em et al. \/}}  
\newcommand{\Cal}{\mbox{${\rm C}_{\alpha}$~}}  
\newcommand{\Cbe}{\mbox{${\rm C}_{\beta}$~}}
\newcommand{\ii}{\mbox{$i$~}}
\newcommand{\jj}{\mbox{$j$~}}
\newcommand{\ip}{\mbox{$i^{\prime}$~}}
\newcommand{\jp}{\mbox{$j^{\prime}$~}}
\newcommand{\ra}{\mbox{$\rightarrow$}}
\newcommand{\sdp}{\setlength{\baselineskip}{18truept}}
\newcommand{\ssp}{\setlength{\baselineskip}{13.6truept}}
  
\begin{document}   
\begin{titlepage} 
\begin{center} 
\begin{Large}  
{\bf Protein fold recognition by mapping predicted secondary structures}\\
%{\bf Protein fold recognition from secondary structure prediction}\\
\vskip 0.10in
\end{Large}
{\em Robert B. Russell}\ddag{\em, Richard R. Copley and Geoffrey J. Barton}\dag

University of Oxford\\
Laboratory of Molecular Biophysics\\
The Rex Richards Building, South Parks Road\\
Oxford, OX1 3QU, England\\
Tel: 44 1865 275368 FAX: 44 1865 510454\\
E-mail: gjb@bioch.ox.ac.uk

\ddag Present Address:\\
Biomolecular Modelling Laboratory\\
Imperial Cancer Research Fund Laboratories\\
44 Lincoln's Inn Fields, P.O. Box 123\\
London, WC2A 3PX, England\\
E-mail: russell@icrf.icnet.uk

\dag To whom correspondence should be addressed.

Keywords: protein; structure prediction; fold recognition; threading;
nuclear magnetic resonance; secondary structure mapping.

Running title: Fold recognition from secondary structure prediction

Published in {\em J. Mol. Biol.} (1996), {\bf 259}, (3), 349-365.

\end{center}
\end{titlepage}


\section{abstract}

A strategy is presented for protein fold recognition from secondary
structure assignments (\al-helix and \be-strand).  The method can detect
similarities between protein folds in the absence of sequence 
similarity.  {\em Secondary structure mapping} first
identifies all possible matches (maps) between a query string of secondary
structures and the secondary structures of protein domains of known
three--dimensional structure.  The maps are then passed through a
series of structural filters to remove those that do not obey
simple rules of protein structure.  The surviving maps are ranked
by scores from the alignment of predicted and experimental
accessibilities.  Searches made with secondary structure assignments
for a test set of eleven fold-families put the correct
sequence-dissimilar fold in the first rank 8/11 times.  With
cross-validated predictions of secondary structure this drops to 4/11
which compares favourably with the widely used THREADER program
(1/11).  The structural class is correctly predicted 10/11 times by
the method in contrast to 5/11 for THREADER.  The new technique
obtains comparable accuracy in the alignment of amino acid residues
and secondary structure elements.  Searches are also performed with
published secondary structure predictions for the von-Willebrand
factor type A domain, the proteasome 20S \al subunit and the
phosphotyrosine interaction domain.  These searches demonstrate how
the method can find the correct fold for a protein from a carefully
constructed secondary structure prediction, multiple sequence
alignment and distance restraints.  Scans with experimentally
determined secondary structures and accessibility, recognise
the correct fold with high alignment
accuracy (86\% on secondary structures).  This suggests that the
accuracy of mapping will improve alongside any improvements in the
prediction of secondary structure or accessibility.  Application to
NMR structure determination is also discussed.


\newpage

\section{Introduction}

The flood of new protein sequences demands techniques to infer protein
3D $\ddag$ structure from sequence alone.  For $\approx 30$\%
of protein sequences, conventional alignment techniques
(e.g. \cite{lipman85,altschul90,smith81}) or profile and pattern
methods (e.g. \cite{gribskov87,bs90}) find similarities to a protein
of known 3D structure \cite{chothia92}.  The remaining
70\% of protein sequences may adopt previously unseen protein folds.
Alternatively, they may have topologies (folds) similar to known protein
structures but share no detectable sequence similarity \cite{rb94}.  Such
fold similarities will normally not be found until both protein 3D
structures have been determined experimentally
\cite{orengo94a,holm94a}.  In an attempt to find fold similarities of this type
in advance of 3D structure determination, several fold recognition
techniques have been developed.  (see \cite{bowie93,wodak93,jones93}
and refs therein.) These techniques may locate some fold similarities that
are undetectable by the comparison of sequence.  However, the methods
are often computationally intensive and many similarities still go
un-detected \cite{pickett92,lemer96}.\\
\\
In parallel with the development of fold detection methods, the
accuracy of secondary structure prediction has improved from $\approx
65$\% to $\approx 72$\% on average.  Though this is only a small
percentage increase, recent predictions are more useful, since the application 
of multiple sequence alignments improve the 
identification of the number, type and location of core secondary
structure elements.  Prediction from sequence alignments can also
accurately identify the position of loops, and residues likely to be
buried in the the protein core
\cite{benner94,barton95,russell95}.  Given a good secondary structure 
prediction, the next question to ask is how the secondary structures might 
be arranged into a tertiary fold.  {\em ab initio} methods for folding
secondary into tertiary structure
search for possible arrangements of secondary structures
that obey general packing rules 
\cite{cohen80a,cohen80b,cohen82,smith-brown93,sun95}.  
These methods have been applied in numerous blind predictions
\cite{hurle87,cohen86,curtis91,jin94,huang94} with varied results.  
A limitation is the number of  packing combinations that must be
considered.  This can become unmanageable for $>9$ secondary structures
\cite{cohen82}, though approaches to reduce the number of combinations
have been described \cite{taylor91,clark91}.\\
\\

The most successful predictions of protein tertiary structure in the
absence of clear sequence similarity to a protein of known 3D
structure, have been those where secondary structure predictions,
and/or experimental information were combined to suggest resemblance
to an already known fold.  Correct folds have been predicted in this
way for the \al subunit of tryptophan synthase
\cite{crawford87}, a family of cytokines \cite{bazan90}, and 
recently, for the von Willebrand factor type A domain
\cite{edwards95}, and the synaptotagmin C2 domain \cite{gerloff95}.
Although the details of these studies differed, all used predicted
secondary structures from multiple alignment, combined with the
careful application of protein structural principles (often together
with experimental data) to suggest a protein fold. Two automated
methods for comparing predicted and experimental secondary structures
have been described previously \cite{sheridan85,rost95} with promising
though limited preliminary results.\\
%
% REF_RBR I had an idea about ending this sentence in a better way, though
%  I can't remember it now.  It was something about saying that they
%  were potentially promising, but weren't explored to a full enough
%  extent, or something.  Work on it.
%
\\
In this paper we show how secondary structure and accessibility
prediction together with basic rules of protein structure may be used
to find the correct fold within a database of protein structural
domains.  The method first generates all possible matches (referred to
as `maps') between query and database secondary structure patterns,
allowing for insertions and deletions of whole secondary structure
elements.  Maps are filtered by a series of structural criteria to
arrive at a collection of sensible template structures.  The sequence
of the query protein is then aligned to the template structures by
matching predicted and observed patterns of residue accessibility.
Finally, alignments are ranked by a score that combines accessibility
matching with a penalty for differences in secondary structure length.
The method is designed to cope with incorrect secondary structure
assignments, insertions/deletions of whole secondary structure
elements, and differences in the lengths and orientations of secondary
structures.

\section{Methods}

\subsection{Database of unique protein 3D structural domains}

A database of protein 3D structural domains was derived
from the Brookhaven Protein Databank \cite{brookhaven}. 
$930$ non-identical chains were clustered by sequence comparison
\cite{smith81,barton93b} to leave $275$ sequence families.  One 
representative of each family was chosen to have the highest resolution and lowest R-factor.
The representative structures were then split into $377$ domains by eye.
A sub--database of higher quality domains was created for 
analysis.  This  contained only
those structures determined by X--ray crystallography, refined and of
a resolution of $2.5$ \AA~ or better.  Secondary structures for all
domains were defined by the programs DSSP (definition of secondary structure in proteins) \cite{dssp} 
or by DEFINE \cite{define} when only \Cal atoms were available.  Axial coordinates were
calculated for all secondary structures as described in 
\cite{define}.  Extra axial coordinates were calculated at the N-- and
C-- terminal ends to allow for possible differences in secondary
structure length.  The domain database is available via the WWW 
(http://geoff.biop.ox.ac.uk/).

\subsection{Alignment of secondary structures}

The secondary structure of the protein is represented as a sequence of
H and B characters where each H represents an entire \al helix and
each B a \be strand.  A fast method for generating all exact matching
alignments between two strings that allows up to a maximum number of
deletions from each string \cite{rcb95a} is used to find all {\em maps}
between the query pattern of secondary structures and the domain
database.  The method is recursive, and reminiscent of regular
expression matching.  In this study up to two deletions were permitted
from the query secondary structure string, to allow for errors in the
prediction.  Up to five deletions were permitted from each database
structure, to allow insertions or deletions of secondary structures
typical of proteins having similar 3D structures in the absence of
sequence similarity.  Deletions from the database structure were only
counted if they were contained within matched elements (overhanging
deletions were ignored).  Explicit mismatches were not allowed, but were
treated as deletions from either the query or database structure.
%
% REF_RBR Point to this paragraph in answer to Ref I's comment delta
%  I have added one new sentence to the end of the above paragraph,
%    since it wasn't quite accurate before.
% REF_GJB  I've added another sentence to mention mismatches.
%
%  I have added the sentence below in an attempt to answer part
%    of Ref I's question about optimisation of "free" parameters
%
These values were chosen since they are typical of the expected
accuracy of secondary structure prediction, and typical of insertions
and deletions of secondary structure elements across members of a 
diverse structural family.  In practice, the allowable deletions from
query and database should be chosen on a case by case basis.  
For consistency, we kept the maximum numbers of deletions fixed
during this study.

\subsection{Filters}

The alignment method will find all maps between two strings of
secondary structure elements, but due to the allowance for deletions,
many of these will correspond to implausible topologies.  Accordingly,
seven filters are used to remove maps corresponding to nonsensical
protein 3D structures and/or those not satisfying imposed experimental
restraints.

\subsubsection{Removing un-compact structures}

Two filters exploit the radius of gyration, $R_{g}$, to
remove non-compact maps.  Analysis of the $275$  high
quality domains shows that $R_{g} \leq 2.8 L^{0.34} + 4.0$,
where $L$ is the length of the structure in residues.
For each map, a coarse $R_{g}$ is first calculated  by considering the
centroids of secondary structures, and their C-terminal
loops as point masses.  A fine $R_{g}$ is also calculated
by considering all matched residues (plus C-terminal loops) 
as point masses.  Maps are
removed if either  $R_{g}$ value is greater than the
maximum for compact domains of the same length.

\subsubsection{Loop length distance restraints}

Analysis of the $275$ high quality domains shows that the maximum
distance $D_{max}$ between axial coordinates that can be bridged by a
loop of $N_{l}$ residues is $11.621 (N_{l}+0.25)^{0.359} + 0.5$ \AA.  
%
% REF_RBR Ref I added units
%
Maps having any loop with distances larger than $D_{max} + 4$ \AA~
are removed.  $4$\AA~is added to allow for differences in the packing of
database and query secondary structures, since similar structures with
little sequence similarity can have shifts of up to $4$ \AA~ \cite{holm95}.\\
\\
Care is taken to allow a range of possible positions for the match of
query and database structures.   This allows for errors in secondary 
structure prediction, which may fail to predict the precise start or end of
correctly identified elements, and allows for the observed differences between
the lengths of secondary structure elements within proteins having similar
topologies despite no significant sequence similarity.
For a position $x$ on a
database secondary structure, and a minimum and maximum length 
for a query secondary structure, $L_{min}$ and $L_{max}$, the range
of allowable positions of the query residue on the database structure
(of length $L_{obs}$) is given by:\\

\begin{center}
$x_{min} = $ min $(L_{obs} - L_{max},0) - h + x$\\
$x_{max} = $ max $(L_{obs} - L_{min},0) + h + x$\\
\end{center}

where $h$ is a leniency parameter, allowing for differences in
the length of query and database secondary structures.  $h = 4$ 
allows for differences typical of  those found in proteins having
similar 3D structures despite no sequence similarity.

\subsubsection{Poor \be sheets}

The deletion of \be strands from a \be sheet can lead to maps
corresponding to nonsensical 3D structures.  Maps containing isolated
\be strands (i.e. those lacking hydrogen bonding partners)
are removed.  Maps are also removed if 
\be strands are deleted from the centre of \be sheets contained within the
map.\\
\\
Analysis of high quality domains shows that
the number of \Cal -- \Cal contacts $\leq 6$ \AA~
made by a \be strand ($C_{\beta - \alpha\alpha}$) with any of its neighbouring \be strands is
always $\geq  N_{\beta} - 2$, where $N_{\beta}$ is the number of
residues in the $\beta$ strand.  Thus maps are also removed
if one or more \be strands has $C_{\beta - \alpha\alpha} < N_{\beta} - 2$.

\subsubsection{Adjacent parallel structures}

Maps are removed if tandem
secondary structures in the query are made to match parallel structures in 
the database by the deletion of intervening secondary structures.  Genuine
adjacent parallel structures within the database are allowed.  This filter
can be turned off in instances when there are long loops connecting query 
secondary structure elements,  as in the phosphotyrosine interaction domain 
example (see Results).

\subsubsection{Distance restraints}

Distance restraints may be imposed from the results of NMR experiments,
knowledge of the disulphide linkages, or knowledge of residues involved
in the active or binding site of the query.  In this study, distance restraints
are only included in the von Willebrand factor and Proteasome examples 
(see results).  A tolerance value $t = 4$\AA~ is added to all distance
restraints as for the loop length filtering.

% REF_RBR - not a change due to comments
%   I just thought the two sections below could be combined, since
%    "Redundancy" previously only contained one sentence.
\subsubsection{Consistency \& Redundancy}

Maps are only kept if there is at least one 
placement of the query onto the database secondary structures where all
distance restraints (loop length and/or experimental) 
are satisfied simultaneously.  

After application of all the other filters, matches contained entirely 
within another match are considered redundant, and removed.

\subsubsection{Maps removed by each filter}

It is illustrative to consider the fraction of maps removed by each of
the filters described above.  For example,  a pattern derived from a DSSP
assignment of secondary structure for thioredoxin that allows for 2 
secondary structure element deletions from the query and 5 from the
database, the initial alignment of secondary structure elements
reduces the number of folds from $377 \ra 212$.  $165$ folds
have no match of secondary structures with the predicted
thioredoxin pattern.  \htmladdnormallink{Table~1}{ntable1.ps} illustrates the fractions of the initial
$204783$ maps within $212$ folds that are removed by each filter when
applied independently.  \htmladdnormallink{Table~2}{ntable2.ps} shows for the same example, how the
number of maps drops as the filters are applied in succession.  The
filters are independent of one another apart from consistency filtering,
which must be applied {\em after} loop and distance restraint filtering,
and redundancy filtering, which must be applied last.  The order of
filters shown in \htmladdnormallink{Table~1}{ntable1.ps} was chosen so as to optomise speed.\\
%
% RBR_REF Ref I's comment about the order of filters addressed by
%   new sentences (above).
\\
The gradual elimination of maps and folds shows how the simple
principles of protein structure are sufficient to reduce the
number of possible alignments by two orders of magnitude.  Interestingly,
the number of folds drops very little after the generation of
maps, suggesting that the filters are tending mostly to remove 
nonsensical maps associated with each identified fold rather
than ruling out folds.  Note that
consistency filtering tends only to remove maps when tight loop
lengths or distance restraints are included in the pattern.
%
% RBR_REF Ref I commented that the number of folds only dropped
%  by 10 %.  Clearly this reflects a failure to distinguish between
%  maps and folds.  I have added the sentence "Interestingly...
%  to try and clarify this a little.
%  

\subsection{Fitting sequences on to 3D structures}

Accessibilities for residues within each map are calculated quickly by
exploting the relationship between relative accessibility and the
number of other $C_{\beta}$ atoms within $7$ \AA~ ($N_{C\beta7}$) of a
residues $C_{\beta}$ atom.  $N_{C\beta7}$ is calculated by considering
secondary structures and the C-terminal coils for the matched
structures.  Analysis of the high quality domains shows that helical
residues are buried (b) when $N_{C\beta7} \geq 3$, exposed (e) when
$N_{C\beta7} = 0$ and intermediate/unknown (u) otherwise.  Similarly,
residues in \be strands are b when $N_{C\beta7} \geq 6$, e when
$N_{C\beta7} \leq 3$ and u otherwise.  In the examples presented here,
predicted accessibilities were taken from the SUB line within PHD
\cite{rost94b} output, which highlights those regions predicted with
confidence.  Remaining positions were assigned as unknown (u) accessibility.\\
\\
Given assignments of accessibility, the best alignment for each pair
of secondary structures not permitting gaps within either secondary
structure is found by applying the scoring matrix shown in \htmladdnormallink{Table~3}{ntable3.ps}.
%
% REF_RBR more on "free" parameter optimisation comment by Ref I.
These values were chosen to prevent long overhanging gaps in
the alignment of predicted and experimental secondary structures,
and designed not to penalise mismatches too heavily.
The total similarity score for the alignment is then defined as:

\[
(\sum_{i=0}^{i=N} S_{acc}) - L_{{diff}}
\]

where $S_{acc}$ is the best score for a pair of matched secondary
structures calculated by summing values from \htmladdnormallink{Table~3}{ntable3.ps}, $N$ is the
number of matched secondary structures, and $L_{diff}$ is the total
difference in the lengths of the two protein domains being compared.
When calculating $L_{diff}$ those secondary structures that have been
equivalenced are ignored, since overhanging gaps are already penalised
by the gap score in \htmladdnormallink{Table~3}{ntable3.ps}.

\subsection{Protein Structure Patterns for Evaluation}

Representatives (queries) from each of $11$ structural families containing 
structural similarities despite no sequence similarity
\cite{rb94} were chosen to assess the 
method.  The $11$ queries are shown in \htmladdnormallink{Table~4}{ntable4.ps} and represent a
diversity of folds from all four protein folding classes.  For all
queries, there is at least one clear example of a similar fold in the
database that does not show any detectable sequence similarity to the
query.  For reference, similar folds in the database were
found by the STAMP (structural alignment of multiple proteins)
structure comparison program \cite{rb92b} and with reference to the
structural classification of proteins (SCOP) database
\cite{murzin95}.\\
%
% REF_RBR added an explanation about why the 11 were chosen.  
%  (covers range of folding classes, at least one example, etc.
%
\\
Two patterns were defined for each of the eleven structures: a) one
taken directly from the DSSP secondary structure assignment and
accessibility (i.e. perfect prediction) and b) one from
cross--validated secondary structure and accessibility prediction by
the methods of Rost \& Sander \cite{rost93a,rost94b}.   The PHD program
and jack-knifed neural network architectures were kindly provided by
Dr Burkhard Rost (EMBL).  
%
% REF_RBR added sentence about where we got the PHD program from
%  in answer to Ref II.
%
Experimental secondary structure summaries and accessibilities (a)
were taken from DSSP \cite{dssp}.
Predicted secondary structure summaries (b) were taken from the `PHD sec'
%
% REF_RBR More in answer to Ref I's comments re point epsilon
%  and the minor point about SUB_ACC  several changes in the 
%  paragraph below
entries and accessibilities from the `SUB acc' entries, since these
most closely resembled the assignments from the $N_{C\beta7}$ calculation
of accessibility.   PHD assignments of buried and exposed states 
were classified as buried and exposed, with all other positions `i' or 
no assignment as `u'.  Strands shorter than
two residues, and helices shorter than four residues were ignored. The 
length of the secondary structure was given by the number of residues 
in each secondary structure (maximum = minimum), and the number of 
residues between the secondary structures was taken as the minimum 
loop length.\\
\\
Patterns may also contain distance restraints, such as those available from
NMR experiments, disulphide linkages, or SDM studies. Distance restraints
were only added in the von--Willebrand factor and Proteasome patterns (see
Results).

\subsection{Cross--validation}

Any predictive method that needs large numbers of parameters must be
cross-validated to ensure that the method does not do artificially
well on the examples used to derive the parameters.  For cross
validation of the secondary structure and accessibility predictions,
we used the jack-knifed neural--network architectures described by
Rost \& Sander (1993a) (Kindly provided by Dr. B. Rost.)  Secondary
structure and accessibility for each query protein was predicted by an
architecture that did not include the query protein or any
homologue.\\
\\
The filters and matching algorithm described here use only a few
geometric parameters all of which are independent of the protein
sequence.  Accordingly, removal of query proteins and homologues from
the set used to derive the equations above makes a negligible 
difference to the parameters.

\subsection{Computational details}

Runs for the patterns shown in \htmladdnormallink{Table~4}{ntable4.ps} take between 5 and 60 minutes on
a Silicon Graphics Indigo 2 (150 MHZ IP22 Processor MIPS
R4400).  The MAP program is available from the authors. Contact GJB by
e-mail: gjb@bioch.ox.ac.uk or see the WWW address http://geoff.biop.ox.ac.uk/ for details.

\section{Results}

\subsection{Assessing accuracy}

Structural similarity is a continuum and for some fold types opinions
differ as to what constitutes ``similar''.  For example, thioredoxin
has a $\beta$-sheet with helices packing on each side which
superficially resembles a Rossmann fold domain.  However, the topology
of the sheet is different from a Rossmann fold: the connectivity is
different, and it contains a mixture of parallel and antiparallel \be
hairpins rather than all parallel.  To build a detailed model of
thioredoxin based on a Rossmann fold would be incorrect, but
recognising that thioredoxin has a ``single sheet with helix on each
side'' is still useful.  For some folds, e.g. the $\beta$-trefoils,
there is no such ambiguity.  We discuss the accuracy of our method
using two grades of success `strict' and `loose', which are outlined
in \htmladdnormallink{Table~5}{ntable5.ps}.  Strict similarities are those where the topology of the
structure in the database is nearly an exact match of that found in
the query (e.g. plastocyanin and azurin).  Loose similarities are
those where the topologies are broadly similar, with additional
secondary structures in one fold relative to another, and with some
differences in topological ordering or orientation of equivalent
secondary structure elements (e.g. plastocyanin and an Ig fold).
Strict similarities tend to correspond with those specified by {\bf scop}
\cite{murzin95}, whereas the loose similarities tend to correspond
roughly with those identified by CATH \cite{orengo93a} and by
the assessors of the protein structure prediction challenge \cite{lemer96}.\\
\\
For comparison, we also scanned the same eleven queries against the
database of domains using the  fold recognition 
program THREADER \cite{jones92} with default parameters.
\\
In addition to the recognition of the correct fold, it is important to
consider how well the query is aligned onto the database structure.
Two measures of alignment accuracy are given: a) the
fraction of correct residue equivalences found by each method 
{\em \%~Res--Res}, and b) the fraction of correctly overlapping secondary
structure elements found {\em \%~Sec--Sec}.  Secondary structures were
considered correctly matched if at least two residues from 
structurally equivalent secondary structures overlapped in the
alignment generated by each method.  \%~Res--Res is a
%
% REF_RBR Tried to clarify Sec-Sec a little better to
%  help out our poor referees
%
strict definition, and broadly measures how accurate a 3D model would
be if based on  the alignment found.  \%~Sec--Sec is a looser
definition, and allows for slippages of secondary structures and thus
indicates the accuracy of the predicted topology.  The second
measure is arguably a more reliable guide, since for many pairs of similar
protein structures, alignments of sequence based on 3D structure
are ambiguous.  Problems arise when assessing the  symmetrical
\albe barrel structures.  Shifting the alignment of secondary
structure elements by one \be\al unit can lead to zero accuracy by
these measures, though the resulting structure is largely correct.  We
thus report average accuracies with and without the \albe barrels.
To assess the overall alignment accuracies of each method, 
only those strict similarities that were not detectable by a sensitive 
sequence comparison algorithm \cite{barton93b} were considered.  Similarities 
excluded were those with the globins, 1ECA, 1HBG and 1MYGA when scanning with 
Sea Hare Myoglobin, and that with 1PAZ when scanning with plastocyanin.  
For all other examples, accuracies were included in the calculation of an 
average, regardless of whether the similarity was found at or near the 
top of the ranked lists.  A total of 36 strict similarities were used 
in the calculation.
% REF_RBR added the last sentence here in answer to 
%  Referee I's point beta.

\subsection{Searches with eleven test proteins}

The results of comparing the eleven protein structures to the database
of domains using DSSP patterns, PHD patterns, and the THREADER program
are shown in \htmladdnormallink{Table~6}{ntable6.ps}.  The table lists the top 10 ranked domains for
each query by each method.  For each domain, the code, score,
structural class and fold description are shown together with the
alignment score and the percentage accuracies of the alignments at the
residue (\% Res-Res) and secondary structure (\% Sec--Sec) level (see
below).  Within \htmladdnormallink{Table~6}{ntable6.ps}, domains classified as strict similarities
(ignoring those detectable by sequence comparison) are shown in
inverse text; loose similarities are shown as shaded.  \htmladdnormallink{Table~7}{ntable7.ps}
summarises the rankings shown in \htmladdnormallink{Table~6}{ntable6.ps} (see legend).\\
\\
Judging by the strict criteria shown in \htmladdnormallink{Table~5}{ntable5.ps}, 8/11 of the scans
made with experimentally determined secondary structure (MAP(DSSP))
put the correct fold in the first rank.  By the loose definition, the
method located 10/11 folds in the first rank.  Predictably, the scans
based on patterns from secondary structure prediction fare worse.
4/11 folds were correctly ranked at position 1 by the strict criteria.
However, this compares favourably with THREADER which placed 1 fold
correctly in the first rank.  When the loose definitions of fold
similarity are used, our method placed 5/11 correct folds at the top
of the list compared to 2/11 for THREADER.  Expanding the definition
of success to include any search that places a correct fold in the top 10, as
described by Lemer \ea (1996) \nocite{lemer96}, shows a similar trend
\htmladdnormallink{(Table~7)}{ntable7.ps}.  The greater success of the
DSSP derived patterns suggests that fold recognition by this method
will improve alongside any improvements in secondary structure and
accessibility prediction.  The structural class of proteins (as
identified using SCOP) in the top 10 domains was more consistent by 
our method: MAP(PHD) scans lead to 10/11 correct protein class 
predictions for the 1st ranked protein, compared to 5/11 for THREADER.
Although this improvement may be due mostly to the accuracy of the PHD
predictions, the result suggests that other fold recognition methods could 
profit from the consideration of predicted secondary structures.\\
%
%  REF_RBR added comment above about classes from SCOP and
%     that class predictions by MAP(PHD) may be due
%     to the accuracy of PHD.
\\
Our method (MAP) shows an improvement over THREADER with respect to
detecting the correct fold.  What of alignments of sequence to structure?
Values for individual accuracies are given in \htmladdnormallink{Table~6}{ntable6.ps}.  Reference alignments
of 3D structures were found by the STAMP algorithm \cite{rb92b} for
all strict similarities with the eleven protein 
families.  The averaged values for \% Res--Res and \% Sec--Sec are shown in
\htmladdnormallink{Table~8}{ntable8.ps}.  MAP(DSSP), MAP(PHD) and THREADER give \% Res--Res of 35,  15 
and 11 \% respectively and \% Sec--Sec of 75, 43 and 37\%.  If one 
ignores the repetitive \albe barrel alignments, accuracies improve slightly 
with \% Res--Res 39, 15 and 13\% and \% Sec--Sec of 86, 49 and 50 \% for 
MAP(DSSP), MAP(PHD) and THREADER.   None of the methods perform well by the 
\% Res--Res criterion, though \% Sec--Sec suggests that the correct topology
is achieved about 50 \% of the time.  The high \% Sec--Sec for 
MAP(DSSP) scans suggests that alignment accuracy, like fold recognition,
will improve with developments in secondary structure and
accessibility prediction.\\
\\
How useful are the detected loose similarities?  For some examples,
loose similarities imply only a broadly similar architecture, and may
not immediately be used for homology modelling studies.  However, for
others the loose similarity genuinely represents a feasible modelling
template.  For example, the PHD prediction of hepatocyte nuclear factor 3 
(HNF-3) failed to predict
two short \be strands found in the native structure, and thus the MAP
search did not detect BirA domain I (PDB code 1BIA) or GAP domain I
(2GAP) as possible templates.  However, the search with the
predominantly helical prediction did rank another helix-turn-helix
motif first, as shown in \htmladdnormallink{Figure~1}{figure1.ps}.  The core three
helices have been aligned correctly at the secondary structure level
and a prediction of this type could be useful in the absence of
experimental 3D structure information.

\subsection{Fold recognition from published predictions}

In the tests above only the type and length of secondary structures,
the loop length observed in the query structure, and the pattern of
burial and exposure, observed or predicted for each secondary
structure segment were used in the search.  Many published predictions
are augmented by human insight, contain detailed predictions of
loop lengths, and consider experimental distance restraints.  All of
this information can be used with the MAP method described here.  To test
the method under these circumstances, we considered three predictions:
1) the von Willebrand factor (vWf) prediction by Edwards \&
Perkins (1995), 2) the Proteasome prediction by Lupas \ea (1994)
\nocite{lupas94} and 3) a prediction for the Phosphotyrosine
Interaction Domain (PID) by Bork \& Margolis (1995).
\nocite{bork95a,edwards95} All of these predictions were made
from very diverse sequences, which is likely to improve prediction
accuracy \cite{russell95}.  The predictions also comprise carefully
constructed sequence alignments, that can provide tight loop--length
distance restraints.  For the three searches, a larger and more
up-to-date database of 780 protein domains was scanned (A. S. Siddiqui
per. comm.)  Subsequent 3D structure determination has shown all three
of these proteins to resemble previously observed folds
\cite{lee95,brannigan95,zhou95}.

\subsubsection{The vWF domain}

Perkins \& co--workers (Perkins \ea, 1994; Edwards \& Perkins, 1995)  
\nocite{perkins94,edwards95} used an alignment of 92 sequences together
with spectroscopic data, and prediction algorithms to predict that
the vWf domain would comprise a repeating arrangement of \be strands
and \al helices.  Edwards \& Perkins combined a THREADER scan with
analysis of the location  of active site residues, a putative
disulphide bridge,  and the principles of protein 3D structure.
They suggested that the vWf domain would be most likely to resemble ras p21.
The subsequently determined 3D structures
\cite{lee95} showed this prediction of secondary structure and fold to 
be largely correct \cite{russell95}.\\
\\
Our mapping technique allows many of the features
exploited by Perkins \ea to be combined in a prediction.  \htmladdnormallink{Figure~2}{figure2.ps} shows a vWf
pattern based on the prediction of Perkins \& co-workers
\cite{perkins94,edwards95}.  In addition to a pattern of predicted
secondary structures, the pattern also contains detailed information
as to the loop lengths, and details of two distance restraints: one
from a pair of aspartic acids thought to be involved in a metal
binding site (constrained to have their axial coordinates within 15
\AA), and a putative disulphide bond (constrained to have their axial
coordinates within 9.5 \AA).  A tolerance of $t = 4$ \AA~ was added to
each of these restraints to allow for changes in secondary structure
packing across similar protein 3D structures.\\
%
% REF_RBR Re: Ref I's "free" parameters comment
%  I think that the last sentence (above) is a good enough caveat.
%  I am tempted to call it a "slop factor", but this is sloppy.
%  
%
\\
A comparison of the vWf pattern to the database of 780 domains finds
Elongation factor Tu (PDB code 1ETU), Ras P21 (821P) and Che-Y (3CHY)
as the three top scoring folds, with other double--wound, \albe,
Rossmann-type folds following in the top 20 scoring folds.  The top 3
scoring proteins are highly similar to the recently solved structures
of the vWf, with Ras P21/Elongation factor Tu being the most similar
\cite{lee95}.

\subsubsection{The Proteasome}

Lupas \ea (1994) predicted the secondary structure for the 20S
proteasome \al subunits by a variety of algorithms.  We took their
predicted pattern of secondary structure elements and accessibility and
searched the database of 780 non-redundant protein domains.  Without
imposing any experimental distance restraints, the method finds $7$
folds ($173$ maps).  The top scoring fold, according the the
amphipathicity scoring scheme, is that of glutamine amidotransferase
(PDB code 1GPH), which is structurally and functionally similar to the
proteasome
\cite{lowe95,brannigan95}.\\
\\
A small number of weak distance restraints can make a significant
difference to the results of this search.  If alignment positions
identified as putative active site residues by Lupas \ea, by the
method of Benner and co-workers \cite{benner93a}, are required to have
axial coordinates within $15$ \AA~ (tolerance of $4$ \AA) of each
other, only $4$ folds ($19$ maps) remain, with the correct fold still
at the first rank.  Although distance restraints are not always
available prior to 3D structure determination, our results 
suggest that they should be used to aid fold recognition whenever
possible.

\subsubsection{The phosphotyrosine interaction domain}

Bork \& Margolis (1995) recently identified a new phosphotyrosine
interaction domain (PID) involved in the cytoplasmic signalling
cascade.  They constructed an alignment of several diverse members of
this sequence family, and performed a prediction of secondary
structure.  We ran the PHD program on a slightly more up-to-date
alignment of PID proteins (P. Bork, personal communication), to 
predict the secondary structure and accessibility.  A search pattern 
was made from the prediction, and the loop length
ranges taken from the multiple alignment.  The pattern of 9 secondary
structures was BBHBBBBBH and  these elements are numbered sequentially
from 1--9 below.  Since there were two long loops connecting the
predicted secondary structures, the adjacent parallel filter was not
used during the search.  Structures corresponding to the best
alignment with each of the top six scoring folds are shown in \htmladdnormallink{Figure~3}{figure3.ps}.   Recent structure
determination has shown the PID (PTB domain) to resemble the plekstrin homology
(PH) domain
in structure and function
\cite{zhou95}.  By the accessibility  scoring scheme, the top ranked fold
is not a PH domain, although a PH domain (from
dynamin) is ranked at position 2.  The top 6 folds are illustrative in
that they show how the method can suggest alternative plausible folds 
that satisfy
a pattern of predicted secondary structures and accessibilities.\\
\\
The best scoring fold \htmladdnormallink{(Figure 3a)}{figure3.ps} is that of profilin (PDB code
2BFPP), and the best scoring map gives an anti-parallel \be sheet
with the strand order 218754 (predicted strand 6 is deleted) with
one helix packing against each face.  The second best scoring fold is
a correct match with the PH domain from human dynamin (1DYNB), having
deleted the first predicted \al helix from the PID pattern.  The third
best scoring fold (3c) comes from {\em S. aureus} \be lactamase (1BLH,
domain 1), with an anti-parallel \be sheet of order 54876, with both
helices packing against one face.  The fourth and fifth best scoring
folds come from members of the Ig superfamily, and comprise
alternative arrangements of \be strands to form a greek key \be
sandwich.  Both of the predicted \al helices from the PID pattern
have been deleted in these matches.  Finally, the sixth (3e) match
comes from the tryptic core of {\em E. coli lac} repressor (1TLFD
domain 4), and comprises a parallel \be sheet (42576) with both
helices packing against one face.  This fold is perhaps the least
plausible, since it would require 3 crossover connections between
adjacent and parallel \be strands.  \\
\\
The method has suggested plausible alternative structures
that can be scrutinised, in the absence of 3D structural
data, by way of further experiments, secondary structure predictions,
or even other methods of fold recognition.  The results show how the
predicted secondary structure elements can be accommodated into a
compact, plausible protein fold, and encouragingly, the method has
identified the correct fold high in the list of alternatives.

\section{Discussion \& Conclusions}

In this paper we have presented a new method for protein fold
recognition which exploits recent improvements in protein secondary
structure prediction, and can use other information such as
predictions of accessibility, loop lengths and experimental data to
restrict possible folds. When applied to predicted secondary
structures and accessibilities, the method has been shown to be
slightly better than one widely used fold recognition method
\cite{jones92} at detecting the correct fold for eleven test examples.
The alignments generated by the method are of comparable accuracy at
the residue-residue and secondary structure alignment level.  When the
query is defined by experimental secondary structures and
accessibilities, the method is highly successful
at recognising the correct fold.  This suggests that the mapping
method will improve alongside any future improvement in secondary
structure and accessibility prediction.  The method also has the
advantage of being computationally inexpensive, and so allows for
multiple searches to be performed quickly.\\
\\
The simplicity of the technique suggests several  enhancements that 
could improve accuracy even further.  The method of aligning
sequences onto 3D structures might be developed by the use of
empirically derived pair-potentials or accessibility preferences
(e.g. \cite{sippl90,jones92}), or by the identification of favourable 
interaction sites between secondary structures 
\cite{cohen80a,cohen80b,cohen82}.  A more sophisticated alignment and ranking 
procedure is under development.\\
\\
The initial alignment and filtering procedures are perhaps the most
unique feature of this technique.  Other techniques for fold-recognition tend
only to provide a single sequence alignment of query and database
structures.  The use of a secondary structure element alignment method has
the advantage that exhaustive comparisons of two proteins can be
performed; most folds identified have an ensemble of alternative
alignments that can be explored further.\\
\\
Since most protein structure
similarities occur at the domain level, it is advantageous, whenever
possible to split both query and database structures into domains.  
The problem of assigning domains for protein 3D structures has been
the subject of revived interest
\cite{holm94b,siddiqui95,sowdhamini95,islam95} and is likely to lead
to accessible databases of protein structural domains.  Assigning
domains within proteins of unknown 3D structure is more problematic,
though approaches based in sequence homology \cite{pongor94,sonnhammer94}
are undoubtedly the most promising; the vWf and PID proteins above are
both examples of domains that occur in a variety of multi-domain
contexts.\\
% REF_RBR point to this paragraph in answer to Ref I's point gamma
%
\\
The method described here has applications in protein structure
determination by NMR.  During NMR structure determination, a
preliminary secondary structure assignment (equivalent to a very
accurate prediction) and a small number of distance restraints may be
available early in the study.  However, these data are usually
insufficient to determine a unique structure by distance geometry or
molecular dynamics \cite{smith-brown93}.  Our results for the
vWF and Proteasome domains suggest that the data may be sufficient to locate a
similar fold in the database if one is present.  Folds predicted from
distance restraints and secondary structure assignment may be used to
guide the assignment of cross-peaks and thus speed up the structure
determination process.  Clearly, the alternative consistent topologies may also
give clues as to possible structural/functional/evolutionary
relationships that are generally not known until after 3D structure
determination (such as that described in Matthews {\em et al.}, 1994).
\nocite{matthews94}\\
\\
We have shown that secondary structure predictions of typical
accuracy, together with simple principles of protein 3D structures
and/or experimental data can be used to recognise correct protein
folds in a library of domains.  These results and others
\cite{edwards95,russell95,gerloff95} suggest that secondary structure
prediction, experimental data, and protein structural principles
should be used to augment protein fold recognition whenever possible.

\section{Acknowledgements}

We thank Professor L.N. Johnson for encouragement and support.  We are
indebted to Dr D.T. Jones (University of Warwick) for giving advice on
the THREADER program and its database, Dr B. Rost (EMBL, Heidelberg)
for providing the PHD program, Drs P. Bork (EMBL, Heidelberg) and
S.J. Perkins (Royal Free Hospital, London) for providing prediction
data for the PID and vWF domains, Dr S.K. Burley (Rockefeller
University, New York) for providing the coordinates of the HNF--3
structure and Mr A. S. Siddiqui (LMB, Oxford) for providing a database
of protein structural domains.  RBR thanks Dr C. P. Ponting
(Fibrinolysis Research Unit, Oxford) for helpful discussions.  RBR and
GJB thank the Royal Commission for the Exhibition of 1851 and the
Royal Society for support.  RRC is funded by an MRC studentship.  This
research was funded in part by a grant from the BBSRC (UK).
%
% We also thank our parents for raising us so well,
%  erm... I'd like to thank my producer and all my friends for
%  support during the long years, and for understanding.  Thanks
%  to Bill at the Betty Ford clinic.
%

\section{$\ddag$Abbreviations}

3D three dimensional; NMR nuclear magnetic resonance; Ig
Immunoglobulin; SDM site directed mutagenesis; WWW world wide web;
The standard one-- and three--letter abbreviations for the amino acids 
are also used throughout.


\section{Figure and Table Legends}

\subsection{Figure 1}

\htmladdnormallink{Figure~1}{figure1.ps}.

An example of a useful `loose' similarity between 3D structure detected using
the MAP method and a secondary structure prediction.  a) The alignment found by the
method between the predicted pattern for HNF--3 and the helical DNA binding motif
within phage 434 repressor.  Boxed, bold-faced, upper-case regions indicate aligned
predicted and experimental secondary structures.  Sec denotes the PHD prediction for 
HNF--3, and a 3-state DSSP secondary structure assignment for 434 repressor.  Bur
shows predicted and experimental states of burial for HNF--3 and 434 repressor: 
b = buried, e = exposed; u = intermediate/unknown.  b) The equivalent alignment 
found using the STAMP (Russell \& Barton, 1992) structure comparison algorithm.  
Boxed, bold-faced, upper-case regions indicate structural equivalences.  
Sec denotes DSSP 3--state secondary structures for both proteins.
c) and d) show the crystallographic structures of the matched regions of HNF--3 and
434 repressor, with structurally equivalent residues shown in ribbon/coil format, and
unequivalent regions shown as \Cal trace.  The N- and C- termini of the structures are
labelled.\\


\subsection{Figure 2}

\htmladdnormallink{Figure~2}{figure2.ps}.

Search pattern for the von-Willebrand factor type A domain (derived from Edwards \& Perkins, 1995)
as discussed in the text.  \al helices are indicated by cylinders, \be strands by arrows.
The range of numbers given beside each secondary structure or loop are the range of
predicted lengths.  Bullets ($\bullet$) show those secondary structure that are
required for any possible map (i.e. those involved in distance restraints).  Two
distance restraints, one from a putative disulphide bond ($9.5$ \AA)and the other 
from knowledge of two residues thought to be involve in metal coordination
($15$ \AA) are shown to the left of the figure.\\


\subsection{Figure 3}

\htmladdnormallink{Figure~3}{figure3.ps}.

Maps from the top six scoring folds found during a search with the PID pattern.  
Details are given in the text.\\

\subsection{Table 6}

\htmladdnormallink{Table~6}{ntable6.ps}.

Results of running MAP using secondary structure assignments (I) and
PHD secondary structure predictions (II) shown beside THREADER results
(III) for eleven protein structures having type B and C similarities
(Russell \& Barton 1994) within the domain database.  The first column
for each method shows the top ten scoring domains, which are denoted
by a PDB four letter code (Bernstein {\em et al.}, 1977), a chain
identifier as the fifth character (if any), followed by an underscore
and a Roman numeral denoting the domain (if any).  Bold inverted text
denotes a correct match using the strict classification, grey
backgrounds show loose classifications (see text).  The second column
shows the score for each domain, the third the protein structure
class, and the fourth the name of the fold/structure.  Upper case
denotes fold families under the strict definitions.  Upper case names
in parentheses (if present) denote the name of the loose family
classification.  The globins 1HBG, 1MYGA and 1ECA and the cupredoxin
1PAZ are sequence similar to the query so are not shown inverted and
are not included in the evaluation statistics (see text).

Strict fold classifications:
4HB-1= Up-down-up-down four helix bundle (4HB); 4HB-2= up-up-down-down (interleukin-4 
type) 4HB;  GLOBIN= globin-type folds; W-HTH= winged helix-turn-helix (HTH) folds; 
EF-HAND= calcium binding EF hands; CYTOC= cytochromes C; THIO= thioredoxin-like folds; 
FLAVO= flavodoxin-like folds; ROSS= Rossman folds; PBL= periplasmic binding 
protein-like folds; ACTIN-ATPASE= actin/HSC-70/hexokinse like folds;  G-PROT= 
G-protein (ras) like folds; FAD-BIND= FAD/NAD binding protein-like folds; 
\al\be-BARREL= \al\be (TIM) barrels; \be-GRASP= \be-grasp (ferredoxin) like
folds; IG= Immunoglobulin superfamily; CUP= Cuppredoxins (plastocyanin-like); 
\be-TREFOIL= \be-trefoils (interleukin-1-\be-like); OB-FOLD= 
oligonucleotide/oligosaccharride binding folds.

Loose fold classifications:
4HB= 4HB-1, 4HB-2, ferritin; HTH= W-HTH, $\lambda$-rep., trp-rep.; DWAB 
(doubly-wound-\al\be)= ROSS, FLAVO, THIO, PBL, G-PROT, sugar phosphatase, 
pfk, pgk, dhfr; GKBS (greek key \be sandwich)= IG, CUP, \al-amylase inhibitor, 
sod, macromycin, prealbumin.

Other abbreviations:
sod: superoxide dismutase;  pfk= phosphrofuctokinase; pgk= phosphoglcerate kinase; 
dhfr= dihdrofolate reducatse; ldh= lactate dehydrogenase; ser-prot= serine proteinase; 
asp-prot= aspartic proteinase; inh.= inhibitor; rep.=repressor; glut.=glutathione; 
red.=reductase; thym. phosph.=thymidine phosphorylase; ribo.=ribonuclease; 
glyc.=glycoprotein; P-glucomutase= phosphoglucomutase; glyc. ribo trans.= glycinamide 
ribotransferase.


\subsection{Table 7}

\htmladdnormallink{Table~7}{ntable7.ps}.

Summary of fold recognition success rates.  Strict and Loose refer to
the critera for structural similarity discussed in the text.  Class
refers to structural class success as discussed in the text.  (1st)
refers to success measured as a correct fold at rank 1, (Top 10) as a
correct fold in the top 10 ranked structures.

\section{Figures}

[\htmladdnormallink{Figure 1}{figure1.ps}][\htmladdnormallink{Figure 2}{figure2.ps}][\htmladdnormallink{Figure 3}{figure3.ps}].


\section{Tables}

[\htmladdnormallink{Table 1}{ntable1.ps}][\htmladdnormallink{Table 2}{ntable2.ps}][\htmladdnormallink{Table 3}{ntable3.ps}][\htmladdnormallink{Table 4}{ntable4.ps}][\htmladdnormallink{Table 5}{ntable5.ps}][\htmladdnormallink{Table 6}{ntable6.ps}][\htmladdnormallink{Table 7}{ntable7.ps}][\htmladdnormallink{Table 8}{ntable8.ps}].


\nocite{TitlesOn}
\bibliographystyle{jmb}
\bibliography{rbr}

\end{document}