User guide for:
SCANPS
A program for rigorous protein sequence database scanning

Geoffrey J. Barton

University of Oxford
Laboratory of Molecular Biophysics
Rex Richards Building
South Parks Road
Oxford OX1 3QU
UK

Tel: (44) 865-275368
Fax: (44) 865-510454
email: gjb@bioch.ox.ac.uk
www: http://geoff.biop.ox.ac.uk/personalities.html

Preliminary manual: 29th July 1994 Revised: 26th August 1994

Contents

What is SCANPS?

SCANPS (pronounced Scan-P-S) stands for SCAN Protein Sequence. The main function of SCANPS is to use a rigorous local alignment method to search protein sequence databases with a query sequence or multiple alignment. SCANPS also allows all pairwise comparsons to be made between a set of sequences and can estimate the statistical significance of the alignments. SCANPS has been used in the analysis of many protein families. For example, it was used to make the discovery of similarity between PD-ECGF (Platelet derived endothelial cell growth factor) and TP (Thymidine Phosphorylase) [4]. The program was also used to find the similarity between E. coli diadenosine tetra-phosphatase and the protein Ser/Thr phosphatases [3].

Principal features of SCANPS

Availability

The SCANPS program has been used as a test bed for a lot of studies, many of which are not yet published. When the work is published, I will try to clean up the source code and distribute it. Currently, I can not be sure that the code will compile on all ANSI-C compilers, so for the time being, I am making precompiled binaries available for Sun (SunOS 4.1.3) and Silicon Graphics (IRIX 5.2). I have access to a Silicon Graphics running IRIX 4.X, so if you want the programs on the older operating system, then let me know.

The programs are available by anonymous ftp from geoff.biop.ox.ac.uk in the subdirectory programs/scanps. You can also reach this directory using a WWW browser such as Mosaic (URL=http://geoff.biop.ox.ac.uk). If you download the programs please send me a short email with your name, affiliation and address. I will add you to my user database and send you an email when the programs are updated and/or sources are made available.

Disclaimer

This package is distributed on an ``as is'' basis. There is no warranty whatsoever as to functioning, performance or effect on hardware or other software, express or implied. The author disclaims any implied warranties of merchantability or fitness for any particular purpose.

Citing SCANPS

The program itself has not been written up for a journal, however the underlying algorithms are published in Barton (1993) [2]. If you use these programs in your work, please cite the paper and this manual ``SCANPS user guide, G. J. Barton, Oxford University, UK''. Thank you.

How to use SCANPS to search a protein database

Introduction

SCANPS may be used in two ways, simple and advanced. To use the simple method, you only need SCANPS, a sequence query, a SCANPS defaults file and database file. To use the advanced features you ideally should have created indexes for the database file using the programs simclean and id_pir3. Please see the section 11 for details.

This manual assumes that the installation has been performed correctly, if you experience problems running the programs please see the section 11 and if that fails, send me an email!

Typing ``scanps'' on its own gives a list of the legal command line arguments. Most of these are described with examples in the following sections. They are also listed in Section 14.5.

Introduction to Unix Pipes and redirection

Unix experts can skip this section!

SCANPS and associated programs make use of Unix standard input and output to simplify data processing. This brief introduction should help those who are either novices at Unix, or have never made use of pipes (|) and redirection (< >).

Unix programs read instructions from standard input (stdin) and write output to standard output (stdout). By default, standard input is the keyboard and standard output is the screen. A feature of Unix is that you can redirect the input to come from a file by appending the '<' character, and redirect output by appending the '>' character.

For example: let us suppose we have a program called ``garbage'' which we would like to have read data from a file called ``in.dat'' and write the results to a file called ``out.dat''. We could type:



garbage < in.dat > out.dat

Unix also includes the pipe character '|'. This allows the output of one program to be directed to the input of another. For example, suppose we want to process the output of the ``garbage'' program using another program called ``cleaner'', then one way to do this would be to type:



garbage < in.dat > out.dat

cleaner < out.dat > cleaned.dat

This saves the results of the ``garbage'' program in the file out.dat, then takes the out.dat file as input to the ``cleaner'' program, finally saving the results to the file cleaned.out.

A neater solution using a pipe does away with the need for the out.dat file altogether:



garbage < in.dat | cleaner > cleaned.out

Simple Scanning - just returns top score for each sequence

There are four steps to simple database scanning using SCANPS:

  1. Perform the scan, either using a single sequence or a multiple sequence alignment. Save the score and identifier of each sequence in the database.

  2. Sort the results of the scan into descending order.

  3. Inspect the sorted file to decide how many of the high scoring proteins we are interested in. Extract the sequences for these high scoring proteins.

  4. Run scanps again, this time reading sequences from the high scoring sequence file and generate all alignments down to some threshold score.

For example, we can scan with the SH2 domain from src. The sequence data file should look something like this:



>TVHUSC_SH2
src SH2 domain
WYFGKITRRESERLLLNAENPRGTFLVRES
ETTKGAYCLSVSDFDNAKGLNVKHYKIRKL
DSGGFYITSRTQFNSLQQLVAYYSKHADGL
CHRLTTV*

This is standard NBRF-PIR format. SCANPS expects to find a ``>'' symbol followed by an identifier code, then on the NEXT line a title, in this case ``src SH2 domain'', then the one letter amino acid code terminated by a star ``*''. Note that the amino acid sequence MUST be in uppercase, but any number of characters per line is allowed. scanps ONLY reads alphabetic characters and IGNORES spaces, dots, or numbers. Non legal amino acid codes are read as ``X'' (eg. the letters O, or I).

Scan with sequence using default parameters and database

To do this, type:



scanps -ssh2.seq > sh2.scan

The -s tells the program to read the query sequence from the file sh2.seq. The results of the scan will be saved in the file sh2.scan. On a Silicon Graphics R4000 ``Indy SC'' using the PIR38 database which contains 61,248 sequences this scan takes about 580 seconds. On a Sun SPARCstation 2 the scan takes about 3-4 times longer.

Sort the result

To do this, you can use the Unix sort utility as follows:



sort +0 -1 -n -r < sh2.scan > sh2.sorted

If you want to understand what the +0 -1 -n -r means, please consult the Unix sort man pages (i.e. type: man sort). The sort is very quick.

You can avoid saving the sh2.scan file by piping the output of scanps directly into sort. For example:



scanps -ssh2.seq | sort +0 -1 -n -r > sh2.sorted

Examine the hit list

Examination of the top of the sorted output file shows the highest scoring hits to the query sequence. You must now inspect this file and decide on how many of the top scoring sequences you would like to examine by alignment. If you use the default parameters (PAM250 and penalty of 8) then scores below 90 are often uninteresting. However, this is not an absolute rule and each scan will require careful scrutiny of the score list. It is usually better to include a lot of sequences at this stage since ``interesting'' matches may emerge even for low scores.

There are no programs supplied with scanps to help you look at the ``.sorted'' file. You must use the Unix tools ``more'' or ``head'' to inspect and extract the interesting parts of the file. Or you could use your favourite text editor (vi, emacs, jot, pico etc).

In order to keep this guide to manageable length, I will illustrate the following sections using only the top 15 scoring sequences. In practice, the top 150 or so in this scan would be worth looking at. To get the top 15 sequence scores into a file you could type:



head -15 < sh2.sorted > sh2.top15

This saves the top 15 score/ID pairs in a file called sh2.top15. Here it is:



497 A43610
497 TVHUSC
492 TVCHS
492 TVFV60
492 TVFVPR
492 TVFVS2
490 TVFVS1
488 TVFVMT
474 B34104
473 A34104
458 OKFVYR
458 S15582
458 S20808
456 TVFVR
443 S20676

Unless you know the identifier codes, this is pretty unhelpful. If you have built an indexed database, then it is easy to get the titles of these sequences back using the program ``sortsco'', see Section 6.3.3. For now, we can extract the sequences that correspond to these protein identifiers using the program ``select''.

Extracting the sequences from the database



select
Program S E L E C T

Extracts sequences from PIR database

Author: G. J. Barton (1990)
Maximum Allowed Sequence Length: 8000
Maximum Allowed Number of Sequences: 2000

Enter name of file containing SCORE ID pairs: sh2.top15

Opening File: sh2.top15


Opening File: /data/pir/pir38.seq

Just Extract Identifiers/titles (no sequences) ?[Y/N]: 

Enter Output Filename: sh2.top15.seq

Opening File: sh2.top15.seq

Searching for: 15 Sequences
1 A34104
2 A43610
3 B34104
4 OKFVYR
5 S15582
6 S20676
7 S20808
8 TVCHS
9 TVFV60
10 TVFVMT
11 TVFVPR
12 TVFVR
13 TVFVS1
14 TVFVS2
15 TVHUSC
Found: S20676     1
Found: S20808     2
Found: S15582     3
Found: A34104     4
Found: B34104     5
Found: A43610     6
Found: TVHUSC     7
Found: TVCHS     8
Found: TVFV60     9
Found: TVFVMT    10
Found: TVFVPR    11
Found: TVFVR    12
Found: OKFVYR    13
Found: TVFVS2    14
Found: TVFVS1    15
Extracted: 15 Sequences

You have supplied the name of the file containing score, id pairs (sh2.top15) then the name for a file to save the sequences to (sh2.top15.seq), select then lists the identifiers it is searching for and as they are found in the database, it lists them to the screen again. The sequences are saved in the output file in the same order as they are shown in the sh2.top15 file.

If you have access to a more sophisticated database program, then you may prefer to use that to extract the sequences. For example, the program ``sortsco'' works much faster than ``select'' since it makes use of indexing - See Section 6.3.3 for details.

If you have not set the environment variables for the database file, then the program ``select'' will prompt you for the database filename.

Generating the local alignments

We now have the 15 top scoring sequences in a file called sh2.top15.seq. We can re-run scanps on this file to generate the alignments.

Since scanps is able to find many local alignments between the query sequence and the database it is necessary to set a cutoff score otherwise you will output thousands of insignificant alignments in addition to those that are useful. A suitable value for the cutoff score will depend on the search you have completed, but values of 80-100 make a good starting point.

In order to illustrate the NALL alignment feature of scanps I have added the sequence ``S01966 GTPase-activating protein - bovine'' in place of the 15th sequence (TVFVS1).

For example we can now type:



scanps -ssh2.seq -a1 -c90 -d < sh2.top15.seq > sh2.top15.alig

We are using the sh2.seq sequence to scan the sh2.top15.seq file. The -a1 means ``generate alignments'', the -c90 sets the cutoff score to 90 and the -d means read the database from standard input - in this example, the file ``sh2.top15.seq''.

The output of this command looks like this:


---------------------------
Comparison with: TVHUSC protein-tyrosine kinase (EC 2.7.1.112) src - human    538 Residues
Raw Score: 497.0 TVHUSC Allen: 97 Score/Allen: 5.123711
      **************************************************
    1 WYFGKITRRESERLLLNAENPRGTFLVRESETTKGAYCLSVSDFDNAKGL    50
  151 WYFGKITRRESERLLLNAENPRGTFLVRESETTKGAYCLSVSDFDNAKGL   200

      ***********************************************
   51 NVKHYKIRKLDSGGFYITSRTQFNSLQQLVAYYSKHADGLCHRLTTV    97
  201 NVKHYKIRKLDSGGFYITSRTQFNSLQQLVAYYSKHADGLCHRLTTV   247
---------------------------


         13 BORING ALIGNMENTS DELETED


---------------------------
Comparison with: S01966 GTPase-activating protein - bovine   1046 Residues
Raw Score: 171.0 S01966 Allen: 90 Score/Allen: 1.900000
      ** **. *  .*  * .* .. *..*.***.   *.. . .* * .   .
    1 WYFGKITRRESERLLLNAENPRGTFLVRESETTKGAYCLSVSDFDNAKGL    50
  178 WYHGKLDRTIAEERLRQAGKS GSYLIRESDRRPGSF V LS FLSQTNV   223

       *.*..*  .  *..** .* .*.** .*..*** *   *
   51 NVKHYKIRKLDSGGFYITSRTQFNSLQQLVAYYSKHADGL    90
  224  VNHFRIIAM CGDYYIGGR RFSSLSDLIGYYS HVSCL   259

Raw Score: 130.0 S01966 Allen: 86 Score/Allen: 1.511628
      *. ***...*.  **.   ..  .**** *. * * * *    *  .   
    1 WYFGKITRRESERLLLNAENPRGTFLVRESETTKGAYCLSVSDFDNAKGL    50
  348 WFHGKISKQEAYNLLMTVGQA CSFLVRPSDNTPGDYSL Y  F RTSE    391

      *....**    .  * . .*  .**. ...  * *.
   51 NVKHYKIRKLDSGGFYITSRTQFNSLQQLVAYYSKH    86
  392 NIQRFKICPTPNNQFMMGGRY YNSIGDIIDHYRKE   426

Each local alignment is shown with the Raw Score for the alignment, the length of the alignment and the score/length (this value is not actually very useful and will be removed in future versions of the program).

Stars highlight identities and dots show positions that give positive scores in the pair score matrix that is being used. The match with S01966 illustrates the ability of SCANPS to find multiple hits to the same sequence. Lowering the cutoff score would find more alignments, but they would be unlikely to be significant.

Generating the local alignments - estimating significance

You can estimate the statistical significance of the local alignments by adding the -F1 option to the scanps command. The numbers that are produced are only ``true'' probabilities when used with the PAM250 matrix and gap penalty of 8 (this will change in later releases). A paper describing the method by which the probabilities are estimated is in preparation.

For example:



scanps -ssh2.seq -a1 -c90 -d -F1 < sh2.top15.seq > sh2.top15.alig.prob

Inspection of the sh2.top15.alig.prob file shows the alignments now include a ``probability'' value. These are all small numbers for these alignments. The S01966 alignments are shown here:



Comparison with: S01966 GTPase-activating protein - bovine   1046 Residues
Raw Score: 171.0 S01966 Allen: 90 Score/Allen: 1.900000
Probability: 8.6301e-18
      ** **. *  .*  * .* .. *..*.***.   *.. . .* * .   .
    1 WYFGKITRRESERLLLNAENPRGTFLVRESETTKGAYCLSVSDFDNAKGL    50
  178 WYHGKLDRTIAEERLRQAGKS GSYLIRESDRRPGSF V LS FLSQTNV   223

       *.*..*  .  *..** .* .*.** .*..*** *   *
   51 NVKHYKIRKLDSGGFYITSRTQFNSLQQLVAYYSKHADGL    90
  224  VNHFRIIAM CGDYYIGGR RFSSLSDLIGYYS HVSCL   259

Raw Score: 130.0 S01966 Allen: 86 Score/Allen: 1.511628
Probability: 3.133e-11
      *. ***...*.  **.   ..  .**** *. * * * *    *  .   
    1 WYFGKITRRESERLLLNAENPRGTFLVRESETTKGAYCLSVSDFDNAKGL    50
  348 WFHGKISKQEAYNLLMTVGQA CSFLVRPSDNTPGDYSL Y  F RTSE    391

      *....**    .  * . .*  .**. ...  * *.
   51 NVKHYKIRKLDSGGFYITSRTQFNSLQQLVAYYSKH    86
  392 NIQRFKICPTPNNQFMMGGRY YNSIGDIIDHYRKE   426

The probability values can be useful when comparing alignments of very different length. Short alignments will normally be expected to have lower scores than long alignments. Simply ranking on the Raw Score takes no account of this fact.

Advanced Scanning - apply the NALL algorithm to the database

Introduction

The simple method of scanning that has been described in the previous section is relatively fast and allows you the flexibility to generate local alignments for those proteins you think will be interesting from an inspection of the hit list. The advanced scanning option allows the NALL local alignment algorithm to be applied during the scan. This means that if the program finds that a protein shows similiarty to multiple regions of a database protein, then all these regions will be reported. This approach has the drawback that it can generate very large numbers of ``hits'' and these can be uninteresting because they are due to matches between repetetive sequences or hydrophobic runs. NALL scanning is also slower, though only by about a factor of 3.

The length dependent statistics are used to screen out alignments that we would expect to see by chance. Furthermore, a minimum alignment length threshold can be set to improve the scan speed.

If you are interested in short ungapped alignments, then it is best to use the program BLAST from NCBI. BLAST is much faster than SCANPS and is highly tuned for the ungapped alignment problem. If you are interested in alignments much over 50 residues in length, then scanps may offer some advantages. You can get ungapped alignments from scanps by setting the gap penalty very high (e.g. -p100), such alignments should agree reasonably well with BLAST output that used the same pairscore matrix.

In order to process the results of a NALL scan properly, you must have built the indexed databases and be able to run the sortsco program (Section 11).

Running the NALL scan

To run the sh2.seq scan but with the NALL algorithm we can type:



scanps -ssh2.seq -a1 -F1 -l40 -n > sh2.all.scan
Minimum score set to: 63.000000 Length: 40 Probability: 0.000100
Grand Total of Paths Considered: 237

The new arguments to scanps are -l40 which specifies a minimum length of 40 for alignments and -n which stops the -a1 option from displaying alignments. The default probability cutoff is set in the SCANPSDEFAULTS file, but may be overridden by the -g command line option. For example:


scanps -ssh2.seq -a1 -g0.000001 -F1 -l40 -n
Minimum score set to: 79.000000 Length: 40 Probability: 0.000001

Note that with the smaller probability threshold, the minimum score that will be considered has increased from 63 to 79. This will reduce the scan time.

The scan with default probability threshold, again on the PIR38 database, takes 877 seconds on the Indy R4000 SC. The scan considered 237 alignments to be within the probability threshold.

Running sortsco to sort the output and get titles

The program sortsco has a number of possible arguments. By default it expects a file that contains the results of a NALL scan. To sort the result and append titles to each hit, type:



sortsco -t < sh2.all.scan > sh2.all.scan.sorted

Processing the results of the NALL scan

The result of the NALL scan once sorted looks like this. I have truncated the output to 80 characters and removed much of the file for brevity. See the file sh2.all.scan.sorted for the full output.



 497   97 1.1e-82  0 1    1   97  151  247 TVHUSC   protein-tyrosine kinase 
 497   97 1.1e-82  0 1    1   97  156  252 A43610   protein-tyrosine kinase 
 492   97 1.4e-81  0 1    1   97  148  244 TVFVS2   protein-tyrosine kinase 
 492   97 1.4e-81  0 1    1   97  148  244 TVFVPR   protein-tyrosine kinase 
 492   97 1.4e-81  0 1    1   97  148  244 TVFV60   protein-tyrosine kinase 
                                .
                                .
                                .
**  171   90 8.6e-18  0 2    1   90  178  259 S01966   GTPase-activating pro
                                .
                                .
                                .

 144   90 2.2e-13  0 2    1   87  110  198 A42031   hematopoietic cell phosp
 144   94 4.2e-13  0 1    1   94  127  213 TVHUA    protein-tyrosine kinase 
 140   93 1.7e-12  1 2    2   91   11   96 A40802   protein-tyrosine kinase 
 138   90 1.9e-12  0 2    1   87  110  198 A38189   tyrosine phosphatase=hSH
 138   90 1.9e-12  0 2    1   87  112  200 S17234   Protein-tyrosine-phospha
 138   90 1.9e-12  0 2    1   87  112  200 S20837   Protein-tyrosine-phospha
 139   91 2.5e-12  0 2    1   87  112  201 S27398   protein-tyrosine phospha
 139   91 2.5e-12  0 2    1   87  112  201 A46209   SH2-containing phosphoty
 139   91 2.5e-12  0 2    1   87  112  201 S31767   protein-tyrosine phospha
 139   91 2.5e-12  0 2    1   87  112  201 A47244   SH-PTP2=SH2-containing p
 139   91 2.5e-12  0 2    1   87  112  201 A46210   phosphotyrosine phosphat
 136   89 3.9e-12  0 1    1   89  271  352 TVFFA    protein-tyrosine kinase 
 139   99 4.5e-12  0 1    1   97  603  693 TVHUVV   transforming protein (va
 136   92 7.1e-12  0 2    1   91  111  195 A43254   protein tyrosine phospha
 135   91 1.0e-11  1 2    1   90    6   88 S27398   protein-tyrosine phospha
 135   91 1.0e-11  1 2    1   90    6   88 A47244   SH-PTP2=SH2-containing p
 135   91 1.0e-11  1 2    1   90    6   88 A46209   SH2-containing phosphoty
 135   91 1.0e-11  1 2    1   90    6   88 A46210   phosphotyrosine phosphat
 135   91 1.0e-11  1 2    1   90    6   88 S31767   protein-tyrosine phospha
 116   43 1.3e-11  0 1    1   43   13   53 B45022   CRK-I - human
 116   43 1.3e-11  0 1    1   43   13   53 A45022   CRK-II - human
 134   96 2.5e-11  0 1    1   91  434  524 C46243   GRB-7=epidermal growth f

**  130   86 3.1e-11  1 2    1   86  348  426 S01966   GTPase-activating pro

 113   43 3.6e-11  0 1    1   43   44   84 A46243   GRB-3=epidermal growth f
 129   86 4.4e-11  1 2    1   86  174  252 B40121   GTPase-activating protei
 129   86 4.4e-11  1 2    1   86  351  429 A40121   GTPase-activating protei
 130   98 9.9e-11  1 2    1   97    4   93 A42031   hematopoietic cell phosp
 128   98 1.9e-10  1 2    1   97    6   95 S20837   Protein-tyrosine-phospha
 128   98 1.9e-10  1 2    1   97    6   95 S17234   Protein-tyrosine-phospha
 124   93 4.2e-10  1 2    2   92   11   97 A44266   ZAP-70=70 kda protein-ty
 122   89 4.7e-10  1 2    1   88    6   86 A43254   protein tyrosine phospha
 124   98 7.3e-10  1 2    1   97    4   93 A38189   tyrosine phosphatase=hSH
                                .
                                .
                                .

Two lines are shown with ``**'' at the start. These stars do not appear in the output file but are here to draw your attention to the lines for discussion below.

There are 11 columns of information in this file.

Column 1:

This is the raw score for the local alignment. i.e. the sum of the pairscore matrix values for the alignment, less the gap penalty times the number of gaps.

Column 2:

This is the length of the local alignment. Simply the length including the gaps.

Column 3:

The probability calculated using the length dependent statistics. The output is sorted into increasing probability order.

Column 4:

The rank of the alignment in the comparison with this database sequence. This number is 0 if this is the highest scoring alignment with the database sequence, 1 if the second highest, 2 if the third and so on.

Column 5:

This shows how many local alignments are found with this database sequence. For example, if Column 5/6 show values of ``0 7'', then this line is giving statistics on the highest ranked alignment out of 7 found. ``2 7'' would be the third ranked alignment with the database sequence.

Columns 6 and 7

These indicate the starting and ending residues from the query sequence of the fragment that is aligned.

Columns 8 and 9

These show the staring and ending residues of the section of database sequence that is aligned to the query.

Column 10

The identifier code for the database sequence.

Column 11

The title line for the database sequence. This is not truncated.

The first line highlighted by ``**'' shows a score between the query and the database sequence S01966 of 171 for a length of 90 residues. The probability is 8.6e-18 and this is the highest scoring alignment of two that are found with the database protein. The alignment is from residue 1 to 90 of the query and 178 to 259 of the database sequence.

If we look further down the file, we can see the second match to S01966. This scores 130 with a length of 86, probability of 3.1e-11. The region matched is 348-426.

Further analysis of the NALL output

The program ``sortsco'' allows the sequence fragments that are aligned to the query to be output. It also permits the fragments to be extended to the N and C terminii by a predefined percentage. The program also allows the top N hits to be output, or all those that score above a cutoff. Further analysis tools are under development. Just type ``sortsco'' to see the program options.

sortsco will also read a file of ID codes to allow these sequences or just the titles to be extracted from the database. The program will read the output of the standard scanps scan (ie not using the NALL method) and sort the results if you do not want to use the Unix sort utility. Normally sortsco is a little slower at the sort than Unix.

Scanning with an Alignment

If you have a multiple sequence alignment in AMPS blockfile format, then you can scan with this using the command:



scanps -btest.blc -a0 > test.scan

The commands and operations are exactly the same as for using a sequence file. Alignments will only show the FIRST sequence from the block file with pre-existing gaps shown as dashes ``-'' rather than spaces `` ''. Note that the length dependent statistics cannot be used with a block file scan.

Alignments in GCG .MSF format or CLUSTAL PIR format can be converted to block file format using the programs ``msf2blc'' and ``clus2blc'' which are distributed with the ALSCRIPT and ASSP program packages. Alternatively, you could generate your alignment using the AMPS package. All these programs are distributed from our ftp server (geoff.biop.ox.ac.uk - please see the README file).

Comparing all pairs of sequences

This feature is not fully developed, but it is useable (and useful!). For pairwise comparisons, the .seq file MUST NOT contain any non-amino acid characters or spaces in the sequence part of the file.

Having checked this, you must first create a copy of the .seq file (call this .sec). The .sec file could contain secondary structure definitions for the protein, or any other characters that you want to align with the sequences. Check that your SCANPS defaults file has the value of MAX_NSEQ set greater than the number of sequences in your sequence file, then for example, for the file test.seq type:



scanps -stest.seq -ttest.sec -T

This gives the score for each pair comparison to stdout. You could redirect the output to a file.



553 HAJUA HAHOD
543 HAJUA HAHOK
475 HAJUA HAKOAW
481 HAJUA HAJSA
461 HAJUA HAFEDR
261 HAJUA HBOTE
646 HAHOD HAHOK
490 HAHOD HAKOAW
502 HAHOD HAJSA
471 HAHOD HAFEDR
306 HAHOD HBOTE
484 HAHOK HAKOAW
490 HAHOK HAJSA
461 HAHOK HAFEDR
292 HAHOK HBOTE
587 HAKOAW HAJSA
433 HAKOAW HAFEDR
269 HAKOAW HBOTE
439 HAJSA HAFEDR
274 HAJSA HBOTE
307 HAFEDR HBOTE

Each line of the output shows the score and a the corresponding pair of ID codes.

Pairwise comparisons may also be performed using the NALL method. Currently, this only works if you also request probability scores. For example:



scanps -stest.seq -ttest.sec -T -a1 -F1

gives ...



7.2765e-88 HAJUA HAHOD
1.0265e-85 HAJUA HAHOK
2.5983e-71 HAJUA HAKOAW
1.4448e-72 HAJUA HAJSA
2.1362e-68 HAJUA HAFEDR
2.0553e-29 HAJUA HBOTE
3.4899e-108 HAHOD HAHOK
1.868e-74 HAHOD HAKOAW
5.5253e-77 HAHOD HAJSA
1.7759e-70 HAHOD HAFEDR
1.2143e-37 HAHOD HBOTE
3.3974e-73 HAHOK HAKOAW
1.868e-74 HAHOK HAJSA
2.1362e-68 HAHOK HAFEDR
4.8684e-35 HAHOK HBOTE
3.1609e-95 HAKOAW HAJSA
1.2637e-62 HAKOAW HAFEDR
7.5967e-31 HAKOAW HBOTE
7.4395e-64 HAJSA HAFEDR
9.5138e-32 HAJSA HBOTE
7.8891e-38 HAFEDR HBOTE

You can also get the alignments corresponding to these pair comparisons by adding the -v command line argument.



scanps -stest.seq -ttest.sec -T -a1 -F1 -v

The output of this comparison will include the characters from the .sec file aligned along with the sequences.

The final option in pairwise mode is to output the scores in a form that can be analysed by the cluster analysis program ``oc''. To produce suitable output, simply add a -X to the command line.

For example:



scanps -stest.seq -ttest.sec -T -X

for raw scores or:

scanps -stest.seq -ttest.sec -T -a1 -E -F1 -X

for probabilities

The -E option is necessary to prevent scanps from writing all local alignment scores. For cluster analysis you only need the top scoring alignment.

Cluster analysis with program ``oc''

Program oc is a general purpose cluster analysis program. It implements three simple methods for hierarchical clustering and for sequence data will show the overall sub-grouping of the sequences. Although one output from ``oc'' is a dendrogram or tree, the program should not be used alone to estimate phylogeny.

Typing ``oc'' shows the options:



Cluster analysis program

Usage: oc <sim/dis> <single/complete/means> <ps> <cut N>

Version 1.0 - Requires a file to be piped to standard input
Format:  Line   1:  Number (N) of entities to cluster (e.g. 10)
Format:  Lines 2 to 2+N-1:  Identifier codes for the entities (e.g. Entity1)
Format:  N*(N-1)/2:  Distances, or similarities - ie the upper diagonal

Options:
sim = similarity /  dis = distances
method = single/complete/means
ps <file> = plot out dendrogram to <file.ps> 
log = take logs before calculation 
cut = only show clusters above/below the cutoff
id = output identifier codes rather than indexes for entities
timeclus = output times to generate each cluster
amps <file> = produce amps <file>.tree and <file>.tord files

Usually, complete linkage cluster analysis gives the most interpretable results. To run oc on a data file, perhaps the output of a scanps pairwise comparison run that just includes raw scores:



oc sim complete ps test id < test.ocin > test.ocout

sim tells oc to work in similarity mode. This means that as numbers in the input file get bigger, they mean that the objects being compared are more similar. The alternative is distance mode, (dis) where smaller numbers mean greater similarity.

complete refers to the method of cluster analysis. This is a little difficult to explain without a diagram or equations (maybe in the next manual), but ...complete linkage joins clusters only if all members of both clusters are similar to each other at at least a given level of similarity. single linkage joins clusters if one pair between the clusters are similar. means joins the clusters on the basis of the mean similarity between the clusters.

ps test asks ``oc'' to draw a dendrogram. This will be stored in the file ``test.ps''. This is a PostScript file and can be printed on a PostScript printer, or viewed using GhostScript/GhostView. Currently, the dendrogram does not have a proper axis but just shows max and min values found for joining clusters.

id Asks for ID codes rather than numbers to be output to indicate the clusters.

The output of this comparison is shown here:



## 0 646 2
 HAHOD HAHOK
## 1 587 2
 HAKOAW HAJSA
## 2 543 3
 HAJUA HAHOD HAHOK
## 3 475 5
 HAJUA HAHOD HAHOK HAKOAW HAJSA
## 4 433 6
 HAFEDR HAJUA HAHOD HAHOK HAKOAW HAJSA
## 5 261 7
 HBOTE HAFEDR HAJUA HAHOD HAHOK HAKOAW HAJSA

Each line starting with ``##'' shows the cluster number, (starting at 0), the score at which all members of the cluster are similar, and the number of members in the cluster. The line following the ``##'' shows the ID codes of the members of each cluster.

oc will optionally accept a cutoff score. If a cutoff is given, only clusters that score above (or below in distance mode) the score will be output. This can be useful for filtering comparisons of very large numbers of sequences.

The PostScript tree is shown in the file test.ps.

Other functions of program oc

``oc'' also allows .tord and .tree files for the AMPS multiple alignment program to be generated. See the AMPS documentation for an explanation of how to use the .tord and .tree files.

How to build an indexed database for use with scanps

Two programs are used to build the indexed database.

simclean takes the PIR .seq file and removes any blank space from the sequence part of the file. Each sequence entry is reduced to three lines and three return characters.

id_pir3 takes the cleaned up .seq file and generates two index files, .bin and .inx.

To run the two programs on a .seq file called ``pir1.seq'' type:



simclean < pir1.seq > pir1.clean

cp pir1.seq pir1.seq.safe

cp pir1.clean pir1.seq

id\_pir3 pir1.seq pir1.bin pir1.inx

If this all works, you should have three files pir1.seq, pir1.bin and pir1.inx.

These database files should be placed in a single directory and the environment variable GJNDBDIR set to the directory name. The environment variable GJNDBROOT should be set to the database name, in the example ``pir1''. In this way, multiple databases can reside in the same directory. If you want to scan using a different database, you just redefine the GJNDBROOT variable.

For example, if we want to use the database called ``brookhaven'', we'd just type:



setenv GJNDBROOT brookhaven

scanps and sortsco would then expect to find the files brookhaven.seq, brookhaven.inx and brookhave.bin in the directory defined by GJNDBDIR.

Installation of scanps, sortsco, select, id_pir3and simclean

The distribution is in the form of a gzip compressed tar file. Executables for sun (extension .sun), Silicon Graphics (IRIX 5.X) (extension .sgi) are included in the top directory. Documentation and example files are in the doc subdirectory.

  1. Unzip and detar the distribution, then create links from /usr/local/bin (or wherever you put local software executables) to the appropriate executable files for scanps oc and select. Optionally do the same for id_pir3, simclean and sortsco.

  2. Set the environment to point to the locations of the scanps.def file and variables by editing the ``scanps.environment'' file.

  3. Edit the scanps.def file to define the correct locations of the matrix file and fit file on your system.

  4. Ensure that the scanps.environment file is sourced before running any of the programs in this package.

You can test that all programs are working simply by typing their name. Only simclean will produce no output.

Example database

I have included a small sequence database for testing purposes. This is in the examples subdirectory and is called ``protein''. If you set GJNDBDIR and GJNDBROOT appropriately you can do a quick scan against this database to see how the program works before investing time setting up the up to date sequence databases. (In fact this database is PIR14 which contains 6,858 sequences - 1988 vintage I think). If I find some disk space I may make the sequences from the latest PIR database available with indexes on our ftp server.

Hints and tips for successful searching

File formats

Sequence file format

All the programs use the same format for storing sequences. This includes the database, the query and any sequences extracted by scanps or sortsco. The format is as follows:



>IDENTIFIER
TITLE LINE
one letter code in capitals terminated by *
>IDENTIFIER2
Title line
one letter code..... *
etc

This is the format of the NBRF-PIR database distributed for VAX. I use this format for historical reasons. If anyone can suggest which format is the most commonly used for database scanning, then I will support this. I guess that FASTA format as used by BLAST would be a good one to include...

Multiple alignment query format

This should be in AMPS block file format.


The minimum requirements for a block file for N aligned sequences are
1.   N  '>comment line(s)'
2.  '* iteration int'
3.  'N or more vertically aligned sequences'
4.  '*'

  1. The comment lines define the sequence identifiers and the number of '>' characters preceding the first '* iteration int' line define the number of sequences that are defined in the sequence lines.

  2. This line specifies the beginning of the alignment to be read. The '*' character specifies the column in which the alignment begins. The 'iteration int' specifier identifies the particular alignment within this block_file.

    The format allows several alternative alignments to follow each other providing they are identified by a different iteration number (eg. 1,2,3). Currently, SCANPS only reads the first alignment. See the AMPS documentation for further details of alternative multiple alignments.

  3. The alignment is ended by a '*' character which should be in the same column as the '*' character that started the alignment.

Simple example:


This is a block file containing two alternative alignments of three sequences.
The comments that I an writing here may appear in the block file, but are
ignored when the file is read.  The only proviso is that no
'greater than' or 'star' characters must be present.

>first  this is sequence A
>second this is sequence B
>third  This is sequence C
* iteration 1
A  
A P
AVG
LLG
LCR
G
 PG
WWW
S	
*

Matrix File format

This follows the conventions of the NBRF (PIR) programs.



line1:	Title of matrix
line2:	23 characters representing the one-letter codes and defining the order
that the matrix is stored in
lines3 to contain 25 are integers separated by spaces.

SCANPS can also cope with matrices that have more or less characters in the matrix file.

The one letter code line is read and used as an index into the matrix that follows. For example:



Mutation Data Matrix (250 PAMs)
ARNDCQEGHILKMFPSTWYVBZX
  2 -2  0  0 -2  0  0  1 -1 -1 -2 -1 -1 -4  1  1  1 -6 -3  0  0  0  0
 -2  6  0 -1 -4  1 -1 -3  2 -2 -3  3  0 -4  0  0 -1  2 -4 -2 -1  0  0
  0  0  2  2 -4  1  1  0  2 -2 -3  1 -2 -4 -1  1  0 -4 -2 -2  2  1  0
  0 -1  2  4 -5  2  3  1  1 -2 -4  0 -3 -6 -1  0  0 -7 -4 -2  3  3  0
 -2 -4 -4 -5 12 -5 -5 -3 -3 -2 -6 -5 -5 -4 -3  0 -2 -8  0 -2 -4 -5  0
  0  1  1  2 -5  4  2 -1  3 -2 -2  1 -1 -5  0 -1 -1 -5 -4 -2  1  3  0
  0 -1  1  3 -5  2  4  0  1 -2 -3  0 -2 -5 -1  0  0 -7 -4 -2  2  3  0
  1 -3  0  1 -3 -1  0  5 -2 -3 -4 -2 -3 -5 -1  1  0 -7 -5 -1  0 -1  0
 -1  2  2  1 -3  3  1 -2  6 -2 -2  0 -2 -2  0 -1 -1 -3  0 -2  1  2  0
 -1 -2 -2 -2 -2 -2 -2 -3 -2  5  2 -2  2  1 -2 -1  0 -5 -1  4 -2 -2  0
 -2 -3 -3 -4 -6 -2 -3 -4 -2  2  6 -3  4  2 -3 -3 -2 -2 -1  2 -3 -3  0
 -1  3  1  0 -5  1  0 -2  0 -2 -3  5  0 -5 -1  0  0 -3 -4 -2  1  0  0
 -1  0 -2 -3 -5 -1 -2 -3 -2  2  4  0  6  0 -2 -2 -1 -4 -2  2 -2 -2  0
 -4 -4 -4 -6 -4 -5 -5 -5 -2  1  2 -5  0  9 -5 -3 -3  0  7 -1 -5 -5  0
  1  0 -1 -1 -3  0 -1 -1  0 -2 -3 -1 -2 -5  6  1  0 -6 -5 -1 -1  0  0
  1  0  1  0  0 -1  0  1 -1 -1 -3  0 -2 -3  1  2  1 -2 -3 -1  0  0  0
  1 -1  0  0 -2 -1  0  0 -1  0 -2  0 -1 -3  0  1  3 -5 -3  0  0 -1  0
 -6  2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4  0 -6 -2 -5 17  0 -6 -5 -6  0
 -3 -4 -2 -4  0 -4 -4 -5  0 -1 -1 -4 -2  7 -5 -3 -3  0 10 -2 -3 -4  0
  0 -2 -2 -2 -2 -2 -2 -1 -2  4  2 -2  2 -1 -1 -1  0 -6 -2  4 -2 -2  0
  0 -1  2  3 -4  1  2  0  1 -2 -3  1 -2 -5 -1  0  0 -5 -3 -2  2  2  0
  0  0  1  3 -5  3  3 -1  2 -2 -3  0 -2 -5  0  0 -1 -6 -4 -2  2  3  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0

Defaults file format

SCANPS does not have hard wired limits for number or length of sequences etc. These are all defined in the defaults file. A defaults file has a series of keyword, value pairs. The defaults file must be defined by the SCANPSDEFAULTS environment variable.

For example:



MAX_NSEQ 500			
MAX_SEQ_LEN 7000
MAX_ID_LEN 30
MAX_TITLE_LEN 500
MAX_BLOC_SEQ 500
PEN 8
MIN_SCORE 0
OUTPUT_LENGTH 50
SCAN 0
PRECISION 100
PCUT 0.0001
MATRIX_FILE /home/geoff/gjb/md/md.mat
FIT_FILE /home/geoff/gjb/c/scanps/metro/new/fits.md.8.dat
RUN_SW_MIN 35

MAX_NSEQ defines the maximum number of sequences that may be read into the program. If you are just doing database scanning, then it is most efficient to set this to a small value - say 2 or 3.

MAX_SEQ_LEN The maximum allowed length for a sequence. Set this to something big. The program reallocates memory down to the actual length of the sequence.

MAX_ID_LEN The maximum length of an identifier for a sequence.

MAX_TITLE_LEN The maximum length for a sequence title.

MAX_BLOC_SEQ The maximum number of sequences allowed in a block file.

PEN The length dependent gap penalty. This can also be set as a command line argument (-p).

MIN_SCORE The minimum scoring alignment that will be output. This can be set from the command line (-c).

OUTPUT_LENGTH The number of characters per line for alignment output.

SCAN Set to 0 for fast method, 1 for NALL method. This can also be set from the command line (-a0, -a1).

PRECISION Set the numeric precision of the program. SCANPS does all calculations as integers. All numbers are multiplied by PRECISION before any operation. 100 is enough for most pairscore matrices. Making this value too big may cause integer overflow problems with long sequences.

PCUT Probability cutoff. Only alignments that give lower values of probability will be output. This can be set at the command line (-g). See the section on advanced scanning.

MATRIX_FILE The name of the file containing the pairscore matrix. This can be defined on the command line (-m).

FIT_FILE The file of length-dependent probability parameters. Currently there is only one. Soon there will be other files for alternative matrix/gap-penalty combination.

RUN_SW_MIN In NALL scanning mode, scanps first does a fast Smith-Waterman comparison. If the score for the comparison is above this value, then the NALL method is applied to the sequence pair. If probability scoring is enabled, then this value is calculated from the probability and length cutoffs.

Command line options

Typing ``scanps'' with no options will show you the following screen:



Options:

DATABASE SCANNING:
-s<file.seq> Query sequence file in PIR format [e.g. -shahu.seq]
-b<file.blc> Query multiple alignment in AMPS format [e.g. -bmyo.blc]
-d           Read the database from standard input
-F<file.fit> File of length dependent fit parameters [e.g. -Ffits.dat]
-F1          Turn on length dependent parameters defined in
                                        SCANPSDEFAULTS file
-g<Prob>     Set probability threshold (for use with -F [e.g. -g0.001]
-n           Work silently - do not show alignments
-m<file.mat> Define pair score matrix file (e.g. PAM250)
-p<N>        Define gap penalty e.g. -p8
-a<N>        Define mode: -a0 for top score only
                          -a1 for all local alignments
-c<N>        Define cutoff score. [e.g. -c80]
-l<N>        Define alignment length cutoff (only valid for -a1)
-o<file.seq> Define output file for sequence alignment fragments
             These can then be multiply aligned later using AMPS
-V<file.gap> Define a file of variable gap penalties
-G           Turn on variable gap penalties if no -V
-L<file.lk>  Read the look up table file
-D<file.lk>  Print out the look up table and variable gap penalties

PAIRWISE COMPARISONS:
-t<file.sec> Secondary structure file in PIR format [e.g. -thahu.sec]
-T           File defined with -t is not true sec struc.
-E           Only consider the top scoring alignment in pairwise mode
-Y           Do all pairs output down to threshold defined by -g
-y           Do all pairs output down to threshold defined by -g
                Also output start and end residues of each aligment.
-X           Produce output in a format suitable for program oc

Hopefully, most of this is self explanatory. The options that are not discussed in the previous sections are:

-o When -a1 is set, this outputs the aligned fragments from the database file to the defined file in .seq file format.

-V, -G -L and -D Allow variable gap-penalties and user-defined per-residue scoring schemes to be applied when scanning with a sequence or an alignment. I will document and describe these features in the next release.

References

1
S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local alignment search tool. J. Mol. Biol., 215:403-410, 1990.

2
G. J. Barton. An efficient algorithm to locate all locally optimal alignments between two sequences allowing for gaps. Comput. Appl. Biosci., 9:729-734, 1993.

3
G. J. Barton, P. T. C. Cohen, and D. Barford. Conservation analysis and structure prediction of the protein ser/thr phosphatases: Diadenosine tetra-phosphatase from e. coli is homologous to the protein phosphatases. Eur. J. Biochem., 220:225-237, 1994.

4
G. J. Barton, C. P. Ponting, G. Spraggon, C. Finnis, and D. Sleep. Human platelet derived endothelial cell growth factor is homologous to e.coli thymidine phosphorylase. Protein Science, 1:688-690, 1992.

About this document ...

This document was generated using the LaTeX2HTML translator Version 0.5.3 (Wed Jan 26 1994) Copyright © 1993, Nikos Drakos, Computer Based Learning Unit, University of Leeds.

The command line arguments were:
latex2html -address gjb@bioch.ox.ac.uk -split 0 -no_navigation manual.tex.

The translation was initiated by gjb@ on Fri Aug 26 14:28:12 BST 1994


gjb@bioch.ox.ac.uk