User guide for:
SCANPS
A program for rigorous protein sequence database scanning
Geoffrey J. Barton
University of Oxford
Laboratory of Molecular Biophysics
Rex Richards Building
South Parks Road
Oxford OX1 3QU
UK
Tel: (44) 865-275368
Fax: (44) 865-510454
email: gjb@bioch.ox.ac.uk
www: http://geoff.biop.ox.ac.uk/personalities.html
Preliminary manual: 29th July 1994 Revised: 26th August 1994
SCANPS (pronounced Scan-P-S) stands for SCAN Protein Sequence. The main function of SCANPS is to use a rigorous local alignment method to search protein sequence databases with a query sequence or multiple alignment. SCANPS also allows all pairwise comparsons to be made between a set of sequences and can estimate the statistical significance of the alignments. SCANPS has been used in the analysis of many protein families. For example, it was used to make the discovery of similarity between PD-ECGF (Platelet derived endothelial cell growth factor) and TP (Thymidine Phosphorylase) [4]. The program was also used to find the similarity between E. coli diadenosine tetra-phosphatase and the protein Ser/Thr phosphatases [3].
Efficient finding of Nearly-ALL local alignments (the NALL method) [2] that score above a cutoff or probability threshold, between a sequence and a database. This means if two proteins have more than one common region, most regions are reported. Effectively, this is like BLAST [1] but with gapped alignments.
Efficient implementation of the Smith-Waterman Algorithm - this returns the highest scoring local alignment between two sequences including gaps where necessary. The program is approximately a factor of three faster than sssearch.
Estimation of the significance of the local alignments. An empirical method is used which takes into account the alignment score and the alignment length. This has the effect of pushing unusually high scoring, but short alignments higher up the hit list.
Comparison of all pairs of sequences in a set using either the Smith-Waterman, or NALL methods.
The SCANPS program has been used as a test bed for a lot of studies, many of which are not yet published. When the work is published, I will try to clean up the source code and distribute it. Currently, I can not be sure that the code will compile on all ANSI-C compilers, so for the time being, I am making precompiled binaries available for Sun (SunOS 4.1.3) and Silicon Graphics (IRIX 5.2). I have access to a Silicon Graphics running IRIX 4.X, so if you want the programs on the older operating system, then let me know.
The programs are available by anonymous ftp from geoff.biop.ox.ac.uk in the subdirectory programs/scanps. You can also reach this directory using a WWW browser such as Mosaic (URL=http://geoff.biop.ox.ac.uk). If you download the programs please send me a short email with your name, affiliation and address. I will add you to my user database and send you an email when the programs are updated and/or sources are made available.
This package is distributed on an ``as is'' basis. There is no warranty whatsoever as to functioning, performance or effect on hardware or other software, express or implied. The author disclaims any implied warranties of merchantability or fitness for any particular purpose.
The program itself has not been written up for a journal, however the underlying algorithms are published in Barton (1993) [2]. If you use these programs in your work, please cite the paper and this manual ``SCANPS user guide, G. J. Barton, Oxford University, UK''. Thank you.
SCANPS may be used in two ways, simple and advanced. To use the simple method, you only need SCANPS, a sequence query, a SCANPS defaults file and database file. To use the advanced features you ideally should have created indexes for the database file using the programs simclean and id_pir3. Please see the section 11 for details.
This manual assumes that the installation has been performed correctly, if you experience problems running the programs please see the section 11 and if that fails, send me an email!
Typing ``scanps'' on its own gives a list of the legal command line arguments. Most of these are described with examples in the following sections. They are also listed in Section 14.5.
Unix experts can skip this section!
SCANPS and associated programs make use of Unix standard input and output to simplify data processing. This brief introduction should help those who are either novices at Unix, or have never made use of pipes (|) and redirection (< >).
Unix programs read instructions from standard input (stdin) and write output to standard output (stdout). By default, standard input is the keyboard and standard output is the screen. A feature of Unix is that you can redirect the input to come from a file by appending the '<' character, and redirect output by appending the '>' character.
For example: let us suppose we have a program called ``garbage'' which we would like to have read data from a file called ``in.dat'' and write the results to a file called ``out.dat''. We could type:
garbage < in.dat > out.dat
Unix also includes the pipe character '|'. This allows the output of one program to be directed to the input of another. For example, suppose we want to process the output of the ``garbage'' program using another program called ``cleaner'', then one way to do this would be to type:
garbage < in.dat > out.dat cleaner < out.dat > cleaned.dat
This saves the results of the ``garbage'' program in the file out.dat, then takes the out.dat file as input to the ``cleaner'' program, finally saving the results to the file cleaned.out.
A neater solution using a pipe does away with the need for the out.dat file altogether:
garbage < in.dat | cleaner > cleaned.out
There are four steps to simple database scanning using SCANPS:
Perform the scan, either using a single sequence or a multiple sequence alignment. Save the score and identifier of each sequence in the database.
Sort the results of the scan into descending order.
Inspect the sorted file to decide how many of the high scoring proteins we are interested in. Extract the sequences for these high scoring proteins.
Run scanps again, this time reading sequences from the high scoring sequence file and generate all alignments down to some threshold score.
For example, we can scan with the SH2 domain from src. The sequence data file should look something like this:
>TVHUSC_SH2 src SH2 domain WYFGKITRRESERLLLNAENPRGTFLVRES ETTKGAYCLSVSDFDNAKGLNVKHYKIRKL DSGGFYITSRTQFNSLQQLVAYYSKHADGL CHRLTTV*
This is standard NBRF-PIR format. SCANPS expects to find a ``>'' symbol followed by an identifier code, then on the NEXT line a title, in this case ``src SH2 domain'', then the one letter amino acid code terminated by a star ``*''. Note that the amino acid sequence MUST be in uppercase, but any number of characters per line is allowed. scanps ONLY reads alphabetic characters and IGNORES spaces, dots, or numbers. Non legal amino acid codes are read as ``X'' (eg. the letters O, or I).
To do this, type:
scanps -ssh2.seq > sh2.scan
The -s tells the program to read the query sequence from the file sh2.seq. The results of the scan will be saved in the file sh2.scan. On a Silicon Graphics R4000 ``Indy SC'' using the PIR38 database which contains 61,248 sequences this scan takes about 580 seconds. On a Sun SPARCstation 2 the scan takes about 3-4 times longer.
To do this, you can use the Unix sort utility as follows:
sort +0 -1 -n -r < sh2.scan > sh2.sorted
If you want to understand what the +0 -1 -n -r means, please consult the Unix sort man pages (i.e. type: man sort). The sort is very quick.
You can avoid saving the sh2.scan file by piping the output of scanps directly into sort. For example:
scanps -ssh2.seq | sort +0 -1 -n -r > sh2.sorted
Examination of the top of the sorted output file shows the highest scoring hits to the query sequence. You must now inspect this file and decide on how many of the top scoring sequences you would like to examine by alignment. If you use the default parameters (PAM250 and penalty of 8) then scores below 90 are often uninteresting. However, this is not an absolute rule and each scan will require careful scrutiny of the score list. It is usually better to include a lot of sequences at this stage since ``interesting'' matches may emerge even for low scores.
There are no programs supplied with scanps to help you look at the ``.sorted'' file. You must use the Unix tools ``more'' or ``head'' to inspect and extract the interesting parts of the file. Or you could use your favourite text editor (vi, emacs, jot, pico etc).
In order to keep this guide to manageable length, I will illustrate the following sections using only the top 15 scoring sequences. In practice, the top 150 or so in this scan would be worth looking at. To get the top 15 sequence scores into a file you could type:
head -15 < sh2.sorted > sh2.top15
This saves the top 15 score/ID pairs in a file called sh2.top15. Here it is:
497 A43610 497 TVHUSC 492 TVCHS 492 TVFV60 492 TVFVPR 492 TVFVS2 490 TVFVS1 488 TVFVMT 474 B34104 473 A34104 458 OKFVYR 458 S15582 458 S20808 456 TVFVR 443 S20676
Unless you know the identifier codes, this is pretty unhelpful. If you have built an indexed database, then it is easy to get the titles of these sequences back using the program ``sortsco'', see Section 6.3.3. For now, we can extract the sequences that correspond to these protein identifiers using the program ``select''.
select Program S E L E C T Extracts sequences from PIR database Author: G. J. Barton (1990) Maximum Allowed Sequence Length: 8000 Maximum Allowed Number of Sequences: 2000 Enter name of file containing SCORE ID pairs: sh2.top15 Opening File: sh2.top15 Opening File: /data/pir/pir38.seq Just Extract Identifiers/titles (no sequences) ?[Y/N]: Enter Output Filename: sh2.top15.seq Opening File: sh2.top15.seq Searching for: 15 Sequences 1 A34104 2 A43610 3 B34104 4 OKFVYR 5 S15582 6 S20676 7 S20808 8 TVCHS 9 TVFV60 10 TVFVMT 11 TVFVPR 12 TVFVR 13 TVFVS1 14 TVFVS2 15 TVHUSC Found: S20676 1 Found: S20808 2 Found: S15582 3 Found: A34104 4 Found: B34104 5 Found: A43610 6 Found: TVHUSC 7 Found: TVCHS 8 Found: TVFV60 9 Found: TVFVMT 10 Found: TVFVPR 11 Found: TVFVR 12 Found: OKFVYR 13 Found: TVFVS2 14 Found: TVFVS1 15 Extracted: 15 Sequences
You have supplied the name of the file containing score, id pairs (sh2.top15) then the name for a file to save the sequences to (sh2.top15.seq), select then lists the identifiers it is searching for and as they are found in the database, it lists them to the screen again. The sequences are saved in the output file in the same order as they are shown in the sh2.top15 file.
If you have access to a more sophisticated database program, then you may prefer to use that to extract the sequences. For example, the program ``sortsco'' works much faster than ``select'' since it makes use of indexing - See Section 6.3.3 for details.
If you have not set the environment variables for the database file, then the program ``select'' will prompt you for the database filename.
We now have the 15 top scoring sequences in a file called sh2.top15.seq. We can re-run scanps on this file to generate the alignments.
Since scanps is able to find many local alignments between the query sequence and the database it is necessary to set a cutoff score otherwise you will output thousands of insignificant alignments in addition to those that are useful. A suitable value for the cutoff score will depend on the search you have completed, but values of 80-100 make a good starting point.
In order to illustrate the NALL alignment feature of scanps I have added the sequence ``S01966 GTPase-activating protein - bovine'' in place of the 15th sequence (TVFVS1).
For example we can now type:
scanps -ssh2.seq -a1 -c90 -d < sh2.top15.seq > sh2.top15.alig
We are using the sh2.seq sequence to scan the sh2.top15.seq file. The -a1 means ``generate alignments'', the -c90 sets the cutoff score to 90 and the -d means read the database from standard input - in this example, the file ``sh2.top15.seq''.
The output of this command looks like this:
--------------------------- Comparison with: TVHUSC protein-tyrosine kinase (EC 2.7.1.112) src - human 538 Residues Raw Score: 497.0 TVHUSC Allen: 97 Score/Allen: 5.123711 ************************************************** 1 WYFGKITRRESERLLLNAENPRGTFLVRESETTKGAYCLSVSDFDNAKGL 50 151 WYFGKITRRESERLLLNAENPRGTFLVRESETTKGAYCLSVSDFDNAKGL 200 *********************************************** 51 NVKHYKIRKLDSGGFYITSRTQFNSLQQLVAYYSKHADGLCHRLTTV 97 201 NVKHYKIRKLDSGGFYITSRTQFNSLQQLVAYYSKHADGLCHRLTTV 247 --------------------------- 13 BORING ALIGNMENTS DELETED --------------------------- Comparison with: S01966 GTPase-activating protein - bovine 1046 Residues Raw Score: 171.0 S01966 Allen: 90 Score/Allen: 1.900000 ** **. * .* * .* .. *..*.***. *.. . .* * . . 1 WYFGKITRRESERLLLNAENPRGTFLVRESETTKGAYCLSVSDFDNAKGL 50 178 WYHGKLDRTIAEERLRQAGKS GSYLIRESDRRPGSF V LS FLSQTNV 223 *.*..* . *..** .* .*.** .*..*** * * 51 NVKHYKIRKLDSGGFYITSRTQFNSLQQLVAYYSKHADGL 90 224 VNHFRIIAM CGDYYIGGR RFSSLSDLIGYYS HVSCL 259 Raw Score: 130.0 S01966 Allen: 86 Score/Allen: 1.511628 *. ***...*. **. .. .**** *. * * * * * . 1 WYFGKITRRESERLLLNAENPRGTFLVRESETTKGAYCLSVSDFDNAKGL 50 348 WFHGKISKQEAYNLLMTVGQA CSFLVRPSDNTPGDYSL Y F RTSE 391 *....** . * . .* .**. ... * *. 51 NVKHYKIRKLDSGGFYITSRTQFNSLQQLVAYYSKH 86 392 NIQRFKICPTPNNQFMMGGRY YNSIGDIIDHYRKE 426
Each local alignment is shown with the Raw Score for the alignment, the length of the alignment and the score/length (this value is not actually very useful and will be removed in future versions of the program).
Stars highlight identities and dots show positions that give positive scores in the pair score matrix that is being used. The match with S01966 illustrates the ability of SCANPS to find multiple hits to the same sequence. Lowering the cutoff score would find more alignments, but they would be unlikely to be significant.
You can estimate the statistical significance of the local alignments by adding the -F1 option to the scanps command. The numbers that are produced are only ``true'' probabilities when used with the PAM250 matrix and gap penalty of 8 (this will change in later releases). A paper describing the method by which the probabilities are estimated is in preparation.
For example:
scanps -ssh2.seq -a1 -c90 -d -F1 < sh2.top15.seq > sh2.top15.alig.prob
Inspection of the sh2.top15.alig.prob file shows the alignments now include a ``probability'' value. These are all small numbers for these alignments. The S01966 alignments are shown here:
Comparison with: S01966 GTPase-activating protein - bovine 1046 Residues Raw Score: 171.0 S01966 Allen: 90 Score/Allen: 1.900000 Probability: 8.6301e-18 ** **. * .* * .* .. *..*.***. *.. . .* * . . 1 WYFGKITRRESERLLLNAENPRGTFLVRESETTKGAYCLSVSDFDNAKGL 50 178 WYHGKLDRTIAEERLRQAGKS GSYLIRESDRRPGSF V LS FLSQTNV 223 *.*..* . *..** .* .*.** .*..*** * * 51 NVKHYKIRKLDSGGFYITSRTQFNSLQQLVAYYSKHADGL 90 224 VNHFRIIAM CGDYYIGGR RFSSLSDLIGYYS HVSCL 259 Raw Score: 130.0 S01966 Allen: 86 Score/Allen: 1.511628 Probability: 3.133e-11 *. ***...*. **. .. .**** *. * * * * * . 1 WYFGKITRRESERLLLNAENPRGTFLVRESETTKGAYCLSVSDFDNAKGL 50 348 WFHGKISKQEAYNLLMTVGQA CSFLVRPSDNTPGDYSL Y F RTSE 391 *....** . * . .* .**. ... * *. 51 NVKHYKIRKLDSGGFYITSRTQFNSLQQLVAYYSKH 86 392 NIQRFKICPTPNNQFMMGGRY YNSIGDIIDHYRKE 426
The probability values can be useful when comparing alignments of very different length. Short alignments will normally be expected to have lower scores than long alignments. Simply ranking on the Raw Score takes no account of this fact.
The simple method of scanning that has been described in the previous section is relatively fast and allows you the flexibility to generate local alignments for those proteins you think will be interesting from an inspection of the hit list. The advanced scanning option allows the NALL local alignment algorithm to be applied during the scan. This means that if the program finds that a protein shows similiarty to multiple regions of a database protein, then all these regions will be reported. This approach has the drawback that it can generate very large numbers of ``hits'' and these can be uninteresting because they are due to matches between repetetive sequences or hydrophobic runs. NALL scanning is also slower, though only by about a factor of 3.
The length dependent statistics are used to screen out alignments that we would expect to see by chance. Furthermore, a minimum alignment length threshold can be set to improve the scan speed.
If you are interested in short ungapped alignments, then it is best to use the program BLAST from NCBI. BLAST is much faster than SCANPS and is highly tuned for the ungapped alignment problem. If you are interested in alignments much over 50 residues in length, then scanps may offer some advantages. You can get ungapped alignments from scanps by setting the gap penalty very high (e.g. -p100), such alignments should agree reasonably well with BLAST output that used the same pairscore matrix.
In order to process the results of a NALL scan properly, you must have built the indexed databases and be able to run the sortsco program (Section 11).
To run the sh2.seq scan but with the NALL algorithm we can type:
scanps -ssh2.seq -a1 -F1 -l40 -n > sh2.all.scan Minimum score set to: 63.000000 Length: 40 Probability: 0.000100 Grand Total of Paths Considered: 237
The new arguments to scanps are -l40 which specifies a minimum length of 40 for alignments and -n which stops the -a1 option from displaying alignments. The default probability cutoff is set in the SCANPSDEFAULTS file, but may be overridden by the -g command line option. For example:
scanps -ssh2.seq -a1 -g0.000001 -F1 -l40 -n Minimum score set to: 79.000000 Length: 40 Probability: 0.000001
Note that with the smaller probability threshold, the minimum score that will be considered has increased from 63 to 79. This will reduce the scan time.
The scan with default probability threshold, again on the PIR38 database, takes 877 seconds on the Indy R4000 SC. The scan considered 237 alignments to be within the probability threshold.
The program sortsco has a number of possible arguments. By default it expects a file that contains the results of a NALL scan. To sort the result and append titles to each hit, type:
sortsco -t < sh2.all.scan > sh2.all.scan.sorted
The result of the NALL scan once sorted looks like this. I have truncated the output to 80 characters and removed much of the file for brevity. See the file sh2.all.scan.sorted for the full output.
497 97 1.1e-82 0 1 1 97 151 247 TVHUSC protein-tyrosine kinase 497 97 1.1e-82 0 1 1 97 156 252 A43610 protein-tyrosine kinase 492 97 1.4e-81 0 1 1 97 148 244 TVFVS2 protein-tyrosine kinase 492 97 1.4e-81 0 1 1 97 148 244 TVFVPR protein-tyrosine kinase 492 97 1.4e-81 0 1 1 97 148 244 TVFV60 protein-tyrosine kinase . . . ** 171 90 8.6e-18 0 2 1 90 178 259 S01966 GTPase-activating pro . . . 144 90 2.2e-13 0 2 1 87 110 198 A42031 hematopoietic cell phosp 144 94 4.2e-13 0 1 1 94 127 213 TVHUA protein-tyrosine kinase 140 93 1.7e-12 1 2 2 91 11 96 A40802 protein-tyrosine kinase 138 90 1.9e-12 0 2 1 87 110 198 A38189 tyrosine phosphatase=hSH 138 90 1.9e-12 0 2 1 87 112 200 S17234 Protein-tyrosine-phospha 138 90 1.9e-12 0 2 1 87 112 200 S20837 Protein-tyrosine-phospha 139 91 2.5e-12 0 2 1 87 112 201 S27398 protein-tyrosine phospha 139 91 2.5e-12 0 2 1 87 112 201 A46209 SH2-containing phosphoty 139 91 2.5e-12 0 2 1 87 112 201 S31767 protein-tyrosine phospha 139 91 2.5e-12 0 2 1 87 112 201 A47244 SH-PTP2=SH2-containing p 139 91 2.5e-12 0 2 1 87 112 201 A46210 phosphotyrosine phosphat 136 89 3.9e-12 0 1 1 89 271 352 TVFFA protein-tyrosine kinase 139 99 4.5e-12 0 1 1 97 603 693 TVHUVV transforming protein (va 136 92 7.1e-12 0 2 1 91 111 195 A43254 protein tyrosine phospha 135 91 1.0e-11 1 2 1 90 6 88 S27398 protein-tyrosine phospha 135 91 1.0e-11 1 2 1 90 6 88 A47244 SH-PTP2=SH2-containing p 135 91 1.0e-11 1 2 1 90 6 88 A46209 SH2-containing phosphoty 135 91 1.0e-11 1 2 1 90 6 88 A46210 phosphotyrosine phosphat 135 91 1.0e-11 1 2 1 90 6 88 S31767 protein-tyrosine phospha 116 43 1.3e-11 0 1 1 43 13 53 B45022 CRK-I - human 116 43 1.3e-11 0 1 1 43 13 53 A45022 CRK-II - human 134 96 2.5e-11 0 1 1 91 434 524 C46243 GRB-7=epidermal growth f ** 130 86 3.1e-11 1 2 1 86 348 426 S01966 GTPase-activating pro 113 43 3.6e-11 0 1 1 43 44 84 A46243 GRB-3=epidermal growth f 129 86 4.4e-11 1 2 1 86 174 252 B40121 GTPase-activating protei 129 86 4.4e-11 1 2 1 86 351 429 A40121 GTPase-activating protei 130 98 9.9e-11 1 2 1 97 4 93 A42031 hematopoietic cell phosp 128 98 1.9e-10 1 2 1 97 6 95 S20837 Protein-tyrosine-phospha 128 98 1.9e-10 1 2 1 97 6 95 S17234 Protein-tyrosine-phospha 124 93 4.2e-10 1 2 2 92 11 97 A44266 ZAP-70=70 kda protein-ty 122 89 4.7e-10 1 2 1 88 6 86 A43254 protein tyrosine phospha 124 98 7.3e-10 1 2 1 97 4 93 A38189 tyrosine phosphatase=hSH . . .
Two lines are shown with ``**'' at the start. These stars do not appear in the output file but are here to draw your attention to the lines for discussion below.
There are 11 columns of information in this file.
This is the raw score for the local alignment. i.e. the sum of the pairscore matrix values for the alignment, less the gap penalty times the number of gaps.
This is the length of the local alignment. Simply the length including the gaps.
The probability calculated using the length dependent statistics. The output is sorted into increasing probability order.
The rank of the alignment in the comparison with this database sequence. This number is 0 if this is the highest scoring alignment with the database sequence, 1 if the second highest, 2 if the third and so on.
This shows how many local alignments are found with this database sequence. For example, if Column 5/6 show values of ``0 7'', then this line is giving statistics on the highest ranked alignment out of 7 found. ``2 7'' would be the third ranked alignment with the database sequence.
These indicate the starting and ending residues from the query sequence of the fragment that is aligned.
These show the staring and ending residues of the section of database sequence that is aligned to the query.
The identifier code for the database sequence.
The title line for the database sequence. This is not truncated.
The first line highlighted by ``**'' shows a score between the query and the database sequence S01966 of 171 for a length of 90 residues. The probability is 8.6e-18 and this is the highest scoring alignment of two that are found with the database protein. The alignment is from residue 1 to 90 of the query and 178 to 259 of the database sequence.
If we look further down the file, we can see the second match to S01966. This scores 130 with a length of 86, probability of 3.1e-11. The region matched is 348-426.
The program ``sortsco'' allows the sequence fragments that are aligned to the query to be output. It also permits the fragments to be extended to the N and C terminii by a predefined percentage. The program also allows the top N hits to be output, or all those that score above a cutoff. Further analysis tools are under development. Just type ``sortsco'' to see the program options.
sortsco will also read a file of ID codes to allow these sequences or just the titles to be extracted from the database. The program will read the output of the standard scanps scan (ie not using the NALL method) and sort the results if you do not want to use the Unix sort utility. Normally sortsco is a little slower at the sort than Unix.
If you have a multiple sequence alignment in AMPS blockfile format, then you can scan with this using the command:
scanps -btest.blc -a0 > test.scan
The commands and operations are exactly the same as for using a sequence file. Alignments will only show the FIRST sequence from the block file with pre-existing gaps shown as dashes ``-'' rather than spaces `` ''. Note that the length dependent statistics cannot be used with a block file scan.
Alignments in GCG .MSF format or CLUSTAL PIR format can be converted to block file format using the programs ``msf2blc'' and ``clus2blc'' which are distributed with the ALSCRIPT and ASSP program packages. Alternatively, you could generate your alignment using the AMPS package. All these programs are distributed from our ftp server (geoff.biop.ox.ac.uk - please see the README file).
This feature is not fully developed, but it is useable (and useful!). For pairwise comparisons, the .seq file MUST NOT contain any non-amino acid characters or spaces in the sequence part of the file.
Having checked this, you must first create a copy of the .seq file (call this .sec). The .sec file could contain secondary structure definitions for the protein, or any other characters that you want to align with the sequences. Check that your SCANPS defaults file has the value of MAX_NSEQ set greater than the number of sequences in your sequence file, then for example, for the file test.seq type:
scanps -stest.seq -ttest.sec -T
This gives the score for each pair comparison to stdout. You could redirect the output to a file.
553 HAJUA HAHOD 543 HAJUA HAHOK 475 HAJUA HAKOAW 481 HAJUA HAJSA 461 HAJUA HAFEDR 261 HAJUA HBOTE 646 HAHOD HAHOK 490 HAHOD HAKOAW 502 HAHOD HAJSA 471 HAHOD HAFEDR 306 HAHOD HBOTE 484 HAHOK HAKOAW 490 HAHOK HAJSA 461 HAHOK HAFEDR 292 HAHOK HBOTE 587 HAKOAW HAJSA 433 HAKOAW HAFEDR 269 HAKOAW HBOTE 439 HAJSA HAFEDR 274 HAJSA HBOTE 307 HAFEDR HBOTE
Each line of the output shows the score and a the corresponding pair of ID codes.
Pairwise comparisons may also be performed using the NALL method. Currently, this only works if you also request probability scores. For example:
scanps -stest.seq -ttest.sec -T -a1 -F1
gives ...
7.2765e-88 HAJUA HAHOD 1.0265e-85 HAJUA HAHOK 2.5983e-71 HAJUA HAKOAW 1.4448e-72 HAJUA HAJSA 2.1362e-68 HAJUA HAFEDR 2.0553e-29 HAJUA HBOTE 3.4899e-108 HAHOD HAHOK 1.868e-74 HAHOD HAKOAW 5.5253e-77 HAHOD HAJSA 1.7759e-70 HAHOD HAFEDR 1.2143e-37 HAHOD HBOTE 3.3974e-73 HAHOK HAKOAW 1.868e-74 HAHOK HAJSA 2.1362e-68 HAHOK HAFEDR 4.8684e-35 HAHOK HBOTE 3.1609e-95 HAKOAW HAJSA 1.2637e-62 HAKOAW HAFEDR 7.5967e-31 HAKOAW HBOTE 7.4395e-64 HAJSA HAFEDR 9.5138e-32 HAJSA HBOTE 7.8891e-38 HAFEDR HBOTE
You can also get the alignments corresponding to these pair comparisons by adding the -v command line argument.
scanps -stest.seq -ttest.sec -T -a1 -F1 -v
The output of this comparison will include the characters from the .sec file aligned along with the sequences.
The final option in pairwise mode is to output the scores in a form that can be analysed by the cluster analysis program ``oc''. To produce suitable output, simply add a -X to the command line.
For example:
scanps -stest.seq -ttest.sec -T -X for raw scores or: scanps -stest.seq -ttest.sec -T -a1 -E -F1 -X for probabilities
The -E option is necessary to prevent scanps from writing all local alignment scores. For cluster analysis you only need the top scoring alignment.
Program oc is a general purpose cluster analysis program. It implements three simple methods for hierarchical clustering and for sequence data will show the overall sub-grouping of the sequences. Although one output from ``oc'' is a dendrogram or tree, the program should not be used alone to estimate phylogeny.
Typing ``oc'' shows the options:
Cluster analysis program Usage: oc <sim/dis> <single/complete/means> <ps> <cut N> Version 1.0 - Requires a file to be piped to standard input Format: Line 1: Number (N) of entities to cluster (e.g. 10) Format: Lines 2 to 2+N-1: Identifier codes for the entities (e.g. Entity1) Format: N*(N-1)/2: Distances, or similarities - ie the upper diagonal Options: sim = similarity / dis = distances method = single/complete/means ps <file> = plot out dendrogram to <file.ps> log = take logs before calculation cut = only show clusters above/below the cutoff id = output identifier codes rather than indexes for entities timeclus = output times to generate each cluster amps <file> = produce amps <file>.tree and <file>.tord files
Usually, complete linkage cluster analysis gives the most interpretable results. To run oc on a data file, perhaps the output of a scanps pairwise comparison run that just includes raw scores:
oc sim complete ps test id < test.ocin > test.ocout
sim tells oc to work in similarity mode. This means that as numbers in the input file get bigger, they mean that the objects being compared are more similar. The alternative is distance mode, (dis) where smaller numbers mean greater similarity.
complete refers to the method of cluster analysis. This is a little difficult to explain without a diagram or equations (maybe in the next manual), but ...complete linkage joins clusters only if all members of both clusters are similar to each other at at least a given level of similarity. single linkage joins clusters if one pair between the clusters are similar. means joins the clusters on the basis of the mean similarity between the clusters.
ps test asks ``oc'' to draw a dendrogram. This will be stored in the file ``test.ps''. This is a PostScript file and can be printed on a PostScript printer, or viewed using GhostScript/GhostView. Currently, the dendrogram does not have a proper axis but just shows max and min values found for joining clusters.
id Asks for ID codes rather than numbers to be output to indicate the clusters.
The output of this comparison is shown here:
## 0 646 2 HAHOD HAHOK ## 1 587 2 HAKOAW HAJSA ## 2 543 3 HAJUA HAHOD HAHOK ## 3 475 5 HAJUA HAHOD HAHOK HAKOAW HAJSA ## 4 433 6 HAFEDR HAJUA HAHOD HAHOK HAKOAW HAJSA ## 5 261 7 HBOTE HAFEDR HAJUA HAHOD HAHOK HAKOAW HAJSA
Each line starting with ``##'' shows the cluster number, (starting at 0), the score at which all members of the cluster are similar, and the number of members in the cluster. The line following the ``##'' shows the ID codes of the members of each cluster.
oc will optionally accept a cutoff score. If a cutoff is given, only clusters that score above (or below in distance mode) the score will be output. This can be useful for filtering comparisons of very large numbers of sequences.
The PostScript tree is shown in the file test.ps.
``oc'' also allows .tord and .tree files for the AMPS multiple alignment program to be generated. See the AMPS documentation for an explanation of how to use the .tord and .tree files.
Two programs are used to build the indexed database.
simclean takes the PIR .seq file and removes any blank space from the sequence part of the file. Each sequence entry is reduced to three lines and three return characters.
id_pir3 takes the cleaned up .seq file and generates two index files, .bin and .inx.
To run the two programs on a .seq file called ``pir1.seq'' type:
simclean < pir1.seq > pir1.clean cp pir1.seq pir1.seq.safe cp pir1.clean pir1.seq id\_pir3 pir1.seq pir1.bin pir1.inx
If this all works, you should have three files pir1.seq, pir1.bin and pir1.inx.
These database files should be placed in a single directory and the environment variable GJNDBDIR set to the directory name. The environment variable GJNDBROOT should be set to the database name, in the example ``pir1''. In this way, multiple databases can reside in the same directory. If you want to scan using a different database, you just redefine the GJNDBROOT variable.
For example, if we want to use the database called ``brookhaven'', we'd just type:
setenv GJNDBROOT brookhaven
scanps and sortsco would then expect to find the files brookhaven.seq, brookhaven.inx and brookhave.bin in the directory defined by GJNDBDIR.
The distribution is in the form of a gzip compressed tar file. Executables for sun (extension .sun), Silicon Graphics (IRIX 5.X) (extension .sgi) are included in the top directory. Documentation and example files are in the doc subdirectory.
You can test that all programs are working simply by typing their name. Only simclean will produce no output.
I have included a small sequence database for testing purposes. This is in the examples subdirectory and is called ``protein''. If you set GJNDBDIR and GJNDBROOT appropriately you can do a quick scan against this database to see how the program works before investing time setting up the up to date sequence databases. (In fact this database is PIR14 which contains 6,858 sequences - 1988 vintage I think). If I find some disk space I may make the sequences from the latest PIR database available with indexes on our ftp server.
It can be useful to increase default gap penalty slightly. Try values of 9 or 10. The length dependent statistics are reasonably similar for these values.
You can use the simple method to scan, then find out what the score threshold would have been for a NALL scan (by using the -F1 option to scanps). You can then extract all sequences that give scores above this threshold and run the NALL method on them. This two step strategy is usually quicker than running the NALL method on the whole database.
All the programs use the same format for storing sequences. This includes the database, the query and any sequences extracted by scanps or sortsco. The format is as follows:
>IDENTIFIER TITLE LINE one letter code in capitals terminated by * >IDENTIFIER2 Title line one letter code..... * etc
This is the format of the NBRF-PIR database distributed for VAX. I use this format for historical reasons. If anyone can suggest which format is the most commonly used for database scanning, then I will support this. I guess that FASTA format as used by BLAST would be a good one to include...
This should be in AMPS block file format.
The minimum requirements for a block file for N aligned sequences are 1. N '>comment line(s)' 2. '* iteration int' 3. 'N or more vertically aligned sequences' 4. '*'
The format allows several alternative alignments to follow each other providing they are identified by a different iteration number (eg. 1,2,3). Currently, SCANPS only reads the first alignment. See the AMPS documentation for further details of alternative multiple alignments.
Simple example:
This is a block file containing two alternative alignments of three sequences. The comments that I an writing here may appear in the block file, but are ignored when the file is read. The only proviso is that no 'greater than' or 'star' characters must be present. >first this is sequence A >second this is sequence B >third This is sequence C * iteration 1 A A P AVG LLG LCR G PG WWW S *
This follows the conventions of the NBRF (PIR) programs.
line1: Title of matrix line2: 23 characters representing the one-letter codes and defining the order that the matrix is stored in lines3 to contain 25 are integers separated by spaces.
SCANPS can also cope with matrices that have more or less characters in the matrix file.
The one letter code line is read and used as an index into the matrix that follows. For example:
Mutation Data Matrix (250 PAMs) ARNDCQEGHILKMFPSTWYVBZX 2 -2 0 0 -2 0 0 1 -1 -1 -2 -1 -1 -4 1 1 1 -6 -3 0 0 0 0 -2 6 0 -1 -4 1 -1 -3 2 -2 -3 3 0 -4 0 0 -1 2 -4 -2 -1 0 0 0 0 2 2 -4 1 1 0 2 -2 -3 1 -2 -4 -1 1 0 -4 -2 -2 2 1 0 0 -1 2 4 -5 2 3 1 1 -2 -4 0 -3 -6 -1 0 0 -7 -4 -2 3 3 0 -2 -4 -4 -5 12 -5 -5 -3 -3 -2 -6 -5 -5 -4 -3 0 -2 -8 0 -2 -4 -5 0 0 1 1 2 -5 4 2 -1 3 -2 -2 1 -1 -5 0 -1 -1 -5 -4 -2 1 3 0 0 -1 1 3 -5 2 4 0 1 -2 -3 0 -2 -5 -1 0 0 -7 -4 -2 2 3 0 1 -3 0 1 -3 -1 0 5 -2 -3 -4 -2 -3 -5 -1 1 0 -7 -5 -1 0 -1 0 -1 2 2 1 -3 3 1 -2 6 -2 -2 0 -2 -2 0 -1 -1 -3 0 -2 1 2 0 -1 -2 -2 -2 -2 -2 -2 -3 -2 5 2 -2 2 1 -2 -1 0 -5 -1 4 -2 -2 0 -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 -3 4 2 -3 -3 -2 -2 -1 2 -3 -3 0 -1 3 1 0 -5 1 0 -2 0 -2 -3 5 0 -5 -1 0 0 -3 -4 -2 1 0 0 -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 0 -2 -2 -1 -4 -2 2 -2 -2 0 -4 -4 -4 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 -5 -3 -3 0 7 -1 -5 -5 0 1 0 -1 -1 -3 0 -1 -1 0 -2 -3 -1 -2 -5 6 1 0 -6 -5 -1 -1 0 0 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2 1 -2 -3 -1 0 0 0 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3 -5 -3 0 0 -1 0 -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 0 -6 -5 -6 0 -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 -2 -3 -4 0 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4 -2 -2 0 0 -1 2 3 -4 1 2 0 1 -2 -3 1 -2 -5 -1 0 0 -5 -3 -2 2 2 0 0 0 1 3 -5 3 3 -1 2 -2 -3 0 -2 -5 0 0 -1 -6 -4 -2 2 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
SCANPS does not have hard wired limits for number or length of sequences etc. These are all defined in the defaults file. A defaults file has a series of keyword, value pairs. The defaults file must be defined by the SCANPSDEFAULTS environment variable.
For example:
MAX_NSEQ 500 MAX_SEQ_LEN 7000 MAX_ID_LEN 30 MAX_TITLE_LEN 500 MAX_BLOC_SEQ 500 PEN 8 MIN_SCORE 0 OUTPUT_LENGTH 50 SCAN 0 PRECISION 100 PCUT 0.0001 MATRIX_FILE /home/geoff/gjb/md/md.mat FIT_FILE /home/geoff/gjb/c/scanps/metro/new/fits.md.8.dat RUN_SW_MIN 35
MAX_NSEQ defines the maximum number of sequences that may be read into the program. If you are just doing database scanning, then it is most efficient to set this to a small value - say 2 or 3.
MAX_SEQ_LEN The maximum allowed length for a sequence. Set this to something big. The program reallocates memory down to the actual length of the sequence.
MAX_ID_LEN The maximum length of an identifier for a sequence.
MAX_TITLE_LEN The maximum length for a sequence title.
MAX_BLOC_SEQ The maximum number of sequences allowed in a block file.
PEN The length dependent gap penalty. This can also be set as a command line argument (-p).
MIN_SCORE The minimum scoring alignment that will be output. This can be set from the command line (-c).
OUTPUT_LENGTH The number of characters per line for alignment output.
SCAN Set to 0 for fast method, 1 for NALL method. This can also be set from the command line (-a0, -a1).
PRECISION Set the numeric precision of the program. SCANPS does all calculations as integers. All numbers are multiplied by PRECISION before any operation. 100 is enough for most pairscore matrices. Making this value too big may cause integer overflow problems with long sequences.
PCUT Probability cutoff. Only alignments that give lower values of probability will be output. This can be set at the command line (-g). See the section on advanced scanning.
MATRIX_FILE The name of the file containing the pairscore matrix. This can be defined on the command line (-m).
FIT_FILE The file of length-dependent probability parameters. Currently there is only one. Soon there will be other files for alternative matrix/gap-penalty combination.
RUN_SW_MIN In NALL scanning mode, scanps first does a fast Smith-Waterman comparison. If the score for the comparison is above this value, then the NALL method is applied to the sequence pair. If probability scoring is enabled, then this value is calculated from the probability and length cutoffs.
Typing ``scanps'' with no options will show you the following screen:
Options: DATABASE SCANNING: -s<file.seq> Query sequence file in PIR format [e.g. -shahu.seq] -b<file.blc> Query multiple alignment in AMPS format [e.g. -bmyo.blc] -d Read the database from standard input -F<file.fit> File of length dependent fit parameters [e.g. -Ffits.dat] -F1 Turn on length dependent parameters defined in SCANPSDEFAULTS file -g<Prob> Set probability threshold (for use with -F [e.g. -g0.001] -n Work silently - do not show alignments -m<file.mat> Define pair score matrix file (e.g. PAM250) -p<N> Define gap penalty e.g. -p8 -a<N> Define mode: -a0 for top score only -a1 for all local alignments -c<N> Define cutoff score. [e.g. -c80] -l<N> Define alignment length cutoff (only valid for -a1) -o<file.seq> Define output file for sequence alignment fragments These can then be multiply aligned later using AMPS -V<file.gap> Define a file of variable gap penalties -G Turn on variable gap penalties if no -V -L<file.lk> Read the look up table file -D<file.lk> Print out the look up table and variable gap penalties PAIRWISE COMPARISONS: -t<file.sec> Secondary structure file in PIR format [e.g. -thahu.sec] -T File defined with -t is not true sec struc. -E Only consider the top scoring alignment in pairwise mode -Y Do all pairs output down to threshold defined by -g -y Do all pairs output down to threshold defined by -g Also output start and end residues of each aligment. -X Produce output in a format suitable for program oc
Hopefully, most of this is self explanatory. The options that are not discussed in the previous sections are:
-o When -a1 is set, this outputs the aligned fragments from the database file to the defined file in .seq file format.
-V, -G -L and -D Allow variable gap-penalties and user-defined per-residue scoring schemes to be applied when scanning with a sequence or an alignment. I will document and describe these features in the next release.
This document was generated using the LaTeX2HTML translator Version 0.5.3 (Wed Jan 26 1994) Copyright © 1993, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
The command line arguments were:
latex2html -address gjb@bioch.ox.ac.uk -split 0 -no_navigation manual.tex.
The translation was initiated by gjb@ on Fri Aug 26 14:28:12 BST 1994