SCANPS version 2.3.1	G. J. Barton,  University of Oxford, UK
---------------------------------------------------------------

See APPENDIX I  for revision list.
See APPENDIX II for a full list of commands.

Preliminary User Guide (18th August 1996)
-----------------------------------------------------------------------------
Latest revision: 2.3.1  11 August 1997:  Minor revisions to manual: 30 Dec 1997
-----------------------------------------------------------------------------

Pronounciation
--------------
SCANPS is pronounced scan-P-S.

What is scanps?
---------------

Scanps is a program for comparing a protein or nucleic acid sequence
to a database of sequences.  It implements various flavours of dynamic
programming algorithm such as the Smith-Waterman local alignment method.

Scanps 2.3.1 supports:

1. Scanning of a protein query sequence
against a protein sequence database.  The algorithm is full
Smith-Waterman either with simple length-dependent penalty, or with
affine gaps (u +gl).  

2. Version 2.2.3 and later also allows a DNA sequence to be
compared to a protein sequence database using a comparison method that
allows for frameshift errors.

For all methods, the alignment output reports can output not only the
top scoring alignment between each pair but sub-alignments down to a
user-defined threshold.

In version 2.3.1 and later, the statistical significance of alignments
is estimated by a method similar to that used by  W. Pearson's
SSEARCH and FASTA programs.

The program will read query and database sequence files either in
FASTA or PIR format.  To speed start-up times, particularly on
parallel processing computers, the program may also use its own
compact database format.

Unlike SCANPS version 1.x, SCANPS version 2.X is aimed at
workstations and servers with sufficient memory to hold the protein
database in RAM.  For the OWL protein sequence database this is around
70MBytes for the data.  The database is held in memory to improve the
efficiency of searches on shared memory parallel processing machines
such as the Silicon Graphics Challenge.  The program supports parallel
processing on the SGI Challenge and Origin and gives good scaling up
to 32 processors, the most we have tested.


Citing the program
------------------

A paper is in preparation that describes the program.  In the
meantime, please cite, Barton, G. J. (1997) "SCANPS Version 2.3.1 User
guide", University of Oxford, UK.

The underlying algorithm is described in:

Barton, G. J. (1993) CABIOS, 9, 729-734.


Disclaimer 
---------- 

SCANPS is provided "as-is" and without warranty of any kind, express,
implied or otherwise, including without limitation any warranty of
merchantability or fitness for a particular purpose. In no event will
the author be liable for any special, incidental, indirect or
consequential damages of any kind, or any damages whatsoever resulting
from loss of data or profits, whether or not advised of the
possibility of damage, and on any theory of liability, arising out of
or in connection with the use or performance of this software.


Installing SCANPS
-----------------

Scanps 2.3.1 is currently available for Silicon Graphics computers in
parallel and uniprocessor forms.  The uniprocessor version is also
available for Sun computers running Solaris 2.5 or later.  To obtain
scanps please see the WWW page: http://barton.ebi.ac.uk/.  Once you
have obtained the software distribution you will need to request a
license file for the computers on which you want to run the program.
Instructions on how to obtain the license file are on the web page.
Instructions for installing scanps are given below.


Running SCANPS
--------------

The interface to SCANPS follows a fairly standard Unix command style.
If you are used to using Unix, then this should be easy.  By default,
output goes to stdout and so can be piped to other programs for
processing.  If you are not happy with the Unix command line, then it
should be easy for someone to hide this interface behind a WWW form,
or other windowing interface.

SCANPS is controlled by a set of keyword, value commands.  The
commands may either be specified in a defaults file
(scanps_defaults.dat), or on the command line.  Values specified on
the command line override the defaults set in the scanps_defaults.dat
file.  The scanps_defaults.dat, scanps_alias.dat and
scanps_gapdefs.dat files must all reside in the directory pointed to
by the environment variable SCANPSDIR.  If SCANPSDIR is not defined,
then the program assumes that these files are in the user's current
directory.  Thus, an installation can have a central set of defaults
which may be overridden by individual users who may copy and modify
their own copies of the scanps_defaults.dat, scanps_alias.dat and
scanps_gapdefs.dat files.


The scanps_defaults.dat file 
----------------------------

The scanps_defaults.dat contains settings for valid commands.  When
you first install scanps, you will have to modify this file to define
the correct locations for your directories.  For example:

DB_DIR      /gjb/delly/databases/scanps/
MATRIX_DIR  /gjb/delly/gjb/c/scanps/gjscanps/mat/ 
LICENSE_DIR /gjb/delly/gjb/c/scanps/gjscanps/dat/
GENERAL_DIR /gjb/delly/gjb/c/scanps/gjscanps/dat/
MATRIX_FILE PAM250
MAX_NSEQ 5000000
CODON_FILE codon.dat
MATRIX_FILE nmd.mat
MAX_NSEQ 500000
MAX_SEQ_LEN 400000
LD_PEN 1 
LI_PEN 8
DB_FORMAT 0

This defines the directories for various files.  See the specific
command summary below for further details.

Commands in the scanps_defaults.dat file may either be the full
command name or an alias as defined in the scanps_alias.dat file (see
below).  Usually, it is best to use the full command name in the
scanps_defaults.dat file.  Aliases are there to ease typing commands
on the command line.  Note that ALL commands and their aliases are
case sensitive.

The LICENSE_DIR contains the file scanps.lic that enables the program
on your registered computers.

File names
----------

File names in the scanps_defaults.dat file MUST NOT be fully 
qualified.  Thus, 

MATRIX_FILE nmd.mat

will look in the users' current directory for the file.  If the matrix
file was in /data/local/scanps/matrices you would have to define:

MATRIX_DIR /data/local/scanps/matrices/

as well.

This is also true of the  binary database files.  The directory
for these may be defined by DB_DIR.  e.g.

DB_DIR /data/local/databases/scanps/
BDB_ROOT pir66

would look for the files pir66.bix and pir66.bsq in the directory
/data/local/databases/scanps.

The only other directory that can be specified in this way is
GENERAL_DIR, this holds "other" files needed by the program.  At the
moment, the only file that is put in GENERAL_DIR is the CODON_FILE for
use with DNA vs protein comparisons.  More of that later...


The scanps_alias.dat file
-------------------------

You will not normally modify this file.  For the sake of completeness,
the format of the file is described here.

The scanps_alias.dat file allows aliases to be defined for any of the
valid commands.  For example, here is an excerpt from scanps_alias.dat.

MAX_NSEQ        max_nseq        #Maximum number of sequences allowed
TIME            time            #Set to 1 to record CPU times
MODE            mode            #Type [scanps HELP modes] to see available modes
QSEQ_FORMAT     QSEQ_F          qseq_format     qseq_f   #0 = PIR format, 1 = FASTA format
DB_FORMAT       DB_F            db_format       db_f     #As for QSEQ_FORMAT
MAX_SEQ_LEN     max_seq_len     #Max allowed length for an amino acid sequence

Each command name is followed by a list of aliases, then a # followed by 
some optional descriptive text. 


Command line switches
---------------------

All the commands in scanps_alias.dat are available for use on the
command line.  You can either specify the command or its alias, with
or without a preceding - symbol.  For example, 

QSEQ_FILE hahu.seq
-qseq_file hahu.seq
-s hahu.seq

all do the same thing on the command line assuming the standard scanps_alias.dat file has not been modified.

A typical scan might be started by:

scanps -s hahu.seq -bdb owl -ld_pen 8 -mode 0 > hahu.out

This would scan the sequence in file "hahu.seq" against the binary
database called "owl" using a length dependent gap penalty of 8 with
whatever default matrix file was specified in the scanps_defaults.dat
file.

More examples are given below.


Controlling the scan
--------------------

SCANPS has a series of different MODEs.  Each mode does a different
job.  

MODE 0	Scan protein sequence against protein sequence database
	with simple gap penalty.  Default and fastest method.
MODE 2	As for MODE 1, but with Affine gaps.
MODE 20 DNA vs protein database with frameshifts (simple gaps).
MODE 22 DNA vs protein database scan with frameshifts (affine gaps).
MODE 99 Build scanps binary database files from FASTA or PIR format files.
MODE 100 Extract sequences from database (reads output of earlier scan).

MODE 0 is the simplest and fastest.  MODE 2 takes about 3 times longer
than MODE 0 on the same query sequence.  MODE 20 takes about 1.8 times
longer to run on a DNA sequence of the same length and MODE 22 takes 5
times longer.  The C code for MODES 2 and 20 and 22 have not been
heavily optimised for speed, so better times may be seen in the future
as this is done.  A 141 residue protein query running on 1 R10,000
processor (195MHz) against OWL (154,540 sequences, 48472285 residues)
in MODE 0 takes 239 seconds CPU time to scan, + 1.6 seconds to load
the database.  A query 8 times longer, takes 87 seconds to do the same
scan using 23 processors - this corresponds to 627 million cell
updates/second (see below).  These figures are approximate and will
vary with the load on the computer.

There are a variety of controlling commands that affect the scan and
the output.  By default, all results are printed to stdout.  You can 
override this by using the STDOUT command to specify a different file for 
output.  STDERR can also be redefined.

Commands affecting the scan - look in the file scanps_alias.dat to see
alternative names for these commands and see APPENDIX II for a full list 
of commands:

QSEQ_FILE  	is the query sequence file
qseq_format 	defines the format (0 for pir, 1 for fasta)
bdb_root	defines the root name of the binary database to search
ld_pen		length dependent gap-penalty
li_pen		length independent gap-penalty
matrix_file	pairscore matrix file (e.g. Dayhoff, BLOSUM etc )
use_gapdefs	set to 1 to enable the gap defaults in scanps_gapdefs.dat

Commands affecting the output:

log_scores	Calculate ln() scores for each comparison (Pearson).
do_sort		Sort results list before output.
max_nout	Maximum number of scores to output.
max_aout	Maximum number of alignments to output.
min_score	Minimum score for output (overrides max_nout).
min_ascore	Minimum score for an alignment to output.
show_titles	=1 for titles to print with scores, 0 for no titles.
show_blurb	=1 for full information in output (recommended).
do_stats        =1 to calculate statistics of score distribution and sort by probability.
probcut         probability cutoff for showing results.
top_only        =1 to show only the top scoring alignment for each pair of sequences.

See the full list in APPENDIX II for other valid commands.

Notes:  
------

Version 2.3.1 has probability scoring and this is normally the most
useful way of sorting the results of a scan.  You can adjust the
PROBCUT to allow more or fewer high scoring alignments to be output.

Setting PROBCUT to a value > 1.0 will cause scanps to output all 
hits down to the value of MAX_NOUT.

If you are not using probability scoring (i.e. do_stats=0) then:

Setting max_nout to -1 forces scanps to calculate from the value of
min_score, the number of identifiers to output.

Setting min_score to -1 forces scanps to calculate the min_score from the
value of max_nout.

Setting max_aout to -1 sets max_aout to max_nout.
Setting min_ascore to -1 sets min_ascore to min_score.

Normally, it is best to set min_score and min_ascore to -1, then
adjust the output by setting values for probcut and/or max_nout.


Commands to help find problems:
------------------------------

VERBOSE Setting this to something > 0 will give messages to stderr as
the program executes.  Setting it to a big number will give more
messages.  VERBOSE 100 should output all messages.  NOTE: If you set
VERBOSE and you still don't get useful messages, try setting it first
to 1, then to be absolutely sure of getting all messages, set VERBOSE
in the scanps_defaults.dat file rather than on the command line.

TIME Setting this to a number > 0 will output various timings to
std_err.  Again, bigger numbers should give more messages.


The scanps_gapdefs.dat file
---------------------------

This file sets default gap penalty combinations for each matrix type
and mode.  If USE_GAPDEFS is set to 1, then scanps will take the
default gap penalty combinations from the scanps_defaults.dat file for
the given matrix and MODE.  These values will OVERRIDE any penalties
specified in the scanps_defaults.dat file, or on the command
line.  If you wish to specify non-default gap penalty combinations,
then set USE_GAPDEFS to 0.

For example:

If the scanps_gapdefs.dat file has the following entries:

#Format is:
# mode:matrix_name:ld_pen:li_pen:fs_pen:fse_pen
#
0:nmd.mat:8:0:0:0
2:nmd.mat:4:12:0:0
0:md.mat:8:0:0:0
2:md.mat:0.2:12:0:0
#

and USE_GAPDEFS = 1, then 

scanps -s hahu.seq -mode 0 -m nmd.mat -bdb swissprot

will do a scan of the swissprot database with the penalty of 8

scanps -s hahu.seq -mode 2 -m md.mat -bdb swissprot

would do an affine scan with the penalties of ld_pen 0.2, li_pen 12.

scanps -s hahu.seq -mode 2 -m md.mat -bdb swissprot -ugd 0 -li_pen 12 -ld_pen 4

would do the affine scan with penalties of 12, 4 rather than 12, 0.2. 


WARNING: The scanps_gapdefs.dat file contains numbers that I guessed
might be appropriate for the given matrix and mode.  They are almost
certainly not the optimum choices.  Work is in progress to decide what
the optimum values should be...


Parallel processing
-------------------

Parallel processing is currently only supported on Silicon Graphics.
The code has been used on Challenge and Origin machines, but should
also work on Octane.  You must first set the number of processors to
use for the scan.  Do this with:

setenv MPC_NUM_THREADS N

where N = the number of processors to use with scanps.  More
processors mean that the scan will finish more quickly, but if the
query is short, then the scan will be so quick anyway (on an R10,000)
that there will be little advantage in running in parallel.  For
larger problems, e.g. a 1000 residue protein with affine gaps (mode 2)
scan I have seen over 99% efficiency in speedup to 20 processors.  If
MPC_NUM_THREADS is not set, the system will default to a sensible
number depending on the number of processors in your system.

If you are running scanps on a heavily loaded server, you may find
that scanps takes too much of the CPU time and so upsets other users.
If this is a problem, then try setting the following environment
variable:

setenv MPC_GANG OFF

or allocate fewer processors to your scanps job.

Although the search phase of a scanps scan is run in parallel, loading
of the database, sorting the results and generating the alignments are
all done on a single processor.  So, if you want to generate alignments for
every sequence in the database (pretty pointless!!) there is no point in 
running in parallel.

NOTE: The version of scanps compiled for parallel processing has a
fixed limit on the length of query sequence.  This cannot be
overridden by the MAX_SEQ_LEN command.  The serial (non-parallel)
version does not have this limitation.  The database sequence length
is not affected by this limit.  The program will exit if given a sequence
B
that is too long...


Some examples
-------------

1.  Scan hahu.seq (a FASTA format file) against the binary database OWL using
simple penalty and save all hits to the database that give better
than the default probability cutoff of 0.1.

scanps -mode 0  -s hahu.seq -bdb owl  -qseq_format 1  > hahu.out

The output file contains the following header information:

# SCANPS Scan: Mon Aug 11 15:21:37 1997
# SCANPS Version: 2.3 (Fri Jul  4 16:16:41 BST 1997)
# Author:  G. J. Barton,  University of Oxford, UK
# Query file: hahu.fasta
# Query ID: HAHU
# Query Title: Hemoglobin alpha chain - Human and chimpanzees                  
# Query Length: 141
# SCANPS MODE: 0
# Matrix file: /gjb/delly/gjb/c/scanps/gjscanps/mat//PAM250
# Matrix title: NCBI Matrix - see file for type
# LD penalty:   8.00
# MAX number of scores output: 10000
# MAX number of alignments output: 10000
# MIN score to output:  54.17
# MIN score of alignments to  output:  54.17
# Database name: /gjb/ramsden/databases/blast/owl.fa
# Created on: Thu Jul 24 14:15:30 1997
# Number of sequences: 191509
# Number of residues/bases in database: 60552547
# Score type: ln() scores
# FIT: m: 0.683790 c: 34.727547 var: 5.720673 nval: 172359
# PROBCUT: 0.100000
# -------------------------------------------------------------------
B

then the ranked score list - the end of which is shown here:

Rank score  Z      E      title

 783  103  11.41 0.0427   owl|S64703|S64703 myoglobin - slug sea hare (fragment)              
 784  103  11.38 0.0444   owl|P02210|GLB_APLLI GLOBIN (MYOGLOBIN). - APLYSIA LIMACINA (SLUG SEA H
 785  101  11.24 0.0531   owl|V00151|CHBGL2 CHBGL2 NID: g959 - goat.                          
 786  102  11.17 0.0581   owl|U15293|KSU15293 KSU15293 NID: g563924 - midge.                    
 787  101  11.01 0.0713   owl|P15447|GLB4_GLYDI GLOBIN, MONOMERIC COMPONENT M-IV


then the alignments - the last is shown here:


# Start_Alignments_Rank: 787
# Query: HAHU Target: owl|P15447|GLB4_GLYDI Number_of_Alignments: 1
# Target_Title: GLOBIN, MONOMERIC COMPONENT M-IV. - GLYCERA DIBRANCHIATA (BLOODWORM).                                                                                                                                   
# N: 1 Raw_Score: 102 Query_Length: 141 Target_Length: 147 ln()_Score: 101
        **.*..  * ..*  . *.. *   * * . . ***      . *  *  
      2 LSPADKTNVKAAWGKV GAHAG EYGAEALERMFLS FPTTKTYFPHFDL      48
      2 LSAAQRQVVASTWKDIAGSDNGAGVGKECFTK FLSAHHDIAAVFG FS       48

        . . . * . * **   .  **.*. *.  *   . *..  * .   . .
     49 SHGSAQVKGHGKKVADALTNAVAHV DD  MPNALSALSDLH AHKLR V      93
     49 GASDPGVADLGAKVLAQIGVAVSHLGDEGKMVAEMKAVGVRHKGYGYKHI      98

         .  *  *.  ** ..  .. .  *.*   .     * .*  * *
     94 DPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTS     138
     99 KAEYFEPLGASLLSAMEHRIGGKMTAAAKDAWAAAYADISGALIS     143

# End_N: 1
# End_Alignments_Rank: 787

Stars above the alignment show identities (obviously), dots show pairs
of amino acids that have positive scores in whatever pairscore matrix
you are using.

finally, some timing information is printed:

# Scan Completed at: Mon Aug 11 15:21:43 1997
# 
# Times:                              (Seconds)
# Load matrix, query and index:        0.0     0.0     0.0     0.0
# Load database:                       1.8     1.8     1.8     1.8
# Scan database:                      14.8    16.5    21.6    23.4
# Sort results :                       0.8    17.4     1.7    25.1
# Output scores:                       1.9    19.2     2.0    27.1
# Generate and write alignments:       6.1    25.3     6.1    33.2
# Millions of cell updates/sec in scan (CPU): 576.89 (Elapsed): 395.27

This scan was performed in parallel on a 24 processor Challenge with
1MByte cache R10,000 processors.  The four columns of timing
information the first two give actual and cumulative CPU times for the
master thread, the second two give actual and cumulative elapsed
times.  Figures for millions of array operations per second are also
shown.  These are the product of the length of the query sequence and
the number of amino acids in the database, divided by the database
scan (or elapsed) times.


Various aspects of the scan can be adjusted.  See APPENDIX II for
details of commands.

Scanning Protein with DNA sequence
----------------------------------

Version 2.3 allows a DNA sequence to be compared to a protein
database (MODE 22).  Scanps first translates the DNA in three forward reading
frames to create a pseudo sequence of amino acids where every base is
an amino acid residue or STOP.  This sequence is then compared to each
protein sequence using a dynamic programming algorithm.  The result is
an alignment of the DNA with a protein that can include frameshifts.

Additional adjustable parameters are:

FS_PEN and STOP_WEIGHT.  These are defined in the command
summary APPENDIX II.

Work is in progress to establish suitable values for these parameters.
I have found that a large and negative value for STOP_WEIGHT is
appropriate.  FS_PEN  should be set larger than LI_PEN and LD_PEN.

Here is an example of the comparison of the primary transcript of
human alpha haemoglobin versus the corresponding protein sequence.
The DNA has been modified to introduce a single frameshift error, an
additional G at position 356.

Here is the command line:

scanps -s hahunucerror.fasta -d hahu.fasta -db_format 1 -mode 22 \
-ugd 0 -m PAM250 -ld_pen 0.2 -li_pen 12 -fs_pen 24 -qseq_format 1

we are scanning a database of only one sequence (-d hahu.fasta),
overriding the gap-penalty defaults for this MODE and matrix (-ugd 0)
and supplying our own gap-penalties (-ld_pen, -li_pen, -fs_pen).  The
query sequence is in FASTA format, but we have set the default
sequence format to 0 (PIR) in the scanps_defaults.dat file.
Accordingly, we need to override the default on the command line
(-qseq_format).

Here is the output:

# SCANPS Scan: Mon Aug 11 15:36:04 1997
# SCANPS Version: 2.3 (Fri Jul  4 16:16:41 BST 1997)
# Author:  G. J. Barton,  University of Oxford, UK
# Query file: hahunucerror.fasta
# Query ID: HSHBA4
# Query Title: Dummy title inserted by gseq_fasta                              
# Query Length: 835
# SCANPS MODE: 22
# Matrix file: /gjb/delly/gjb/c/scanps/gjscanps/mat//PAM250
# Matrix title: NCBI Matrix - see file for type
# LD penalty:   0.20
# LI penalty:  12.00
# FS  penalty:  24.00
# MAX number of scores output: 1
# MAX number of alignments output: 10000
# MIN score to output: 769.70
# MIN score of alignments to  output: 769.70
# Number of residues/bases in database: 141
# Score type: ln() scores
# Start_Scores: ----------------------------------------------------
   1 769 HAHU Hemoglobin alpha chain - Human and chimpanzees    
# End_Scores: ------------------------------------------------------
# -------------------------------------------------------------------
# Start_Alignments_Rank: 1
# Query: HSHBA4 Target: HAHU Number_of_Alignments: 1
# Target_Title: Hemoglobin alpha chain - Human and chimpanzees                                                                                                                                                          
# N: 1 Raw_Score: 566 Query_Length: 835 Target_Length: 141 ln()_Score: 769
     41 gctcggaaagaggtgagggcgggtggggcgatgtccttgcgttcgcacac
        ttcccaacataccggatgcacgaagcactagggctcgcacgccccgcagc
        ggttccgcccgccgtgccgctcgttggcgggacccccccgccgccgccga     190
     41 VLSPADKTNVKAAWGKVGAHAGEYGAEALER!GSLPCSDPGSSPARTHRP      90
        *******************************                   
      1 VLSPADKTNVKAAWGKVGAHAGEYGAEALER                         31

    191 ctagcgcgcacacctgtccaatcttcaaaattcctgcacgtgcgagcgaa
        cccttccacacccaccccggtttctcccacatcatatgagccatagagaa
        caccgcgcaccctctttccggcgcccccgcccgcccgccctcgtgcccgg     340
     91 PSTVLAPDPNPTPHSASPRRMFLSFPTTKTYFPHFDLSHGSAQVKGHGKK     140
                            ******************************
     32                     MFLSFPTTKTYFPHFDLSHGSAQVKGHGKK      61

    341 ggggcGaagggcgggacagctgcagccgcaccggcgataGtgggcgctgt
        tcact cactcataatcactcctgatacaatgtactata gccggagcgc
        gcccg cccggcgccgccggccgccgcgcgtggcgcccg aggcggatgg     486
    141 VADAL^TNAVAHVDDMPNALSALSDLHAHKLRVDPVNFK^!AAGRERSGS     188
        ***** *********************************           
     62 VADAL^TNAVAHVDDMPNALSALSDLHAHKLRVDPVNFK^                99

    487 aggagcttcgagtcgtcggtcaccccgtggcacttgcccactccgacggc
        ggatccccagggcggtgataggggtgcgcctcttccattgagtttctcca
        gcgggtctgcaaacggggggcgggggcgcagccctagcacccgggcgccc     636
    189 RGEMAPSSQGRGSRGLREV!RRRRLRAWAALTLFSAQLLSHCLLVTLAAH     238
                                             *************
    100                                      LLSHCLLVTLAAH     112

    637 ccggtacggcgtcgatcgtgaagcatatc
        tccatccctacctaattcctgcttccaag
        cccgcctggcccgcgcgttgccggccact     723
    239 LPAEFTPAVHASLDKFLASVSTVLTSKYR     267
        *****************************
    113 LPAEFTPAVHASLDKFLASVSTVLTSKYR     141

# End_N: 1
# End_Alignments_Rank: 1
# ------------------------------------------------------
# Scan Completed at: Mon Aug 11 15:36:04 1997
# 
# Times:                              (Seconds)
# Load matrix, query and index:        0.0     0.0     0.0     0.0
# Load database:                       0.0     0.0     0.0     0.0
# Scan database:                       0.1     0.1     0.9     0.9
# Sort results :                       0.0     0.1     0.0     0.9
# Output scores:                       0.0     0.1     0.0     0.9
# Generate and write alignments:       0.1     0.2     0.1     1.0
# Millions of cell updates/sec in scan (CPU):   1.18 (Elapsed):   0.13


The output shows the base triplets arranged VERTICALLY above the
corresponding amino acid residue.  The protein sequence is shown
conventionally below this.  Frameshifts  are indicated by a caret
symbol "^" and the bases involved in the frameshift are shown in
uppercase.  You may see odd effects around frameshifted gaps in
alignments.  My advice is to look carefully at any frameshifted gaps
to see if there might be alternative alignments by taking a different
reading frame in the region, or shifting the frameshift gap a base or
so to either side.  With any dynamic programming method, there may be
more than one, equally valid alignment in a region, but the program
only reports one solution.  There are more possibilities for such
alternatives in DNA vs protein comparisons than for DNA v DNA or
protein v protein.

As well as MODE 22, there is MODE 20.  MODE 20 does not have affine
gap penalties and as a consequence it is about a factor fo 3 faster
than MODE 22.  Penalties for frameshifts are length dependent in MODES
20 and 22, thus if FS_PEN is 8, then a single frameshift costs 8 and a
double costs 16.


Creating SCANPS binary databases 
--------------------------------

Why use a SCANPS' binary database?  Well, it takes around 25 seconds
to parse and load the 192,000 sequence OWL database from a FASTA
format file.  In contrast, the SCANPS binary database loads in around
a second.  If you are going to do a lot of scans against a particular
database and you have the disk space, then it is well worth building
the binary database.  Scanps has been optimised to read data fast from
a local disk on a high speed bus (e.g. Fast Wide Differential SCSI).
You may get much slower database loading speeds if you are reading
across a network using NFS.

Scanps is used to create its own databases for searching.  By default
the databases are pre-indexed with the index string from the default
pairscore matrix.  If you expect to use pairscore matrices that have
varied index strings, then switch this option off with the APPLY_INDEX
0 command.  This will increase the load time by about 0.5 second so
scans will take a little longer and indexing will be done on the fly.

For example: To build a binary database from the file swissprot.fa
which is a fasta format sequence file and create a database called swiss.

scanps -d swissprot.fa -bdb swiss -db_format 1 -mode 99 -apply_index 1 -verbose 1 -time 1

This will create the files swiss.bix and swiss.bsq.  The .bix file
contains information about the database + pointers to the ID, Title
and SEQ portions of each entry in the .bsq file.  The .bsq file is
just the sequence database, but written in a consistent format.

-verbose and -time are not essential, but they tell you what is going
on as the database is created.  On a Challenge R10,000 (195MHz) it
takes about 25 seconds to create the OWL binary database.

MODE 100
--------

Version 2.2.5 introduced MODE 100.  This allows an output file from 
scanps to be fed back into scanps to extract sequences from the database.

The advantage of this mode is apparent if a scan takes a very long
time.  It is often better to specify max_aout 0 to suppress alignment
output, then re-run scanps on a selected set of sequences from the top
hits.  MODE 100 lets you do this.

The EXTRACT_SEQS N command specifies you want the top N sequences to be 
saved to a file (default is scanps_extract.seq).  In MODE 100 you give the
previous output file as the standard input and use EXTRACT_SEQS to specifiy how
many sequences you want.

e.g.  assuming we have run scanps once and produced an output file called 
"scanps.scan" as a result of a scan of owl.  Then 

scanps -MODE 100 -EXTRACT_SEQS 20 -bdb owl < scanps.scan

would save the top 20 sequences from the results in  scanps.scan to the
scanps_extract.seq file.  You can specify a different file to save to
by using the EXTRACT_FILE command (see appendix).

Installing scanps
-----------------

1. Obtain the gzipped tar file scanps.tar.gz.

2. uncompress and un tar the distribution:

	gunzip scanps.tar.gz
	tar xvf scanps.tar

3. You should now have a directory tree that looks like this - there may be 
extra binary directories in your distribution.  e.g. sgi_5.3.


scanps/bin/sgi_6.2/README 		binaries - see this README FILE.
scanps/bin/sgi_6.2/mscanps_16K          multiprocessor versions of scanps.
scanps/bin/sgi_6.2/mscanps_2M 
scanps/bin/sgi_6.2/mscanps_64K
scanps/bin/sgi_6.2/uscanps              uniprocessor version of scanps.

scanps/dat/codon.dat			data files.
scanps/dat/scanps_alias.dat
scanps/dat/scanps_defaults.dat
scanps/dat/scanps_gapdefs.dat 

scanps/doc/scanps.doc			this document.

scanps/examples/hahu.fasta		example query.
scanps/examples/hahu.log 		example output.
scanps/examples/mode22.log		example frame shift output.
scanps/examples/protein.seq 		small database to try out scans.

scanps/mat/BLOSUM62			pairscore matrices.
scanps/mat/PAM250


4.  Choose a directory for the scanps_defaults.dat, scanps_alias.dat
and scanps_gapdefs.dat master files.  Modify scanps_defaults.dat to
include the DB_DIR, MATRIX_DIR, GENERAL_DIR and appropriate file names
for your system.  Since you will almost certainly have NCBI's BLAST
installed on your system, I recommend pointing MATRIX_DIR at their
matrix directory.  scanps can read NCBI format matrix files and they
provide a good range of alternatives.

4.  Obtain your scanps license from http://barton.ebi.ac.uk/.  Install this 
in the file scanps.lic in the directory LICENSE_DIR defined in scanps_defaults.dat.

5.  setenv SCANPSDIR to the directory containing the defaults and alias files.

6.  Optionally, in the chosen database directory build the binary databases for
the sequence libraries you are interested in.  The fasta format files,
if any, need not be in the same directory, but DB_DIR must point to
the right place, or be set to ./

7.  Ensure each user has the SCANPSDIR and MPC_NUM_THREADS environment
variables set as appropriate.

5.  Test run the program, perhaps with the little protein.seq
database.  This database is actually the entire PIR 14 database from
1988.

In the bin directory there may be additional subdirectories with binaries
for different computer types.  Please see the readme files in those directories
for further details.


Comment
-------

Though Unix hackers will like it, I don't expect the scanps interface
to be seen by many.  This program is meant to hide behind a WWW form
or somesuch.

GJB 11/8/1997

APPENDIX I - Revision notes
---------------------------

2.1

First full parallel version with searching for protein vs protein database.
Includes code to build and use binary database for fast loading.
Includes code for sorting the results.
ln() scores.
alignment output options.
All within the same program.

2.2

Fix get_fasta.c so that it will read a FASTA file that is missing a title.
Fix DB_DIR to permit missing / at end of directory name.
Add MATRIX_DIR command.

2.2.1

Add option to read NCBI format matrix files.  MATRIX_TYPE command.

2.2.2

Various odds and ends.  Added EXTRACT options to enable the sequence
of high scoring database hits to be fished out of the database.

2.2.3

Add routines to compare DNA to protein, with frameshifts.  
Fix mysterious looking  bug when generating alignments with affine gaps.  
Tidy up timing routines.
Add option to print full length titles.

First released version?  Nope.

2.2.4

Small changes not worth mentioning

2.2.5

Add MODE 20 for fast frameshifting DNA vs Protein comparisons.  Add
HQUERY option at complile time to allow VERY big query sequences
(e.g. 2 Megabases).  Add COMPLEMENT_QUERY option to allow scanning with 
complement of the query.  Modify MODE 22 code to eliminate FSE_PEN and replace
with simple length-dependent penalty for frameshift gaps.
Add MODE 100 to allow sequences to be extracted from the database following
a scan that produced no alignments.

2.2.6

Add statistical estimates based on extreme value distribution.  This is based on the 
statistics used in the programs FASTA and SSEARCH3 though the implementation is
different.  No statistics in alignment output as yet,
just the score list.

2.3

Small bug fixes. Add the licensing routines.  Tidy up the distribution.


2.3.1

Small changes to the way in which the probcut and max_nout options interact.
The program now allows max_nout to control the number of sequences output
when probcut >1.  Bug removed for probcut ==1 case.


Plans for additions - in no special order
----------------------------------------

Option to read multiple alignment files as query.
Option to read "profiles" as query.
Include alternative significance statistics in output.
Extend range of methods to include variable gap-penalties.
Add parallel pairwise comparison option.
Add option to read and apply HMMs.
Option to scan a database of alignments.
Allow comparison of protein sequence to DNA database.


APPENDIX II - Alphabetical list of scanps commands
-----------------------------------------------

Many of the commands listed here are left over from scanps version 1.0
(which did absolutely everything).  Some of the features of version
1.0 that are not implemented in version 2.x will be, but not just yet!

Note:  These are the defining names of the commands.  You should check
the scanps_alias.dat file for alternative names for the same command.

APPLY_INDEX

Used when creating a binary database file (mode 99) APPLY_INDEX 1 will
convert all amino acid characters to a number corresponding to the
position that character is in the MATRIX_FILE index string.
Typically, A=0, R=1 and so on.  Setting APPLY_INDEX 0 prevents this
indexing happening.  This can be useful if you use a lot of different
pair score matrices with different index strings.  The sacrifice is a
small reduction in scan speed.

AUTO_CORNER	Not implemented.

BDB_ROOT

The root name of the binary database files you are using.  Binary
databases have two files, something.bix and something.bsq.  BDB_ROOT
specifies the "something" part of the name.  This is used in MODE 99 to 
name the database and other modes when scanning.

CODON_FILE

Defines the name of the file that contains the codon table to use when
translating DNA to protein.  The file has the format:

AAA K
AAC N
AAG K
AAT N
ACA T
etc

The file MUST be sorted into alphabetical order.

COMPLEMENT_QUERY	

If set to 1 then the DNA query sequence is complemented and that
sequence is scanned against the database.

CUT_CONSTANT	Not implemented.
CUT_CORNERS	Not implemented.

DB_DIR

Defines the directory in which the BINARY database files will reside.
Note that this has no effect on files defined with the DB_FILE
command.

DB_FILE	

Specify name of a FASTA or PIR format sequence file to scan against,
or (in mode 99) to build into a scanps binary database file.

DB_FORMAT

Specify the format of the database file to scan.  Either 0 for PIR format
or 1 for FASTA format.

DO_SORT

Set to 1 to sort the results list from highest score to lowest. Set to
0 for no sorting.

EPB_PEN		Not implemented
EPL_PEN 	Not implemented
EPR_PEN 	Not implemented
EPT_PEN 	Not implemented

DO_STATS

Tell the program to do some statistics on the score distribution of a scan.  This is 
normally desirable.

DUMP_STATS

Dump out a lot of information about the statistics done on the score distribution.
This is just for debugging.

EXTRACT_FILE

Name of file to save full length sequences to that are at the top of
the results list.  Default name is scanps_extract.seq.

EXTRACT_SEQ

Number of sequences to save from results list.

EXTRACT_SEQ_FORMAT

Format of extracted sequence file.  0 = PIR, 1 = FASTA.

FIND_REPEATS	Not implemented.
FIT_FILE	Not implemented.

FSE_PEN

Frame shift extension penalty.  The penalty to be applied to extending
an exising gap that is a frameshift.  Only applicable in MODE 22.  DNA
vs Protein.  From Version 2.2.5 NO LONGER IMPLEMENTED.

FS_PEN

Penalty for creating a frame shift gap.

GAP_CHARACTER

The character to be used to represent a gap in alignment output.

GENERAL_DIR

The name of a directory for various scanps files.  e.g. the codon.dat file.

LD_PEN

Length dependent gap-penalty. In MODE 0 this is the only penalty.  In MODE 2 this is the penalty
for extending an existing gap.  In MODE 22 it is the penalty for extending an existing in-frame gap.

LICENSE_DIR

Directory that contains the scanps.lic file.

LI_PEN

Length independent gap-penalty.  As for LD_PEN, but this is the gap-creation penalty.

LOG_SCORES

Calculate ln() scores for each sequence comparison and use these to
rank the results.  If S is the raw score for an alignment of a query
and database sequence, of lengths M and N then the ln() score is given
by ln() = S*(log(M)/log(N)).  This transformation, suggested by
W. Pearson, is usually beneficial in database scans.  It tries to
correct for the tendency for long database sequences to give higher
scores with a typical query.

MATRIX_DIR

Directory that contains the pair score matrix files.  e.g. BLOSUM62,
PAM250 etc.

MATRIX_FILE

The name of the pairscore matrix file.

MATRIX_FORMAT

The format of the pairscore file.  0 for PIR 1 for NCBI.

MAX_AOUT

Max number of alignments to report.

MAX_BLOC_SEQ	Not implemented

MAX_ID_LEN	

Max length for a sequence identifier.

MAX_NOUT	

Max number of scores to output.

MAX_NSEQ	

Max number of sequences allowed.

MAX_SEQ_LEN	

Max length allowed for a sequence (in bases or residues).  Note: for 
multiprocessor versions of scanps, this value only affects the database
sequence lengths.  The query sequence limit is fixed.

MAX_TITLE_LEN	

Max lenth of a sequence title.

MIN_ASCORE	

Min score necessary before an alignment will be output.

MIN_LEN		Not implemented.

MIN_NSTATS	Minimum database size required before statistics are
estimated.  Setting this too small will produce garbage probabilities.

MIN_SCORE	

Minimum score required for a result to be reported.

MODE

Defines what sort of analysis will be performed.

MODE 0	protein vs protein with simple length-dependent gap penalty.
This is the fastest and oldest code in the program.  For most database scans,
this is the best method.  The code that implements this option has been 
heavily used since 1989.

MODE 2 protein vs protein with affine penalties.  In the current
program, this is about a factor of 3 slower than MODE 0.  This is
newer code and I think it could probably be persuaded to go faster.

MODE 20 DNA vs protein with simple gap penalty for in-frame and 
frameshift gaps.   

MODE 22 DNA vs protein with affine penalties for in-frame and simple
penalties for frameshift gaps.  It appears to work OK, but has not
been extensively tested.


NRANS		Not implemented.
OSEQ_FILE	Not implemented.

OUTPUT_LENGTH

Number of residues to output per line in alignments.

PCUT		Not implemented.

PRECISION

Since, most of the arithmetic in scanps is addition and subtraction,
it is most efficient to do all calculations with integers.  The
PRECISION value defines a factor by which all numbers read into the
program (e.g. pairscore matrix, gap-penalties) are multiplied.
Typically this is set to 100.

PRED_FILE	Not implemented.
PRINT_ALIGN	Not implemented.

PROBCUT		Probability cutoff.  Active when DO_STATS is on.

QBLC_FILE	Not implemented.
QBLC_FORMAT	Not implemented.

QSEQ_FILE

Name of the file that contains the query sequence.

QSEQ_FORMAT

Format of the query sequence file.  0 for PIR, 1 for FASTA.

READ_PRED	Not implemented.
ROBSIM		Not implemented.
RUN_SW_MIN	Not implemented.
SAVE_PROFILES	Not implemented.
SCAN		Not implemented.
SEC_FILE	Not implemented.

SHOW_BLURB

Set to 1 to print various useful information in the results file.  You
might want to set this to 0 to simplify parsing of a list of score, id
pairs in the output file.  Probably only useful for compatibility with
sortsco, the program that came with scanps version 1.0.

SHOW_PMATRIX

Print the pairscore matrix in the information in the output file.
Requires SHOW_BLURB to be set to 1.

SHOW_TITLES

Set to a number > 0 to print titles of database sequences alongside
the scores and ID's in the sorted results list.   For example, 
SHOW_TITLES 50 will print the first 50 characters of each title line.

STDERR

Define a different file for error messages.

STDIN

Define a different file for standard input (e.g. the database).

STDOUT

Define a different file for output.  Usually I use pipes.

STOP_WEIGHT

Specify the score for aligning a DNA stop codon with an amino acid.
Large and negative is a good idea!

TIME

Set to 1 to print CPU times at various stages of the program.  Set to > 1 to 
print even more times!  Pretty much obsolete now, as the program prints times
summary at the end anyway.

TOP_CUTFRAC 

Fraction of top scores removed before statistics are calcualted.

TOP_ONLY

Scanps can produce the top and alternative local alignments between
the query and the database.  This can produce verbose output, so setting
TOP_ONLY to 1 tells the program only to report the top scoring alignment for each
pair of sequences.

USE_GAPDEFS

If =0 then scanps takes the default gap penalties from
scanps_defaults.dat, or the command line.  If = 1, then the
scanps_gapdefs.dat file is consulted for a match in MODE and matrix.
The penalties are then defined by whatever the scanps_gapdefs.dat file
shows.  If the MATRIX is not present in the scanps_gapdefs file, then
the default penalties are used.


VERBOSE	

Set to 1 to get lots of useful messages about the progress of the
program.  Set to > 1 to get even more messages.  This can be helpful
to check (a) that the query sequence you think you are scanning is the
one the program has swallowed.  Or to find out at what stage the
program has crashed.  Of course, that should never happen....

VINGRON_FILE    Not implemented

End of scanps.doc---------------------------------------------