SCANPS version 2.3.1 G. J. Barton, University of Oxford, UK --------------------------------------------------------------- See APPENDIX I for revision list. See APPENDIX II for a full list of commands. Preliminary User Guide (18th August 1996) ----------------------------------------------------------------------------- Latest revision: 2.3.1 11 August 1997: Minor revisions to manual: 30 Dec 1997 ----------------------------------------------------------------------------- Pronounciation -------------- SCANPS is pronounced scan-P-S. What is scanps? --------------- Scanps is a program for comparing a protein or nucleic acid sequence to a database of sequences. It implements various flavours of dynamic programming algorithm such as the Smith-Waterman local alignment method. Scanps 2.3.1 supports: 1. Scanning of a protein query sequence against a protein sequence database. The algorithm is full Smith-Waterman either with simple length-dependent penalty, or with affine gaps (u +gl). 2. Version 2.2.3 and later also allows a DNA sequence to be compared to a protein sequence database using a comparison method that allows for frameshift errors. For all methods, the alignment output reports can output not only the top scoring alignment between each pair but sub-alignments down to a user-defined threshold. In version 2.3.1 and later, the statistical significance of alignments is estimated by a method similar to that used by W. Pearson's SSEARCH and FASTA programs. The program will read query and database sequence files either in FASTA or PIR format. To speed start-up times, particularly on parallel processing computers, the program may also use its own compact database format. Unlike SCANPS version 1.x, SCANPS version 2.X is aimed at workstations and servers with sufficient memory to hold the protein database in RAM. For the OWL protein sequence database this is around 70MBytes for the data. The database is held in memory to improve the efficiency of searches on shared memory parallel processing machines such as the Silicon Graphics Challenge. The program supports parallel processing on the SGI Challenge and Origin and gives good scaling up to 32 processors, the most we have tested. Citing the program ------------------ A paper is in preparation that describes the program. In the meantime, please cite, Barton, G. J. (1997) "SCANPS Version 2.3.1 User guide", University of Oxford, UK. The underlying algorithm is described in: Barton, G. J. (1993) CABIOS, 9, 729-734. Disclaimer ---------- SCANPS is provided "as-is" and without warranty of any kind, express, implied or otherwise, including without limitation any warranty of merchantability or fitness for a particular purpose. In no event will the author be liable for any special, incidental, indirect or consequential damages of any kind, or any damages whatsoever resulting from loss of data or profits, whether or not advised of the possibility of damage, and on any theory of liability, arising out of or in connection with the use or performance of this software. Installing SCANPS ----------------- Scanps 2.3.1 is currently available for Silicon Graphics computers in parallel and uniprocessor forms. The uniprocessor version is also available for Sun computers running Solaris 2.5 or later. To obtain scanps please see the WWW page: http://barton.ebi.ac.uk/. Once you have obtained the software distribution you will need to request a license file for the computers on which you want to run the program. Instructions on how to obtain the license file are on the web page. Instructions for installing scanps are given below. Running SCANPS -------------- The interface to SCANPS follows a fairly standard Unix command style. If you are used to using Unix, then this should be easy. By default, output goes to stdout and so can be piped to other programs for processing. If you are not happy with the Unix command line, then it should be easy for someone to hide this interface behind a WWW form, or other windowing interface. SCANPS is controlled by a set of keyword, value commands. The commands may either be specified in a defaults file (scanps_defaults.dat), or on the command line. Values specified on the command line override the defaults set in the scanps_defaults.dat file. The scanps_defaults.dat, scanps_alias.dat and scanps_gapdefs.dat files must all reside in the directory pointed to by the environment variable SCANPSDIR. If SCANPSDIR is not defined, then the program assumes that these files are in the user's current directory. Thus, an installation can have a central set of defaults which may be overridden by individual users who may copy and modify their own copies of the scanps_defaults.dat, scanps_alias.dat and scanps_gapdefs.dat files. The scanps_defaults.dat file ---------------------------- The scanps_defaults.dat contains settings for valid commands. When you first install scanps, you will have to modify this file to define the correct locations for your directories. For example: DB_DIR /gjb/delly/databases/scanps/ MATRIX_DIR /gjb/delly/gjb/c/scanps/gjscanps/mat/ LICENSE_DIR /gjb/delly/gjb/c/scanps/gjscanps/dat/ GENERAL_DIR /gjb/delly/gjb/c/scanps/gjscanps/dat/ MATRIX_FILE PAM250 MAX_NSEQ 5000000 CODON_FILE codon.dat MATRIX_FILE nmd.mat MAX_NSEQ 500000 MAX_SEQ_LEN 400000 LD_PEN 1 LI_PEN 8 DB_FORMAT 0 This defines the directories for various files. See the specific command summary below for further details. Commands in the scanps_defaults.dat file may either be the full command name or an alias as defined in the scanps_alias.dat file (see below). Usually, it is best to use the full command name in the scanps_defaults.dat file. Aliases are there to ease typing commands on the command line. Note that ALL commands and their aliases are case sensitive. The LICENSE_DIR contains the file scanps.lic that enables the program on your registered computers. File names ---------- File names in the scanps_defaults.dat file MUST NOT be fully qualified. Thus, MATRIX_FILE nmd.mat will look in the users' current directory for the file. If the matrix file was in /data/local/scanps/matrices you would have to define: MATRIX_DIR /data/local/scanps/matrices/ as well. This is also true of the binary database files. The directory for these may be defined by DB_DIR. e.g. DB_DIR /data/local/databases/scanps/ BDB_ROOT pir66 would look for the files pir66.bix and pir66.bsq in the directory /data/local/databases/scanps. The only other directory that can be specified in this way is GENERAL_DIR, this holds "other" files needed by the program. At the moment, the only file that is put in GENERAL_DIR is the CODON_FILE for use with DNA vs protein comparisons. More of that later... The scanps_alias.dat file ------------------------- You will not normally modify this file. For the sake of completeness, the format of the file is described here. The scanps_alias.dat file allows aliases to be defined for any of the valid commands. For example, here is an excerpt from scanps_alias.dat. MAX_NSEQ max_nseq #Maximum number of sequences allowed TIME time #Set to 1 to record CPU times MODE mode #Type [scanps HELP modes] to see available modes QSEQ_FORMAT QSEQ_F qseq_format qseq_f #0 = PIR format, 1 = FASTA format DB_FORMAT DB_F db_format db_f #As for QSEQ_FORMAT MAX_SEQ_LEN max_seq_len #Max allowed length for an amino acid sequence Each command name is followed by a list of aliases, then a # followed by some optional descriptive text. Command line switches --------------------- All the commands in scanps_alias.dat are available for use on the command line. You can either specify the command or its alias, with or without a preceding - symbol. For example, QSEQ_FILE hahu.seq -qseq_file hahu.seq -s hahu.seq all do the same thing on the command line assuming the standard scanps_alias.dat file has not been modified. A typical scan might be started by: scanps -s hahu.seq -bdb owl -ld_pen 8 -mode 0 > hahu.out This would scan the sequence in file "hahu.seq" against the binary database called "owl" using a length dependent gap penalty of 8 with whatever default matrix file was specified in the scanps_defaults.dat file. More examples are given below. Controlling the scan -------------------- SCANPS has a series of different MODEs. Each mode does a different job. MODE 0 Scan protein sequence against protein sequence database with simple gap penalty. Default and fastest method. MODE 2 As for MODE 1, but with Affine gaps. MODE 20 DNA vs protein database with frameshifts (simple gaps). MODE 22 DNA vs protein database scan with frameshifts (affine gaps). MODE 99 Build scanps binary database files from FASTA or PIR format files. MODE 100 Extract sequences from database (reads output of earlier scan). MODE 0 is the simplest and fastest. MODE 2 takes about 3 times longer than MODE 0 on the same query sequence. MODE 20 takes about 1.8 times longer to run on a DNA sequence of the same length and MODE 22 takes 5 times longer. The C code for MODES 2 and 20 and 22 have not been heavily optimised for speed, so better times may be seen in the future as this is done. A 141 residue protein query running on 1 R10,000 processor (195MHz) against OWL (154,540 sequences, 48472285 residues) in MODE 0 takes 239 seconds CPU time to scan, + 1.6 seconds to load the database. A query 8 times longer, takes 87 seconds to do the same scan using 23 processors - this corresponds to 627 million cell updates/second (see below). These figures are approximate and will vary with the load on the computer. There are a variety of controlling commands that affect the scan and the output. By default, all results are printed to stdout. You can override this by using the STDOUT command to specify a different file for output. STDERR can also be redefined. Commands affecting the scan - look in the file scanps_alias.dat to see alternative names for these commands and see APPENDIX II for a full list of commands: QSEQ_FILE is the query sequence file qseq_format defines the format (0 for pir, 1 for fasta) bdb_root defines the root name of the binary database to search ld_pen length dependent gap-penalty li_pen length independent gap-penalty matrix_file pairscore matrix file (e.g. Dayhoff, BLOSUM etc ) use_gapdefs set to 1 to enable the gap defaults in scanps_gapdefs.dat Commands affecting the output: log_scores Calculate ln() scores for each comparison (Pearson). do_sort Sort results list before output. max_nout Maximum number of scores to output. max_aout Maximum number of alignments to output. min_score Minimum score for output (overrides max_nout). min_ascore Minimum score for an alignment to output. show_titles =1 for titles to print with scores, 0 for no titles. show_blurb =1 for full information in output (recommended). do_stats =1 to calculate statistics of score distribution and sort by probability. probcut probability cutoff for showing results. top_only =1 to show only the top scoring alignment for each pair of sequences. See the full list in APPENDIX II for other valid commands. Notes: ------ Version 2.3.1 has probability scoring and this is normally the most useful way of sorting the results of a scan. You can adjust the PROBCUT to allow more or fewer high scoring alignments to be output. Setting PROBCUT to a value > 1.0 will cause scanps to output all hits down to the value of MAX_NOUT. If you are not using probability scoring (i.e. do_stats=0) then: Setting max_nout to -1 forces scanps to calculate from the value of min_score, the number of identifiers to output. Setting min_score to -1 forces scanps to calculate the min_score from the value of max_nout. Setting max_aout to -1 sets max_aout to max_nout. Setting min_ascore to -1 sets min_ascore to min_score. Normally, it is best to set min_score and min_ascore to -1, then adjust the output by setting values for probcut and/or max_nout. Commands to help find problems: ------------------------------ VERBOSE Setting this to something > 0 will give messages to stderr as the program executes. Setting it to a big number will give more messages. VERBOSE 100 should output all messages. NOTE: If you set VERBOSE and you still don't get useful messages, try setting it first to 1, then to be absolutely sure of getting all messages, set VERBOSE in the scanps_defaults.dat file rather than on the command line. TIME Setting this to a number > 0 will output various timings to std_err. Again, bigger numbers should give more messages. The scanps_gapdefs.dat file --------------------------- This file sets default gap penalty combinations for each matrix type and mode. If USE_GAPDEFS is set to 1, then scanps will take the default gap penalty combinations from the scanps_defaults.dat file for the given matrix and MODE. These values will OVERRIDE any penalties specified in the scanps_defaults.dat file, or on the command line. If you wish to specify non-default gap penalty combinations, then set USE_GAPDEFS to 0. For example: If the scanps_gapdefs.dat file has the following entries: #Format is: # mode:matrix_name:ld_pen:li_pen:fs_pen:fse_pen # 0:nmd.mat:8:0:0:0 2:nmd.mat:4:12:0:0 0:md.mat:8:0:0:0 2:md.mat:0.2:12:0:0 # and USE_GAPDEFS = 1, then scanps -s hahu.seq -mode 0 -m nmd.mat -bdb swissprot will do a scan of the swissprot database with the penalty of 8 scanps -s hahu.seq -mode 2 -m md.mat -bdb swissprot would do an affine scan with the penalties of ld_pen 0.2, li_pen 12. scanps -s hahu.seq -mode 2 -m md.mat -bdb swissprot -ugd 0 -li_pen 12 -ld_pen 4 would do the affine scan with penalties of 12, 4 rather than 12, 0.2. WARNING: The scanps_gapdefs.dat file contains numbers that I guessed might be appropriate for the given matrix and mode. They are almost certainly not the optimum choices. Work is in progress to decide what the optimum values should be... Parallel processing ------------------- Parallel processing is currently only supported on Silicon Graphics. The code has been used on Challenge and Origin machines, but should also work on Octane. You must first set the number of processors to use for the scan. Do this with: setenv MPC_NUM_THREADS N where N = the number of processors to use with scanps. More processors mean that the scan will finish more quickly, but if the query is short, then the scan will be so quick anyway (on an R10,000) that there will be little advantage in running in parallel. For larger problems, e.g. a 1000 residue protein with affine gaps (mode 2) scan I have seen over 99% efficiency in speedup to 20 processors. If MPC_NUM_THREADS is not set, the system will default to a sensible number depending on the number of processors in your system. If you are running scanps on a heavily loaded server, you may find that scanps takes too much of the CPU time and so upsets other users. If this is a problem, then try setting the following environment variable: setenv MPC_GANG OFF or allocate fewer processors to your scanps job. Although the search phase of a scanps scan is run in parallel, loading of the database, sorting the results and generating the alignments are all done on a single processor. So, if you want to generate alignments for every sequence in the database (pretty pointless!!) there is no point in running in parallel. NOTE: The version of scanps compiled for parallel processing has a fixed limit on the length of query sequence. This cannot be overridden by the MAX_SEQ_LEN command. The serial (non-parallel) version does not have this limitation. The database sequence length is not affected by this limit. The program will exit if given a sequence B that is too long... Some examples ------------- 1. Scan hahu.seq (a FASTA format file) against the binary database OWL using simple penalty and save all hits to the database that give better than the default probability cutoff of 0.1. scanps -mode 0 -s hahu.seq -bdb owl -qseq_format 1 > hahu.out The output file contains the following header information: # SCANPS Scan: Mon Aug 11 15:21:37 1997 # SCANPS Version: 2.3 (Fri Jul 4 16:16:41 BST 1997) # Author: G. J. Barton, University of Oxford, UK # Query file: hahu.fasta # Query ID: HAHU # Query Title: Hemoglobin alpha chain - Human and chimpanzees # Query Length: 141 # SCANPS MODE: 0 # Matrix file: /gjb/delly/gjb/c/scanps/gjscanps/mat//PAM250 # Matrix title: NCBI Matrix - see file for type # LD penalty: 8.00 # MAX number of scores output: 10000 # MAX number of alignments output: 10000 # MIN score to output: 54.17 # MIN score of alignments to output: 54.17 # Database name: /gjb/ramsden/databases/blast/owl.fa # Created on: Thu Jul 24 14:15:30 1997 # Number of sequences: 191509 # Number of residues/bases in database: 60552547 # Score type: ln() scores # FIT: m: 0.683790 c: 34.727547 var: 5.720673 nval: 172359 # PROBCUT: 0.100000 # ------------------------------------------------------------------- B then the ranked score list - the end of which is shown here: Rank score Z E title 783 103 11.41 0.0427 owl|S64703|S64703 myoglobin - slug sea hare (fragment) 784 103 11.38 0.0444 owl|P02210|GLB_APLLI GLOBIN (MYOGLOBIN). - APLYSIA LIMACINA (SLUG SEA H 785 101 11.24 0.0531 owl|V00151|CHBGL2 CHBGL2 NID: g959 - goat. 786 102 11.17 0.0581 owl|U15293|KSU15293 KSU15293 NID: g563924 - midge. 787 101 11.01 0.0713 owl|P15447|GLB4_GLYDI GLOBIN, MONOMERIC COMPONENT M-IV then the alignments - the last is shown here: # Start_Alignments_Rank: 787 # Query: HAHU Target: owl|P15447|GLB4_GLYDI Number_of_Alignments: 1 # Target_Title: GLOBIN, MONOMERIC COMPONENT M-IV. - GLYCERA DIBRANCHIATA (BLOODWORM). # N: 1 Raw_Score: 102 Query_Length: 141 Target_Length: 147 ln()_Score: 101 **.*.. * ..* . *.. * * * . . *** . * * 2 LSPADKTNVKAAWGKV GAHAG EYGAEALERMFLS FPTTKTYFPHFDL 48 2 LSAAQRQVVASTWKDIAGSDNGAGVGKECFTK FLSAHHDIAAVFG FS 48 . . . * . * ** . **.*. *. * . *.. * . . . 49 SHGSAQVKGHGKKVADALTNAVAHV DD MPNALSALSDLH AHKLR V 93 49 GASDPGVADLGAKVLAQIGVAVSHLGDEGKMVAEMKAVGVRHKGYGYKHI 98 . * *. ** .. .. . *.* . * .* * * 94 DPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTS 138 99 KAEYFEPLGASLLSAMEHRIGGKMTAAAKDAWAAAYADISGALIS 143 # End_N: 1 # End_Alignments_Rank: 787 Stars above the alignment show identities (obviously), dots show pairs of amino acids that have positive scores in whatever pairscore matrix you are using. finally, some timing information is printed: # Scan Completed at: Mon Aug 11 15:21:43 1997 # # Times: (Seconds) # Load matrix, query and index: 0.0 0.0 0.0 0.0 # Load database: 1.8 1.8 1.8 1.8 # Scan database: 14.8 16.5 21.6 23.4 # Sort results : 0.8 17.4 1.7 25.1 # Output scores: 1.9 19.2 2.0 27.1 # Generate and write alignments: 6.1 25.3 6.1 33.2 # Millions of cell updates/sec in scan (CPU): 576.89 (Elapsed): 395.27 This scan was performed in parallel on a 24 processor Challenge with 1MByte cache R10,000 processors. The four columns of timing information the first two give actual and cumulative CPU times for the master thread, the second two give actual and cumulative elapsed times. Figures for millions of array operations per second are also shown. These are the product of the length of the query sequence and the number of amino acids in the database, divided by the database scan (or elapsed) times. Various aspects of the scan can be adjusted. See APPENDIX II for details of commands. Scanning Protein with DNA sequence ---------------------------------- Version 2.3 allows a DNA sequence to be compared to a protein database (MODE 22). Scanps first translates the DNA in three forward reading frames to create a pseudo sequence of amino acids where every base is an amino acid residue or STOP. This sequence is then compared to each protein sequence using a dynamic programming algorithm. The result is an alignment of the DNA with a protein that can include frameshifts. Additional adjustable parameters are: FS_PEN and STOP_WEIGHT. These are defined in the command summary APPENDIX II. Work is in progress to establish suitable values for these parameters. I have found that a large and negative value for STOP_WEIGHT is appropriate. FS_PEN should be set larger than LI_PEN and LD_PEN. Here is an example of the comparison of the primary transcript of human alpha haemoglobin versus the corresponding protein sequence. The DNA has been modified to introduce a single frameshift error, an additional G at position 356. Here is the command line: scanps -s hahunucerror.fasta -d hahu.fasta -db_format 1 -mode 22 \ -ugd 0 -m PAM250 -ld_pen 0.2 -li_pen 12 -fs_pen 24 -qseq_format 1 we are scanning a database of only one sequence (-d hahu.fasta), overriding the gap-penalty defaults for this MODE and matrix (-ugd 0) and supplying our own gap-penalties (-ld_pen, -li_pen, -fs_pen). The query sequence is in FASTA format, but we have set the default sequence format to 0 (PIR) in the scanps_defaults.dat file. Accordingly, we need to override the default on the command line (-qseq_format). Here is the output: # SCANPS Scan: Mon Aug 11 15:36:04 1997 # SCANPS Version: 2.3 (Fri Jul 4 16:16:41 BST 1997) # Author: G. J. Barton, University of Oxford, UK # Query file: hahunucerror.fasta # Query ID: HSHBA4 # Query Title: Dummy title inserted by gseq_fasta # Query Length: 835 # SCANPS MODE: 22 # Matrix file: /gjb/delly/gjb/c/scanps/gjscanps/mat//PAM250 # Matrix title: NCBI Matrix - see file for type # LD penalty: 0.20 # LI penalty: 12.00 # FS penalty: 24.00 # MAX number of scores output: 1 # MAX number of alignments output: 10000 # MIN score to output: 769.70 # MIN score of alignments to output: 769.70 # Number of residues/bases in database: 141 # Score type: ln() scores # Start_Scores: ---------------------------------------------------- 1 769 HAHU Hemoglobin alpha chain - Human and chimpanzees # End_Scores: ------------------------------------------------------ # ------------------------------------------------------------------- # Start_Alignments_Rank: 1 # Query: HSHBA4 Target: HAHU Number_of_Alignments: 1 # Target_Title: Hemoglobin alpha chain - Human and chimpanzees # N: 1 Raw_Score: 566 Query_Length: 835 Target_Length: 141 ln()_Score: 769 41 gctcggaaagaggtgagggcgggtggggcgatgtccttgcgttcgcacac ttcccaacataccggatgcacgaagcactagggctcgcacgccccgcagc ggttccgcccgccgtgccgctcgttggcgggacccccccgccgccgccga 190 41 VLSPADKTNVKAAWGKVGAHAGEYGAEALER!GSLPCSDPGSSPARTHRP 90 ******************************* 1 VLSPADKTNVKAAWGKVGAHAGEYGAEALER 31 191 ctagcgcgcacacctgtccaatcttcaaaattcctgcacgtgcgagcgaa cccttccacacccaccccggtttctcccacatcatatgagccatagagaa caccgcgcaccctctttccggcgcccccgcccgcccgccctcgtgcccgg 340 91 PSTVLAPDPNPTPHSASPRRMFLSFPTTKTYFPHFDLSHGSAQVKGHGKK 140 ****************************** 32 MFLSFPTTKTYFPHFDLSHGSAQVKGHGKK 61 341 ggggcGaagggcgggacagctgcagccgcaccggcgataGtgggcgctgt tcact cactcataatcactcctgatacaatgtactata gccggagcgc gcccg cccggcgccgccggccgccgcgcgtggcgcccg aggcggatgg 486 141 VADAL^TNAVAHVDDMPNALSALSDLHAHKLRVDPVNFK^!AAGRERSGS 188 ***** ********************************* 62 VADAL^TNAVAHVDDMPNALSALSDLHAHKLRVDPVNFK^ 99 487 aggagcttcgagtcgtcggtcaccccgtggcacttgcccactccgacggc ggatccccagggcggtgataggggtgcgcctcttccattgagtttctcca gcgggtctgcaaacggggggcgggggcgcagccctagcacccgggcgccc 636 189 RGEMAPSSQGRGSRGLREV!RRRRLRAWAALTLFSAQLLSHCLLVTLAAH 238 ************* 100 LLSHCLLVTLAAH 112 637 ccggtacggcgtcgatcgtgaagcatatc tccatccctacctaattcctgcttccaag cccgcctggcccgcgcgttgccggccact 723 239 LPAEFTPAVHASLDKFLASVSTVLTSKYR 267 ***************************** 113 LPAEFTPAVHASLDKFLASVSTVLTSKYR 141 # End_N: 1 # End_Alignments_Rank: 1 # ------------------------------------------------------ # Scan Completed at: Mon Aug 11 15:36:04 1997 # # Times: (Seconds) # Load matrix, query and index: 0.0 0.0 0.0 0.0 # Load database: 0.0 0.0 0.0 0.0 # Scan database: 0.1 0.1 0.9 0.9 # Sort results : 0.0 0.1 0.0 0.9 # Output scores: 0.0 0.1 0.0 0.9 # Generate and write alignments: 0.1 0.2 0.1 1.0 # Millions of cell updates/sec in scan (CPU): 1.18 (Elapsed): 0.13 The output shows the base triplets arranged VERTICALLY above the corresponding amino acid residue. The protein sequence is shown conventionally below this. Frameshifts are indicated by a caret symbol "^" and the bases involved in the frameshift are shown in uppercase. You may see odd effects around frameshifted gaps in alignments. My advice is to look carefully at any frameshifted gaps to see if there might be alternative alignments by taking a different reading frame in the region, or shifting the frameshift gap a base or so to either side. With any dynamic programming method, there may be more than one, equally valid alignment in a region, but the program only reports one solution. There are more possibilities for such alternatives in DNA vs protein comparisons than for DNA v DNA or protein v protein. As well as MODE 22, there is MODE 20. MODE 20 does not have affine gap penalties and as a consequence it is about a factor fo 3 faster than MODE 22. Penalties for frameshifts are length dependent in MODES 20 and 22, thus if FS_PEN is 8, then a single frameshift costs 8 and a double costs 16. Creating SCANPS binary databases -------------------------------- Why use a SCANPS' binary database? Well, it takes around 25 seconds to parse and load the 192,000 sequence OWL database from a FASTA format file. In contrast, the SCANPS binary database loads in around a second. If you are going to do a lot of scans against a particular database and you have the disk space, then it is well worth building the binary database. Scanps has been optimised to read data fast from a local disk on a high speed bus (e.g. Fast Wide Differential SCSI). You may get much slower database loading speeds if you are reading across a network using NFS. Scanps is used to create its own databases for searching. By default the databases are pre-indexed with the index string from the default pairscore matrix. If you expect to use pairscore matrices that have varied index strings, then switch this option off with the APPLY_INDEX 0 command. This will increase the load time by about 0.5 second so scans will take a little longer and indexing will be done on the fly. For example: To build a binary database from the file swissprot.fa which is a fasta format sequence file and create a database called swiss. scanps -d swissprot.fa -bdb swiss -db_format 1 -mode 99 -apply_index 1 -verbose 1 -time 1 This will create the files swiss.bix and swiss.bsq. The .bix file contains information about the database + pointers to the ID, Title and SEQ portions of each entry in the .bsq file. The .bsq file is just the sequence database, but written in a consistent format. -verbose and -time are not essential, but they tell you what is going on as the database is created. On a Challenge R10,000 (195MHz) it takes about 25 seconds to create the OWL binary database. MODE 100 -------- Version 2.2.5 introduced MODE 100. This allows an output file from scanps to be fed back into scanps to extract sequences from the database. The advantage of this mode is apparent if a scan takes a very long time. It is often better to specify max_aout 0 to suppress alignment output, then re-run scanps on a selected set of sequences from the top hits. MODE 100 lets you do this. The EXTRACT_SEQS N command specifies you want the top N sequences to be saved to a file (default is scanps_extract.seq). In MODE 100 you give the previous output file as the standard input and use EXTRACT_SEQS to specifiy how many sequences you want. e.g. assuming we have run scanps once and produced an output file called "scanps.scan" as a result of a scan of owl. Then scanps -MODE 100 -EXTRACT_SEQS 20 -bdb owl < scanps.scan would save the top 20 sequences from the results in scanps.scan to the scanps_extract.seq file. You can specify a different file to save to by using the EXTRACT_FILE command (see appendix). Installing scanps ----------------- 1. Obtain the gzipped tar file scanps.tar.gz. 2. uncompress and un tar the distribution: gunzip scanps.tar.gz tar xvf scanps.tar 3. You should now have a directory tree that looks like this - there may be extra binary directories in your distribution. e.g. sgi_5.3. scanps/bin/sgi_6.2/README binaries - see this README FILE. scanps/bin/sgi_6.2/mscanps_16K multiprocessor versions of scanps. scanps/bin/sgi_6.2/mscanps_2M scanps/bin/sgi_6.2/mscanps_64K scanps/bin/sgi_6.2/uscanps uniprocessor version of scanps. scanps/dat/codon.dat data files. scanps/dat/scanps_alias.dat scanps/dat/scanps_defaults.dat scanps/dat/scanps_gapdefs.dat scanps/doc/scanps.doc this document. scanps/examples/hahu.fasta example query. scanps/examples/hahu.log example output. scanps/examples/mode22.log example frame shift output. scanps/examples/protein.seq small database to try out scans. scanps/mat/BLOSUM62 pairscore matrices. scanps/mat/PAM250 4. Choose a directory for the scanps_defaults.dat, scanps_alias.dat and scanps_gapdefs.dat master files. Modify scanps_defaults.dat to include the DB_DIR, MATRIX_DIR, GENERAL_DIR and appropriate file names for your system. Since you will almost certainly have NCBI's BLAST installed on your system, I recommend pointing MATRIX_DIR at their matrix directory. scanps can read NCBI format matrix files and they provide a good range of alternatives. 4. Obtain your scanps license from http://barton.ebi.ac.uk/. Install this in the file scanps.lic in the directory LICENSE_DIR defined in scanps_defaults.dat. 5. setenv SCANPSDIR to the directory containing the defaults and alias files. 6. Optionally, in the chosen database directory build the binary databases for the sequence libraries you are interested in. The fasta format files, if any, need not be in the same directory, but DB_DIR must point to the right place, or be set to ./ 7. Ensure each user has the SCANPSDIR and MPC_NUM_THREADS environment variables set as appropriate. 5. Test run the program, perhaps with the little protein.seq database. This database is actually the entire PIR 14 database from 1988. In the bin directory there may be additional subdirectories with binaries for different computer types. Please see the readme files in those directories for further details. Comment ------- Though Unix hackers will like it, I don't expect the scanps interface to be seen by many. This program is meant to hide behind a WWW form or somesuch. GJB 11/8/1997 APPENDIX I - Revision notes --------------------------- 2.1 First full parallel version with searching for protein vs protein database. Includes code to build and use binary database for fast loading. Includes code for sorting the results. ln() scores. alignment output options. All within the same program. 2.2 Fix get_fasta.c so that it will read a FASTA file that is missing a title. Fix DB_DIR to permit missing / at end of directory name. Add MATRIX_DIR command. 2.2.1 Add option to read NCBI format matrix files. MATRIX_TYPE command. 2.2.2 Various odds and ends. Added EXTRACT options to enable the sequence of high scoring database hits to be fished out of the database. 2.2.3 Add routines to compare DNA to protein, with frameshifts. Fix mysterious looking bug when generating alignments with affine gaps. Tidy up timing routines. Add option to print full length titles. First released version? Nope. 2.2.4 Small changes not worth mentioning 2.2.5 Add MODE 20 for fast frameshifting DNA vs Protein comparisons. Add HQUERY option at complile time to allow VERY big query sequences (e.g. 2 Megabases). Add COMPLEMENT_QUERY option to allow scanning with complement of the query. Modify MODE 22 code to eliminate FSE_PEN and replace with simple length-dependent penalty for frameshift gaps. Add MODE 100 to allow sequences to be extracted from the database following a scan that produced no alignments. 2.2.6 Add statistical estimates based on extreme value distribution. This is based on the statistics used in the programs FASTA and SSEARCH3 though the implementation is different. No statistics in alignment output as yet, just the score list. 2.3 Small bug fixes. Add the licensing routines. Tidy up the distribution. 2.3.1 Small changes to the way in which the probcut and max_nout options interact. The program now allows max_nout to control the number of sequences output when probcut >1. Bug removed for probcut ==1 case. Plans for additions - in no special order ---------------------------------------- Option to read multiple alignment files as query. Option to read "profiles" as query. Include alternative significance statistics in output. Extend range of methods to include variable gap-penalties. Add parallel pairwise comparison option. Add option to read and apply HMMs. Option to scan a database of alignments. Allow comparison of protein sequence to DNA database. APPENDIX II - Alphabetical list of scanps commands ----------------------------------------------- Many of the commands listed here are left over from scanps version 1.0 (which did absolutely everything). Some of the features of version 1.0 that are not implemented in version 2.x will be, but not just yet! Note: These are the defining names of the commands. You should check the scanps_alias.dat file for alternative names for the same command. APPLY_INDEX Used when creating a binary database file (mode 99) APPLY_INDEX 1 will convert all amino acid characters to a number corresponding to the position that character is in the MATRIX_FILE index string. Typically, A=0, R=1 and so on. Setting APPLY_INDEX 0 prevents this indexing happening. This can be useful if you use a lot of different pair score matrices with different index strings. The sacrifice is a small reduction in scan speed. AUTO_CORNER Not implemented. BDB_ROOT The root name of the binary database files you are using. Binary databases have two files, something.bix and something.bsq. BDB_ROOT specifies the "something" part of the name. This is used in MODE 99 to name the database and other modes when scanning. CODON_FILE Defines the name of the file that contains the codon table to use when translating DNA to protein. The file has the format: AAA K AAC N AAG K AAT N ACA T etc The file MUST be sorted into alphabetical order. COMPLEMENT_QUERY If set to 1 then the DNA query sequence is complemented and that sequence is scanned against the database. CUT_CONSTANT Not implemented. CUT_CORNERS Not implemented. DB_DIR Defines the directory in which the BINARY database files will reside. Note that this has no effect on files defined with the DB_FILE command. DB_FILE Specify name of a FASTA or PIR format sequence file to scan against, or (in mode 99) to build into a scanps binary database file. DB_FORMAT Specify the format of the database file to scan. Either 0 for PIR format or 1 for FASTA format. DO_SORT Set to 1 to sort the results list from highest score to lowest. Set to 0 for no sorting. EPB_PEN Not implemented EPL_PEN Not implemented EPR_PEN Not implemented EPT_PEN Not implemented DO_STATS Tell the program to do some statistics on the score distribution of a scan. This is normally desirable. DUMP_STATS Dump out a lot of information about the statistics done on the score distribution. This is just for debugging. EXTRACT_FILE Name of file to save full length sequences to that are at the top of the results list. Default name is scanps_extract.seq. EXTRACT_SEQ Number of sequences to save from results list. EXTRACT_SEQ_FORMAT Format of extracted sequence file. 0 = PIR, 1 = FASTA. FIND_REPEATS Not implemented. FIT_FILE Not implemented. FSE_PEN Frame shift extension penalty. The penalty to be applied to extending an exising gap that is a frameshift. Only applicable in MODE 22. DNA vs Protein. From Version 2.2.5 NO LONGER IMPLEMENTED. FS_PEN Penalty for creating a frame shift gap. GAP_CHARACTER The character to be used to represent a gap in alignment output. GENERAL_DIR The name of a directory for various scanps files. e.g. the codon.dat file. LD_PEN Length dependent gap-penalty. In MODE 0 this is the only penalty. In MODE 2 this is the penalty for extending an existing gap. In MODE 22 it is the penalty for extending an existing in-frame gap. LICENSE_DIR Directory that contains the scanps.lic file. LI_PEN Length independent gap-penalty. As for LD_PEN, but this is the gap-creation penalty. LOG_SCORES Calculate ln() scores for each sequence comparison and use these to rank the results. If S is the raw score for an alignment of a query and database sequence, of lengths M and N then the ln() score is given by ln() = S*(log(M)/log(N)). This transformation, suggested by W. Pearson, is usually beneficial in database scans. It tries to correct for the tendency for long database sequences to give higher scores with a typical query. MATRIX_DIR Directory that contains the pair score matrix files. e.g. BLOSUM62, PAM250 etc. MATRIX_FILE The name of the pairscore matrix file. MATRIX_FORMAT The format of the pairscore file. 0 for PIR 1 for NCBI. MAX_AOUT Max number of alignments to report. MAX_BLOC_SEQ Not implemented MAX_ID_LEN Max length for a sequence identifier. MAX_NOUT Max number of scores to output. MAX_NSEQ Max number of sequences allowed. MAX_SEQ_LEN Max length allowed for a sequence (in bases or residues). Note: for multiprocessor versions of scanps, this value only affects the database sequence lengths. The query sequence limit is fixed. MAX_TITLE_LEN Max lenth of a sequence title. MIN_ASCORE Min score necessary before an alignment will be output. MIN_LEN Not implemented. MIN_NSTATS Minimum database size required before statistics are estimated. Setting this too small will produce garbage probabilities. MIN_SCORE Minimum score required for a result to be reported. MODE Defines what sort of analysis will be performed. MODE 0 protein vs protein with simple length-dependent gap penalty. This is the fastest and oldest code in the program. For most database scans, this is the best method. The code that implements this option has been heavily used since 1989. MODE 2 protein vs protein with affine penalties. In the current program, this is about a factor of 3 slower than MODE 0. This is newer code and I think it could probably be persuaded to go faster. MODE 20 DNA vs protein with simple gap penalty for in-frame and frameshift gaps. MODE 22 DNA vs protein with affine penalties for in-frame and simple penalties for frameshift gaps. It appears to work OK, but has not been extensively tested. NRANS Not implemented. OSEQ_FILE Not implemented. OUTPUT_LENGTH Number of residues to output per line in alignments. PCUT Not implemented. PRECISION Since, most of the arithmetic in scanps is addition and subtraction, it is most efficient to do all calculations with integers. The PRECISION value defines a factor by which all numbers read into the program (e.g. pairscore matrix, gap-penalties) are multiplied. Typically this is set to 100. PRED_FILE Not implemented. PRINT_ALIGN Not implemented. PROBCUT Probability cutoff. Active when DO_STATS is on. QBLC_FILE Not implemented. QBLC_FORMAT Not implemented. QSEQ_FILE Name of the file that contains the query sequence. QSEQ_FORMAT Format of the query sequence file. 0 for PIR, 1 for FASTA. READ_PRED Not implemented. ROBSIM Not implemented. RUN_SW_MIN Not implemented. SAVE_PROFILES Not implemented. SCAN Not implemented. SEC_FILE Not implemented. SHOW_BLURB Set to 1 to print various useful information in the results file. You might want to set this to 0 to simplify parsing of a list of score, id pairs in the output file. Probably only useful for compatibility with sortsco, the program that came with scanps version 1.0. SHOW_PMATRIX Print the pairscore matrix in the information in the output file. Requires SHOW_BLURB to be set to 1. SHOW_TITLES Set to a number > 0 to print titles of database sequences alongside the scores and ID's in the sorted results list. For example, SHOW_TITLES 50 will print the first 50 characters of each title line. STDERR Define a different file for error messages. STDIN Define a different file for standard input (e.g. the database). STDOUT Define a different file for output. Usually I use pipes. STOP_WEIGHT Specify the score for aligning a DNA stop codon with an amino acid. Large and negative is a good idea! TIME Set to 1 to print CPU times at various stages of the program. Set to > 1 to print even more times! Pretty much obsolete now, as the program prints times summary at the end anyway. TOP_CUTFRAC Fraction of top scores removed before statistics are calcualted. TOP_ONLY Scanps can produce the top and alternative local alignments between the query and the database. This can produce verbose output, so setting TOP_ONLY to 1 tells the program only to report the top scoring alignment for each pair of sequences. USE_GAPDEFS If =0 then scanps takes the default gap penalties from scanps_defaults.dat, or the command line. If = 1, then the scanps_gapdefs.dat file is consulted for a match in MODE and matrix. The penalties are then defined by whatever the scanps_gapdefs.dat file shows. If the MATRIX is not present in the scanps_gapdefs file, then the default penalties are used. VERBOSE Set to 1 to get lots of useful messages about the progress of the program. Set to > 1 to get even more messages. This can be helpful to check (a) that the query sequence you think you are scanning is the one the program has swallowed. Or to find out at what stage the program has crashed. Of course, that should never happen.... VINGRON_FILE Not implemented End of scanps.doc---------------------------------------------