(For the sake of brevity in the following example, we only store the top 20)

ENTER FILE TO PROCESS: bash_ge_4_scan1.out/S=20 I WILL REPORT THE TOP 20 SCORES ENTER FILE FOR SORTED OUTPUT [.sorted]: bash_ge_4_scan1.top20 1=MATCH, 2=NAS, 3=RMEAN, 4=RSD, 5=SCORE ENTER CHOICE :1 (Always answer 1) READING DATA FILE 6721 DATASETS READ IN CALCULATING SCORE DISTRIBUTION USING ITEM 1 STATISTICS COMPLETE NUMBER OF POINTS 6721 MEAN 40.8484 SD 26.2812 SKEW 2.19682 KURTOSIS 6.44562 CALCULATING SIGSCORES PERFORMING SORT DATA SORTED WRITING SORTED FILE SORTED FILE WRITTEN Generate prolog clauses? [N]: <return>

The /S= option calculates various statistics on the scores obtained, and expresses each reported score in SD units from the mean. For example part of the file bash_ge_4_scan1.top20:

>P1;HZPG : Hemoglobin zeta chain - Pig 46 141 153.14 3.33 0.00 0.00 4.27 A B C D E F G

The numbers shown after each protein identifier and title line refer to the following:

A Number of elements in the pattern (46). B The length of the sequence being scanned (141). C The score for the best alignment of the pattern and the sequence (153.14). D the score divided by the number of pattern elements (3.33). E The mean (if randomisations are performed). F The Standard Deviation (if randomisations are performed). G The Distance the score (C) is from the mean of the distribution in standard deviation units (i.e. (153.14 - 40.84)/26.28).

If randomisations are performed using the DATABASE,N option where N= the number of randomisations to be performed, then E,F and G refer to the mean and s.d. of the comparisons to randomised sequences.

Although S.D. values are shown, these are not normally very useful. It is generally better to look at the distribution of scores visually using the /hist option described below.

gjb@bioch.ox.ac.uk