To sort the output and store only the top N scoring results

Next: Extracting the IDentifier Up: Sorting the results Previous: To sort the

To sort the output and store only the top N scoring results

(For the sake of brevity in the following example, we only store the top 20)


ENTER FILE TO PROCESS: bash_ge_4_scan1.out/S=20
I WILL REPORT THE TOP  20 SCORES

ENTER FILE FOR SORTED OUTPUT [.sorted]: bash_ge_4_scan1.top20
1=MATCH, 2=NAS, 3=RMEAN, 4=RSD, 5=SCORE


ENTER CHOICE  :1       (Always answer 1)

READING DATA FILE
  6721 DATASETS READ IN
CALCULATING SCORE DISTRIBUTION
USING ITEM 1
 STATISTICS COMPLETE
NUMBER OF POINTS  6721
MEAN                40.8484
SD                  26.2812
SKEW                2.19682
KURTOSIS            6.44562
CALCULATING SIGSCORES
PERFORMING SORT
DATA SORTED
WRITING SORTED FILE
SORTED FILE WRITTEN

Generate prolog clauses? [N]: <return>

The /S= option calculates various statistics on the scores obtained, and expresses each reported score in SD units from the mean. For example part of the file bash_ge_4_scan1.top20:



 >P1;HZPG            : Hemoglobin zeta chain - Pig
    46  141     153.14      3.33      0.00      0.00      4.27
    A    B        C          D         E         F          G

The numbers shown after each protein identifier and title line refer to the following:



A	Number of elements in the pattern (46).
B	The length of the sequence being scanned (141).
C	The score for the best alignment of the pattern 
				and the sequence (153.14).
D	the score divided by the number of pattern elements (3.33).
E       The mean (if randomisations are performed).
F	The Standard Deviation (if randomisations are performed).
G	The Distance the score (C) is from the mean of the distribution in
	    standard deviation units (i.e. (153.14 - 40.84)/26.28).

If randomisations are performed using the DATABASE,N option where N= the number of randomisations to be performed, then E,F and G refer to the mean and s.d. of the comparisons to randomised sequences.

Although S.D. values are shown, these are not normally very useful. It is generally better to look at the distribution of scores visually using the /hist option described below.

gjb@bioch.ox.ac.uk