Next: Extracting the sequences Up: Simple Scanning - Previous: Sort the result

Examine the hit list

Examination of the top of the sorted output file shows the highest scoring hits to the query sequence. You must now inspect this file and decide on how many of the top scoring sequences you would like to examine by alignment. If you use the default parameters (PAM250 and penalty of 8) then scores below 90 are often uninteresting. However, this is not an absolute rule and each scan will require careful scrutiny of the score list. It is usually better to include a lot of sequences at this stage since ``interesting'' matches may emerge even for low scores.

There are no programs supplied with scanps to help you look at the ``.sorted'' file. You must use the Unix tools ``more'' or ``head'' to inspect and extract the interesting parts of the file. Or you could use your favourite text editor (vi, emacs, jot, pico etc).

In order to keep this guide to manageable length, I will illustrate the following sections using only the top 15 scoring sequences. In practice, the top 150 or so in this scan would be worth looking at. To get the top 15 sequence scores into a file you could type:



head -15 < sh2.sorted > sh2.top15

This saves the top 15 score/ID pairs in a file called sh2.top15. Here it is:



497 A43610
497 TVHUSC
492 TVCHS
492 TVFV60
492 TVFVPR
492 TVFVS2
490 TVFVS1
488 TVFVMT
474 B34104
473 A34104
458 OKFVYR
458 S15582
458 S20808
456 TVFVR
443 S20676

Unless you know the identifier codes, this is pretty unhelpful. If you have built an indexed database, then it is easy to get the titles of these sequences back using the program ``sortsco'', see Section 6.3.3. For now, we can extract the sequences that correspond to these protein identifiers using the program ``select''.


gjb@bioch.ox.ac.uk