Guidelines for Database Scanning

Next: Summary Up: Database scanning Previous: Approximations: BLAST Basic

Guidelines for Database Scanning

Which is the best method for database scanning? Sadly, there is not a straightforward answer to this question. Attempts have been made to make comparisons but the process is complicated by the difficulty of designing suitable test cases and the number of adjustable parameters. The most effective method of assessing the success of a scanning technique is to test its ability to find all the members of a known protein family from the database of all known sequences (e.g. see [68][67]). The principle is simple:

Record the identifier codes of all proteins known to be in the family.
Select a member to scan with (the query).
Perform the scan using the method of choice.
Count how many of the known members are found with higher scores than known non-members.

A less strict criterion is to count the number of members that score as high as the top 0.5%of the non-members in the databank [68]. The best scanning method will give the most members before non-members. i.e. will have the fewest false-positives. Of course, evaluation is not as simple as this appears. First one must choose well characterised protein families with which to test. Do we really know all the members? A high scoring non-member may in fact be a previously undiscovered family member. Further difficulites arise for scans where there are many false-negatives. If two methods both miss 30 known members, are they missing the same 30? Ideally, evaluation should also explore alternative parameter combinations, but this greatly increases the number of tests that need to be done and complicates the data analysis. For example, if we consider scanning with dynamic programming, then there is a choice of pair-score matrix and gap-penalty, local or global alignment. The best gap-penalty depends on the matrix in use. If both length-dependent and independent penalties are used, then the number of alternative combinations increases dramatically. The best combination of matrix and penalty may not be appropriate for other algorithms. BLAST does not consider gaps, so the situation is a little easier and this feature was exploited by Henikoff and Henikoff to evaluate different substitution matrices [16] however we still have the choice of other parameters special to the BLAST algorithm.

When given a newly determined sequence, a search with BLAST or FASTA will quickly tell you if a close homologue exists. Although a scan with full dynamic programming takes longer on a local workstation, the turn-round time from email servers such as BLITZ are similar to BLAST searches at NCBI. Accordingly, it is worth scanning using one of these services as well. If no similar sequences are found then alternative PAM matrices should be tried. Start with PAM120, then try PAM250 and in each case vary the gap penalty around the minimum value of the matrix. For PAM250 this is 8, values of 7-10 are worth trying. Care should always be taken to consider the likely significance of an apparent match. The methods for predicting the accuracy of alignment that are discussed in Section 4.1.

Next: Summary Up: Database scanning Previous: Approximations: BLAST Basic

geoff.barton@ox.ac.uk