The completed H matrix with the paths for the top 15 alignments highlighted.
time scanps -shahu.seq -a1 -n -c35 < /data/pir/protein.seq > ! hahu.a1.n 6858 Sequences Grand Total of Paths Considered: 30690 1004.420u 4.570s 17:22.27 96.8
... for alignments without gaps, the Karlin-Altschul statistics provide an estimate of the probability fo finding an alignment of the score, these statistics underpin the BLAST algorithm and permit its speed. However, no such statistic currently exists for gapped alignments, accordingly an estimate of the significance of the score for an alignment score/length has been obtained.
When two protein sequences are aligned, residues in the two sequences are brought into equivalence. This equivalencing implies that the residues are performing similar roles in the native folded protein. Accordingly, the estimate of significance is based on the likelihood that the two aligned segments will fold in the same way. A study of XX protein chains for with the X-ray structure has been determined was performed. All pairs of chains were compared using the local-similarity algorithm, and statistics gathered on the score, length and accuracy of the alignments. The accuracy was simply assessed as the number of secondary structure elements that are correctly equivalenced. Sander and Z performed a similar study (XXX) and illustrated that the value of percentage identity was strongly length dependent. In my study the alignment score was considered and the distribution of scores wrt length used to give an estimate of the probability of finding an alignment of any score in a given length between where the aligned segments show no structural similarity.
The estimate is based upon the comparison of protein sequences for An implementation of the Smith-Waterman algorithm based upon these ideas may be used to screen the protein sequence databank with a newly sequence protein. The result of such a scan will highlight which proteins have greatest similarity in some local region to the query, but it will not show which regions of the proteins are similar, nor whether multiple regions could be equivalenced.