Given two sequences and
of length
and
, all possible
overlapping segments having a particular length (sometimes called a
`window length') from
are compared to all segments of
. This
requires of the order of
comparisons to be made. For
each pair of segments the amino acid pair scores are accumulated over
the length of the segment. For example, consider the comparison of two
7-residue segments; ALGAWDE and ALATWDE using identity scoring. The
total score for this pair would be 1 + 1 + 0 + 0 + 1 + 1 + 1 = 5.
In early studies of protein sequences, statistical analysis of segment
comparison scores was used to infer homology between sequences. For
example, Fitch [4] applied the genetic code scoring scheme to the
comparison of - and
- haemoglobin and showed the score
distribution to be non-random. Today, segment comparison methods are
most commonly used in association with a ``dot plot'' or ``diagram''
[19] and can be a more effective method of
finding repeats than using dynamic programming.
The scores obtained by comparing all pairs of segments from and
may be represented as a comparison matrix
where each element
represents the score for matching an odd length segment
centred on residue
with one centred on residue
. This
matrix can provide a graphic representation of the segment comparison
data particularly if the scores are contoured at a series of
probability levels to illustrate the most significantly similar
regions. Collins and Coulson [20] have summarised the features of
the ``dot-plot''. The runs of similarity can be enhanced visually by
placing a dot at all the contributing match points in a window rather
than just at the centre.
McLachlan [6] introduced two further refinements into segment comparison methods. The first was the inclusion of weights in the comparison of two segments in order to improve the definition of the ends of regions of similarity. For example, the scores obtained at each position in a 5-residue segment comparison might be multiplied by 1,2,3,2,1 respectively before being summed. The second refinement was the development of probability distributions which agreed well with experimental comparisons on random and unrelated sequences and which could be used to estimate the significance of an observed comparison.