Sequence comparison without gaps - fixed length segments

Next: Correlation methods Up: Comparison of two Previous: Comparison of two

Sequence comparison without gaps - fixed length segments

Given two sequences and of length and , all possible overlapping segments having a particular length (sometimes called a `window length') from are compared to all segments of . This requires of the order of comparisons to be made. For each pair of segments the amino acid pair scores are accumulated over the length of the segment. For example, consider the comparison of two 7-residue segments; ALGAWDE and ALATWDE using identity scoring. The total score for this pair would be 1 + 1 + 0 + 0 + 1 + 1 + 1 = 5.

In early studies of protein sequences, statistical analysis of segment comparison scores was used to infer homology between sequences. For example, Fitch [4] applied the genetic code scoring scheme to the comparison of - and - haemoglobin and showed the score distribution to be non-random. Today, segment comparison methods are most commonly used in association with a ``dot plot'' or ``diagram'' [19] and can be a more effective method of finding repeats than using dynamic programming.

The scores obtained by comparing all pairs of segments from and may be represented as a comparison matrix where each element represents the score for matching an odd length segment centred on residue with one centred on residue . This matrix can provide a graphic representation of the segment comparison data particularly if the scores are contoured at a series of probability levels to illustrate the most significantly similar regions. Collins and Coulson [20] have summarised the features of the ``dot-plot''. The runs of similarity can be enhanced visually by placing a dot at all the contributing match points in a window rather than just at the centre.

McLachlan [6] introduced two further refinements into segment comparison methods. The first was the inclusion of weights in the comparison of two segments in order to improve the definition of the ends of regions of similarity. For example, the scores obtained at each position in a 5-residue segment comparison might be multiplied by 1,2,3,2,1 respectively before being summed. The second refinement was the development of probability distributions which agreed well with experimental comparisons on random and unrelated sequences and which could be used to estimate the significance of an observed comparison.

Next: Correlation methods Up: Comparison of two Previous: Comparison of two

geoff.barton@ox.ac.uk