[Jalview-discuss] Calculating the percent identity between two sequences

Jim Procter jprocter at compbio.dundee.ac.uk
Sun Feb 27 14:32:14 GMT 2011

Hello Joel

On 25/02/2011 21:08, Joel Guenther wrote:
> I'd like to be able to calculate the percent identity for two
> sequences in an alignment. The attached alignment (with several empty
> columns) contains two sequences that were pulled from a larger
> structure-based alignment generated by Dali. In Jalview, when I select
> the two sequences and perform a pairwise alignment calculation
> (Calculate —> Pairwise Alignments...) the output (attached) only
> includes an alignment that contains only 7 columns, but the two
> sequences are 204 and 224 aa in length and the structures are highly
> conserved throughout.  
> Why isn't Jalview comparing the sequences along their full length, and
> can I force it to do so?
I suspect you may not realise that the 'Pairwise alignment' option
actually computes a Needleman and Wunsch pairwise alignment for each
pair of sequences in the selected set, using a BLOSUM 62 matrix and
nominal gap parameters (120 for opening, 20 for widening). Whilst these
parameters give a reasonable alignment for sequences with high sequence 
homology, it they can fail for less homologous pairs.  In your case,
you're trying to align a pair of structurally homologous protein
sequences which have quite a low sequence identity - and the algorithm
just returns a stretch of 7 aa that align well, without any of the other
regions of the two sequences, because the gaps introduced into the
alignment make them far less optimal.

> If Jalview won't compare full length sequences, is there another
> program that will?
There are plenty out there (checkout EMBOSS, for instance:
http://emboss.sourceforge.net/servers/#pise), but I get the impression
that what you actually want is the percentage identity of the pair of
sequences as aligned by DALI. Apart from looking in the DALI report
(where,if I remember correctly, you will always find a percent identity
score in addition to Dali's own Z-score),  the quickest way to do this
in the current version of Jalview is to copy one or both of sequences
into the same alignment, and then calculating a percent identity tree.
The branches will be labelled with the %age difference between the
sequences, *under current alignment length*. The reason I stress this is
because If I do this with your DALI alignment as you sent it, I get a
value of 9.3 - ie the sequences are 90.7% identical - however, if I
exclude the gapped columns in the alignment (using Edit->Remove empty
columns), I get 37.5 - ie 63.5% identical. This number is probably still
not reliable, because there are a fair few 'X' symbols in both sequences
that do not align to ther Xes, and Jalview will count these as a
mismatch, rather than a match (also now reported as a bug).

I will schedule for implementation a new function allowing a pairwise
%age identity matrix (or flat report) to be generated, enabling you to
do these calculations more easily.

Hope this clears things up - thanks for the email!

ps. if you find the last comment about gaps/non gaps confusing, you
might want to check out Geoff Barton's paper about percentage identity,
and this wiki page :

J. B. Procter  (JALVIEW/ENFIN)  Barton Bioinformatics Research Group
Phone/Fax:+44(0)1382 388734/345764  http://www.compbio.dundee.ac.uk
