[Jalview-discuss] Calculating the percent identity between two sequences
guenthej at gmail.com
Wed May 25 07:52:07 BST 2011
A noticed an odd behavior when calculating the percent identity between two
sequences using the Calculate Tree features. The values that are returned
appear to be too high because, when at least one of the two sequences
contains a gap, then the column is scored as an identity. My guess is that
the gap character is being scored as a wildcard that matches all other
characters. I could be wrong, though, because I didn't investigate very
On Sun, Feb 27, 2011 at 11:01 PM, Engin Özkan <eozkan at stanford.edu> wrote:
> Dear Jim,
> Let me second a request for adding a percent id matrix output, especially
> straight from the multiple sequence alignment. MSAs can be better than
> pairwise alignments especially in the case of distant pairs, and reporting
> pairwise identities based on a new N-W alignment, while the better alignment
> (hopefully) exists within the MSA, seems unwanted. Come to think of it, the
> option to calculate "new" pairwise alignments should still exist, in case
> the MSA is of lower quality.
> Thanks for the great work,
> On 2/27/11 2:16 PM, Joel Guenther wrote:
> Hi, Jim.
> Thanks for the reply. Following your advice, I was able to calculate a
> percent identity between two sequences (with empty columns removed) using:
> Calculate —> Calculate Tree —> Neighbor joining using % Identity
> If you have time, adding an percentage identity matrix out to Jalview
> would be nice, but not essential.
> Thanks again!
> On Sun, Feb 27, 2011 at 6:32 AM, Jim Procter <
> jprocter at compbio.dundee.ac.uk> wrote:
>> Hello Joel
>> On 25/02/2011 21:08, Joel Guenther wrote:
>> > I'd like to be able to calculate the percent identity for two
>> > sequences in an alignment. The attached alignment (with several empty
>> > columns) contains two sequences that were pulled from a larger
>> > structure-based alignment generated by Dali. In Jalview, when I select
>> > the two sequences and perform a pairwise alignment calculation
>> > (Calculate —> Pairwise Alignments...) the output (attached) only
>> > includes an alignment that contains only 7 columns, but the two
>> > sequences are 204 and 224 aa in length and the structures are highly
>> > conserved throughout.
>> > Why isn't Jalview comparing the sequences along their full length, and
>> > can I force it to do so?
>> I suspect you may not realise that the 'Pairwise alignment' option
>> actually computes a Needleman and Wunsch pairwise alignment for each
>> pair of sequences in the selected set, using a BLOSUM 62 matrix and
>> nominal gap parameters (120 for opening, 20 for widening). Whilst these
>> parameters give a reasonable alignment for sequences with high sequence
>> homology, it they can fail for less homologous pairs. In your case,
>> you're trying to align a pair of structurally homologous protein
>> sequences which have quite a low sequence identity - and the algorithm
>> just returns a stretch of 7 aa that align well, without any of the other
>> regions of the two sequences, because the gaps introduced into the
>> alignment make them far less optimal.
>> > If Jalview won't compare full length sequences, is there another
>> > program that will?
>> There are plenty out there (checkout EMBOSS, for instance:
>> http://emboss.sourceforge.net/servers/#pise), but I get the impression
>> that what you actually want is the percentage identity of the pair of
>> sequences as aligned by DALI. Apart from looking in the DALI report
>> (where,if I remember correctly, you will always find a percent identity
>> score in addition to Dali's own Z-score), the quickest way to do this
>> in the current version of Jalview is to copy one or both of sequences
>> into the same alignment, and then calculating a percent identity tree.
>> The branches will be labelled with the %age difference between the
>> sequences, *under current alignment length*. The reason I stress this is
>> because If I do this with your DALI alignment as you sent it, I get a
>> value of 9.3 - ie the sequences are 90.7% identical - however, if I
>> exclude the gapped columns in the alignment (using Edit->Remove empty
>> columns), I get 37.5 - ie 63.5% identical. This number is probably still
>> not reliable, because there are a fair few 'X' symbols in both sequences
>> that do not align to ther Xes, and Jalview will count these as a
>> mismatch, rather than a match (also now reported as a bug).
>> I will schedule for implementation a new function allowing a pairwise
>> %age identity matrix (or flat report) to be generated, enabling you to
>> do these calculations more easily.
>> Hope this clears things up - thanks for the email!
>> ps. if you find the last comment about gaps/non gaps confusing, you
>> might want to check out Geoff Barton's paper about percentage identity,
>> and this wiki page :
>> J. B. Procter (JALVIEW/ENFIN) Barton Bioinformatics Research Group
>> Phone/Fax:+44(0)1382 388734/345764 http://www.compbio.dundee.ac.uk
>> The University of Dundee is a Scottish Registered Charity, No. SC015096.
>> Jalview-discuss mailing list
>> Jalview-discuss at jalview.org
> Jalview-discuss mailing listJalview-discuss at jalview.orghttp://www.compbio.dundee.ac.uk/mailman/listinfo/jalview-discuss
> Engin Özkan
> Post-doctoral Scholar
> Laboratory of K. Christopher Garcia
> Howard Hughes Medical Institute
> Dept of Molecular and Cellular Physiology
> 279 Campus Drive, Beckman Center B173
> Stanford School of Medicine
> Stanford, CA 94305
> ph: (650)-498-7111
> Jalview-discuss mailing list
> Jalview-discuss at jalview.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Jalview-discuss