Next: Tracking domains
Up: New and old data
Previous: Identifying new PDB files
Sequence families in 3Dee are groups of PDB chains related by
sequence similarity. When the first release of the database was
created, sequence families were derived by complete linkage
hierarchical clustering with OC [Barton, 1997], according to
probability scores calculated with the SCANPS sequence comparison
program [Barton, 1993]. At that time there were 4,205 chains in
the database which now (May 2000) contains 12,458 chains.
Hierarchical cluster analysis of a large number of objects is very
CPU time extensive. Therefore, a method was derived that does not
require recalculation of all probability scores and hierarchical
clustering of all chains.
Figure 5 illustrates how new chains are either
added to existing sequence families or new sequence families are
created. First, chains relating to files not present in a new
version of the PDB, are deleted from the sequence families. Then
SCANPS probabilities are calculated with and between new chains.
If the SCANPS probabilities of a new chain with all members of an
existing sequence family are lower, i.e. better than the original
probability threshold used for complete linkage clustering
(
) and provided it matches no other family, the chain is
added to that family. A chain that matches more than one sequence
family is added to the family with which it has the lowest
probability score, while a chain matching none of the existing
sequence families creates a new family.
Figure 5:
Depending on their probability
score, new protein chains are either added to existing sequence
families or are the first member of a new sequence family.
Proceeding this way, recalculating all pairwise probability scores
of the chains in the database is not necessary.
|
Next: Tracking domains
Up: New and old data
Previous: Identifying new PDB files
Uwe Dengler,
2000-10-16