next up previous
Next: Tracking domains Up: New and old data Previous: Identifying new PDB files

Adding new chains

Sequence families in 3Dee are groups of PDB chains related by sequence similarity. When the first release of the database was created, sequence families were derived by complete linkage hierarchical clustering with OC [Barton, 1997], according to probability scores calculated with the SCANPS sequence comparison program [Barton, 1993]. At that time there were 4,205 chains in the database which now (May 2000) contains 12,458 chains. Hierarchical cluster analysis of a large number of objects is very CPU time extensive. Therefore, a method was derived that does not require recalculation of all probability scores and hierarchical clustering of all chains. Figure 5 illustrates how new chains are either added to existing sequence families or new sequence families are created. First, chains relating to files not present in a new version of the PDB, are deleted from the sequence families. Then SCANPS probabilities are calculated with and between new chains. If the SCANPS probabilities of a new chain with all members of an existing sequence family are lower, i.e. better than the original probability threshold used for complete linkage clustering ($10^{-7}$) and provided it matches no other family, the chain is added to that family. A chain that matches more than one sequence family is added to the family with which it has the lowest probability score, while a chain matching none of the existing sequence families creates a new family.

Figure 5: Depending on their probability score, new protein chains are either added to existing sequence families or are the first member of a new sequence family. Proceeding this way, recalculating all pairwise probability scores of the chains in the database is not necessary.
\includegraphics[scale=0.65]{figures/new-chains.ps}


next up previous
Next: Tracking domains Up: New and old data Previous: Identifying new PDB files
Uwe Dengler, 2000-10-16