Chain, Residue and Protein Naming Conventions

Next: Simple Prolog Clauses Up: Introduction Previous: Introduction to Prolog

Chain, Residue and Protein Naming Conventions

A protein coordinate entry (file) as listed in the Brookhaven Protein Data Bank (PDB) is uniquely identified by a 4 or 5 character code starting with an integer, for example, 1FB4 is the code for an immunoglobulin Fab fragment. However, any given entry may contain multiple protein chains, and these are usually the basic unit of protein structure. If more than one chain exists in an entry, the PDB assign a single character code to each chain. For example, the entry 1FB4 contains two chains labelled L and H.

One often wishes to reference an individual chain rather than the whole protein entry. Accordingly, when converting the PDB to Prolog clauses, a two-level naming convention was adopted. Each Entry is identified by the PDB code (e.g. 1fb4) and each chain belonging to an entry is identified by the PDB code with the chain identifier code appended (e.g. 1fb4l). If an entry has a single chain that does not have a chain identifier, then the chain is simply identified by the PDB code alone. The PDB code and chain codes are linked by the Prolog clause chain/2 as described below.

Residue numbers are represented by two-element lists (e.g. [37,a], [37,b], [38,-]) in order to accommodate the insertion characters that allow homologous structures to be numbered in a manner consistent with the `parent' structure. For example, all serine proteinase enzymes (trypsin, elastase etc. ) are numbered according to the structure of chymotrypsinogen, which was the first member of the family to be solved by X-ray crystallography. Although this simplifies cross-referencing between different proteins in the same family, the flexibility creates difficulties when searching for residues amino acids before, or after the current position.

gjb@bioch.ox.ac.uk