Simple index for identical matching

Indexing has long been used for identifying identical ungapped regions in sequences. For example the SCAN facility in the PSQ and ATLAS programs distributed with the NBRF-PIR databank allows the rapid identification of short identical strings [61]. This is achieved by pre-processing the entire databank once to identify the locations of all unique tripeptides. These data are stored in a direct access file together with pointers to the sequence identifier codes. The query peptide is also divided into a series of tripeptides and identification of the sequence in the databank then becomes a simple matter of looking up the starting positions of each peptide in the list held on file. There is a tradeoff with indexing methods between the time and space taken to build and store the index and the number of queries expected. Search times are usually very fast and involve a few disk accesses, the drawback with simple indexes is that they are restricted to exact matching without gaps.