Next: Other functions of Up: No Title Previous: Comparing all pairs

Cluster analysis with program ``oc''

Program oc is a general purpose cluster analysis program. It implements three simple methods for hierarchical clustering and for sequence data will show the overall sub-grouping of the sequences. Although one output from ``oc'' is a dendrogram or tree, the program should not be used alone to estimate phylogeny.

Typing ``oc'' shows the options:



Cluster analysis program

Usage: oc <sim/dis> <single/complete/means> <ps> <cut N>

Version 1.0 - Requires a file to be piped to standard input
Format:  Line   1:  Number (N) of entities to cluster (e.g. 10)
Format:  Lines 2 to 2+N-1:  Identifier codes for the entities (e.g. Entity1)
Format:  N*(N-1)/2:  Distances, or similarities - ie the upper diagonal

Options:
sim = similarity /  dis = distances
method = single/complete/means
ps <file> = plot out dendrogram to <file.ps> 
log = take logs before calculation 
cut = only show clusters above/below the cutoff
id = output identifier codes rather than indexes for entities
timeclus = output times to generate each cluster
amps <file> = produce amps <file>.tree and <file>.tord files

Usually, complete linkage cluster analysis gives the most interpretable results. To run oc on a data file, perhaps the output of a scanps pairwise comparison run that just includes raw scores:



oc sim complete ps test id < test.ocin > test.ocout

sim tells oc to work in similarity mode. This means that as numbers in the input file get bigger, they mean that the objects being compared are more similar. The alternative is distance mode, (dis) where smaller numbers mean greater similarity.

complete refers to the method of cluster analysis. This is a little difficult to explain without a diagram or equations (maybe in the next manual), but ...complete linkage joins clusters only if all members of both clusters are similar to each other at at least a given level of similarity. single linkage joins clusters if one pair between the clusters are similar. means joins the clusters on the basis of the mean similarity between the clusters.

ps test asks ``oc'' to draw a dendrogram. This will be stored in the file ``test.ps''. This is a PostScript file and can be printed on a PostScript printer, or viewed using GhostScript/GhostView. Currently, the dendrogram does not have a proper axis but just shows max and min values found for joining clusters.

id Asks for ID codes rather than numbers to be output to indicate the clusters.

The output of this comparison is shown here:



## 0 646 2
 HAHOD HAHOK
## 1 587 2
 HAKOAW HAJSA
## 2 543 3
 HAJUA HAHOD HAHOK
## 3 475 5
 HAJUA HAHOD HAHOK HAKOAW HAJSA
## 4 433 6
 HAFEDR HAJUA HAHOD HAHOK HAKOAW HAJSA
## 5 261 7
 HBOTE HAFEDR HAJUA HAHOD HAHOK HAKOAW HAJSA

Each line starting with ``##'' shows the cluster number, (starting at 0), the score at which all members of the cluster are similar, and the number of members in the cluster. The line following the ``##'' shows the ID codes of the members of each cluster.

oc will optionally accept a cutoff score. If a cutoff is given, only clusters that score above (or below in distance mode) the score will be output. This can be useful for filtering comparisons of very large numbers of sequences.

The PostScript tree is shown in the file test.ps.




Next: Other functions of Up: No Title Previous: Comparing all pairs


gjb@bioch.ox.ac.uk