Program oc is a general purpose cluster analysis program. It implements three simple methods for hierarchical clustering and for sequence data will show the overall sub-grouping of the sequences. Although one output from ``oc'' is a dendrogram or tree, the program should not be used alone to estimate phylogeny.
Typing ``oc'' shows the options:
Cluster analysis program Usage: oc <sim/dis> <single/complete/means> <ps> <cut N> Version 1.0 - Requires a file to be piped to standard input Format: Line 1: Number (N) of entities to cluster (e.g. 10) Format: Lines 2 to 2+N-1: Identifier codes for the entities (e.g. Entity1) Format: N*(N-1)/2: Distances, or similarities - ie the upper diagonal Options: sim = similarity / dis = distances method = single/complete/means ps <file> = plot out dendrogram to <file.ps> log = take logs before calculation cut = only show clusters above/below the cutoff id = output identifier codes rather than indexes for entities timeclus = output times to generate each cluster amps <file> = produce amps <file>.tree and <file>.tord files
Usually, complete linkage cluster analysis gives the most interpretable results. To run oc on a data file, perhaps the output of a scanps pairwise comparison run that just includes raw scores:
oc sim complete ps test id < test.ocin > test.ocout
sim tells oc to work in similarity mode. This means that as numbers in the input file get bigger, they mean that the objects being compared are more similar. The alternative is distance mode, (dis) where smaller numbers mean greater similarity.
complete refers to the method of cluster analysis. This is a little difficult to explain without a diagram or equations (maybe in the next manual), but ...complete linkage joins clusters only if all members of both clusters are similar to each other at at least a given level of similarity. single linkage joins clusters if one pair between the clusters are similar. means joins the clusters on the basis of the mean similarity between the clusters.
ps test asks ``oc'' to draw a dendrogram. This will be stored in the file ``test.ps''. This is a PostScript file and can be printed on a PostScript printer, or viewed using GhostScript/GhostView. Currently, the dendrogram does not have a proper axis but just shows max and min values found for joining clusters.
id Asks for ID codes rather than numbers to be output to indicate the clusters.
The output of this comparison is shown here:
## 0 646 2 HAHOD HAHOK ## 1 587 2 HAKOAW HAJSA ## 2 543 3 HAJUA HAHOD HAHOK ## 3 475 5 HAJUA HAHOD HAHOK HAKOAW HAJSA ## 4 433 6 HAFEDR HAJUA HAHOD HAHOK HAKOAW HAJSA ## 5 261 7 HBOTE HAFEDR HAJUA HAHOD HAHOK HAKOAW HAJSA
Each line starting with ``##'' shows the cluster number, (starting at 0), the score at which all members of the cluster are similar, and the number of members in the cluster. The line following the ``##'' shows the ID codes of the members of each cluster.
oc will optionally accept a cutoff score. If a cutoff is given, only clusters that score above (or below in distance mode) the score will be output. This can be useful for filtering comparisons of very large numbers of sequences.
The PostScript tree is shown in the file test.ps.