Cluster analysis

HAMLET joint frequencies offers an optional hierarchical cluster analysis which can help to determine whether further analysis of the similarities matrix, for example by the multidimensional scaling techniques offered, will be useful.

When matrices have been separately saved, the main menu item Cluster Analysis offers independent access to both hierarchical and non-hierarchical clustering procedures.

For Hierarchical Cluster Analyses, toggle the button "Diameter"/"Connectedness" in the above control panel to select the clustering method to be applied to the matrix of similarities generated by HAMLET joint frequencies. Click Display ... to see the results of the clustering method selected displayed as a dendrogram.

Enter a minimum similarity value as a decimal fraction ( 0 <= s <= 1 ) in the appropriate box and click Show clusters ... to list the word clusters with this minimum similarity. The cluster contents can be saved in a file for later reference if required.

The results of cluster analysis are frequently used as an interpretative aid in examining configurations of points resulting from dimensional analysis. The means of clusters selected can be plotted when the results of a MINISSA analysis of the same matrix are displayed, by selecting analyse clusters from the Data menu in the corresponding display. This will also provide an analysis of variance of the squared distances in the specified clusters. For the purposes of text analysis, although hierarchical clustering and dimensional analysis are not equivalent procedures, the absence of any substantial clustering normally indicates that there would be little point in continuing with a full dimensional scaling procedure. In particular, further analysis should be avoided where any items are joined to the dendrogram only at the highest level, indicating that they have no connection with any other item in the analysis.

The alternative methods of hierarchical clustering offered here use different criteria in assigning the individual vocabulary entries to clusters: the "connectedness" or "single link" method looks for the greatest similarity between an unassigned item and those contained in existing clusters; the "diameter" or "complete linkage" method defines the similarity between groups as the similarity between their least similar pair of individual items.

The Non-Hierarchical Cluster Analysis option uses an efficient new algorithm (M.Brusco, 2003) to partition the input matrix into successive numbers of clusters, each containing at least one item. Clusters are mutually exclusive, and exhaustive, in that all items are assigned to a cluster. Unfortunately, given the number of ties that can occur in minimum diameter partitioning, it is likely that there are many alternative optima in large matrices. It is therefore advisable to compare the results obtained by this method with those from hierarchical clustering of the same data, as well as with those of multimensional scaling of the same matrix.   

For matrices representing up to 15 categories, the program lists all sets of clusters from 2 to the total number of categories minus 1. For larger matrices, enumeration is restricted to half the matrix size up to a maximum of 20 clusters, and may take some time due to the number of alternative partitions possible. 

The algorithm seeks to minimize the partition diameter, which is related to Johnson's diameter method used in the hierarchical clustering option. The diameter of a cluster is the maximum pairwise dissimilarity index among objects in that cluster. The partition diameter is defined as the maximum of the cluster diameters. To minimize the diameter of the partition is equivalent to minimizing the maximum dissimilarity index accross all subsets for the number of partitions to be calculated.