HAMLET II Joint Frequencies

generates statistics for individual and joint word frequencies and the corresponding frequencies expressed in a chosen unit of context.

Individual word frequencies (fi) are counted together with joint frequencies (fij) for all possible pairs of words, and the corresponding standardised joint frequencies are calculated:
sij = fij / (fi + fj - fij),
where fij and fi, fj refer respectively to joint and individual frequencies of words i and j in a given vocabulary list, expressed in units of context in each case.

This simple (Jaccard) coefficient treats joint non-occurrences as irrelevant, which seems to be a suitable procedure in textual analysis. It is, however, indifferent also to the order in which the words in each pair occur, and depends for its values on a sensible choice of context unit being made in reading the text.

The above coefficient has an expected value of

          E(sij) = fi . fj / [t(fi + fj) - fi . fj] , since the expected value of fij is fi . fj / t ,

 where t is the total number of context-units counted in the text.

As an alternative, you can employ Sokal's matching coefficient, in which the number of joint non-occurrences is also included in the numerator and denominator. In the terms already outlined, this coefficient is:
            cij = (fij + t - (fi + fj - fij)) / t

An analysis of some well-known similarity measures by van Eck and Waltman (2009) suggests that an appropriate measure for normalizing co-occurrence data is the association strength, also referred to as the proximity index or the probabilistic affinity index:
            saij = cij/ (fi . fj )
This is proportional to the ratio between the observed number of co-occurrences of objects i and j and the expected number of co-occurrences of objects i and j under the assumption that occurrences of i and j are statistically independent. In the interpretation of similarities, it is probably advisable to compare the results of applying this measure with those derived from co-occurrence data using a set-theoretic similarity measure such as the Jaccard coefficient.

 Optional settings:

 Several possibilities are offered for the definition of context- units when reading a text. These must be used with some care, to ensure that the context-units chosen are indeed capable of meaningful interpretation, and that they are not so large that almost all of the target words occur together in each unit, losing any discrimination in the analysis:

 If the text file contains special characters which should be ignored when making comparisons in Hamlet, enter these in the edit box provided in the options window. Characters used in this way must be chosen to serve this purpose alone, since they must not be confused with normal text and punctuation.

 Check the box in the options window if searching for words is to be case-sensitive. If this option is chosen, words in the vocabulary list must also be entered with regard to upper- and lower-case letters if they are not to be missed. Take care: inadvertent choice of case-sensitive searching when the search list has been specified without regard to case can lead to unexpected results.

Raw and standardised joint frequencies are displayed in lower-triangular matrix format, suitably labelled with the corresponding vocabulary list entries. Either matrix can be regarded as a set of similarity measures between pairs of words, and can be submitted to further analysis using Cluster Analysis, Multidimensional Scaling methods or Correspondence Analysis to identify characteristic word clusters or associations of symbols in the original text.

  Using HAMLET Joint Frequencies

 main window

  The most important points to check are :

You will then be asked if you want to carry out further analyses of the matrix of joint occurrences using Cluster Analysis, Multidimensional Scaling or Correspondence Analysis, and prompted to save any of the files which have been generated for later use, to avoid having to repeat the current joint frequencies analysis.