HAMLET II Joint Frequencies

generates statistics for individual and joint word frequencies and the corresponding frequencies expressed in a chosen unit of context.

Individual word frequencies (f_i) are counted together with joint frequencies (f_ij) for all possible pairs of words, and the corresponding standardised joint frequencies are calculated:
s_ij = f_ij / (f_i + f_j - f_ij),
where f_ij and f_i, f_j refer respectively to joint and individual frequencies of words i and j in a given vocabulary list, expressed in units of context in each case.

This simple (Jaccard) coefficient treats joint non-occurrences as irrelevant, which seems to be a suitable procedure in textual analysis. It is, however, indifferent also to the order in which the words in each pair occur, and depends for its values on a sensible choice of context unit being made in reading the text.

The above coefficient has an expected value of

E(s_ij) = f_i . f_j / [t(f_i + f_j) - f_i . f_j] , since the expected value of f_ij is f_i . f_j / t ,

where t is the total number of context-units counted in the text.

As an alternative, you can employ Sokal's matching coefficient, in which the number of joint non-occurrences is also included in the numerator and denominator. In the terms already outlined, this coefficient is:
c_ij = (f_ij + t - (f_i + f_j - f_ij)) / t

An analysis of some well-known similarity measures by van Eck and Waltman (2009) suggests that an appropriate measure for normalizing co-occurrence data is the association strength, also referred to as the proximity index or the probabilistic affinity index:
sa_ij = c_ij/ (f_i . f_j )
This is proportional to the ratio between the observed number of co-occurrences of objects i and j and the expected number of co-occurrences of objects i and j under the assumption that occurrences of i and j are statistically independent. In the interpretation of similarities, it is probably advisable to compare the results of applying this measure with those derived from co-occurrence data using a set-theoretic similarity measure such as the Jaccard coefficient.

Optional settings:

Several possibilities are offered for the definition of context- units when reading a text. These must be used with some care, to ensure that the context-units chosen are indeed capable of meaningful interpretation, and that they are not so large that almost all of the target words occur together in each unit, losing any discrimination in the analysis:

Fixed-length contexts - the traditional method of social-scientific quantitative "content analysis" (Iker, 1974) - are specified as a kind of "sampling unit" consisting of a fixed number of words, and the text is treated as a series of "blocks" of this fixed length, within each of which joint occurrences of words in the specified vocabulary list are counted (it is advisable when adopting this approach to test the effects of varying the size of these fixed-length units between, say, 60 and 120 words, to see which produces the clearest result for a given text and vocabulary list) ;
Variable-length contexts are defined by the inclusion of a special character in the text to denote the end of each unit of context to be read. This may be a character (not normally used for punctuation, etc.) inserted in pre-editing as to mark the end of each context-unit as appropriate to the sense of the particular text. Alternatively, it is possible to use the invisible ANSI 0D|0Ahex end-of-line marker, if it is consistently employed in the text, by entering 'Eoln' or '#13' in the options window ;
Sentences, as normally punctuated, may be chosen as the context unit, or
the Collocation option counts joint occurrences within a given number, or span, of words. This will be generally slower in operation than the other context options, but may be more suitable for smaller bodies of text or when specific word usage is of particular interest.

If the text file contains special characters which should be ignored when making comparisons in Hamlet, enter these in the edit box provided in the options window. Characters used in this way must be chosen to serve this purpose alone, since they must not be confused with normal text and punctuation.

Check the box in the options window if searching for words is to be case-sensitive. If this option is chosen, words in the vocabulary list must also be entered with regard to upper- and lower-case letters if they are not to be missed. Take care: inadvertent choice of case-sensitive searching when the search list has been specified without regard to case can lead to unexpected results.

Raw and standardised joint frequencies are displayed in lower-triangular matrix format, suitably labelled with the corresponding vocabulary list entries. Either matrix can be regarded as a set of similarity measures between pairs of words, and can be submitted to further analysis using Cluster Analysis, Multidimensional Scaling methods or Correspondence Analysis to identify characteristic word clusters or associations of symbols in the original text.

Using HAMLET Joint Frequencies

main window

The most important points to check are :

that you have made a sensible choice of vocabulary items for which to search,
and
that you have specified appropriate options for the current analysis.
If the text to be read is in a language other than English, use the pull-down menu to apply the correct lexicographic conventions.
Click on to start the search process. You can click on to stop searching at any time.
When the process is finished, results are temporarily displayed in an edit window.

You will then be asked if you want to carry out further analyses of the matrix of joint occurrences using Cluster Analysis, Multidimensional Scaling or Correspondence Analysis, and prompted to save any of the files which have been generated for later use, to avoid having to repeat the current joint frequencies analysis.