Context units

Three possibilities are offered for the definition of context units when reading a text for analysis by Latent Dirichlet Allocation. These must be used with some care, to ensure that the context units chosen are indeed capable of meaningful interpretation, and that they are not so large that almost all of the target words occur together in each unit, losing any discrimination in the analysis.

Variable-length contexts may be defined by the inclusion of a special character in the text to denote the end of each unit of context to be read (except the last - the end of the file automatically ends of the final context unit). The marker must be a single character, not among those normally used for punctuation, etc., which may be inserted in pre-editing;

sentences, or a number of sentences, as punctuated with the characters entered in the corresponding box, may also be chosen as the context unit; and

fixed-length contexts are specified as a kind of "sampling unit" consisting of a fixed number of words, and the text is treated as a series of "blocks" of this fixed length, within each of which joint occurrences of words are considered.