Latent Dirichlet Allocation to latent topics

assigns words occurring in one or more texts to a specified number of latent 'topics'. If applied to a single text, it allocates words to topics according to the unit of context selected by the user. When applied to a number of texts, the results can be used to compare them by singular value decomposition ((SVD) and/or correspondence analysis of the profiles of words allocated.

An appropriate stoplist should be used to disregard all words which may be regarded as irrelevant to the task of identifying latent topics in the text corpus nominated. It is also possible to decide to disregard all numerals as well as words occurring only once, which will increase the speed of convergence of the allocation process but may risk missing significant content, according to the nature of the text(s).

A Bayesian estimation process applies a convenient generative model (Blei et al., 2003) to the text corpus, or to the assembly of sentences or other context units specified within a single text, which treats these as the product of sampling from a topic distribution followed by the sampling of a word according to the topic-specific distribution. A maximum of 25 topics may be allocated by this routine in HAMLET II 3.0 for Windows. On a normal PC, this routine can be very demanding of computing time and resources. To interrupt its operation at any time, press Escape or click on the display when the progress indicator is shown.

Click on the text file names in the directory selected to copy the names to the right-hand panel.

Click Stoplist to select and activate or to edit a currently selected stoplist of words to be disregarded in reading the selected text(s).

Click the menu item Clear selection to remove selected text file names and clear the current stoplist entries.

Click on to activate the procedure, and enter the number of topics to be assigned when prompted to do so. After selecting the number of topics to apply, click on to continue. Separate prompts offer options to disregard all numerals encountered in reading the text(s) and all word tokens occuring only once.

To discover the largest possible number of topics assignable for particular texts, it is necessary to repeat the process, increasing the number requested until the advice appears that the maximum has probably been exceeded. Since the allocation process depends upon pseudo-random numbers, it is advisable to verify this by re-starting the program, in case it is the result of an unfortunate assignment of the starting point of your computer's random generator.

Word tokens appearing in the topics initially allocated which have no obvious relevance to the sense of the text in relation to the purpose of the investigation may be added to the stoplist currently in use, and the allocation process repeated. In this way, it is possible to refine the process by successively excluding superfluous tokens from consideration. It is not essential that the topics in themselves are open to interpretation in any useful way. They serve only as a means of identifying word tokens for use in comparing the texts or context units under consideration. If, however, a set of topics is identifed which can plausibly be identified in the sense of the current investigation, these may also be saved as a vocabulary list for use in other procedures in HAMLET II 3.0.