Text conventions

General appearance of text

The Microsoft Windows release of HAMLET II requires the text in question to exist already in a file stored in the Windows ANSI character set. In the event of difficulty, use CONVERT first on the file to be read.

To avoid any possible misunderstanding in reading the text, it may be advisable that it should be stored in plain text format, but the program may also be applied successfully to files in the format of widely-used word-processing programs. The checking of a sample of text is nevertheless always advisable to be sure that special characters used by word-processing packages do not result in later unexpected effects.

HAMLET distinguishes separate words in the input as continuous strings of 'letters', separated by 'punctuation', spaces, or the end of a line. The program contains default definitions of recognised sets of letters and punctuation suitable for most purposes and all european languages. On encountering unrecognised characters, HAMLET and WORDLIST allow direct amendment of letters to be recognised, and their appropriate sorting order, according to the conventions of the language selected.

It should, however, be noted that the program may be confused by numbers containing decimal points or commas into increasing the word count on each group of numerals: e.g. 60,000 may be read as two separate words containing the strings '60' and '000' separated by a comma. Normally this is of no great consequence, but is mentioned here to emphasize again the importance of prior knowledge of the properties of a textfile and the need for pre-editing in cases where the above features are of real significance to the meaning of the text.

If any word in the original text requires continuation from one line of input to the next, a hyphen ('-') should always be used as the last character of the line to be continued, to indicate that the characters from the beginning of the next line form part of the current word. Otherwise, the end of a line (or a #13 character) will automatically be regarded as marking the end of the current word on input, with the possibility that some words may become inadvertently divided. Upper- and lower-case letters may be separately regarded or treated as equivalent. The latter option, of course, will normally regard words beginning sentences as different from the same words occurring later. Such words will have to be explicitly and separately specified in the vocabulary list if they are not to be missed. Hence the importance of knowing the basic vocabulary of the text before considering the use of HAMLET. Care is needed here, as an inadvertent choice e.g. of case-sensitive searching when the search list has been specified as case-insensitive can lead to unexpected results.

Special characters

The characters '<' and '>' are, by default, regarded as comment delimiters. This is intended to permit the addition of comment in the text, without prejudicing the normal word search functions. All text appearing between these characters will be disregarded, unless they are explicitly declared as characters to ignore, as described in the following paragraph.

It may be that the text contains characters which should not be considered as part of the text itself but do not normally occur in punctuation. Such characters (e.g. '#', '~' , '¿') sometimes occur systematically in source files but should be ignored when making comparisons in HAMLET. Several different characters may even optionally be used in preparatory editing of the text, for example to delimit major text components, but must be chosen to serve this purpose alone, since they must not be allowed to become confused with ordinary text and punctuation. Characters used in this way can be declared to be ignored in the appropriate box in the options window. They will then be skipped in looking for matching words.