EUG PD


The Big Squeeze #2

 
Published in EUG #53

In the last article, we discussed how text could be compressed by converting it into four-bit nibbles of commonly-used characters, or codes giving information about the following nibble:

Nibble value   Meaning
------------   -------
0              Next nibble is a less commonly used character.
1              Next nibble is a token.
2              End of text.
3-15           The 13 most commonly used characters.

Here we'll be discussing 'tokens'. A 'token' is a representation of a commonly used word, a bit like BASIC keywords, which are stored in memory as one-byte tokens.

In our case, tokens occupy one nibble (four bits), meaning we can include 16 commonly-used words like "this", "and", "that", and so on. If "this" was token number zero, "and" token number one, and "that" token two, then a sentence like "this and that" would be compacted as so:

      byte    1  2  3
      value &10 11 13

Program COMPRE2 is identical to COMPRES, but with the addition of the machine code to 'tokenise' any commonly-used words found in the input text, and expand them when the text is decompressed.

The tokens are contained in the table 'toktab'. Each token starts with a number specifying the length of the token (e.g. 3 for "and") followed by the token word itself. The words are READ in from line 1990, and their length is worked out automatically by DIM (LEN word$) in line 1210.

Matching words in the input string to the token words is handled by 'chktok'. It starts at the beginning of 'toktab' and compares the current input character with the first character of the current word in 'toktab'. If it is the same, the next character from the input string is checked with the next character in the table and so on. The token length is put into the X register, and decremented each time a match is found. So if X=0, all characters have been matched, and the word in the input string has been successfully matched with the word in 'toktab'.

If, on the other hand, the first letter was matched but not the next, the routine branches to 'notfnd'. The position of the next word in 'toktab' is calculated by adding the length of the current token, 'toklen', to the address in 'from' (which began as pointing to the start of the token table). The token number, 'tokno', is incremented by one so if a match is ever found, 'tokno' is written out to the output string, preceded by a one nibble which signifies that the next nibble is a token.

&FF marks the end of the token table, so if the end of 'toktab' is reached with no match having been made, 'chktok' exits with the negative flag set. The 'compress' routine then gets on with individually matching common or less-common characters as before.

If you want, you can devise your own set of token words and put them in line 1990. Note that there's no point in including two-letter words comprising common letters - "of", "is", and so forth - because a token flag followed by a token number takes up the same amount of memory as two commonly-used letters, i.e. two nibbles.

The token expansion routine, 'outtok', works out the correct position in 'toktab' from which to begin printing. The lengths of any tokens preceding the token in question are added to the address in 'from'. Again, the token length is loading into X and used as a counter for how many characters from the table should be sent to OSWRCH.

To see what effect this has on the compression factor, type in "she sells sea shells on the sea shore", and compare the result with last time: 54% is quite impressive. Having said that, the entire compression program suffers from a limited token table, and only lower-case letters are accepted. You could improve on this by assuming that the character after a full stop and space is a capital letter. A few more spaces could be squeezed out by assuming that all punctuation marks are followed by a space. You also could assign a nibble value of 3 to mean that the following two nibbles index into a 256-word token table.

Christopher Dewhurst, EUG #53

Chris Dewhurst