Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Page Content

Text is preprocessed and made into words and sentences. In order to do the latter we need to handle abbreviations. The linguistic sides of the issue are found in this document, here is a more specific documentation on the linguistic reasoning see also the Preprocessor Specification on the pmatch fst behind the hfst method.

Here we look at how to compile and use the preprocessor that deals with the abbreviations.

Abbreviation handling with hfst

This is the recommended approach. Compile and test with the following setting (here with sme as example):

./configure --with-hfst --enable-tokenisers
make
echo "dr. Watson."|hfst-tokenise  $GTHOME/langs/sme/tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst

The result should treat the first period as part of the abbreviation “dr.”, but the second as a period separated from the word it was attached to.

Abbreviation handling with xfst

This method is not actively maintained, but documented here in case you have not installed hfst.

Standing in the catalogue $GTHOME/langs/$LANG check whether you have a file abbr.txt in the folder tools/tokenisers. If you do, you should be fine, and can write

echo "dr. Watson."|preprocess --abbr=tools/tokenisers/abbr.txt

The result should be as above.

If you don’t have this file, you may compile it as follows:

In the $LANG catalogue (the catalogue of your language), give the compilation setting and compile as follows:

./configure --enable-abbr
cd tools/tokenisers
make abbr

The result should be a file abbr.txt in tools/tokenisers, and you may test it with the preprocess command as given above.