Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

The basic outline

Details regarding moving the source files

The directory structure contains more levels than in the old infra, and some source files are now split into several source files. Also, many filenames have changed, hopefully such that it is easier to get an idea of what the content is just based on the filename.

All of this means that you need to be a bit careful when moving files from the old infra. Here are some points to remember:

The .lexc files

The main morphology lexc file, typically named LANG-lex.txt in the old infra, is split into several files:

Other lexc files are renamed as follows:

Remember to update the source file definitions in src/morphology/Makefile.am to contain all and only the actual source files.

The disambiguation and dependency files

The language-specific disambiguation file is placed within src/syntax/, and now has the suffix .cg3. The full filename is disambiguation.cg3.

The language-independent dependency file is currently placed in the same directory, but should probably be moved to a shared directory.

Twolc files / phonological xfst script files

These are placed in src/phonology/ and end in .twolc or .xfscript respectively. The filename is LANG-phon.twolc or LANG-phon.xfscript.

The clock, numbers and date files

These go into src/transcriptions/ and are named the same as earlier.

Orthographical conventions

These go into src/orthography/ and are named as follows:

allcaps.xfscript
inituppercase.regex

Filters

Into src/filters go various language specific filters.

These keep the same name as before, after being moved:

Yaml files

If there are any yaml test files defined for a language, they should go into test/src/morphology/ where they will be picked up automatically, with the following caveat: the filename must end not only in .yaml but also including the selective part of the transducer name for the transducer you want to use for the test.

Presently there are two shell scripts for running yaml tests: run-gt-norm-yaml-testcases.sh.in and run-gt-desc-yaml-testcases.sh.in, for running tests agains the normative gt-norm and descriptive gt-desc transducers respectively.

You indicate the target transducer (and shell script) by specifying the identifying part of the transducer name as the last part of the yaml test file name (before the .yaml suffix), and separated from the rest of the filename with an underscore, like this:

adjective-tests_gt-norm.yaml

where the essential part is _gt-norm - that is the part that tells the shell scripts whether to include the yaml tests or not (and thus which transducer to run the test data against). In this case the transducer names are analyser-gt-norm.xfst (or .hfst) and generator-gt-norm.xfst.

If you want to run a test set against, say, a normative Oahpa transducer, the analyser will be named analyser-oahpa-gt-norm.xfst, the shell script should be named run-oahpa-gt-norm-yaml-testcases.sh.in, and the yaml test files should have names ending in _oahpa-gt-norm.yaml.

Other test files

There are several other test files used in the old infra. Most of the tests they are used for are not yet integrated in the new infra. These files should be moved to their natural location in the new infra with the filename they have. If you are unsure of what the natural location is, please ask Sjur.