Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Plan and status

  1. start small - only one language. First language is fao, which is reasonably big but still not too complex. Create the basic dir tree, and use svn copy to copy over the fao sources, so that the old fao dir remains intact and usable all the time (only when everything is working ok, the old dir will be removed).
    1. DONE
  2. get all the basic infrastructure and build functionality to work for the needs of fao
    1. DONE
  3. add another language, probably a small one, to test the multilingual behaviour as well as templating system (the small language will most likely not have all features that fao has)
    1. DONE
  4. add a third language - a big one this time, e.g. kal (which is using xfst instead of twolc and thus provides a slightly new use case). Make sure all build targets are working as they should, and extend the build system, template files, etc as needed. kal has probably more requirements than fao.
    1. DONE
  5. add support for the following transducer features:
    1. dialects - DONE
    2. norm vs sub - DONE
    3. PLX languages
    4. hyphenating transducers
    5. morphological border transducers
    6. HFST spellers - DONE
  6. then add one language at a time, all the time ensuring that everything is working for all langauges, and that the small languages automatically pick up new functionality from the big ones as the template dir is expanded to follow the big languages being added
    1. IN PROGRESS
  7. gradually remove the old language dirs as the new location and build infrastructure becomes stable, also forcing the whole group to start using the new infrastructure. This is important to get feedback and correct bugs.
    1. IN PROGRESS

Progress summary

A working new infrastructure using autotools are now in place. Tommi has experimented a bit with CMake, but nothing has been committed.

Faroese is working, but no other languages. This is too little to really test it all, and the next important goal is to add at least one other language. Tommi’s plan was to add Greenlandic.

Details

Configuration and dependency checks for external tools, build framework

There are many tools out there for solving these tasks. Our requirement is that it should be open source, available for as many platforms as possible, preferably including Windows, and have a large user base (=ease of getting support and find solutions). Our two candidates at the moment:

We probably do not have a full overview of all our dependencies, but here are at least some:

Not all tools are required for all parts of our source base. We probably need to split up some parts into separate subprojects, with their own requirements.

Central & local makefiles

It is essential that we can support the following:

This can be supported through a combination of central and local makefiles. The idea is that local makefiles include central makefiles. The central makefiles carries the burden of the build machinery, whereas the local ones can be enhanced with additional machinery for language-specific processing and experiments.

The central makefiles should be part of the core module, and thus always available.

Implemented? Doesn’t look like it based on the faroese build files.

Templating

Another crucial feature of the new infrastructure is that it should be easy to add new languages, and easy to add new functionality to existing languages. This can be done through a templating system combined with the central build system.

There should exist - within the core - a fully functional template for all languages, functional in the sense that all files compile and constitute a working environment for linguistic development. The linguistric content is of course only placeholder text, to be replaced with real content for any new language.

The template structure should also be used to propagate new shared resources to all languages. The idea here is that when a new feature is developed enough for a single language, the files making up the new feature should be copied to the template tree (with boilerplate content), and the corresponding build instructions should be moved to the central makefiles. Then, the next time another language is built, the build system should note the new files (or the non-existence of the template files in the language dir tree), and copy the template files into the dir tree for the language, and build them. From there on the “only” thing needed is to populate the new files with real content for the language in question.

Populating files with real content is of course the main job, but with a template structure as described above we ensure two things:

Known issues