Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Presentation of the Divvun and Giellatekno infrastructure

University of Alberta, Edmonton, June 19th

Sjur Moshagen & Trond Trosterud, UiT The Arctic University of Norway

Content

Background

The problem

The plan

To create an infrastructure that:

  1. scales well both regarding languages and tools
  2. has full parity between Hfst and Xerox
  3. treats all languages the same
  4. is consistent from language to language, supporting cross-language cooperation
  5. … while still being flexible enough to handle variation between the languages

The solution

[../images/S_curve.png]

Details in the rest of the presentation.

Introduction

Developed by Tommi Pirinen and Sjur Moshagen.

A schematic overview of the main components of the infrastructure:

[../images/newinfra.png]

General principles

  1. Be explicit (use non-cryptic catalogue and file names)
  2. Be clear (files should be found in non-surprising locations)
  3. Keep conventions identical from language to language whenever possible
  4. Divide language-dependent and language-independent code
  5. Modularise the source code and the builds
  6. Reuse resources
  7. Know the basic setup of one language – know the setup of them all
  8. Possibility for all tools to be built for all languages
  9. Parametrise the build process

What is the infrastructure?

For this to work for many languages in parallel and at the same time, we need:

Conventions

We need conventions for:

E.g., your source files are located in src/:

Directory structure

In detail:

.
├── am-shared
├── doc
├── misc
├── src
│   ├── filters
│   ├── hyphenation
│   ├── morphology
│   │   ├── affixes
│   │   └── stems
│   ├── orthography
│   ├── phonetics
│   ├── phonology
│   ├── syntax
│   ├── tagsets
│   └── transcriptions
├── test
│   ├── data
│   ├── src
│   └── tools
└── tools
    ├── grammarcheckers
    ├── mt
    │   └── apertium
    ├── preprocess
    ├── shellscripts
    └── spellcheckers

Explaining the directory structure

.
├── src                  = source files
│   ├── filters          = adjust fst's for special purposes
│   ├── hyphenation      = nikîpakwâtik >  ni-kî-pa-kwâ-tik
│   ├── morphology       =
│   │   ├── affixes      = prefixes, suffixes
│   │   └── stems        =
│   ├── orthography      = latin <-> syllabics, spellrelax
│   ├── phonetics        = conversion to IPA
│   ├── phonology        = morphophonological rules
│   ├── syntax           = disambiguation, synt. functions, dependency
│   ├── tagsets          = get your tags as you want them
│   └── transcriptions   = convert number expressions to text or v.v.
├── test                 =
│   ├── data             = test data
│   ├── src              = tests for the fst's in the src/ dir
└── tools                =
    ├── grammarcheckers  =
    ├── mt               = machine translation
    │   └── apertium     = ... for certain MT platforms
    ├── preprocess       = split text in sentences and words
    ├── shellscripts     = shell scripts to use the modules we create
    └── spellcheckers    = spell checkers are built here

The core

The core is a separate folder outside the language-specific ones. It contains:

Shared resources

The shared resources come in two flavours:

Shared linguistic data typically is shared only for a subgroup of languages, like smi and urj-Cyrl, potentially also alg and ath.

The fst manipulations remove tags or tagged strings of classes typically found in all languages:

Languages

We have split the languages in four groups, depending on the type of work done on them, and their license:

Available at:

svn co https://gtsvn.uit.no/langtech/trunk/langs/ISO639-3-CODE/

(replace ISO639-3-CODE with the actual ISO code)

Build Structure

Support for:

Testing

Testing is done with the command make check. There is built-in support for two types of tests:

In addition, there is the general support for testing in Autotools (or more specifically in automake), meaning that it is possible to add test scripts for whatever you like.

Documentation

The infrastructure supports extraction of in-source documentation written as comments in a specific format, and will in the end produce html pages.

Documentation written in the actual source code is more likely to be kept up-to-date than external documentation.

The format supports the use of a couple of variables to extract such things as lexicon names, a line of code, etc.

The tools

The pipeline for analysis

The pipeline for grammar checking

Two startup scenarios

In the latter case it could be possible and even preferable to script the conversion from the original format to the lexc format, to make it possible to reimport or update the data.

Summary

  1. This infrastructure makes it possible to
    1. work with several languages
    2. get several tools and programs out of one and the same source code
  2. It is continuously under development
    1. … all new features automatically become available to all languages
  3. It is documented
  4. … and it is available as open source code