Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Edmonton presentation

University of Alberta, Edmonton, June 8th & 13th 2015

Sjur Moshagen, UiT The Arctic University of Norway

Overview of the presentation

Background and goals

Background

Goals

General principles

  1. Be explicit (use non-cryptic catalogue and file names)
  2. Be clear (files should be found in non-surprising locations)
  3. Be consistent (identical conventions in all languages as far as possible)
  4. Be modular
  5. Divide language-dependent and language-independent code
  6. Reuse resources
  7. Build all tools for all languages
  8. … but only as much as you want (parametrised build process)

Bird’s Eye View and Down

The House

[../images/hus_eng_2015.png]

The House and the Infra

[../images/hus_eng_2015_with_infra.png]

$GTHOME - directory structure

Some less relevant dirs removed for clarity:

$GTHOME/                     # root directory, can be named whatever
├── experiment-langs         # language dirs used for experimentation
├── giella-core              # $GTCORE - core utilities
├── giella-shared            # shared linguistic resources
├── giella-templates         # templates for maintaining the infrastructure
├── keyboards                # keyboard apps organised roughly as the language dirs
├── langs                    # The languages being actively developed, such as:
│   ├─[...]                  #
│   ├── crk                  # Plains Cree
│   ├── est                  # Estonian
│   ├── evn                  # Evenki
│   ├── fao                  # Faroese
│   ├── fin                  # Finnish
│   ├── fkv                  # Kven
│   ├── hdn                  # Northern Haida
│   └─[...]                  #
├── ped                      # Oahpa etc.
├── prooftools               # Libraries and installers for spellers and the like
├── startup-langs            # Directory for languages in their start-up phase
├── techdoc                  # technical documentation
├── words                    # dictionary sources
└── xtdoc                    # external (user) documentation & web pages

Organisation - Dir Structure

.
├── src                  = source files
│   ├── filters          = adjust fst's for special purposes
│   ├── hyphenation      = nikîpakwâtik >  ni-kî-pa-kwâ-tik
│   ├── morphology       =
│   │   ├── affixes      = prefixes, suffixes
│   │   └── stems        = lexical entries
│   ├── orthography      = latin -> syllabics, spellrelax
│   ├── phonetics        = conversion to IPA
│   ├── phonology        = morphophonological rules
│   ├── syntax           = disambiguation, synt. functions, dependency
│   ├── tagsets          = get your tags as you want them
│   └── transcriptions   = convert number expressions to text or v.v.
├── test                 =
│   ├── data             = test data
│   └── src              = tests for the fst's in the src/ dir
└── tools                =
    ├── grammarcheckers  = prototype work, only SME for now
    ├── mt               = machine translation
    │   └── apertium     = ... for certain MT platforms
    ├── preprocess       = split text in sentences and words
    └── spellcheckers    = spell checkers are built here

Technologies

Technology for morphological analysis

We presently use three different technologies:

Technology for syntactic parsing

1. We like finite verbs:
SELECT:Vfin VFIN ;

Templated Build Structure And Source Files

[../images/newinfra.png]

Configurable builds

We support a lot of different tools and targets, but in most cases one only wants a handful of them. When running ./configure, you get a summary of the things that are turned on and off at the end:

$ ./configure --with-hfst
[...]
-- Building giella-crk 20110617:

  -- Fst build tools: Xerox, Hfst or Foma - at least one must be installed
  -- Xerox is default on, the others off unless they are the only one present --
  * build Xerox fst's: yes
  * build HFST fst's: yes
  * build Foma fst's: no

  -- basic packages (on by default): --
  * analysers enabled: yes
  * generators enabled: yes
  * transcriptors enabled: yes
  * syntactic tools enabled: yes
  * yaml tests enabled: yes
  * generated documentation enabled: yes

  -- proofing tools (off by default): --
  * spellers enabled: no
    * hfst speller fst's enabled: no
    * foma speller enabled: no
    * hunspell generation enabled: no
  * fst hyphenator enabled: no
  * grammar checker enabled: no

  -- specialised fst's (off by default): --
  * phonetic/IPA conversion enabled: no
  * dictionary fst's enabled: no
  * Oahpa transducers enabled: no
    * L2 analyser: no
    * downcase error analyser: no
  * Apertium transducers enabled: no
  * Generate abbr.txt: no

For more ./configure options, run ./configure --help

The build - schematic

[../images/new_infra_build_overview.png]

Closer View Of Selected Parts:

*Documentation *Testing *From Source To Final Tool: **Relation Between Lexicon, Build And Speller

Closer View: Documentation

Background

Implementation

Example cases:

Documentation:

Closer View: Testing

Testing Framework

All automated testing done within the infrastructure is based on the testing facilities provided by Autotools.

All tests are run with a single command:

make check

Autotools gives a PASS or FAIL on each test as it finishes:

[../images/make_check_output.png]

Yaml Tests

These are the most used tests, and are named after the syntax of the test files. The core syntax is:

Config:
  hfst:
    Gen: ../../../src/generator-gt-norm.hfst
    Morph: ../../../src/analyser-gt-norm.hfst
  xerox:
    Gen: ../../../src/generator-gt-norm.xfst
    Morph: ../../../src/analyser-gt-norm.xfst
    App: lookup

Tests:
  Noun - mihkw - ok : # -m inanimate noun, blood, Wolvengrey
     mihko+N+IN+Sg: mihko
     mihko+N+IN+Sg+Px1Sg: nimihkom
     mihko+N+IN+Sg+Px2Sg: kimihkom
     mihko+N+IN+Sg+Px1Pl: nimihkominân
     mihko+N+IN+Sg+Px12Pl: kimihkominaw
     mihko+N+IN+Sg+Px2Pl: kimihkomiwâw
     mihko+N+IN+Sg+Px3Sg: omihkom
     mihko+N+IN+Sg+Px3Pl: omihkomiwâw
     mihko+N+IN+Sg+Px4Pl: omihkomiyiw

Yaml test output

[../images/make_check_output.png]

In-Source Tests

LexC tests

As an alternative to the yaml tests, one can specify similar test data within the source files:

LEXICON MUORRA !!= @CODE@ Standard even stems with cg (note Q1). OBS: Nouns with invisible 3>2 cg (as bus'sa) go to this lexicon.
 +N:   MUORRAInfl ;
 +N:%> MUORRACmp  ;

## €gt-norm: kárta # Even-syllable test
## € kártta:         kártta+N+Sg+Nom
## € kártajn:        kártta+N+Sg+Com

Such tests are very useful to serve as checks for whether an inflectional lexicon behaves as it should.

The syntax is slightly different from the yaml files:

Twolc tests

The twolc tests look like the following:

## € iemed9#
## € iemet#

## € gål'leX7tj#
## € gål0lå0sj#

The point is to ensure that the rules behave as they should.

Other Tests

You can write any test you want, using your favourite programming language. There are a number of shell scripts to test speller functionality, and more tests will be added as the infrastructre develops.

Closer View: From Source To Final Tool:

Relation Between Lexicon, Build And Speller

Tag Conventions

We use certain tag conventions in the infrastructure:

Automatically Generated Filters

Dealing with descriptive vs normative grammars

Summary

Giitu

Sitemap