Presentation of the Divvun and Giellatekno infrastructure

University of Alberta, Edmonton, June 19th

Sjur Moshagen & Trond Trosterud, UiT The Arctic University of Norway

Content

Background
Introduction
The core
The languages
Build structure
Testing
Documentation
The tools we produce
Summary

Background

The problem

Our original (“old”) infrastructure
- was based upon copy and paste from language to language
- treated different languages differently, for historical reasons
new languages was added all the time
also new features and new tools were added for some languages,
- … but they would not become available for other languages without error-prone copy and paste
Hfst was added as a hack on top of the support for the Xerox tools
- (Xerox = the first fst compiler; Hfst = an open-source implementation)
it was way too time-consuming and boring to maintain (mainly by Sjur)

The plan

To create an infrastructure that:

scales well both regarding languages and tools
has full parity between Hfst and Xerox
treats all languages the same
is consistent from language to language, supporting cross-language cooperation
… while still being flexible enough to handle variation between the languages

The solution

[../images/S_curve.png]

Details in the rest of the presentation.

Introduction

Developed by Tommi Pirinen and Sjur Moshagen.

A schematic overview of the main components of the infrastructure:

[../images/newinfra.png]

General principles

Be explicit (use non-cryptic catalogue and file names)
Be clear (files should be found in non-surprising locations)
Keep conventions identical from language to language whenever possible
Divide language-dependent and language-independent code
Modularise the source code and the builds
Reuse resources
Know the basic setup of one language – know the setup of them all
Possibility for all tools to be built for all languages
Parametrise the build process

What is the infrastructure?

a systematic way to go from source code to compiled modules
a framework for testing the modules
a way of chaining the modules together into larger functional units

For this to work for many languages in parallel and at the same time, we need:

conventions
a fixed directory structure
a shared build system

Conventions

We need conventions for:

filenames
tags
file locations

E.g., your source files are located in src/:

in the folders morphology/stems, morphology/affixes, phonology, …
stem files: nouns.lexc, verbs.lexc, particles.lexc, …
affix files: nouns.lexc, verbs.lexc

Directory structure

In detail:

.
├── am-shared
├── doc
├── misc
├── src
│   ├── filters
│   ├── hyphenation
│   ├── morphology
│   │   ├── affixes
│   │   └── stems
│   ├── orthography
│   ├── phonetics
│   ├── phonology
│   ├── syntax
│   ├── tagsets
│   └── transcriptions
├── test
│   ├── data
│   ├── src
│   └── tools
└── tools
    ├── grammarcheckers
    ├── mt
    │   └── apertium
    ├── preprocess
    ├── shellscripts
    └── spellcheckers

Explaining the directory structure

.
├── src                  = source files
│   ├── filters          = adjust fst's for special purposes
│   ├── hyphenation      = nikîpakwâtik >  ni-kî-pa-kwâ-tik
│   ├── morphology       =
│   │   ├── affixes      = prefixes, suffixes
│   │   └── stems        =
│   ├── orthography      = latin <-> syllabics, spellrelax
│   ├── phonetics        = conversion to IPA
│   ├── phonology        = morphophonological rules
│   ├── syntax           = disambiguation, synt. functions, dependency
│   ├── tagsets          = get your tags as you want them
│   └── transcriptions   = convert number expressions to text or v.v.
├── test                 =
│   ├── data             = test data
│   ├── src              = tests for the fst's in the src/ dir
└── tools                =
    ├── grammarcheckers  =
    ├── mt               = machine translation
    │   └── apertium     = ... for certain MT platforms
    ├── preprocess       = split text in sentences and words
    ├── shellscripts     = shell scripts to use the modules we create
    └── spellcheckers    = spell checkers are built here

The core

The core is a separate folder outside the language-specific ones. It contains:

templates for the languages
scripts used for maintenance and testing
shared resources
- linguistic resources shared among several languages
- language independent fst manipulation

Shared resources

The shared resources come in two flavours:

shared linguistic data
language independent fst manipulation

Shared linguistic data typically is shared only for a subgroup of languages, like smi and urj-Cyrl, potentially also alg and ath.

The fst manipulations remove tags or tagged strings of classes typically found in all languages:

remove non-standard strings (to make a purely normative fst)
remove semantic tags from fst’s where they are not used
remove morphological boundary symbols from the lower/surface side
etc.

Languages

We have split the languages in four groups, depending on the type of work done on them, and their license:

langs: These are the languages being actively developed - 43 languages
startup-langs: These are languages that someone has an interest in, but are not actually being developed, and where the linguistic content is thin - 11 languages
experiment-langs: The name says it all - this is the playground, and these languages are a.o. used for teaching - 3 languages
closed-langs: These are languages with a closed license, only ISL and DAN at the moment

Available at:

svn co https://gtsvn.uit.no/langtech/trunk/langs/ISO639-3-CODE/

(replace ISO639-3-CODE with the actual ISO code)

Build Structure

Support for:

in-source documentation (converted to html)
in-source test data
automated tests
all tools built for all languages - but not everything built by default
basically technology neutral, but focused on rule based systems (fst’s, cg)
all languages structured the same way
separation of language independent and language specific features
all builds are language independent, but most (eventually all) build steps allow a language specific post-build step

Testing

Testing is done with the command make check. There is built-in support for two types of tests:

in-source test data in lexc and twolc
specific test files for testing morphological analysis and generation against a specific fst

In addition, there is the general support for testing in Autotools (or more specifically in automake), meaning that it is possible to add test scripts for whatever you like.

Documentation

The infrastructure supports extraction of in-source documentation written as comments in a specific format, and will in the end produce html pages.

Documentation written in the actual source code is more likely to be kept up-to-date than external documentation.

The format supports the use of a couple of variables to extract such things as lexicon names, a line of code, etc.

The tools

Analysers
Generators
Number transcriptors
Specialised analysers and generators
Spellrelax
Disambiguators and parsers
Spellers
Grammar checkers

The pipeline for analysis

take text
preprocess it (sentences, words)
give all morphological analyses
pick the correct ones
add grammatical functions
add dependency relations

The pipeline for grammar checking

take text
preprocess it (sentences, words)
give all morphological analyses
make a sloppy disambiguation (“do not trust the input”)
find error patterns
mark them
give message to the writer

Two startup scenarios

Add a new language that does not have machine-readable resources (“Blackfoot”)
Add an existing morphological analyser in an incompatible format, in order to generate the full range of tools offered here (“Innu”)

In the latter case it could be possible and even preferable to script the conversion from the original format to the lexc format, to make it possible to reimport or update the data.

Summary

This infrastructure makes it possible to
1. work with several languages
2. get several tools and programs out of one and the same source code
It is continuously under development
1. … all new features automatically become available to all languages
It is documented
… and it is available as open source code

Language Technology at UiT

Page Content