Presentation of the Divvun and Giellatekno infrastructure
University of Alberta, Edmonton, June 19th
Sjur Moshagen & Trond Trosterud, UiT The Arctic University of Norway
Content
- Background
- Introduction
- The core
- The languages
- Build structure
- Testing
- Documentation
- The tools we produce
- Summary
Background
The problem
- Our original (“old”) infrastructure
- was based upon copy and paste from language to language
- treated different languages differently, for historical reasons
- new languages was added all the time
- also new features and new tools were added for some languages,
- … but they would not become available for other languages without
error-prone copy and paste
- Hfst was added as a hack on top of the support for the Xerox tools
- (Xerox = the first fst compiler; Hfst = an open-source implementation)
- it was way too time-consuming and boring to maintain (mainly by Sjur)
The plan
To create an infrastructure that:
- scales well both regarding languages and tools
- has full parity between Hfst and Xerox
- treats all languages the same
- is consistent from language to language, supporting cross-language cooperation
- … while still being flexible enough to handle variation between the
languages
The solution
[../images/S_curve.png]
Details in the rest of the presentation.
Introduction
Developed by Tommi Pirinen and Sjur Moshagen.
A schematic overview of the main components of the infrastructure:
[../images/newinfra.png]
General principles
- Be explicit (use non-cryptic catalogue and file names)
- Be clear (files should be found in non-surprising locations)
- Keep conventions identical from language to language whenever possible
- Divide language-dependent and language-independent code
- Modularise the source code and the builds
- Reuse resources
- Know the basic setup of one language – know the setup of them all
- Possibility for all tools to be built for all languages
- Parametrise the build process
What is the infrastructure?
- a systematic way to go from source code to compiled modules
- a framework for testing the modules
- a way of chaining the modules together into larger functional units
For this to work for many languages in parallel and at the same time, we need:
- conventions
- a fixed directory structure
- a shared build system
Conventions
We need conventions for:
- filenames
- tags
- file locations
E.g., your source files are located in src/
:
- in the folders
morphology/stems, morphology/affixes, phonology
, …
- stem files:
nouns.lexc, verbs.lexc, particles.lexc
, …
- affix files:
nouns.lexc, verbs.lexc
Directory structure
In detail:
.
├── am-shared
├── doc
├── misc
├── src
│ ├── filters
│ ├── hyphenation
│ ├── morphology
│ │ ├── affixes
│ │ └── stems
│ ├── orthography
│ ├── phonetics
│ ├── phonology
│ ├── syntax
│ ├── tagsets
│ └── transcriptions
├── test
│ ├── data
│ ├── src
│ └── tools
└── tools
├── grammarcheckers
├── mt
│ └── apertium
├── preprocess
├── shellscripts
└── spellcheckers
Explaining the directory structure
.
├── src = source files
│ ├── filters = adjust fst's for special purposes
│ ├── hyphenation = nikîpakwâtik > ni-kî-pa-kwâ-tik
│ ├── morphology =
│ │ ├── affixes = prefixes, suffixes
│ │ └── stems =
│ ├── orthography = latin <-> syllabics, spellrelax
│ ├── phonetics = conversion to IPA
│ ├── phonology = morphophonological rules
│ ├── syntax = disambiguation, synt. functions, dependency
│ ├── tagsets = get your tags as you want them
│ └── transcriptions = convert number expressions to text or v.v.
├── test =
│ ├── data = test data
│ ├── src = tests for the fst's in the src/ dir
└── tools =
├── grammarcheckers =
├── mt = machine translation
│ └── apertium = ... for certain MT platforms
├── preprocess = split text in sentences and words
├── shellscripts = shell scripts to use the modules we create
└── spellcheckers = spell checkers are built here
The core
The core is a separate folder outside the language-specific ones.
It contains:
- templates for the languages
- scripts used for maintenance and testing
- shared resources
- linguistic resources shared among several languages
- language independent fst manipulation
Shared resources
The shared resources come in two flavours:
- shared linguistic data
- language independent fst manipulation
Shared linguistic data typically is shared only for a subgroup of languages,
like smi
and urj-Cyrl
, potentially also alg
and ath
.
The fst manipulations remove tags or tagged strings of classes typically found
in all languages:
- remove non-standard strings (to make a purely normative fst)
- remove semantic tags from fst’s where they are not used
- remove morphological boundary symbols from the lower/surface side
- etc.
Languages
We have split the languages in four groups, depending on the type of work done
on them, and their license:
- langs: These are the languages being actively developed - 43
languages
- startup-langs: These are languages that someone has an interest in, but
are not actually being developed, and where the linguistic
content is thin - 11 languages
- experiment-langs: The name says it all - this is the playground, and these
languages are a.o. used for teaching - 3 languages
- closed-langs: These are languages with a closed license, only ISL and
DAN at the moment
Available at:
svn co https://gtsvn.uit.no/langtech/trunk/langs/ISO639-3-CODE/
(replace ISO639-3-CODE
with the actual ISO code)
Build Structure
Support for:
- in-source documentation (converted to html)
- in-source test data
- automated tests
- all tools built for all languages - but not everything built by default
- basically technology neutral, but focused on rule based systems (fst’s, cg)
- all languages structured the same way
- separation of language independent and language specific features
- all builds are language independent, but most (eventually all) build steps
allow a language specific post-build step
Testing
Testing is done with the command make check
. There is built-in support for
two types of tests:
- in-source test data in lexc and twolc
- specific test files for testing morphological analysis and generation against
a specific fst
In addition, there is the general support for testing in Autotools (or more
specifically in automake
), meaning that it is possible to add test scripts
for whatever you like.
Documentation
The infrastructure supports extraction of in-source documentation written as
comments in a specific format, and will in the end produce html pages.
Documentation written in the actual source code is
more likely to be kept up-to-date than external documentation.
The format supports the use of a couple of variables to extract such things as
lexicon names, a line of code, etc.
- Analysers
- Generators
- Number transcriptors
- Specialised analysers and generators
- Spellrelax
- Disambiguators and parsers
- Spellers
- Grammar checkers
The pipeline for analysis
- take text
- preprocess it (sentences, words)
- give all morphological analyses
- pick the correct ones
- add grammatical functions
- add dependency relations
The pipeline for grammar checking
- take text
- preprocess it (sentences, words)
- give all morphological analyses
- make a sloppy disambiguation (“do not trust the input”)
- find error patterns
- mark them
- give message to the writer
Two startup scenarios
- Add a new language that does not have machine-readable resources (“Blackfoot”)
- Add an existing morphological analyser in an incompatible format,
in order to generate the full range of tools offered here (“Innu”)
In the latter case it could be possible and even preferable to script the
conversion from the original format to the lexc format, to make it possible to
reimport or update the data.
Summary
- This infrastructure makes it possible to
- work with several languages
- get several tools and programs out of one and the same source code
- It is continuously under development
- … all new features automatically become available to all languages
- It is documented
- … and it is available as open source code