The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages
View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no
University of Alberta, Edmonton, June 8th & 13th 2015
Sjur Moshagen, UiT The Arctic University of Norway
[../images/hus_eng_2015.png]
[../images/hus_eng_2015_with_infra.png]
*Machine translation: fst's built by the infra, the rest handled by Apertium
*Speech synthesis is not (yet) built by the infra, conversion to IPA is part of the infrastructure though
Supported: fst's and syntactic parsers used are built by the infrastructure
Some less relevant dirs removed for clarity:
$GTHOME/ # root directory, can be named whatever
├── experiment-langs # language dirs used for experimentation
├── giella-core # $GTCORE - core utilities
├── giella-shared # shared linguistic resources
├── giella-templates # templates for maintaining the infrastructure
├── keyboards # keyboard apps organised roughly as the language dirs
├── langs # The languages being actively developed, such as:
│ ├─[...] #
│ ├── crk # Plains Cree
│ ├── est # Estonian
│ ├── evn # Evenki
│ ├── fao # Faroese
│ ├── fin # Finnish
│ ├── fkv # Kven
│ ├── hdn # Northern Haida
│ └─[...] #
├── ped # Oahpa etc.
├── prooftools # Libraries and installers for spellers and the like
├── startup-langs # Directory for languages in their start-up phase
├── techdoc # technical documentation
├── words # dictionary sources
└── xtdoc # external (user) documentation & web pages
.
├── src = source files
│ ├── filters = adjust fst's for special purposes
│ ├── hyphenation = nikîpakwâtik > ni-kî-pa-kwâ-tik
│ ├── morphology =
│ │ ├── affixes = prefixes, suffixes
│ │ └── stems = lexical entries
│ ├── orthography = latin -> syllabics, spellrelax
│ ├── phonetics = conversion to IPA
│ ├── phonology = morphophonological rules
│ ├── syntax = disambiguation, synt. functions, dependency
│ ├── tagsets = get your tags as you want them
│ └── transcriptions = convert number expressions to text or v.v.
├── test =
│ ├── data = test data
│ └── src = tests for the fst's in the src/ dir
└── tools =
├── grammarcheckers = prototype work, only SME for now
├── mt = machine translation
│ └── apertium = ... for certain MT platforms
├── preprocess = split text in sentences and words
└── spellcheckers = spell checkers are built here
We presently use three different technologies:
1. We like finite verbs:
SELECT:Vfin VFIN ;
$GTHOME/core/
[../images/newinfra.png]
We support a lot of different tools and targets, but in most cases one only
wants a handful of them. When running ./configure
, you get a summary of the
things that are turned on and off at the end:
$ ./configure --with-hfst
[...]
-- Building giella-crk 20110617:
-- Fst build tools: Xerox, Hfst or Foma - at least one must be installed
-- Xerox is default on, the others off unless they are the only one present --
* build Xerox fst's: yes
* build HFST fst's: yes
* build Foma fst's: no
-- basic packages (on by default): --
* analysers enabled: yes
* generators enabled: yes
* transcriptors enabled: yes
* syntactic tools enabled: yes
* yaml tests enabled: yes
* generated documentation enabled: yes
-- proofing tools (off by default): --
* spellers enabled: no
* hfst speller fst's enabled: no
* foma speller enabled: no
* hunspell generation enabled: no
* fst hyphenator enabled: no
* grammar checker enabled: no
-- specialised fst's (off by default): --
* phonetic/IPA conversion enabled: no
* dictionary fst's enabled: no
* Oahpa transducers enabled: no
* L2 analyser: no
* downcase error analyser: no
* Apertium transducers enabled: no
* Generate abbr.txt: no
For more ./configure options, run ./configure --help
[../images/new_infra_build_overview.png]
*Documentation *Testing *From Source To Final Tool: **Relation Between Lexicon, Build And Speller
Example cases:
Documentation:
All automated testing done within the infrastructure is based on the testing facilities provided by Autotools.
All tests are run with a single command:
make check
Autotools gives a PASS
or FAIL
on each test as it finishes:
[../images/make_check_output.png]
These are the most used tests, and are named after the syntax of the test files. The core syntax is:
Config:
hfst:
Gen: ../../../src/generator-gt-norm.hfst
Morph: ../../../src/analyser-gt-norm.hfst
xerox:
Gen: ../../../src/generator-gt-norm.xfst
Morph: ../../../src/analyser-gt-norm.xfst
App: lookup
Tests:
Noun - mihkw - ok : # -m inanimate noun, blood, Wolvengrey
mihko+N+IN+Sg: mihko
mihko+N+IN+Sg+Px1Sg: nimihkom
mihko+N+IN+Sg+Px2Sg: kimihkom
mihko+N+IN+Sg+Px1Pl: nimihkominân
mihko+N+IN+Sg+Px12Pl: kimihkominaw
mihko+N+IN+Sg+Px2Pl: kimihkomiwâw
mihko+N+IN+Sg+Px3Sg: omihkom
mihko+N+IN+Sg+Px3Pl: omihkomiwâw
mihko+N+IN+Sg+Px4Pl: omihkomiyiw
[../images/make_check_output.png]
As an alternative to the yaml tests, one can specify similar test data within the source files:
LEXICON MUORRA !!= @CODE@ Standard even stems with cg (note Q1). OBS: Nouns with invisible 3>2 cg (as bus'sa) go to this lexicon.
+N: MUORRAInfl ;
+N:%> MUORRACmp ;
## €gt-norm: kárta # Even-syllable test
## € kártta: kártta+N+Sg+Nom
## € kártajn: kártta+N+Sg+Com
Such tests are very useful to serve as checks for whether an inflectional lexicon behaves as it should.
The syntax is slightly different from the yaml files:
The twolc tests look like the following:
## € iemed9#
## € iemet#
## € gål'leX7tj#
## € gål0lå0sj#
The point is to ensure that the rules behave as they should.
You can write any test you want, using your favourite programming language. There are a number of shell scripts to test speller functionality, and more tests will be added as the infrastructre develops.
We use certain tag conventions in the infrastructure:
+Err/...
(+Err/Orth
, +Err/Cmp
)+Sem/...
+Err/
or +Sem/
gets filters for
different purposes automatically+Err/...
tags+Err/...
strings