Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Page Content

Present building

  1. spellernonrec
  2. plxnonrecder
    1. ` %» %» %» ` - derivations that are lexicalised?
  3. plxnonrec = ( spellernonrec - plxnonrecder ) .o. remove-hyphen
  4. POS specific fst (spellerPOS > spellerPOS-plx)
    1. spellermwe - text file with multi-word PLX entries
      1. spellerverbs.fst - used by both PLX and Hunspell
      2. spellerverbs.txt - PLX variant is made with PLX tags, Hunspell variant is just a wordlist without hyphens
    2. spellernouns. … (see verbs)
    3. spelleradjs. … (see verbs)
    4. spellerabbrs. … (see verbs) - rename fst to others
    5. spellerproper. … (see verbs)
    6. spellernums. This is unioned with spellernouns.fst
  5. concatenate txt files (4.2.b etc above)
  6. build final speller files:
    1. Hunspell - two variants, with and without compounding
    2. PLX

In the POS build targets, abbr = other POS’s.

New dir layout

tools/spellcheckers/listbased/          <= build common hunspell/plx files here
                              hunspell/ <= build final hunspell here
                              plx/      <= build final plx here

Targets for each dir above:

Work plan

  1. make targets for listbased/
  2. make targets for hunspell/
  3. add tests
  4. decide whether to allow PLX for all languages, or only SMA, SME, SMJ
  5. depending the previous choice, integrate PLX building in separate or UND template