Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Plan for common conversion from LexC to speller engines of Aspell type

Three different speller engines

Common features and properties:

Because of the similarities:

Output format

varies according to engine, but is basically a full-form word list that can be processed for compresssion. In Aspell, this processing is called munch-ing. For Polderland, we have no name yet, but they have a similar processing with the same goal.

Information to be added:

Pseudocode:

  1. closed POS: create a transducer containing all and only the rest, and xfst:print; convert to desired format
  2. NAVAdv: For each word:
    1. read one line from the lexicon files, including Comp and Style comments
    2. generate full paradigm, and all compounding forms
    3. filter the resulting word form list against any Comp and Style restrictions
    4. add the Comp and Style restrictions to the relative wordforms (all for Style, 5 for Comp)
    5. output in the desired format

Implementation points to consider:

WHO??? Candidates: Saara, Tomi

The following output was generated to try out different strategies for generating the compounding stem.

hum-tf4-ans157:~ trond$ lookup -flags mbTT -utf8 gt/sme/bin/isme.fst
0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100%
eadni+N+SgNomCmp#giella+N+Sg+Nom
eadni+N+SgNomCmp#giella+N+Sg+Nom        eadnegiella
eadni+N+SgNomCmp#giella+N+Sg+Nom        eadnegiella
eadni+N+SgNomCmp#giella+N+Sg+Nom        eadnegiella
eadni+N+SgNomCmp#giella+N+Sg+Nom        eadnegiella
eadni+N+SgNomCmp#giella+N+Sg+Nom        eadnegiella
eadni+N+SgNomCmp#giella+N+Sg+Nom        eadnegiella
eadni+N+SgNomCmp#giella+N+Sg+Nom        eadne-giella
eadni+N+SgNomCmp#giella+N+Sg+Nom        eadne-giella

eadni+N+SgNomCmp
eadni+N+SgNomCmp        eadni+N+SgNomCmp        +?

Trond's version:

sealgi+N+SgCmp#eadni+N+Sg+Nom
sealgi+N+SgCmp#eadni+N+Sg+Nom   sealeadni
sealgi+N+SgCmp#eadni+N+Sg+Nom   seal-eadni
sealgi+N+SgCmp#eadni+N+Sg+Nom   sealgeadni
sealgi+N+SgCmp#eadni+N+Sg+Nom   sealg-eadni
sealgi+N+SgCmp#eadni+N+Sg+Nom   sealggeadni
sealgi+N+SgCmp#eadni+N+Sg+Nom   sealgg-eadni

sealgi+N+SgNomCmp#eadni+N+Sg+Nom
sealgi+N+SgNomCmp#eadni+N+Sg+Nom        sealgeeadni
sealgi+N+SgNomCmp#eadni+N+Sg+Nom        sealge-eadni
sealgi+N+SgNomCmp#eadni+N+Sg+Nom        sealgieadni    <==== ?
sealgi+N+SgNomCmp#eadni+N+Sg+Nom        sealgi-eadni   <==== ?

sealgi+N+SgGenCmp#eadni+N+Sg+Nom
sealgi+N+SgGenCmp#eadni+N+Sg+Nom        sealggeeadni
sealgi+N+SgGenCmp#eadni+N+Sg+Nom        sealgge-eadni
sealgi+N+SgGenCmp#eadni+N+Sg+Nom        sealggieadni
sealgi+N+SgGenCmp#eadni+N+Sg+Nom        sealggi-eadni

sealgi+N+PlGenCmp#eadni+N+Sg+Nom
sealgi+N+PlGenCmp#eadni+N+Sg+Nom        selggiideadni
sealgi+N+PlGenCmp#eadni+N+Sg+Nom        selggiid-eadni

dušši+N+SgNomCmp#eadni+N+Sg+Nom
dušši+N+SgNomCmp#eadni+N+Sg+Nom duššeadni
dušši+N+SgNomCmp#eadni+N+Sg+Nom dušši-eadni
dušši+N+SgNomCmp#eadni+N+Sg+Nom duššieadni