Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Page Content

PLX Meeting

Participants: Tomi, Thomas, Sjur

New infra and PLX

Should we move smj and sma to new infra?

Benefits:

Risks and drawbacks:

Risk mitigation: we don’t move sme, so if we can’t get PLX conversion in the new infra to work properly, we just won’t update smj and sma now. They will have to wait a bit, after Inga Lill and Maja has worked more on the material.

Issues in PLX - overview

Status: Thomas has marked some derivations +Use/-Spell, and instead lexicalized:

+Der/t
+Der/ár
+Der/huhtti
+Der/huvva
+Der/stuvva
+Der/viđá
+Der/supmi

+Der/veara is marked +Err/Sub and a closed class of words now instead makes two word phrases with Po/Attr “veara”

Next aim is to have a look at LEXICON NAMAT as well to see if the similar trimming can be done here.

Tomi has started a new sma speller compilation today - filter compilation takes a lot of time the first time.

Main issue: compounding does not work as intended for a number of constructions.

Additional issue: some word forms are not accepted in the regular speller, even though they are in the test speller. We try to correct this by making the full speller not so huge by restricting some processes (mainly derivations, see above), and by modelling some derivations as PLX compounds instead of derivations (= generated full forms in the PLX format, e.g. -vuohta).

Vuohta

-vuohta can be added to all adjectives, but not all nouns. Further compounding of vuohta- needs to use the genitive form -vuođa-.

But when vuohta is modelled as a compound, the PLX flags of the adjective will influence the total compound, and vuohta won’t behave as specified.

We already have 1039 words on -vuohta in our lexicon, ie lexicalised derivations, 353 compounds with -vuođa-. Is this enough to cover the most frequent cases of -vuohta, such that we can turn off -vuohta altogether in the speller (and lexicalise the most frequent missing ones)?

We remove -vuohta as a derivation completely from the speller, and instead lexicalise the most frequent ones.

Plan forward to a new release

  1. check if we can get a full speller that behaves as the test speller
  2. if they behave the same, use the PLX conversion test bench to work out the remaining issues
  3. when done, release