The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages
View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no
Participants: Tomi, Thomas, Sjur
Should we move smj
and sma
to new infra?
Benefits:
Risks and drawbacks:
Risk mitigation: we don’t move sme
, so if we can’t get PLX conversion in the new infra to work properly, we just won’t update smj
and sma
now. They will have to wait a bit, after Inga Lill and Maja has worked more on the material.
Status: Thomas has marked some derivations +Use/-Spell, and instead lexicalized:
+Der/t
+Der/ár
+Der/huhtti
+Der/huvva
+Der/stuvva
+Der/viđá
+Der/supmi
+Der/veara is marked +Err/Sub and a closed class of words now instead makes two word phrases with Po/Attr “veara”
Next aim is to have a look at LEXICON NAMAT as well to see if the similar trimming can be done here.
Tomi has started a new sma
speller compilation today - filter compilation takes a lot of time the first time.
Main issue: compounding does not work as intended for a number of constructions.
Additional issue: some word forms are not accepted in the regular speller, even though they are in the test speller. We try to correct this by making the full speller not so huge by restricting some processes (mainly derivations, see above), and by modelling some derivations as PLX compounds instead of derivations (= generated full forms in the PLX format, e.g. -vuohta).
-vuohta can be added to all adjectives, but not all nouns. Further compounding of vuohta- needs to use the genitive form -vuođa-.
But when vuohta is modelled as a compound, the PLX flags of the adjective will influence the total compound, and vuohta won’t behave as specified.
We already have 1039 words on -vuohta in our lexicon, ie lexicalised derivations, 353 compounds with -vuođa-. Is this enough to cover the most frequent cases of -vuohta, such that we can turn off -vuohta altogether in the speller (and lexicalise the most frequent missing ones)?
We remove -vuohta as a derivation completely from the speller, and instead lexicalise the most frequent ones.