Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

Cf. one of the following, depending on context:

Opening, agenda review, participants

Sma issues

Need to correct dictionary against speller

Wrong dict forms in speller? There are misspellings in the smanob dictionary, and often those misspellings are found as well in the speller. Check that misspellings in the dictionary are not replicated in the speller.

The fst and the dictionary show different Sg3 forms (where the Sg3 forms (marked p3p) are taken from the Hemnes dictionary. The p3p forms missing in the fst have now been manually marked with XXX in the dictionary.

The XXX marks can mean:

Status quo for narmativity

The letter on loan words was sent to SGM on friday. We also sent a letter in June-July last year. Are there now normativity issues unresolved? Not that we are aware of.

How shall Divvun and Gt cooperate on the sma work during what is left of this year?

Lene has a list of dictionary verbs not in the analyser. During this work some issues have turned up:

Way of work:

  1. Read through smanob and look for errors (adj, noun)
  2. Check whether these errors are in the normative analyser
  3. mark problematic words as non-includes
  4. reverse sorting the fst lexicon, look at continuation lexicon
  5. unify lemma forms for all fst entries (ie multiple entries for the same word should have the same baseform/lemma)
  6. sort and unique fst entries - double forms in sma lexicon shall be unified
  7. learn typical error patterns from the dictionary correction work (-e vs -ie as stems, -a- vs -aa- or -ie- vs. -ïe- as root vowel), grep similar patterns out of the fst code, and proofread them

The three i-s of sma - is this true?

  1. i as i
  2. ï as i and ï (klinïgke / klinigke)
  3. ï shall not be i (but is made so via spellrelax)

Before we implement it we shall find out whether the second category exists.

TODO:

Forthcoming finsme dictionary

We do not quite know when it comes, but what shall we do with it when it comes? The dictionary is made by KOTUS (fin-sme-fin).

Hvis penger - Johanna Ijäs kan gjøre noe med dette?

Info topics

a new programmer in the sma-oahpa project

Ryan for 2 months.

Cips journey to Iceland

Here is my very minimalistic notes on tools:

The amount of work for installing GLOSSA here is noticeable, yes, but it will make data updating easier. Also, we might easier be able to taylor the interface to our need.

The obt uses the Oslo fullform list for morphological analysis, with a separate component for compound analysis. For syntax, they use CG.

In Bergen they are going to implement a separate corpus interface, corpuscle.

Summing up the sma-oahpa week

Whitespace and empty element diffs:

-              <book name="s4" />
+               <book name="s4"/>
-      <stem class="trisyllabic"></stem>
+      <stem class="trisyllabic"/>

TODO: