Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Korp is a Corpus tool and Karp a Lexicon tool from the Swedish Språkbanken. We want to install them locally.

Links

The Korp code:

Links to the Karp code are forthcoming.

Work plan

Corpora available

Corpus mixes

Interface

Menu:

  1. search for sme wordforms (kwic-snt in corpus ccat) – corpus: smesme
  2. search for sme lemmas (kwic-snt? in analysed corpus syn) – corpus choices: smesme, nob2sme
  3. search for sme and nob in translations (lemma search in sentence aligned sentences) – corpus: nob2sme
  4. deepdict sme (lemma search -> dependency daughters in corpus dep) – corpus: smedep

Lemgram

Definitions

Generation

Generation of lemgrams from lexc:

Use dict-isme-norm.fst or generator-dict-gt-norm.xfst or generator-dict-gt-norm.hfst. We remove the tags v1, v2.. from the fst. It is better for the user that all variants of the same paradigm are in the same lemgram. Many fst-lemmas have more than one entry in lexc, so the list should be uniqed before generating forms. I suggest that we start with these files:

For nouns, we pick different 3 lists: The ordinary nouns, the actors (NomAg), and the G3-marked nouns. For the other parts of speech, one command is enough. Commands to filter (ir)relevant forms:

noun-sme-lex.txt:

*Ordinary words:

egrep -v "(G3|ACTOR|CmpN/Only|ShCmp|RCmpnd|\+V\+|^\!)"

Meetings

Sitemap