Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from last week
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 09:48.

Present: Børre, Sjur, Thomas, Tomi, Trond

Absent: Maaren, Saara

Agenda accepted as is.

2. Updated task status since last meeting

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

3. Documentation

TODO:

4. Corpus gathering

Nothing new during last week.

TODO:

5. Corpus infrastructure

More texts to the graphical corpus interface

Trond has talked with Lars, who is writing documentation for the users.

TODO:

Aligner

No more texts yet, Saara has included the aligner in the relevant perl script.

TODO:

6. Infrastructure

Xerox tools wrapped as servers

The server hasn’t been working for Tomi, but is now working again. The paradigm generator now only generates 16-17 word forms - far too few. It seems all possessives have disappeared:

sajin   NIR # !SUB
sa^jis  NIR
sa^jiin NIR
sa^ji   NIR
sad^jái NIR
sad^ji  NIR
sad^je  NIRL - this is done by client
sa^je   NIRL
sa^ji   NIRL
sa^je   NIRL
sajiis  NIR # !SUB
sa^jiin NIR
sa^jii^guin     NIR
sa^jiid NIR
sa^jii^de       NIR
sa^jit  NIR
sa^jiid NIRL

sajin   NIR #N+Sg+Loc
sa^jis  NIR #N+Sg+Loc
sa^jiin NIR #N+Sg+Com
sa^ji   NIR #N+Sg+Acc
sad^jái NIR #N+Sg+Ill
sad^ji  NIR
sad^je  NIRL #N+Sg+Nom
sa^je   NIRL #N+Sg+Gen
sa^ji   NIRL #N+Sg+Gen
sa^je   NIRL #N+Sg+Gen
sajiis  NIR #N+Pl+Loc
sa^jiin NIR #N+Pl+Loc
sa^jii^guin     NIR #N+Pl+Com
sa^jiid NIR #N+Pl+Acc
sa^jii^de       NIR #N+Pl+Ill
sa^jit  NIR #N+Pl+Nom
sa^jiid NIRL #N+Pl+Gen

Using fsts:
/opt/smi/sme/bin/isme.fst
/opt/smi/sme/bin/hyph-sme.fst

It should have been using: ifst-norm: inverse-norm.fst. The file is available to the server, cf /opt/smi/sme/bin/:

-rwxrwxr-x  1 root  cvs    2257 jun 21 10:13 allcaps.fst
-rwxrwxr-x  1 root  cvs      92 jun 21 10:16 cap-sme
-rwxrwxr-x  1 root  cvs 6995574 des  4 00:38 hyph-sme.fst
-rwxrwxr-x  1 root  cvs 1206092 des  4 00:38 isme.fst
-rwxrwxr-x  1 root  cvs 3106957 des  4 00:38 isme-norm.fst
-rwxrwxr-x  1 root  cvs  674609 des  4 00:38 sme-dis.rle
-rwxrwxr-x  1 root  cvs 1251450 des  4 00:38 sme.fst

TODO:

Hyphenator

TODO:

7. Linguistics

Names and multilinguality

TODO:

  1. finish first version of the editing (Sjur)
  2. test editing of the xml files. If ok, then: (Sjur, Thomas, Trond)
  3. make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as well) (the morphological section should be kept intact, in e.g. propernoun-sme-morph.txt) (Sjur, Saara)
  4. convert propernoun-($lang)-lex.txt to a derived file from common xml files (Sjur, Tomi, Saara)
  5. start to use the xml file as source file
  6. clean terms-sme.xml such that all names have the correct tag for their use (e.g. @type=secondary) (Thomas, Maaren, linguists)
  7. merge placenames which are errouneously in different entries: e.g. Helsinki, Helsingfors, Helsset (linguists)
  8. publish the name lexicon on risten.no (Sjur)
  9. add missing parallel names for placenames (linguists)
  10. add informative links between first names like Niillas and Nils (linguists)

Derivation and spellers like Aspell

TODO:

North Sámi

The following words are included in the normative list despite being marked with !SUB:

accompagnerejun -JUVVON
accompagnerejun accompagneret+V+TV+Der1+Der/j+Der/Pass+PrfPrc
ábuhuvvože -ETNE
ábuhuvvože      ábuhit+V+TV+Pass+Pot+Prs+Du1
áccohallagođežedne -ETNE
áccohallagođežedne      áccohallat+V+TV+Der3+Der/goahti+Pot+Prs+Du1

In sme-lex.txt:
 +Der2+Der/Pass:uvvo DOHPPEINCH ;
 +Der/Pass+PrfPrc:un K ;  !SUB
 +Du1:e K ;  !SUB
 +Du1:edne K ;   !SUB
 +Du1:etne K ;

These are generated by make wordlist TARGET=sme, which uses nonrec-sme.fst (print lower).

The last version of the wordlist does not include the errouneous words anymore. They seem to have disappeared as part of other changes.

TODO:

Lule Sámi

TODO:

8. Name lexicon infrastructure

Decided in Tromsø:

Details can be found in [the meeting memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html]

TODO:

Postponed:

9. Spellers

Polderland data generation

All other open POSes now included in the paradigm generator, below is a verb example:

galggaže        VIR #V+Pot+Prs+Du1
galggažedne     VIR #V+Pot+Prs+Du1
galg^ga^žet^ne  VIR #V+Pot+Prs+Du1
galg^ga^žeh^pet VIR #V+Pot+Prs+Pl2
galg^gaš        VIR #V+Pot+Prs+ConNeg
galg^ga^žat     VIR #V+Pot+Prs+Pl1
galg^ga^žit     VIR #V+Pot+Prs+Pl1
galg^ga^žan     VIR #V+Pot+Prs+Sg1
galg^ga^žea^ba  VIR #V+Pot+Prs+Du3
galg^ga^žat     VIR #V+Pot+Prs+Sg2
galg^ga^žeahp^pi        VIR #V+Pot+Prs+Du2
galg^ga^žit     VIR #V+Pot+Prs+Pl3
galg^ga^ža      VIR #V+Pot+Prs+Sg3
galg^ga^leim^me VIR #V+Cond+Prs+Du1
galg^ga^šeim^me VIR #V+Cond+Prs+Du1
galg^ga^leid^det        VIR #V+Cond+Prs+Pl2
galg^ga^šeid^det        VIR #V+Cond+Prs+Pl2
galg^ga^le      VIR #V+Cond+Prs+ConNeg
galg^ga^še      VIR #V+Cond+Prs+ConNeg
galg^ga^leim^met        VIR #V+Cond+Prs+Pl1
galg^ga^šeim^met        VIR #V+Cond+Prs+Pl1
...

TODO:

Aspell

TODO when the major part of the PLX conversion is done:

Testing

When the PLX-based speller is ready: use the generated word list as test input: all should be accepted (coverage self-testing). Pick random 1% and randomly change them with edit distance 1, run through speller = testing false positives

We need a meeting to plan testing. We’ll do it shortly this week, and perhaps a longer meeting in Alta.

TODO:

10. Other

Corpus contracts

TODO:

Bug fixing

56 open Divvun/Disamb bugs, and 23 risten.no bugs

Guess: 1/3 of the bugs are fixed already (?)

Task lists as iCal entries

TODO:

11. Next meeting, closing

The next meeting is 11.12.2006, 09:30 Norwegian time.

The meeting was closed at 11:22.

Appendix - task lists for the next week

Boerre

Maaren

Saara

Sjur

Thomas

Tomi

Trond