Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from last week
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 11:14.

Present: Børre, Ilona (last part), Maaren, Per-Eric, Sjur, Thomas, Tomi, Trond

Absent: none

Agenda accepted as is.

2. Updated task status since last meeting

Børre

Maaren

Per-Eric

Saara

Sjur

Thomas

Tomi

Trond

3. Documentation

Corpus web interface: Oslo will take care of everything. Unix access: we’ll bring it up if needed. For the time being web access should suffice for most users.

4. Corpus gathering

Per-Eric didn’t meet with Sigga Tuolja-Sandström, but talked with her instead. The text licensing contracts should be fixed now. Per-Eric will get the text from another person.

Børre talked with Michal Aase, and they (Davvi Girji) will send us all texts for which we have a contract with the authors.

Per-Eric talked with Samuel Gælok, who said that Iđut has several of his texts. There’s also another person/publisher in Karasjok that has some of his texts.

Kulturminnelova in smj is only available in paper form, unless Anders Kintel has an electronic copy.

TODO:

5. Corpus infrastructure

Nothing this week either.

6. Infrastructure

Børre got the G5 working again. It was caused by a misconfiguration in the firewall.

We are ordering a new server for faster processing.

7. Linguistics

North Sámi

Remaining twol issues:

###  Changed because:we get olmmož-, not olmmoš-

olmmošmuorra
olmmošmuorra    olmmošmuorra    +?

olmmožmuorra
olmmožmuorra    olmmoš+N+SgNomCmp#muorra+N+Sg+Nom

###  Changed because:we get almmáj- and not almmái-
 but this works now:
 almmájmuorra
almmájmuorra    almmájmuorra    +?

almmáimuorra
almmáimuorra    almmái+N+SgNomCmp#muorra+N+Sg+Nom

almmáj-
almmáj- almmáj- +?

almmái-
almmái- almmái+N+SgNomCmp+Cmpnd

We are now back to two errors in the twol-test file:

### € olmmožX4X7-
### € olmmoš00-
  #:0
###
  €
  o
  l
  m
  m
  o
  ž:š
  X4:0
  X7:0
  REJECTED: "Word Final Consonant Neutralization 1" fails in state 17.

### € hálijd#
### € háliit0
  #:0
###
  €
  h
  á
  l
  i
  j:i
  d:t
  REJECTED: "Word Final Consonant Neutralization 1" fails in state 27.

The sme names from Finland is still not added to the lexicon (Børre has them). This will be done by Ilona.

TODO:

Lule Sámi

The æ-ä alternation issue has turned into an interesting direction. With the latest speller (29.6.), it behaves like the following:

dæbbaga -  ok
däbbaga -  däbboga
           dæbbaga
           dæbbaga--
           dibága
           dibága--
           dubága
           dubága--
           tubága
vællahit - ok
vällahit - vællahit
           vellahit
gæhtjáj  - ok
gähtjáj  - ok

That is, ä works in some cases, but not in other. æ seems to work everywhere it should.

Per-Eric talked to Nils-Olof Sortelius at Sámediggi in Sweden (smj place names), we will have to contact Lennart Dehlin, Lantmäteriverket Lennart.Dehlin(at)lm.se. Also talked to Kåre Tjikkom, who gave P-E the name of the SD/N contact person in place name issues: Lisa Monika Aslaksen.

Kåre Tjikkom has lost his speller-correct document. Sjur will have to find it, or make a new one.

This is probably a speller issue, not a twolc one. The transducers work fine:


däbbaga
däbbaga deppa+N+Sg+Gen
däbbaga deppa+N+Pl+Nom
däbbaga dæppa+N+Sg+Gen
däbbaga dæppa+N+Pl+Nom

vällahit
vällahit        vællahit+V+IV+Inf
vällahit        vællahit+V+IV+Imprt+Pl2

gähtjáj
gähtjáj giehtje+N+Sg+Ill

From invert transducers (regular and normative):

Tomi-si-maskin:~/Documents/eclipse/workspace/gt tomi$ lookup -flags mbTT -utf8 smj/bin/ismj.fst
0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100%
dæppa+N+Sg+Gen
dæppa+N+Sg+Gen  dæbbaga
dæppa+N+Sg+Gen  däbbaga

Tomi-si-maskin:~/Documents/eclipse/workspace/gt tomi$ lookup -flags mbTT -utf8 smj/bin/ismj-norm.fst
0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100%
dæppa+N+Sg+Gen
dæppa+N+Sg+Gen  dæbbaga

Normative produces word only with æ.

TODO:

8. Name lexicon infrastructure

This sub-project needs to get up and running soon. Mainly Sjur’s task.

Decisions made in Tromsø can be found in [this meeting memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html]

TODO:

  1. fix bugs in lexc2xml; add comments to the log element (Saara)
  2. finish first version of the editing (Sjur)
  3. test editing of the xml files. If ok, then: (Sjur, Thomas, Trond)
  4. make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as well) (the morphological section should be kept intact, in e.g. propernoun-sme-morph.txt) (Sjur, Saara)
  5. convert propernoun-($lang)-lex.txt to a derived file from common xml files (Sjur, Tomi, Saara)
  6. implement data synchronisation between risten.no and the cvs repo, and possibly other servers (ie the G5 as an alternative server to the public risten.no - it might be faster and better suited than the official one; also local installations could be treated the same way)
  7. start to use the xml file as source file
  8. clean terms-sme.xml such that all names have the correct tag for their use (e.g. @type=secondary) (Thomas, Maaren, linguists)
  9. merge placenames which are errouneously in different entries: e.g. Helsinki, Helsingfors, Helsset (linguists)
  10. publish the name lexicon on risten.no (Sjur)
  11. add missing parallel names for placenames (linguists)
  12. add informative links between first names like Niillas and Nils (linguists)

9. Spellers

OOo spellers

Tomi is working on the lexicon conversion to the Hunspell format. It is moving forward.

TODO:

Testing

Spelling Error Markup

TODO:

Automated testing

TODO:

Lexicon conversion to the PLX format

It seems to work quite ok now. We might still consider to ask Xerox for a license.

New public beta

Delayed till the majority of the present bugs are fixed.

10. Other

Corpus contracts

TODO:

Bug fixing

When fixing bugs, record the version number containing the fix in the Bugzilla bug report, such that for each bug, we know exactly when it should have been fixed, in what file(s) and what version.

55 open Divvun/Disamb bugs (21 of these 56 are speller bugs, 35 are general bugs), and 23 risten.no bugs

New team member

Ilona is joining the Divvun project from today - welcome! Hi! :)

Board meeting

Sjur went to Oslo on Thursday Aug. 16 to meet the board. Some highlights:

Additions to our project accepted by the board:

The planned release party will be on December 11, in Oslo. All project members are invited.

Project meeting

We’ll meet in September, 24-28, to work on the hardest remaining issues. Default location is Kautokeino, but we’ll also consider Tromsø until the next project meeting.

11. Next meeting, closing

The next meeting is 27.8.2007, 09:30 Norwegian time.

The meeting was closed at 10:57.

Appendix - task lists for the next week

Boerre

Ilona

Maaren

Per-Eric

Saara

Sjur

Thomas

Tomi

Trond