Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from last week
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 10:00.

Present: Børre, Maaren, Per-Eric, Sjur, Thomas, Tomi

Absent: Saara, Trond

Agenda accepted as is.

2. Updated task status since last meeting

Børre

Maaren

Per-Eric

Saara

Sjur

Thomas

Tomi

Trond

3. Documentation

TODO:

4. Corpus gathering

Per-Eric will meet Sigga Tuolja-Sandström on his way home from Östersund.

Torkel Rasmussen will give texts to us, he will be in Tromsø in week 39-40, and transfer texts from his computer to ours then.

TODO:

5. Corpus infrastructure

Nothing this week either.

6. Infrastructure

The G5 is down/off the net after a system update, Børre is working on it.

TODO:

7. Linguistics

North Sámi

The twol problems in sme were “solved” by revering the changes that caused them. Thus, it works now, but with the flaws that started the whole process.

Thomas has finished compounding markup - very good!

The sme names from Finland is still not added to the lexicon (Børre has them). This could be done by Ilona. Sjur will ask Trond.

TODO:

Lule Sámi

The æ-ä alternation issue has turned into an interesting direction. With the latest speller (29.6.), it behaves like the following:

dæbbaga -  ok
däbbaga -  däbboga
           dæbbaga
           dæbbaga--
           dibága
           dibága--
           dubága
           dubága--
           tubága
vællahit - ok
vällahit - vællahit
           vellahit
gæhtjáj  - ok
gähtjáj  - ok

That is, ä works in some cases, but not in other. æ seems to work everywhere it should.

Nils-Olof Sortelius at Sámediggi in Sweden (smj place names) is in hospital - Per-Eric will talk to Kåre Tjikkom instead.

TODO:

8. Name lexicon infrastructure

Decisions made in Tromsø can be found in [this meeting memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html]

TODO:

  1. fix bugs in lexc2xml; add comments to the log element (Saara)
  2. finish first version of the editing (Sjur)
  3. test editing of the xml files. If ok, then: (Sjur, Thomas, Trond)
  4. make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as well) (the morphological section should be kept intact, in e.g. propernoun-sme-morph.txt) (Sjur, Saara)
  5. convert propernoun-($lang)-lex.txt to a derived file from common xml files (Sjur, Tomi, Saara)
  6. implement data synchronisation between risten.no and the cvs repo, and possibly other servers (ie the G5 as an alternative server to the public risten.no - it might be faster and better suited than the official one; also local installations could be treated the same way)
  7. start to use the xml file as source file
  8. clean terms-sme.xml such that all names have the correct tag for their use (e.g. @type=secondary) (Thomas, Maaren, linguists)
  9. merge placenames which are errouneously in different entries: e.g. Helsinki, Helsingfors, Helsset (linguists)
  10. publish the name lexicon on risten.no (Sjur)
  11. add missing parallel names for placenames (linguists)
  12. add informative links between first names like Niillas and Nils (linguists)

9. Spellers

OOo spellers

Tomi is working on the lexicon conversion to the Hunspell format. Some issues to solve. Software problems on the G5 - will be added to Bugzilla.

TODO:

Testing

Spelling Error Markup

Steinar finished his job last week, and copied it all over to victorio. There is still some clean-up to do, mainly because the markup is added to the xml files, and not the originals. The present conversion process doesn’t handle error markup in this format.

TODO:

Testing tools

TODO:

Regression tests

Tomi utilized oe of the java classes in the PLX conversion tool to create a separate tool/command to extract the baseforms from the lexc lexicons.

TODO:

Lexicon conversion to the PLX format

It seems that we don’t need to get new licenses from either Polderland or Xerox

TODO:

Publicity follow-up

Per-Eric is being interviewed next week by Anna Sunna (SR Sámi Radio), and will be demonstrating installation and usage of the proofing tools.

It was cancelled by SR.

New public beta

Delayed till the majority of the present bugs are fixed.

10. Other

Corpus contracts

TODO:

Bug fixing

When fixing bugs, record the version number containing the fix in the Bugzilla bug report, such that for each bug, we know exactly when it should have been fixed, in what file(s) and what version.

56 open Divvun/Disamb bugs (21 of these 56 are speller bugs, 35 are general bugs), and 23 risten.no bugs

TODO:

Board meeting

Sjur is going to Oslo on Thursday Aug. 16 to meet the board.

Project meeting

It would be good to have another project meeting sometime in September, to work on the hardest remaining issues. We’ll decide the date next week.

11. Next meeting, closing

The next meeting is 20.8.2007, 09:30 Norwegian time.

The meeting was closed at 10:57.

Appendix - task lists for the next week

Boerre

Maaren

Per-Eric

Saara

Sjur

Thomas

Tomi

Trond