Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 09:32.

Present: Børre, Sjur, Thomas, Tomi, Trond

Absent: Maaren, Saara

Agenda accepted as is.

2. Updated task status since last meeting

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

3. Documentation

TODO:

4. Corpus gathering

Børre contacted several authors:

Børre will meet Stig Gælok today, he has a lot of texts in Lule Sámi.

Bård Eriksen was concerned that it would be too much work for them to deliver texts to us. Børre has asked for their book catalog, to be able to contact the authors directly.

TODO:

5. Corpus infrastructure

General

Our way of dealing with the conversion of input documents has now reached an advanced level. At some point we might consider to publish our results, to the benefit of the rest of the research community.

JPedal work: Tomi went through the source code and added an option that defines where the result goes. Didn’t solve other issue with tagging.

TODO:

User accounts and access

For details, see a previous meeting memo, as well as the memo from a [dedicated meeting|/infra/corpus_policy.html].

Shell access

TODO:

Web browser access

Has been discussed with Oslo. They will release a new version of the web interface in a couple of weeks. Further discussions delayed till then.

More texts to the graphical corpus interface:

TODO:

Aligner

There has been a bug in the Bergen aligner, we will get a new (graphical) version shortly, and wait for that. When it arrives, we will do some conversion, still waiting for the command-line version, though. The second obstacle is the paucity of nob and fin text to parallel the sme ones.

TODO:

Language recognition

New .wm files heve been made, with better performance. Saara, Ilona and Trond have been testing and refining the software. There still is some room for improvement. We now have a limit of 0 characters for paragraphs.

TODO:

6. Infrastructure

Xerox tools wrapped as servers

Feature request:

TODO:

Hyphenator

Sjur got help from Saara to sketch a Perl solution to the overgeneration problem and a clean-up script is in the works. Will be ready this week.

TODO:

Automatic Bugzilla reminder for untouched bugs

TODO:

M4

TODO:

7. Linguistics

Derivation and spellers like Aspell

Semantic double-tagging of names

Waiting for the name conversion to take place before the disamb rules can be written. Further discussion delayed till then.

North Sámi

Nothing this week?

Lule Sámi

TODO:

8. Name lexicon infrastructure

Decided in Tromsø:

Details can be found in [the meeting memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html]

TODO:

9. Tromsø meeting follow-up

TODO:

Speller data generation

We need to convert our Xerox lexicons to the format required by Polderland, Aspell, etc. The basic architecture for the conversion was decided upon in Tromsø, but it now needs to be implemented.

TODO:

10. Other

Bug fixing

64 open Divvun/Disamb bugs (two down!), and 25 risten.no bugs

Guess: 1/3 of the bugs are fixed already (?)

Meetings and Marratech

TODO:

Task lists as iCal entries

TODO:

11. Next meeting, closing

Next meeting 2.10.2006 at 9:30.

Closed at 10:10.

Appendix - task lists for the next week

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond