Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 09:39.

Present: Børre, Saara, Sjur, Thomas, Tomi

Absent: Maaren, Trond

Agenda accepted as is.

2. Updated task status since last meeting

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

3. Documentation

TODO:

4. Corpus gathering

Børre got text from Lene and Stig Gælok, now we have 24 000 words in smj/facta, up from 800! Have sent letters to Jovnna-Ánde Vest and Aage Solbakk. Will contact more authors this week.

TODO:

5. Corpus infrastructure

General

Our way of dealing with the conversion of input documents has now reached an advanced level. At some point we might consider to publish our results, to the benefit of the rest of the research community.

TODO:

User accounts and access

For details, see a previous meeting memo, as well as the memo from a [dedicated meeting|/infra/corpus_policy.html].

Shell access

TODO:

More texts to the graphical corpus interface:

TODO:

Aligner

Next week: discuss NT parallel corpus

Trond and Saara has tested manual alignment for testing.

TODO:

Language recognition

TODO:

6. Infrastructure

Xerox tools wrapped as servers

Will continue this week, possible issues in the next meeting.

Feature request:

TODO:

Hyphenator

Problem: we overgenerate because we are using a circular transducer:

A-fi^ná^laid#ea^me
A-fi^ná^lai^dea^me
A-fi^ná^lai^dea^me
A-#fi^ná^laid#ea^me
A-#fi^ná^lai^dea^me
A-#fi^ná^lai^dea^me
A-fi^ná^laid#ea^me
A-fi^ná^lai^dea^me
A-fi^ná^lai^dea^me

Pseudocode for hyph cleanup:

- read cohort
- remove all but the readings with the least word boundaries
- compare the rest with the input string, disregarding ^ and #:
-- delete forms that do not correspond to the input string
- unique the final set
- print what is left (it should normally be only one form)

The input that correspond to the partially-cleaned data above:

A-finálaideame  A--fi^ná^laid-ea^mi
A-finálaideame  A--fi^ná^laid-ea^me
A-finálaideame  A--fi^ná^laid#ea^mi
A-finálaideame  A--fi^ná^laid#ea^me
A-finálaideame  A--fi^ná^lai^dea^mi
A-finálaideame  A--fi^ná^lai^dea^me
A-finálaideame  A--fi^ná^lai^dea^me
A-finálaideame  A-fi^ná^laid-ea^mi
A-finálaideame  A-fi^ná^laid-ea^me
A-finálaideame  A-fi^ná^laid#ea^mi
A-finálaideame  A-fi^ná^laid#ea^me
A-finálaideame  A-fi^ná^lai^dea^mi
A-finálaideame  A-fi^ná^lai^dea^me
A-finálaideame  A-fi^ná^lai^dea^me
A-finálaideame  A-#fi^ná^laid-ea^mi
A-finálaideame  A-#fi^ná^laid-ea^me
A-finálaideame  A-#fi^ná^laid#ea^mi
A-finálaideame  A-#fi^ná^laid#ea^me
A-finálaideame  A-#fi^ná^lai^dea^mi
A-finálaideame  A-#fi^ná^lai^dea^me
A-finálaideame  A-#fi^ná^lai^dea^me
A-finálaideame  A-fi^ná^laid-ea^mi
A-finálaideame  A-fi^ná^laid-ea^me
A-finálaideame  A-fi^ná^laid#ea^mi
A-finálaideame  A-fi^ná^laid#ea^me
A-finálaideame  A-fi^ná^lai^dea^mi
A-finálaideame  A-fi^ná^lai^dea^me
A-finálaideame  A-fi^ná^lai^dea^me
A-finálaideame  a--fi^ná^laid-ea^mi
A-finálaideame  a--fi^ná^laid-ea^me
A-finálaideame  a--fi^ná^laid#ea^mi
A-finálaideame  a--fi^ná^laid#ea^me
A-finálaideame  a--fi^ná^lai^dea^mi
A-finálaideame  a--fi^ná^lai^dea^me
A-finálaideame  a--fi^ná^lai^dea^me
A-finálaideame  a-fi^ná^laid-ea^mi
A-finálaideame  a-fi^ná^laid-ea^me
A-finálaideame  a-fi^ná^laid#ea^mi
A-finálaideame  a-fi^ná^laid#ea^me
A-finálaideame  a-fi^ná^lai^dea^mi
A-finálaideame  a-fi^ná^lai^dea^me
A-finálaideame  a-fi^ná^lai^dea^me
A-finálaideame  a-#fi^ná^laid-ea^mi
A-finálaideame  a-#fi^ná^laid-ea^me
A-finálaideame  a-#fi^ná^laid#ea^mi
A-finálaideame  a-#fi^ná^laid#ea^me
A-finálaideame  a-#fi^ná^lai^dea^mi
A-finálaideame  a-#fi^ná^lai^dea^me
A-finálaideame  a-#fi^ná^lai^dea^me

The fst used is:

sme/bin/hyph-sme.fst

To make:
make TARGET=sme hyph-sme.fst

TODO:

Automatic Bugzilla reminder for untouched bugs

Some perl-libraries needed by Bugzilla weren’t in the path, causing it to not work. Adding them should fix the issue.

TODO:

M4

TODO:

7. Linguistics

Derivation and spellers like Aspell

North Sámi

Nothing this week? No.

Lule Sámi

TODO:

8. Name lexicon infrastructure

Decided in Tromsø:

Details can be found in the meeting memo.

TODO:

9. Tromsø meeting follow-up

TODO:

Speller data generation

TODO:

10. Other

Bug fixing

64 open Divvun/Disamb bugs (two down!), and 25 risten.no bugs

Guess: 1/3 of the bugs are fixed already (?)

Meetings and Marratech

TODO:

Task lists as iCal entries

TODO:

11. Next meeting, closing

Next meeting 9.10.2006 at 9:30.

Closed at 10:38.

Appendix - task lists for the next week

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond