Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from last week
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 9:44.

Present: Børre, Maaren, Saara, Sjur, Steinar, Thomas, Tomi, Trond

Absent: none

Agenda accepted as is.

2. Updated task status since last meeting

Børre

Maaren

Saara

Sjur

Steinar

Thomas

Tomi

Trond

3. Documentation

Nothing this week.

4. Corpus gathering

Trond finally got the sma texts from Snåsa, quite a lot of text, but not all. Børre will add it to the corpus repository.

The relevant persons have worked on the tasks below.

TODO:

5. Corpus infrastructure

Lars Nygård has left UiO. Anders Nøklestad is back in his old position. For us, this means that Anders will be the person to contact for technical matters, and Kristin Hagen the one for parsing of the nob parallel texts.

Alignment

TODO:

Conversion issues

TODO:

6. Infrastructure

Nothing this week.

7. Linguistics

North Sámi

TODO:

Numbers:

One problem we have is to correctly identify base forms of numerals, cf: (the baseform of 16 is given as 6)

guhttanuppelohkái
guhttanuppelohkái       guhtta+Num+Sg+Nom
guhttanuppelohkái       guhtta+Num+Sg+Acc

TODO:

Hyphenation problem

TODO:

Lule Sámi

It could actually be that the smj numerals are not recursive. They were made differently from the sme ones, since Spiik reported them as written sepa- rately.

TODO:

8. Name lexicon infrastructure

Decisions made in Tromsø can be found in [the meeting memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html]

Postponed:

TODO:

  1. try to make a first version of xml2lexc in Perl for testing and preparation for the big jump (Saara)
    1. done
  2. restructure interface code for easier maintenance, coding and use
    1. well under way, still some work
  3. finish first version of the editing (Sjur)
  4. test editing of the xml files. If ok, then: (Sjur, Thomas, Trond)
  5. make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as well) (the morphological section should be kept intact, in e.g. propernoun-sme-morph.txt) (Sjur, Saara)
  6. convert propernoun-($lang)-lex.txt to a derived file from common xml files (Sjur, Tomi, Saara)
  7. start to use the xml file as source file
  8. clean terms-sme.xml such that all names have the correct tag for their use (e.g. @type=secondary) (Thomas, Maaren, linguists)
  9. merge placenames which are errouneously in different entries: e.g. Helsinki, Helsingfors, Helsset (linguists)
  10. publish the name lexicon on risten.no (Sjur)
  11. add missing parallel names for placenames (linguists)
  12. add informative links between first names like Niillas and Nils (linguists)

9. Spellers

Polderland data generation

There is now a decision on compound parts, and compounding can now be included in the PLX generation. Compounding is a sine qua non (a must) for the beta version. The specification is found in this document.

We have a UTF-8 problem with the paradigm server in some cases, some characters are returned as Latin1. When the server runs on G5, everything works fine. But when it is run on victorio, some conversion errors turn up. The problem may be Java-related, according to some net sources, and also with the perl settings in victorio, related to the change in perl setup.

Suggestion: Just use the G5, and not victorio, since there is no time to fix the setup in victorio (the real error).

TODO:

  1. decide how to specify compounding behaviour info for the lexicon (Thomas, Trond, Sjur)
    1. Done!
  2. add closed POS and clitics to PLX generation (Børre, Tomi)
    1. Progressing.
  3. add compound stems to the PLX generation (Børre, Tomi)
  4. add derivations to the PLX generation (Børre, Tomi)
  5. Include numerals in the speller (Børre, Tomi)

Aspell

TODO when the major part of the PLX conversion is done:

Testing

TODO:

Localisation

TODO:

10. Other

Corpus contracts

TODO:

Bug fixing

56 open Divvun/Disamb bugs, and 23 risten.no bugs

11. Next meeting, closing

The next meeting is 22.1.2007, 09:30 Norwegian time.

The meeting was closed at 10:44.

Appendix - task lists for the next week

Boerre

Maaren

Saara

Sjur

Steinar

Thomas

Tomi

Trond