Language Technology at UiT The Arctic University of Norway

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Page Content

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 10:11.

Present: Børre, Maaren, Sjur, Thomas, Tomi, Trond

Absent: Saara

Agenda accepted as is.

2. Updated task status since last meeting

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

3. Documentation

TODO:

4. Corpus gathering

Børre talked to Richard Valkeapää, they have several computers, and have the data available. But they have problems with their backup system and interaction between file names on Windows and MacOS X. Børre will help them, and we will receive our data.

TODO:

5. Corpus infrastructure

User accounts and access

The whole issue should be moved into Bugzilla as several tasks.

TODO:

More texts to the graphical corpus interface:

Trond, Børre and Saara aligned and sent the first parallell corpus files to Oslo. One milestone further :-)

TODO:

Aligner

Børre has made some important improvements to the Bergen Aligner. The Aligner compiles better and is now able to save intermediate result files (4 out of 5 files, that is, so there is still some work to do), in order to control memory usage. In this way, we are hopefully able to contribute to the Bergen development.

TODO:

Language recognition

TODO:

6. Infrastructure

Xerox tools wrapped as servers

The Divvun project needs generation pretty soon, and Tomi can add generation to the server if he needs it. Possible conflicts in the code will have to be resolved when Saara is back.

TODO:

Hyphenator

Sjur has hyphenated the pl-wordlist.txt file (the full-form word list generated for Polderland), and now has about 10Gb of hyphenated data on his computer! Forgot to use time, but it took somewhere between 36 and 48 hours! There’s a lot of noise and problematic issues in the data, but a cleaned version will be sent off to PL for their perusal. The cleaned version is 651Mb, just slightly larger than the original input file, which was 627Mb.

Hyphenation for smj still does not work, as the hyphenation rules file does not compile. We have received valuable feedback from Lauri Karttunen, though, as we reported a UTF-8 bug in the latest versions of the Xerox tools.

Overgeneration problem:

Abakalikilaččabuččaideamet      A^ba^ka^li^kilač^ča^buč^čai^dea^met
Abakalikilaččabuččaideamet      A^ba^ka^li^kilač^ča^buč^čai^dea^met
Abakalikilaččabuččaideamet      A^ba^ka^li^kilep^po^žiid^dá^met
Abakalikilaččabuččaideamet      A^ba^ka^li^kilep^po^žiid^dá^met
Abakalikilaččabuččaideamet      A^ba^ka^li^kilab^bo^žiid^dá^met
Abakalikilaččabuččaideamet      A^ba^ka^li^kilab^bo^žiid^dá^met
Abakalikilaččabuččaideamet      A^ba^ka^li^ki#lač^ča^buč^čai^dea^met
Abakalikilaččabuččaideamet      A^ba^ka^li^ki#lep^po^žiid^dá^met
Abakalikilaččabuččaideamet      A^ba^ka^li^ki#lep^po^žiid^dá^met
Abakalikilaččabuččaideamet      A^ba^ka^li^ki#lab^bo^žiid^dá^met
Abakalikilaččabuččaideamet      A^ba^ka^li^ki#lab^bo^žiid^dá^met

Unrecognised input:

abbalaččaboiid          abbalaččaboiid          +?
Abbalaččaboiidda        Abbalaččaboiidda        +?
abbalaččaboiidda        abbalaččaboiidda        +?
Abbalaččaboiiddáde      Abbalaččaboiiddáde      +?
abbalaččaboiiddáde      abbalaččaboiiddáde      +?
Abbalaččaboiiddádeguin  Abbalaččaboiiddádeguin  +?

What is the percentage of unrecognized input? (i.e. of word forms generated by the generator but not accepted by the analyser)? Answer:

This is way too high, and must be investigated. The input is generated from our normative, non-circular version of the lexicon, and should thus only contain recognised word forms already in the lexicon.

TODO:

Automatic Bugzilla reminder for untouched bugs

Some perl-libraries needed by Bugzilla weren’t in the path, causing it to not work. Adding them should fix the issue.

TODO:

M4

TODO:

7. Linguistics

Names and multilinguality

We need a more principled approach to this.

Background: the name lexicon is getting attention from the SD name/terminology sections, and they would like to use our name lexicon also for public searching.

Observations:

1) Multilinguality is always optional.

2) We can observe that “foreign” names in texts follows a domination pattern: majority language forms can be found in minority language texts as real names (“Kautokeino produkter”), whereas minority language names almost always occur in majority language texts as citations. And citations should not be considered a natural part of the text.

3) When looking at our name classification, multilinguality varies according to:

Ani - weak/none? (pet, myth anim.  names)
Fem - weak (informative)
Mal - weak (informative)
Obj - strong
Org - strong
Plc - strong for the national and country names, weak (informative) for foreign
       names
Sur - none
Tit - strong (titles)

Suggestion:

We need to reconsider the all names in all languages policy. That policy is valid only for Fem, Mal, and Sur (and Ani and Tit?). For Obj, Org, Plc the rule should be that if they have multilingual names, each name should only be used in it’s own language. Then we need a modification saying that majority language names can be included in minority language lexicons if attested in our corpus. Also, the majority language varies according to country (obviously), which means that in a speller context, we might consider tailoring spellers for each country, leaving out noise relating to majority language names from another country.

A further issue is whether we should reconsider our cohort policy. Today, Sur and Plc are different readings. An alternative would be to have them as secondary tags, not in conflict with each other:

"<Trosterud>"
        "Trosterud" N Prop Sur Sg Nom <<< @HNOUN
        "Trosterud" N Prop Plc Sg Nom <<< @HNOUN
"<Trosterud>"
        "Trosterud" N Prop Sg Nom <Sur> <Plc> <<< @HNOUN
"<Trosterud>"
        "Trosterud" N Prop Sg Nom &Sur &Plc <<< @HNOUN

Derivation and spellers like Aspell

North Sámi

We have a lot of overgeneration, mainly possible derivations that are spelled out. We need to investigate what is going on, and evaluate whether it is a problem.

TODO:

Lule Sámi

TODO:

8. Name lexicon infrastructure

Decided in Tromsø:

Details can be found in [the meeting memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html]

TODO:

Postponed:

9. Spellers

News

The Greenlandic speller was released today. It is available at the website of the greenlandic language council. It is available for free, and is based on the same technology as our own.

Speller data generation

It reads the lexc files, reads each word, and requests (or will request) the generator server to expand each base form to its full paradigm.

Still missing: the generator isn’t yet available, Tomi should implement a first version while Saara is away. It should allow requests for:

The Java code needs to be checked in into cvs, suggested location:

gt/src/lexc2xspell

TODO:

10. Other

Bug fixing

66 open Divvun/Disamb bugs, and 24 risten.no bugs

Guess: 1/3 of the bugs are fixed already (?)

Meetings and the SD Firewall

TODO:

Task lists as iCal entries

TODO:

Words section

You all need to check out CVS/words, and link to the relevant place, cf how gt/doc/ is linked.

11. Next meeting, closing

Next meeting 23.10.2006 at 9:30.

Sjur will be away on Thursday and Friday this week. Trond will be away next week.

Closed at 11:16.

Appendix -task lists for the next week

Boerre

Maaren

Saara

Sjur

Thomas

Tomi

Trond