Language Technology at UiT The Arctic University of Norway

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Page Content

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 09:58.

Present: Børre, Maaren, Saara, Sjur

Absent: Thomas, Tomi, Trond

Agenda accepted as is.

2. Updated task status since last meeting

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

3. Documentation

TODO:

4. Corpus gathering

Børre wrote the renaming script for NSI, but it didn’t work on their computer.

Børre has also talked with Ája. They agreed that they will send monthly all their documents from that last month, including parallell language versions.

He has also been digging more into the SD document hierarchy, to try to find more texts, has added a few more to our corpus repository.

One author has signed the corpus contract last week, and sent us a book: Synnøve Persen. She has sent us our first novel - excellent!

TODO:

5. Corpus infrastructure

User accounts and access

TODO:

More texts to the graphical corpus interface:

TODO:

Aligner

Børre has been working hard on the Bergen aligner: fixed compiling errors and cleaned the source - compilation now only produces 15 warnings, as opposed to more than 100 warnings and 74 errors earlier.

The aligner now works somewhat automatically, it handles memory issues gracefully without manual intervention. The command line version is coming, so far it loads the input documents, but it does not start aligning by itself.

We’ll stop working on it for this week, as the fixes already done are very useful in themselves.

Some parallell texts in the corpus are now aligned (but many of the texts believed to be in Norwegian were actually in Sámi).

TODO:

Language recognition

TODO:

6. Infrastructure

Xerox tools wrapped as servers

Tomi tried to make the generator, but no success. Saara will cooperate with Tomi to make something useful this week.

TODO:

Hyphenator

TODO:

Automatic Bugzilla reminder for untouched bugs

We now receive reminders for untouched bug reports, once a week.

TODO:

M4

Still anything?

It is problematic for the CG rules, as the rule numbering gets mixed up. The hope is that it is possible to get the same effect in CG3.

7. Linguistics

Names and multilinguality

We need a more principled approach to this.

Background: the name lexicon is getting attention from the SD name/terminology sections, and they would like to use our name lexicon also for public searching.

Observations:

1) Multilinguality is always optional.

2) We can observe that “foreign” names in texts follows a domination pattern: majority language forms can be found in minority language texts as real names (“Kautokeino produkter”), whereas minority language names almost always occur in majority language texts as citations. And citations should not be considered a natural part of the text.

3) When looking at our name classification, multilinguality varies according to:

Ani - weak/none? (pet, myth anim.  names)
Fem - weak (informative)
Mal - weak (informative)
Obj - strong
Org - strong
Plc - strong for the national and country names, weak (informative) for foreign
       names
Sur - none
Tit - strong (titles)

Suggestion:

We need to reconsider the all names in all languages policy. That policy is valid only for Fem, Mal, and Sur (and Ani and Tit?). For Obj, Org, Plc the rule should be that if they have multilingual names, each name should only be used in it’s own language. Then we need a modification saying that majority language names can be included in minority language lexicons if attested in our corpus. Also, the majority language varies according to country (obviously), which means that in a speller context, we might consider tailoring spellers for each country, leaving out noise relating to majority language names from another country.

A further issue is whether we should reconsider our cohort policy. Today, Sur and Plc are different readings. An alternative would be to have them as secondary tags, not in conflict with each other:

"<Trosterud>"
        "Trosterud" N Prop Sur Sg Nom <<< @HNOUN
        "Trosterud" N Prop Plc Sg Nom <<< @HNOUN
"<Trosterud>"
        "Trosterud" N Prop Sg Nom <Sur> <Plc> <<< @HNOUN
"<Trosterud>"
        "Trosterud" N Prop Sg Nom &Sur &Plc <<< @HNOUN

Derivation and spellers like Aspell

North Sámi

Unwanted word forms:

TODO:

Lule Sámi

TODO:

8. Name lexicon infrastructure

Sjur has cleaned a lot of risten.no code during last week.

Decided in Tromsø:

Details can be found in [the meeting memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html]

TODO:

Postponed:

9. Spellers

Speller data generation

Derivations during generation of word forms: how do we generate derivations of input words, and their inflections?

Problem example: how do we get from Oslo to oslolaš?

Input:
Oslo

Output:
Oslo
Oslos
...
oslolaš
oslolaččat
...
(+ other derivations and their inflections)

Oslo ->

Oslo+N+Sg+Nom Oslo+der/laš -> oslolaš

oslolaš+N+Sg+Nom Oslo+N+Prop+Der/laš+A+Sg+Nom

Make a list of all possible derivations (see lexc lexicon files), and try them out one at a time:

What about compounding stem? How do we generate it?

TODO:

Automatic testing of the Word spellchecker

Ask MS Word to spell check the open documents, and store all unrecognised words into the user dictionary. Possibly using AppleScript to interact with Word.

We should also ask Polderland whether they have tools for this.

This will only test unrecognised words. We also need to test the suggestions, and misspellings not recognised as such.

TODO:

10. Other

Bug fixing

66 open Divvun/Disamb bugs, and 24 risten.no bugs

Guess: 1/3 of the bugs are fixed already (?)

Meetings and the SD Firewall

TODO:

How do we set environment variables effective for all users

Look into /etc/environment on victorio - NOT FOUNT on the Mac! (only relevant on the G5)

Task lists as iCal entries

TODO:

Employee seminar in Alta

SD has an employee seminar in Alta in December - should we go there? We’ll discuss it in the next meeting.

11. Next meeting, closing

Next meeting 30.10.2006 at 9:30.

Closed at 11:24.

Appendix - task lists for the next week

Boerre

Maaren

Saara

Sjur

Thomas

Tomi

Trond