Language Technology at UiT The Arctic University of Norway

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Page Content

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from last week
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 09:56.

Present: Børre, Saara, Sjur, Thomas, Tomi

Absent: Maaren, Trond

Agenda accepted as is.

2. Updated task status since last meeting

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

3. Documentation

Børre added the new, fully i18n-ed documentation to our public site.

TODO:

4. Corpus gathering

Sjur has received a heap of Bible files from Pia. Børre will add them to the corpus.

TODO:

5. Corpus infrastructure

Aligner

The aligner produces empty output - not so useful:-) Børre has been working on fixing this bug.

TODO:

6. Infrastructure

Xerox tools wrapped as servers

TODO:

7. Linguistics

Names and multilinguality

TODO:

  1. finish first version of the editing (Sjur)
  2. test editing of the xml files. If ok, then: (Sjur, Thomas, Trond)
  3. make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as well) (the morphological section should be kept intact, in e.g. propernoun-sme-morph.txt) (Sjur, Saara)
  4. convert propernoun-($lang)-lex.txt to a derived file from common xml files (Sjur, Tomi, Saara)
  5. start to use the xml file as source file
  6. clean terms-sme.xml such that all names have the correct tag for their use (e.g. @type=secondary) (Thomas, Maaren, linguists)
  7. merge placenames which are errouneously in different entries: e.g. Helsinki, Helsingfors, Helsset (linguists)
  8. publish the name lexicon on risten.no (Sjur)
  9. add missing parallel names for placenames (linguists)
  10. add informative links between first names like Niillas and Nils (linguists)

North Sámi

The latest change in sme-lex.txt:

 +N+SgNomCmp:   R ; ! gahpirgánda, čoahkkinordnet
 +N+SgNomCmp:X7 R ; ! gahpergánda, čoahkkenordnet

How much will this overgenerate, would it be better to have two different lexicons, or lexicalise exceptional compounding? (GAHPIR has 2329 members…)

Command to extract the relevant parts of GAHPIR words:

grep GAHPIR noun-sme-lex.txt | cut -d":" -f1 | cut -d" " -f1 |
cut -d"#" -f3 | cut -d"#" -f2 | rev | sort | uniq | rev | l

One possibility is to split GAHPIR into three lexica:

  1. vowel lowering (X7)
  2. no vowel lowering
  3. both for the same lexeme

Another possibility could be to write two-level rules, if lowering/non-lowering follows a certain pattern.

TODO:

Lule Sámi

A lot of work has been done on the sme name lexicon, the smj copy should be updated. Nothing new on the smj proper noun lexicon itself..

TODO:

8. Name lexicon infrastructure

Decided in Tromsø:

Details can be found in [the meeting memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html]

TODO:

Postponed:

9. Spellers

Alpha version

Was received one and a half week ago, and contains spellers for sme and smj, as well as a sme hyphenator. The proofing files can be had from Sjur.

Polderland data generation

TODO:

Aspell

TODO when the major part of the PLX conversion is done:

Testing

TODO:

10. Other

Corpus contracts

TODO:

Bug fixing

57 open Divvun/Disamb bugs, and 23 risten.no bugs

Guess: 1/3 of the bugs are fixed already (?)

New Perl modules

TODO:

11. Next meeting, closing

The next meeting is 3.1.2007, 09:30 Norwegian time.

The meeting was closed at 11:09.

Appendix - task lists for the next week

Boerre

Maaren

Saara

Sjur

Thomas

Tomi

Trond