Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 10:00.

Present: Børre, Saara, Sjur, Thomas, Trond, Tomi

Absent: Maaren

Main secretary: Sjur

Agenda accepted as is.

2. Reviewing the task list from the last meeting

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

3. Documentation

TODO:

4. Corpus gathering

Trond added sme beuraucratic texts, roughly 0,4 mill words, total size now approaching 1,5 mill words.

Trip to Sámi municipalities

Børre back from his trip.

The Isak Saba guovddaš will not sign unless the contract is in Sámi, and they are sceptical to giving texts to the project for free. Maaren (and Trond) translated the A contract to Sámi yesterday.

Note to be placed somewhere:

[http://www.oqaasileriffik.gl/dk/ Greenland’s language secretariat] have a paradigm generator based upon Xerox tools, we should ask for their source code. (site in Greeenlandic, English and Danish).

The Min Áigi format should be dealt with: \@ingress etc should be dealt with for the .txt, but business as usal for the .doc files. We should have \@code as paragraphs, look at the rest. Trond to look at the format.

Collecting

See a previous meeting memo for what’s to be done.

TODO: Send out the rest of the letters (Børre)

Signed contracts since last meeting:

Odin

Sæth replied by e-mail, hasn’t had time to follow-up, but will try to include us in their plans.

Olavi Korhonen’s Lule Sámi dictionary.

TODO: Børre to contact Olavi Korhonen and Kuhmunen

KIO Grafisk and the Iđut books

TODO:

Bible texts

We will get text from Finland, but still haven’t received any. We have got the Swedish text from Sweden. As for the last html versions from Norway, Trond has not contacted them last week.

Swedish html has arrived, no paratext. Norsk bibelselskap has not sent corrected New Testament versions for sme, and not paratext for nno/nob.

TODO:

Min Áigi

Børre has received texts, and forwarded them to Trond. Problems with Unicode in the filenames, as the non-ASCII characters are unparsed strings with the octal code of the character(s) in question:

The files (appr 2000 files) are added, here: /usr/local/share/corp/orig/sme/news/MinAigi/

We have problems with Unicode characters in filenames. All characters with diacritics are stored decomposed on MacOS X, and when transferring the files to Linux (cochise) via a tar file, the characters are not recomposed, making the files accessible only by typing the combining diacritic - not nice. We also now have the same problem on Mac, making it in practice impossible to access a set of files like:

a84-231-8-254:~ sjur$ l a+TAB
áda  áde  ádo  åde
a84-231-8-254:~ sjur$ l a

This was solved once before, and we need to look at this again. The old Bugzilla issue should be reopened. The bug was reopened: http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=76

TODO:

Min Áigi seems to have been changing from text files to MS Word around issue 015-05.

Kåfjord

Promised to send us texts, but nothing has arrived yet.

TODOBørre to contact them.

Sámi Instituhtta

Audhild Schanche has signed the contract. We will have to contact them about transferring the texts.

TODOBørre to contact them.

5. Corpus infrastructure

https://giellalt.uit.no/lang/corp/corpus-summary.html

TODO:

Changes and updates because of the Divvun public tender

User account admin and infra: see [previous memo|/admin/weekly/2006/Meeting_2006-03-06.html].

TODO: see above under Documentation.

Automatic build of the content of our corpus repo: also see [previous memo|/admin/weekly/2006/Meeting_2006-03-06.html].

TODO:

Name change again?

gt -> gtbound/
gtbound -> some nifty new letter... ?
gtfree -> some nifty new letter... ?

Trond to come up with some new suggestion.

Free and non-free texts

More info in a [previous meeting memo.|/admin/weekly/2006/Meeting_2006-03-13.html]

TODO:

More texts to the graphical corpus interface:

TODO:

Top-two priorities:

  1. Trond and Saara to discuss with Lars.
  2. Lars to add text to the server.
  3. Tomi to prepare for the parallel corpus.

Language recognition

TODO:

6. Infrastructure

Aligner

Today, we have two anchor files in addition to the original one.

TODO:

Hyphenator

Trond and Thomas have been updating the propernoun file with ^ tags. We need the tag in front of compound parts beginning in a vowel or in two or more consonants. Compound parts beginning with one consonant are handled correctly.

TODO:

7. Linguistics

General - hyphenation

See discussion, open questions and decission in the [previous meeting memo.|/admin/weekly/2006/Meeting_2006-04-03.html]

TODO:

North Sámi

There are some heavy bugs:

We should have some linguistic workshops while Maaren is here.

diehtojuohkindoaimmat   diehtu+N+SgNomCmp#juohku+N+SgNomCmp#itna+N+SgNomCmp#doaibma+N+Pl+Nom
diehtojuohkindoaimmat   diehtu+N+SgNomCmp#juohku+N+SgNomCmp#itna+N+SgNomCmp#doaibma+N+Sg+Acc+PxSg2
diehtojuohkindoaimmat   diehtu+N+SgNomCmp#juohku+N+SgNomCmp#itna+N+SgNomCmp#doaibma+N+Sg+Gen+PxSg2
diehtojuohkindoaimmat   diehtu+N+SgNomCmp#juohkin+N+SgGenCmp#doaibma+N+Pl+Nom
diehtojuohkindoaimmat   diehtu+N+SgNomCmp#juohkin+N+SgGenCmp#doaibma+N+Sg+Acc+PxSg2
diehtojuohkindoaimmat   diehtu+N+SgNomCmp#juohkin+N+SgGenCmp#doaibma+N+Sg+Gen+PxSg2
diehtojuohkindoaimmat   diehtu+N+SgNomCmp#juohkin+N+SgNomCmp#doaibma+N+Pl+Nom
diehtojuohkindoaimmat   diehtu+N+SgNomCmp#juohkin+N+SgNomCmp#doaibma+N+Sg+Acc+PxSg2
diehtojuohkindoaimmat   diehtu+N+SgNomCmp#juohkin+N+SgNomCmp#doaibma+N+Sg+Gen+PxSg2
diehtojuohkindoaimmat   diehtu+N+SgNomCmp#juohku+N+SgNomCmp#itna+N+SgNomCmp#doaibma+N+Pl+Nom
diehtojuohkindoaimmat   diehtu+N+SgNomCmp#juohku+N+SgNomCmp#itna+N+SgNomCmp#doaibma+N+Sg+Acc+PxSg2
diehtojuohkindoaimmat   diehtu+N+SgNomCmp#juohku+N+SgNomCmp#itna+N+SgNomCmp#doaibma+N+Sg+Gen+PxSg2
diehtojuohkindoaimmat   diehtu+N+SgNomCmp#juohkin+N+SgGenCmp#doaibma+N+Pl+Nom
diehtojuohkindoaimmat   diehtu+N+SgNomCmp#juohkin+N+SgGenCmp#doaibma+N+Sg+Acc+PxSg2
diehtojuohkindoaimmat   diehtu+N+SgNomCmp#juohkin+N+SgGenCmp#doaibma+N+Sg+Gen+PxSg2
diehtojuohkindoaimmat   diehtu+N+SgNomCmp#juohkin+N+SgNomCmp#doaibma+N+Pl+Nom
diehtojuohkindoaimmat   diehtu+N+SgNomCmp#juohkin+N+SgNomCmp#doaibma+N+Sg+Acc+PxSg2
diehtojuohkindoaimmat   diehtu+N+SgNomCmp#juohkin+N+SgNomCmp#doaibma+N+Sg+Gen+PxSg2
diehtojuohkindoaimmat   diehtu+N+SgNomCmp#juohku+N+SgNomCmp#itna+N+SgNomCmp#doaibma+N+Pl+Nom
diehtojuohkindoaimmat   diehtu+N+SgNomCmp#juohku+N+SgNomCmp#itna+N+SgNomCmp#doaibma+N+Sg+Acc+PxSg2
diehtojuohkindoaimmat   diehtu+N+SgNomCmp#juohku+N+SgNomCmp#itna+N+SgNomCmp#doaibma+N+Sg+Gen+PxSg2
diehtojuohkindoaimmat   diehtu+N+SgNomCmp#juohkin+N+SgGenCmp#doaibma+N+Pl+Nom
diehtojuohkindoaimmat   diehtu+N+SgNomCmp#juohkin+N+SgGenCmp#doaibma+N+Sg+Acc+PxSg2
diehtojuohkindoaimmat   diehtu+N+SgNomCmp#juohkin+N+SgGenCmp#doaibma+N+Sg+Gen+PxSg2
diehtojuohkindoaimmat   diehtu+N+SgNomCmp#juohkin+N+SgNomCmp#doaibma+N+Pl+Nom
diehtojuohkindoaimmat   diehtu+N+SgNomCmp#juohkin+N+SgNomCmp#doaibma+N+Sg+Acc+PxSg2
diehtojuohkindoaimmat   diehtu+N+SgNomCmp#juohkin+N+SgNomCmp#doaibma+N+Sg+Gen+PxSg2
diehtojuohkindoaimmat   diehtu+N+SgNomCmp#juohkit+V+TV+Actio#doaibma+N+Pl+Nom
diehtojuohkindoaimmat   diehtu+N+SgNomCmp#juohkit+V+TV+Actio#doaibma+N+Sg+Acc+PxSg2
diehtojuohkindoaimmat   diehtu+N+SgNomCmp#juohkit+V+TV+Actio#doaibma+N+Sg+Gen+PxSg2
diehtojuohkindoaimmat   diehtu+N+SgNomCmp#juohkit+V+TV+Actio#doaibma+N+Pl+Nom
diehtojuohkindoaimmat   diehtu+N+SgNomCmp#juohkit+V+TV+Actio#doaibma+N+Sg+Acc+PxSg2
diehtojuohkindoaimmat   diehtu+N+SgNomCmp#juohkit+V+TV+Actio#doaibma+N+Sg+Gen+PxSg2
diehtojuohkindoaimmat   diehtu+N+SgNomCmp#juohkit+V+TV+Actio#doaibma+N+Pl+Nom
diehtojuohkindoaimmat   diehtu+N+SgNomCmp#juohkit+V+TV+Actio#doaibma+N+Sg+Acc+PxSg2
diehtojuohkindoaimmat   diehtu+N+SgNomCmp#juohkit+V+TV+Actio#doaibma+N+Sg+Gen+PxSg2
diehtojuohkindoaimmat   diehtu+N+SgNomCmp#juohkit+V+TV+N+Actio+SgNomCmp#doaibma+N+Pl+Nom
diehtojuohkindoaimmat   diehtu+N+SgNomCmp#juohkit+V+TV+N+Actio+SgNomCmp#doaibma+N+Sg+Acc+PxSg2
diehtojuohkindoaimmat   diehtu+N+SgNomCmp#juohkit+V+TV+N+Actio+SgNomCmp#doaibma+N+Sg+Gen+PxSg2
diehtojuohkindoaimmat   diehto#juohkin+N+SgGenCmp#doaibma+N+Pl+Nom
diehtojuohkindoaimmat   diehto#juohkin+N+SgGenCmp#doaibma+N+Sg+Acc+PxSg2
diehtojuohkindoaimmat   diehto#juohkin+N+SgGenCmp#doaibma+N+Sg+Gen+PxSg2
diehtojuohkindoaimmat   diehto#juohkin+N+SgNomCmp#doaibma+N+Pl+Nom
diehtojuohkindoaimmat   diehto#juohkin+N+SgNomCmp#doaibma+N+Sg+Acc+PxSg2
diehtojuohkindoaimmat   diehto#juohkin+N+SgNomCmp#doaibma+N+Sg+Gen+PxSg2

After postprocessing, obeying Karlsson’s law (choose the wordform with the least compound boundaries) it is reduced to:

"<diehtojuohkindoaimmat>"
         "diehtojuohkin#doaibma" N Sg Acc PxSg2
         "diehtojuohkin#doaibma" N Pl Nom
         "diehtojuohkin#doaibma" N Sg Gen PxSg2

Lule Sámi

TODO:

8. Name lexicon infrastructure

TODO:

  1. refactor and prepare risten.no for multiple collections:
    1. refactor the code into more and more specific components according to our folder hierarchy (Tomi, Sjur)
      1. things are moving forward
  2. write down the most common editing scenarios (to be used as guides for making the editing interface) (adding / changing ) (Trond, Tomi)
  3. develop the needed XQueries and interface (Sjur, Tomi)
    1. developing
  4. data synchronisation between risten.no and the cvs repo (Tomi)
    1. nothing this week
  5. test and review when ready
  6. Rethink the doubletagging procedure for names, consider grammatically motivated semtag conversion routines (“Helsinki” from Plc to Obj to Org) (Trond)

9. Spellers

Nothing until the new proper noun lexicon is in place. We don’t have enough people to do both. Here’s our most important targets for spellers in the near future:

10. Public tender

TODO:

11. Other

Bug fixing

50 open Divvun/Disamb bugs, and 25 risten.no bugs

Please help Saara with bug 279. Not much help… Saara will contact Roy on this issue.

After the corpus issues have been somewhat settled, we should do a bug barnraising. … and then a new one after the name lexicon is fixed.

Move to victorio

xerox tools: update PATH to

/opt/sami/xerox/c-fsm/ix86-linux2.6-gcc3.4/bin/
/opt/sami/xerox/c-fsm/ix86-linux2.6-gcc3.4/lib/

Victorio still does not compile, despite a path fix, cf. bug #282.

        - Building sme.save ***

printf "compile-source sme/src/sme-lex.txt sme/src/adv-sme-lex.txt ... \n\
read-rules sme/bin/twol-sme.bin \n\
compose-result \n\
save-result sme/bin/sme.save \n\
quit \n" > tmp/save-script
lexc -utf8 < tmp/save-script
/bin/sh: lexc: command not found
make: *** [sme/bin/sme.save] Error 127

12. Summary, task list

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

13. Next meeting, closing

15.05.2006 09:30

Closed at 11:18