Language Technology at UiT The Arctic University of Norway

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/

Page Content

Meeting setup


  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation -
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 11:11.

Present: Saara, Sjur, Thomas, Trond

Absent: Børre, Maaren (sick leave), Tomi

Main secretary: Trond

Agenda accepted as is.

2. Reviewing the task list from the last meeting








3. Documentation


4. Corpus gathering

Trond added sme beuraucratic texts, roughly 0,4 mill words, total size now approaching 1,5 mill words.

Trip to Sámi municipalities

Børre is on his trip.


See a previous meeting memo for what’s to be done.

TODO: Send out the rest of the letters (Børre)


Sæth replied by e-mail, hasn’t had time to follow-up, but will try to include us in their plans.

Olavi Korhonen’s Lule Sámi dictionary.

TODO: Børre to contact Olavi Korhonen and Kuhmunen

KIO Grafisk and the Iđut books


Bible texts

We will get text from Finland, but still haven’t received any. We have got the Swedish text from Sweden. As for the last html versions from Norway, Trond has not contacted them last week.

Swedish html has arrived, no paratext. Norsk bibelselskap has not sent corrected New Testament versions for sme, and not paratext for nno/nob.


Min Áigi

Børre has received texts, and forwarded them to Trond. Problems with Unicode in the filenames, as the non-ASCII characters are unparsed strings with the octal code of the character(s) in question:

The files (appr 2000 files) are added, here: /usr/local/share/corp/orig/sme/news/MinAigi/

We have problems with Unicode characters in filenames. All characters with diacritics are stored decomposed on MacOS X, and when transferring the files to Linux (cochise) via a tar file, the characters are not recomposed, making the files accessible only by typing the combining diacritic - not nice. We also now have the same problem on Mac, making it in practice impossible to access a set of files like:

a84-231-8-254:~ sjur$ l a+TAB
áda  áde  ádo  åde
a84-231-8-254:~ sjur$ l a

This was solved once before, and we need to look at this again. The old Bugzilla issue should be reopened.


Unicode table, in case we need to recompose manually:

á    0061 0301
č    0063 030C
đ    0064 0335
ŋ    006E
š    0073 030C
ŧ    0074 0335
ž    007A 030C
æ    00
ø    006F 0337
å    0061 030A
ö    006F 0308
ä    0061 0308
Á    0041 0301
Č    0043 030C
Đ    0044 0335
Ŋ    004E
Š    0053 030C
Ŧ    0054 0335
Ž    005A 030C
Æ    00
Ø    004F 0337
Å    0041 030A
Ö    004F 0308
Ä    0041 0308

Min Áigi seems to have been changing from text files to MS Word around issue 015-05.


Promised to send us texts, but nothing has arrived yet.

TODOBørre to contact them.

Sámi Instituhtta

Audhild Schanche has signed the contract. We will have to contact them about transferring the texts.

TODOBørre to contact them.

5. Corpus infrastructure


Changes and updates because of the Divvun public tender

User account admin and infra: see [previous memo|/admin/weekly/2006/Meeting_2006-03-06.html].

TODO: see above under Documentation.

Automatic build of the content of our corpus repo: also see [previous memo|/admin/weekly/2006/Meeting_2006-03-06.html].


Name change again?

gt -> gtbound/
gtbound -> some nifty new letter... ?
gtfree -> some nifty new letter... ?

Trond to come up with some new suggestion.

Free and non-free texts

More info in a [previous meeting memo.|/admin/weekly/2006/Meeting_2006-03-13.html]


Linking parallel files

DECISION: We’ll keep the original filename, and store linking info in the header (has to be added manually).


lg-a, lg-b
<p id>
<s id>

key file
eq-link# lga-1 = lgb-1,2
eq-link# lga-2 = lgb-3
eq-link# lga-2 = lgb-4

<.>...<?> <!>

sme-dis.rle: DELIMITERS The preprocess and the aligner should agree on what is a sentence. (.!?)

... this is what sme-dis.rle will analyse <.>

<!> <?>

(sme-tdis.rle: DELIMITERS <¶>)

More texts to the graphical corpus interface:


Top-two priorities:

  1. Trond and Saara to discuss with Lars.
  2. Lars to add text to the server.
  3. Tomi to prepare for the parallel corpus.

Language recognition


6. Infrastructure


Today, we have two anchor files in addition to the original one.



Trond and Thomas have been updating the propernoun file with ^ tags. We need the tag in front of compound parts beginning in a vowel or in two or more consonants. Compound parts beginning with one consonant are handled correctly.


7. Linguistics

General - hyphenation

See discussion, open questions and decission in the [previous meeting memo.|/admin/weekly/2006/Meeting_2006-04-03.html]


North Sámi

Nothing specific this week.

Lule Sámi


8. Name lexicon infrastructure


  1. refactor and prepare for multiple collections:
    1. refactor the code into more and more specific components according to our folder hierarchy (Tomi, Sjur)
      1. things are moving forward
  2. develop the needed XQueries and interface (Sjur, Tomi)
    1. developing
  3. data synchronisation between and the cvs repo (Tomi)
    1. nothing this week
  4. test and review when ready

Discussion postponed until Sjur is back.

9. Spellers

Nothing until the new proper noun lexicon is in place. We don’t have enough people to do both.

10. Public tender

2 offers received, from Polderland and Lingsoft.


11. Other

Bug fixing

50 open Divvun/Disamb bugs, and 25 bugs

Please help Saara with bug 279. Not much help…

Saara will contact Roy on this issue.

After the corpus issues have been somewhat settled, we should do a bug barnraising. … and then a new one after the name lexicon is fixed.

12. Summary, task list








13. Next meeting, closing

08.05.2006 09:30

Closed at 13:28