Language Technology at UiT The Arctic University of Norway

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/

Page Content

Meeting setup


  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation -
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 09:56.

Present: Børre, Sjur, Thomas, Trond, Tomi

Absent: Maaren, Saara

Main secretary: Thomas (with help from others)

Agenda accepted as is.

2. Reviewing the task list from the last meeting








3. Documentation


4. Corpus gathering


See a previous meeting memo for what’s to be done.


New contracts:

Olavi Korhonen’s Lule Sámi dictionary.

Phoned Korhonen. He was willing to sign the contracts, and wanted some kind of access to our corpus. He wants to collect words for his Northern Sámi dictionary.


KIO Grafisk and the Iđut books


Bible texts

We will get text from Finland, but still haven’t received any. Swedish html has arrived, no paratext. Norsk bibelselskap has not sent corrected New Testament versions for sme, and not paratext for nno/nob.


Davvi Girji

Called her last week. She said Davvi Girji os would give us permission to use texts by their authors. She wasn’t sure if we could get the texts directly from Davvi Girji, because of copyrights of pictures and other artwork in the books. Said they (Davvi Girji workers) were going to have a meeting right after our conversation by phone, and take some kind of desicion on this case. Haven’t heard anything from Kåven, though.


Min Áigi

The Min Áigi format should be dealt with: \@ingress etc should be dealt with for the .txt, but business as usal for the .doc files. Saara has done the xsl conversion routine for the typographic tags. It still needs some fine tuning, as there are some @ tags that were not included in our initial list.



Promised to send us texts. Some texts have arrived, but nothing from Ája.


Sámi Instituhtta

Børre contacted Richard Valkeapää, the IT-consult at NSI. He put it on his todo list, as he would have to contact the person who has worked with the newspaper texts anyway. He said this would be done in the near future (within a month).


5. Corpus infrastructure

User accounts and access

Talked to Roy Dragseth last week about this. Turns out that we should do the administration ourselves. We will have to discuss on what level we would like to give access to users of our corpus.

This is from our contracts:

Contract 1

3.7 Mottakar kan gje personleg bruksrett til tekstsamlinga til personar som har
skrive under på bruksrettskontrakten i Vedlegg 2. Mottakar skal ikkje gje
bruksrett til tekstsamlinga til personar som ein har grunn til å tru vil bryte
vilkåra i kontrakten. Mottakar forpliktar seg til å informere avgjevar med ein
gong han/ho får kjennskap til mogleg brot på desse vilkåra.

(Vedlegg 2 = Contract 3)

Contract 3

4.1 Brukaren har berre rett til å bruke tekstane til forsking eller slike
kommersielle språkteknologiske eller andre liknande formål, som ikkje bryt med
Lov om opphavsrett til åndsverk. Brukaren kan bruke tekstane for å gjere seg
nytte av dei språktrekka (t.d. statistisk informasjon, grammatiske reglar og
semantiske skildringar) han/ho har funne gjennom forsking, og plukke ut kortare
sitat frå tekstane.

5.3 Brukaren får ikkje ta større delar av løpande tekst i tekstsamlinga enn
korte sitat bort frå den tenaren som tekstsamlinga er installert på. Det er lov
til å lagre temporære kopiar på sjølve tenaren på det vilkåret at brukaren tek
omsyn til datatryggleiken. Denne avgrensinga gjeld ikkje offentlege dokument
(t.d. NOU, stortingsmeldingar o.l.). Det går klårt fram av kvart dokument kva
for lisens som er knytta til det.


Name change again?


Free and non-free texts

More info in a [previous meeting memo.|/admin/weekly/2006/Meeting_2006-03-13.html]


More texts to the graphical corpus interface:


Top-three priorities:

  1. discuss more with Lars on tag unification, and unify them (Trond)
  2. change ccat to be able to create the right input for the corpus analysis (xml- tagged output) (Tomi). Work estimate: a few (2-3?) days - nothing this week, we will re-evaluate (and schedule) next Monday)
  3. add text to the server (Lars)

Language recognition


6. Infrastructure

Paradigm generation

Goal: Reuse Greenlandic code for paradigm generation.



Trond and Saara will continue this issue.


Thomas is finished with adding ^ tags to the smj noun file, and has continued working on the sme noun file.

Trond and Tomi have been working on the smj rule component, and have improved both the treatment of weak grade consonant clusters (preconsonantal geminates) and on some loan word patterns.


7. Linguistics

General - hyphenation

See discussion, open questions and decission in a [previous meeting memo.|/admin/weekly/2006/Meeting_2006-04-03.html]


North Sámi


Lule Sámi

There are some open issues in the marginal area of the smj transducer:


8. Name lexicon infrastructure


  1. finish refactoring for multiple collections in the search interfarce (Sjur)
    1. improving, not finished
  2. develop the needed XQueries and interface (Sjur, Tomi)
    1. progressing, done some, haven’t commited (adding new term, create-termc-entry.xq)
  3. data synchronisation between and the cvs repo (Tomi)
    1. discussion started on eXist-list, we’ll wait a couple of days to see what’s coming out of it, and if nothing useful to us, we’ll add our use case with questions
  4. test and review when ready
  5. Rethink the doubletagging procedure for names, consider grammatically motivated semtag conversion routines (“Helsinki” from Plc to Obj to Org, or the Lyndi England issue) (Trond)

9. Spellers

Nothing until the new proper noun lexicon is in place. We don’t have enough people to do both. Here’s our most important targets for spellers in the near future:

10. Public tender

Finnut called, and here’s their evaluation: if we think that the offers are incomplete or otherwise not fully acceptable, we can enter negotiations with the companies, effectively cancelling the current public tender. This can only be done as long as we don’t change the public tender document (that is, the foundation for the public tender) - if it is changed, we have to announce the whole competition again, with the usual 53 days minimum deadline for applications.

Sjur will send an e-mail to the project board, outlining the different aspects of the two offers, and ask about their opinion on the following questions:


11. Other

Summer vacation

Who When
Børre ?
Linda ?
Maaren ?
Saara July
Sjur ?
Thomas 3.7 - 7.8
Trond July
Tomi 8.7 - 16.7, more?

Bug fixing

45 open Divvun/Disamb bugs, and 25 bugs

Guess: 1/3 of the bugs are fixed already (?)

Please help Saara with [bug 279|] (Perl locale). Not much help… Saara will contact Roy on this issue.

After the corpus issues have been somewhat settled, we should do a bug barnraising. … and then a new one after the name lexicon is fixed.


0.3 is working fine on Mac, Linux and Windows. Should be installed on all computers c.f. [] (our preinstalled Xcode veriosn is 2.0, must be 2.1):

Easy way out when the standard Darwin Ports installation isn’t working: just get a copy of /opt/local/ from Børre.

Trond should ask Lars Nygård and Tero Avellan to install Gobby as well; has asked Per Langgård.

SEE autosave AppleScript

Copy the following into a ScriptEditor window:

tell application "SubEthaEdit"
	repeat until false is true
		save documents
		delay 60
	end repeat
end tell

and click “run”. All your SubEthaEdit documents will be automatically saved every minute (the interval can be changed by specifying another value (in seconds) for delay)

12. Summary, task list








13. Next meeting, closing

29.05.2006 09:30

Closed at 11:35