Language Technology at UiT The Arctic University of Norway

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

Page Content

Meeting setup


  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation -
  4. Corpus gathering
  5. Corpus infrastructure
  6. Linguistics
  7. Speller infrastructure
  8. Other issues
  9. Summary, task lists
  10. Closing

1. Opening, agenda review, participants

Opened at 10:35.

Present: Børre, Maaren, Sjur, Thomas, Tomi, Trond

Absent: Saara

Main secretary: Tomi

Agenda accepted as is.

2. Reviewing the task list from the last meeting








3. Documentation

Documentation tasks:

  1. Add documentation on our corpus infrastructure and our corpus work in general (“To be done by the ones making the corpora”: Børre, Tomi, Trond, Saara).
  2. Now we have 4 documents:
    1. Correct corpus (disamb usage)
    2. Corpus plan (for the disamb corpus cwb)
    3. Corpus conversion, two versions, in infra and in ling. Tomi and Børre have done parallell work ;-(
    4. catxml

For the basic corpora, we need 3 types of documentation, or doc for 3 target groups:

  1. For the users/linguists:
    1. What corpus are found, how do I use them (this info is now scattered)
  2. For the collectors:
    1. How do I add texts, where do I add them, how do I convert them (this is the Corpus conversion doc)
  3. For the programmer
    1. What did I actually do? (this is partly the catxml doc)

For the work on the graphical user interface, we need documentation as well, in principle along the same lines, except that the user is not the same linguist as above.

  1. add/update Aspell documentation (Tomi)
  2. finish divvun2web script (Børre)
    1. the cronjob is up an working. It needs a better error reporting mechanism, though.
  3. as always: document what you’re doing:-) (all)


Do we need validation upon cvs check-in? What about forrestbot? An xml validation check (acc to our dtd) is a good thing. We would like that to be implemented.

We need better error reporting, and errors should preferably be caught before cvs commit.


4. Corpus gathering

See notes from the 12.9. meeting for details about the steps forward.



The most problematic issue:

Who has the copyright of extracted material, like single words, collections of words, syntactic structure (potentially with some words filled in)? We need this to be controlled by us, not by the authors. The exact borderline is hard to define.

We will send the contracts as is to the lawyers, in parallel with waiting for comments from Kimmo. The SD lawyers will be glad to not get this task, as they’re overloaded as is. Trond will follow up on the issue.

North Sámi New Testament

Trond has been in contact with Bibelselskapet, and sent the version he received from a collegue, gathered from the web. Now we await a reaction from Bibelselskapet.

Lule Sámi New Testament

Børre has converted the translation (which was only available in pdf) to rtf, and sent the files back to Bibelsällskapet for corrections.

Update (week 39): Olavi Korhonen had some problems with fonts in the document, Børre helped him with that. We seem to have the final version now.

5. Corpus infrastructure

Naming conventions and directory structure

See [notes from the 12.9. meeting|Meeting_2005-09-12.html#Naming+conventions+and+directory+structure] for details about the decision and implementation, as well as a list of tasks.

Børre has done some work, but it is only locally on his machine. Some more discussions with Trond must be done before these are copied to cochise.

Corpus conversion

Pdf to XML

Extraction priority list

  1. retain correct Sámi characters: ok
  2. retain word and sentence order: ok
  3. retain paragraph order: ok
  4. retain structure
    1. paragraphs: ok, by perl
    2. titles, headers: ok, by perl
    3. metadata (author, year, etc.): ok, when it is present in the document
    4. lists: no
    5. tables: no

A Perl module for character conversions

Problems found so far using open-source tools:



6. Linguistics

Name lexicon

See notes from the 12.9. meeting

Place name summary

Sjur needs a short resumé of the present status wrt the parallel place names. The resumé will be given to the project board for information. This is the situation:

Conclusion: it has been much easier to get place names from Finland and Sweden than anticipated, and so far without any costs on the project. So far, the place names have been the single most important contribution from Finland and Sweden for this project.

Twol SETS definition issue

Trond and Thomas tried to define Lule Sámi G1, G2, G3 sequences in the SETS section of the twol file. It did not succeed, it turned out we had done it in the xfst spirit. We would like to have input on this issue.

North Sámi

Lule Sámi


  1. An empirical overview
    1. Numeral generation
    2. Numeral inflection
    3. Numerals as parts of compounds
  2. A clear concept of how we want to treat them
    1. Tagging
  3. A treatment

7. Speller infrastructure


Write documentation here as well.

The munch-list is working, and the affix file is improving. See [15.8. meeting memo|/admin/weekly/2005/Meeting_2005-08-15.html] for more.

Got an e-mail from Roy Dragseth, that he had to terminate the aspell processes Tomi had left running on cochise after work, because they consumed all the memory.

See 12.9. meeting memo for a list of open issues.

8. Other

Technical issues

Bug fixing


9. Summary, task list








10. Next meeting, closing

10.10.2005 10:00

Closed at 11:20