Language Technology at UiT The Arctic University of Norway

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/

Page Content

Meeting setup


  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation -
  4. Corpus gathering
  5. Corpus infrastructure
  6. Linguistics
  7. Speller infrastructure
  8. Other issues
  9. Summary, task lists
  10. Closing

1. Opening, agenda review, participants

Opened at 10:05.

Present: Børre, Maaren, Sjur, Thomas, Tomi, Trond

Absent: Saara

Main secretary: Tomi

Agenda accepted as is.

2. Reviewing the task list from the last meeting








3. Documentation

Documentation tasks:

  1. Add documentation on our corpus infrastructure and our corpus work in general (“To be done by the ones making the corpora”: Børre, Tomi, Trond, Saara).
  2. Now we have 4 documents:
    1. Correct corpus (disamb usage)
    2. Corpus plan (for the disamb corpus cwb)
    3. catxml

For the basic corpora, we need 3 types of documentation, or doc for 3 target groups:

  1. For the users/linguists: What corpus are found, how do I use them (this info is now scattered)
  2. For the collectors: How do I add texts, where do I add them, how do I convert them (this is the Corpus conversion doc)
  3. For the programmer: What did I actually do? (this is partly the catxml doc)

For the work on the graphical user interface, we need documentation as well, in principle along the same lines, except that the user is not the same linguist as above.

4. Corpus gathering

Governmental documents (earlier in pdf, now in html)




The most problematic issue:

Who has the copyright of extracted material, like single words, collections of words, syntactic structure (potentially with some words filled in)? We need this to be controlled by us, not by the authors. The exact borderline is hard to define.

North Sámi New Testament

Our inhouse sme nt is as new as the one they have at Bibelselskapet, and we were told we could just use the version we have ourselves.

Lule Sámi New Testament

Svenska Bibelsällskapet is putting their finishing touches to the Lule Sámi translation, we will have it soon.

Lule Sámi Dictionary

Nothing new about the meeting with Anders Kintel.

5. Corpus infrastructure

Updated task list:

  1. Make a system for file and directory permission (today: we all belong to the cvs group), to only allow people with root user privileges write access to the corpus repository, at least regarding original files
  2. Include the xsl files under version control (cvs? rcs?)
  3. Incorporate language detection as part of the corpus processing.
  4. we need a way to deal with hyphenated documents in catxml/preprocess:
    1. in normal cases hyphenation points should be removed
    2. when testing the robustness of our parsers, as well as when testing the hyphenator, the hyphenation points should be retained

6. Linguistics

Name lexicon

Summary: see the newsgroup

6090 entries

1788 BERN
 468 NYSTØ
 330 ACCRA
  80 MARJA
  43 ANAR
   2 NYOBL
   1 PIERA


Needed: A plan for this project:

  1. do the main markup in the present propernoun file
  2. make a script for converting it to xml (to be done one time)
  3. make a script for xml2lexc (to be done by the makefile)
    1. There is a sample file for the xml file format in gt/common/src/proper-nouns.xml
    2. There is a working xml2lexc for Komi, written by Saara
  4. make the tags etc. in the parser


  1. Mark up the remaining 6090 entires until conversion starts (Maaren to do the Sámi names, Ilona to look at C-FI-NEN and other Finnish names, Trond, Thomas and Børre to look at the rest)
  2. Entries still to be done: see above
  3. This means we would need a seventh option, the unspecified name.
  4. Then split propernoun-sme-lex.txt into two, one with the sami name being generated by the xml2lexc script, and one manually written file, containing the name sublexica (called propernoun-sme-morph.txt or whatever)
  5. Look into efficient editing of the XML lexicon (Tomi, Saara)
  6. Then convert to xml (Tomi, Saara)
  7. Look into efficient editing of the XML lexicon again (Tomi, Saara)
  8. Look into synchronisation issues with - we want the names there as well (Tomi)
    1. Consider automatic sorting on commit

Twol SETS definition issue

The definition of G1, G2, G3 in Lule Sámi is still open. and we would like to have input on this issue. We need a G3 definition for North Sámi also.

Update: it is still not working, see [bug 193|]

SUGGESTION (Trond): Thomas, Trond and Sjur didn’t meet last week either and should try again this Wednesday instead.

North Sámi

Lule Sámi

Sjur, Thomas and Trond will cont. Lule Sámi issues.


  1. An empirical overview
    1. Numeral generation
    2. Numeral inflection
    3. Numerals as parts of compounds
  2. A clear concept of how we want to treat them
    1. Tagging
  3. A treatment

We will return to this issue after the name conversion.

7. Speller infrastructure

Nothing this week either.

8. Other

Technical issues

Video conferencing across firewalls

The problem we’ve had with the SD firewall persists, and there doesn’t seem to be any resources available to help us. Geir Kaaby instead suggested we look at the Marratech package, and try it out. So please download the MacOS X client (or get it from me), and I’ll send you the URL to the meeting room as soon as I get it.

Bug fixing

19 open bugs (and 24 bugs)

Bugzilla update

When Bugzilla is being moved, it should also be updated to the newest version, and the UTF-8 bug should be resolved.


Project planning and development processes

Trond is using his project as a test case for an IT guy, Geir Tore Voktor, who is taking a course in project management. Be prepared to answer questions.

Conference report from Trond

9. Summary, task list








10. Next meeting, closing

7.11.2005 10:00

Closed at 11:06