Language Technology at UiT The Arctic University of Norway

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/

Page Content

Meeting setup


  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation -
  4. Corpus gathering
  5. Corpus infrastructure
  6. Linguistics
  7. name lexicon infrastructure
  8. Other issues
  9. Summary, task lists
  10. Closing

1. Opening, agenda review, participants

Opened at 09:57.

Present: Børre, Saara, Sjur, Trond

Absent: Maaren, Thomas, Tomi

Main secretary: Børre

Agenda accepted as is.

2. Reviewing the task list from the last meeting






On sick-leave.



3. Documentation


XSLT processing part of the corpus infra review is finished. The code is finished and ready to use, but language identification still needs improvements.

4. Corpus gathering


See a previous meeting memo for what’s to be done.

TODO: Send out the rest of the letters (Børre)

Since last meeting:



Waiting for Sæth to discuss with colleagues about how to implement the cooperation, and return to us.


Bible texts


5. Corpus infrastructure

We need more “version control” in the corpus work - we don’t know which version of the XSL script was used (but roughly whether it was used or not). We need to document within the XSL file which version of all tools were used, including the XSL file (common section/template) itself.

Transferring the old gt/sme/corp files to the new corpus repo:

Task list:

  1. Include the xsl files under version control
    1. RCS version control is almost finished, but an issue with access control is still open.
      1. Access control resolved through Unix groups: one group for corpus maintainers with write access, and another for corpus users, with read-only access.
  2. Improve Finnish language detection as part of the corpus processing
    1. Move to Bugzilla (Saara)
  3. Review automatic hyphen:
    1. Acceptable results: 90% of all real hyphens correctly tagged.
      1. Move to Bugzilla (Saara)

Further discussion about corpus analysis and computer use:

6. Linguistics

Anything? Nothing.

7. Name lexicon infrastructure

Complex names


Move these issues to bugzilla (Børre)

Preprocessor optimization

To optimize one could build a targeted transducer only containing the relevant lexicons for preprocessing. But presently that leads to the following lexicons being reported as referenced but undefined:

Perhaps picking the

hum-tf4-ans142:~/gt/sme/src trond$ grep '% ' adv-sme-lex.txt
earret% eará adv ;
dan% dihte adv ;


  1. make a lexc Root lexicon (first 40 lines of sme-lex.txt)
  2. extract the relevant parts of the relevant lexica from the main transducer
  3. built from the union of a and b.

Discussion will continue on the newsgroup.

XML format


  1. testing of conversion
  2. eXist as editor:
    1. develop the needed XQueries and interface
    2. data synchronisation between and
    3. test whether eXist as editor is actually working well

8. Other

SGL Seminar

SGL has now been elected, with the folowing members:

SGL/normativity seminar:

Infra for new projects and ideas:

Bug fixing

30 open bugs (and 24 bugs)

9. Summary, task list








10. Next meeting, closing

20.02.2006 09:30

Closed at 11:33