Language Technology at UiT The Arctic University of Norway

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/

Page Content

Meeting setup


Cf. one of the following, depending on context:

Opening, agenda review, participants

Updated task status since last meeting









Ciprian prepared demo data for sma-oahpa, it is in a similar format as the dictionary data, to reuse infrastructure and share data between dictionaries and oahpa. Needs to be refined. The demo data was in a parallell folder, but Lene moved it to a new folder we didn’t agree on.

sje-oahpa: We (i.e. a group at Samisk Senter) have got 6 mill NOK for sje work. One aspect of it is sje-oahpa (the Giellatekno part of it). The kickoff is unknown, but we will invite sje-people to the sma-oahpa kick-off.


Corpus gathering

National Library: there is a meeting September 23, to which Trond and Sjur are going. Trond will request a couple of Sámi OCR-ed texts (together with the nob texts we already have), and use them as test cases for spell-checking and (automatically) correcting the OCR-ed texts.

This work should be incorporated in the test bench project, and serve as a basis for developing better nob (and potentially nno) morphologies, to be used in our different projects. The main project, developing an OCR-adapted Norwegian spell-checker (as open source) should be financed by the Language bank.

There are starting points both in our own svn and in Apertium, both based on the lexicons from UiO/Norsk ordbank.

Our main goal with this would be to help the Language bank and the Natinal Library provide us with usable Sámi (and possibly Norwegian) corpus material, but there are a number of other secondary goals and positive side effects of this as well.

From Trondheim:


Promoting Divvun


Future plans, directions and ideas

See a separate document in plan/strat/5year.jspwiki.

Northern areas project

First major obstacle: make working keyboards and fonts. In principle, there are two approaches to keyboards:

  1. Make the Russian-based keyboards work
  2. Design new Kildin-optimized keyboards

The first goal has priority. We should have text analysis in order to consider whether the second goal is feasible. Trond to follow up test machines.

Write trustworthy and detailed documentation (in Russian)

What we know:



Out of the box experiences:

Updated corpus online

See Ciprian´s document about the corpus content in $GTPRIV/plan/corpus/oslo_corpus_update_todo.txt.


facta$ --nolog --corpdir=/usr/local/share/corp L1allOrt.correct.txt

Error message:

sh: /home/sjur/gtmain/gt/script/text_cat: No such file or directory
L1allOrt.correct.txt: ERROR errors in /home/sjur/gtmain/gt/script/text_cat -q \
   -x -d /home/sjur/gtmain/gt/script/LM "/usr/local/share/corp/tmp/L1allOrt.correct.txt.tmp0":

text_cat isn’t part of our repository, we need to add it - we are using a modified version of the original file. It should be added with a short README and some license info, pointing to the original.


Corpus infra remake




Corpus interface

This depends on the infrastructure cleanup.


Makefile + tag simplification

Done. One smallish task left: convert the M4 twolc processing to transducer manipulation.


HFST compilation

Our HFST transducers do still contain all the extra tags resulting from the tag conversion (cf above). The Apertium versions do not, however. The main problem here is the forking of smi transducer builing and maintenance - one branch in Giellatekno, and another in Apertium. This is not good for future maintenance, and should be corrected ASAP. The tag-removing make steps should be incorporated in the giellatekno svn.

Discussion forum server/software

To follow the work done in Trondheim, to provide user support, to provide a discussion forum for Sámi language workers using our LT tools, and to provide a forum for the SGM/SGL (language council) to publish terminology lists in preparations, we need a discussion forum (on our server).

We also need a platform for documentation. If the answer is a wiki, then it can be both documentation and discussion forum. If the answer is a Google group, then the issue is different


General list

To accommodate future enhancements in different directions (in rough order of importance):

  1. test bench for all parts of our language technology efforts
    1. test bench enhanced, but not yet complete
  2. improve Forrest i18n support with static sites
  3. reorganise the documentation:
    1. differ between target groups
    2. get better grouping
    3. decide what to write in Forrest and what in wiki (cf. Apertium and []) for a similar split)
    4. update/add missing parts
  4. migrate lexc lexicons to XML, splitting the task
    1. Name lexica (the Name project)
    2. Dictionaries (already in XML, task is to integrate them)
    3. At least migrate the lexc open POSes (Komi as a pilot case)
  5. change the look of the documentation web
  6. corpus content moved to Max Planck repositories? Norsk språkbank?
  7. update infrastructure to allow content-restricted spellers for special target groups



North Sámi

There are a lot of hyphens found in noun paradigms. They should not be there. Trond and Thomas will look into them.


Lule Sámi


South Sámi


Name lexicon/ infrastructure



Ciprian has made a new online/offline dictionary interface based on Apertium tools. See [].

StarDict is useless for Cyrillic languages, mainly because of the scanning function (ie point and look-up). We need to find an alternative to StarDict. Also, the Kildin Sámi users have different needs than we have had in mind. And they don’t have access to a Kildin Sámi keyboard in Windows.


Other things dictionary-related:


Proofing tools

Spelling feedback from Malta:

South Sámi

Beta release: done.


HFST- and Voikko-based proofing tools


Speller bugs

List of bugs returned from Polderland:

Tag reordering for abbreviations have caused a lot of problems:

hr.	hr+ABBR+Acc
cand.philol.	cand.philol+ABBR+N+Acc
Per	Per+N+Prop+Mal+Sg+Attr

hr.	hr+N+ABBR+Acc
Per	Per+N+Prop+Mal+Sg+Attr

Open issues based on test results:


Version: Davvisámi, version 1.2, 2009-09-18


Version: Julevsáme, version 1.2, 2009-09-20


Hyphenator bugs

Open issues based on test results :


Lexicon version: Davvisámi, version 1.2, 2009-09-18

No known issues!


Lexicon version: Julevsáme, version 1.2, 2009-09-20


Command to test the hyphenator:

preprocess dev/corp/pressemelding.txt | lookup bin/hyph-sma.fst | cut -f2 | \
lookup bin/hyphrules-sma.fst | grep -v '^$' | cut -f2 | uniq | see


Installer changes


User documentation


1.2 release


2.0 release


Nordplus Språk proofing test bench project

Arno Teigseth is hired for the summer. He is sitting in Equador, 7 hours after Norway. He is working with a quechua speller (using hunspell) and dictionary.

A goal is to use texts from parallel domains, in order to be able to compare nob and sme. We start out with 100000 words, 5000 for each genre:

  1. Discussion fora on the net
  2. Blogs

Discussion fora can be problematic, as they contain a lot of very oral language with deliberate oral or dialectal forms used in the text. As such, the texts won’t be typical for the kind of texts people would like to spell-check, and thus the genre is problematic for the purpose at hand.

Possible other genres:

  1. Minutes from local associations
  2. Pupil texts - look for the home pages of the schools
  3. exam texts by 10th grade pupils, where the pupils have published the texts themselves
  4. skrivebua - Nordland fylkeskommune: all three Sámi languages

Text to speech

There will be regular meetings in this project from now on, every second week.


Machine Translation


Kevin and Ryan have now closed the work on smenob and finsme. They will continue, but mainly do other things. As for smesmj, the possible student input is rescheduled for October.

We hope to have a student working one month on smesmj this summer. The world now knows (vaguely) that we work on MT, so we should do, as well. We also need to work together with the external ones.

All languages



Technical problems with compunds in Finnish: each part in the compound is separated by a space character (+Ux20), which is creating problems in the transfer.



Needed: Lexicon completion. Grammatical transfer rules


New worker underway.


Infrastructure in place, to demo in Trondheim to encourage more work and projects.


Ciprian has made a translation memory based on our parallel corpus. The test looks promising, he has identified certain issues. We need to be clear about the license of the original parallel texts, and we need to know the original language (ie translation direction).


SMA seminar in August/September

Done - a great success! Now we need to follow up.



Start to make yearly reports


Thursday inhouse seminar

Next time suggestion list:

  1. introduction to xslt - Ciprian to start out
    1. relevant xslt issues:
    2. basic principles of xslt …
    3. sorting in xslt … (have a look at the dictionary sort xslt script)
    4. converting from one xml format to another wilt xslt (sugg: convert from DivvunGT dictionary dtd to MacDict xml)

Future seminars:

  1. XQuery
  2. More XML (needs concretisation)
  3. UML
  4. other suggestions?

Fall planning



Next meeting, closing

The next meeting is 13.9.2010, 09:30 Norwegian time.

The meeting was closed at 12:11.

Appendix - task lists for the next week