Language Technology at UiT The Arctic University of Norway

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/

Page Content

Meeting setup


  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation -
  4. Corpus gathering
  5. Corpus infrastructure
  6. Linguistics
  7. Speller infrastructure
  8. Other issues
  9. Summary, task lists
  10. Closing

1. Opening, agenda review, participants

Opened at 10:07.

Present: Børre, Maaren, Saara (online), Sjur, Thomas, Tomi, Trond

Absent: none

Main secretary: Trond

Agenda accepted as is.

2. Reviewing the task list from the last meeting








3. Documentation

Documentation tasks:

Add documentation on our corpus infrastructure and our corpus work in general (Børre, Tomi, Trond, Saara):

For the basic corpora, we need 2 additional types of documentation, or doc for 2 target groups:

  1. For the users/linguists: What corpus are found, how do I use them (this info is now scattered) (Part of the HOTWO USE is documented in the catxml docu The what documents are found where etc + an overall documentation is not written, since the corpus is so sparsely populated)
  2. For the collectors: How do I add texts, where do I add them, how do I convert them (this is (partly?) done in the Corpus Conversion document) down again

Tomcat is running out of memory in between. Børre will look into changing to Forrest generating static html pages (forrest site), and serve those off of the standard Apache server. He will also look at utilizing Forrestbot as the tool to update the site, instead of our homegrown script.

4. Corpus gathering

Governmental documents (earlier in pdf, now in html)




The most problematic issue:

Who has the copyright of extracted material, like single words, collections of words, syntactic structure (potentially with some words filled in)? We need this to be controlled by us, not by the authors. The exact borderline is hard to define.

5. Corpus infrastructure

Updated task list:

  1. Make a system for file and directory permission (today: we all belong to the cvs group), to only allow people with root user privileges write access to the corpus repository, at least regarding original files
  2. Include the xsl files under version control (cvs? rcs?)
  3. Incorporate language detection as part of the corpus processing.
  4. we need a way to deal with hyphenated documents in catxml/preprocess:
    1. in normal cases hyphenation points should be removed
    2. when testing the robustness of our parsers, as well as when testing the hyphenator, the hyphenation points should be retained

Catxml performance testing

Testing done with xml files we have now in corpus base, there are seven (7) documents (9495 words). Results are below, the difference is about x30 for real time and x100 for user (processor) time:

Perl catxml:

sme$time catxml --all --lang=no -i=.
real    0m18.509s
user    0m17.753s
sys     0m0.426s

sme$time catxml --all --lang=no -i=. > /dev/null
real    0m17.033s
user    0m16.934s
sys     0m0.095s

C++ catxml:

sme$time /home/tomi/tagparser2/catxml -r *
real    0m0.827s
user    0m0.176s
sys     0m0.083s

sme$time /home/tomi/tagparser2/catxml -r * > /dev/null
real    0m0.183s
user    0m0.174s
sys     0m0.009s

Decision: C++ it is!

6. Linguistics

Name lexicon

Summary: see the newsgroup

The plan for this project was as follows: Two lines of work run in parallel:

  1. mane markup
  2. testing of conversion

When these two tasks are done (at some point in the future), the conversion will be done.

Status quo on the two lines of work:

The mark up of the remaining 3900 entries until conversion starts (People allocated look at the rest: Maaren, Ilona, Trond, Børre). This week’s status quo is as follows (some 3900 names not assigned):

1275 BERN
 314 NYSTØ
  45 ACCRA
  19 MARJA
   3 ANAR

The technical issues (specified in earlier memos: Conducted by: Tomi, Saara, Sjur. Sjur and Tomi will tomorrow Tuesday report back on a plan for using as editor for our name lexicon

North Sámi

Lule Sámi

Sjur, Thomas and Trond will cont. Lule Sámi issues.



The issue is postponed to next week.

  1. An empirical overview
    1. Numeral generation
    2. Numeral inflection
    3. Numerals as parts of compounds
  2. A clear concept of how we want to treat them
    1. Tagging
  3. A treatment
8       8+Num
8       8+Num+Acc
8       8+Num+Gen
8       8+Num+Nom

We will return to this issue after the name conversion.

7. Speller infrastructure

Nothing this week either.

8. Other

Technical issues

Video conferencing across firewalls

The problem we’ve had with the SD firewall persists, and there doesn’t seem to be any resources available to help us. Geir Kaaby instead suggested we look at the Marratech package, and try it out. So please download the MacOS X client (or get it from me), and I’ll send you the URL to the meeting room as soon as I get it.

Bug fixing

24 open bugs (and 23 bugs)

Bugzilla update

When Bugzilla is being moved, it should also be updated to the newest version, and the UTF-8 bug should be resolved.

Binary files for downloading

As we move forward in the Divvun project, we need a place to store downloadable binaries, such as installation packages, speller updates, etc. This is also true for tools and configurations we make for our own internal development.

As I wrote in my task status in the previous meeting, I have fixed bugs and updated the XXE config, and I would really see that people start using it (it makes setting up and maintaining XXE configs much easier than before, and separates our config from the XXE installation). The update is available from the Forrest svn repository, but that is hardly a user friendly place to get it.

Thus, I need a place to store a zip file of the updated config, to make it downloadable directly from our browsers. And as I mentioned initially, the framework for this download area should be the same as for the public proofing tools later on.

Børre: have a look.

Meeting memo improvements?

Look at [this: |] , especially the section “Simple Codes & the 30-Second Deliverable”.

It gave me the idea that we could skip the task lists in the end altogether, and instead specify tasks at each point in the meeting, using codes similar to what is suggested at the link above, and use automated tools to exctract, collect and generate tasks lists. As it is now, we are really specifying tasks twice, and not in a consistent way.

But such automatic postprocessing does require some coding. Can we afford it, or are we satisfied with the present system?

Decision: nothing now, we can go back later if updating the task lists get too complicated or time-consuming.

Reimbursement of expenses

Should be filed before mid November. Whatever needs to be bought, should be bought before the end of November.

9. Summary, task list








10. Next meeting, closing

14.11.2005 10:00

Closed at 11:37