Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from a week ago
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Linguistics
  7. Term db
  8. Other issues
  9. Summary, task lists
  10. Closing

1. Opening, agenda review, participants

Opened at 10.10. Agenda accepted as is.

Present: Maaren, Sjur, Thomas, Tomi, Trond, Børre

Main secretary: Trond (Sjur)

2. Reviewing the task list from the last meeting

3. Documentation - divvun.no

The humfak server will be set up for divvun.no this week, so that the pentalingual welcome pages should be ready by next week.

Todo: Translate the current welcome page into 5 lgs, and add some text as well.

We will add more information on the plans and the progress of the project. The setup incl. sync with CVS

Official opening by Språkstyret medio June. Todo: Ask Anne Britt to put it on the agenda (21.-22.6. in Tysfjord).

North Sámi as default language. There should be a note explaining that English is the internal language for documentation so that people won’t complain about the techdoc not existing in other lgs.

Giellatekno pages will be changed from parallel bilingual text on the same pages into two parallel sets of pages.

4. Corpus gathering

There is a problem that we have got no answer from Oslo and Helsinki. We need the documents soon, Sjur and Trond to talk to relevant people.

Northern Sámi texts: Further issues for corpus gathering: Talk with the Bible translators, the writers association, We have written letters to the Departments in Oslo, the Sámi parliament, the municipalities, etc. Maja Hætta at the Guovdageaidnu suohkan is going to collect text for us. Sámi Allaskuvla has not responded, Sámi Instituhta are awaiting a license suggestion for their taped texts, but we will get the newspaper texts right away.

Lule Sámi texts: We should talk to the Tysfjord branch of the Parliament on gathering of Lule Sámi text. There is some text translated to Lule Sámi within the Sámi parliament system, the challenge is just to find them. There should be Lule Sámi text in Sweden as well, especially the NT, here, Susanna Kuoljok-Angeus should be contacted.

We need a web crawler for Sámi text. TODO: Contact Knut Hofland in Bergen on this issue. The Skolelinux project has set up a web-crawler for bokmål and nynorsk, we could probably also use the same one. The two crawlers should be compared.

5. Corpus infrastructure

We have arrived, or rather are arriving, at the following:

sme/
   /orig/<donor>/<year>/file.doc         <- dump, **must** be write protected
   /int/as-in-orig/file.db.xml           <- docbook format
   /int/as-in-orig/file.xsl              <- file-specific scripts, under version control
   /gt/publisher OR author/year/file.xml <- xmlpreprocess
    ... | lookup ...

We should put some effort into the donor directory. Donor could be ordered according to person, or according to institution.

We need: The texts We need: A script for going from file.xml to .. -> preprocess eventually modifying preprocess as well (but still making preprocess able to take raw text as input, or we may make an xmlpreprocess along with the (txt)preprocess).

6. Linguistics

Gone through 5200+1500 verbs, more than 13 000 all in all.

Maren tries to work with the missing list (again)

Issues:

A priority policy is to first look at the closed classes. For the other classes So far, we have done: inflectional morphology of the n-v-a cluster. Derivational morphology and compounding not been systematically checked.

Linguistic priority list:

  1. finish the verb transitivity
  2. closed POS
  3. compounds
  4. derivation
  5. completing the lexicon
  6. names

7. Term db

Issues left:

New deadline for internal opening: May 13th New server: June 1st Official opening: June 17th

8. Other issues

University project:

Will need 1-2 new linguists, for at least one and a half year.

Friday May 6:

9. Summary, task list

TODO:

10. Next meeting, closing

09.05.2005 10.00

Closed at 12.23.