Language Technology at UiT The Arctic University of Norway

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Page Content

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Linguistics
  7. Speller infrastructure
  8. Other issues
  9. Summary, task lists
  10. Closing

1. Opening, agenda review, participants

Opened at 10:03.

Present: Børre, Maaren, Saara, Sjur, Thomas, Tomi

Absent: Trond

Main secretary: Sjur

Agenda accepted as is.

2. Reviewing the task list from the last meeting

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

3. Documentation

Documentation tasks:

  1. Add documentation on our corpus infrastructure and our corpus work in general (“To be done by the ones making the corpora”: Børre, Tomi, Trond, Saara).
  2. add/update Aspell documentation (Tomi)
  3. finish divvun2web script (Børre)
  4. as always: document what you’re doing:-) (all)
  5. divvun.no is turning white from time to time. Needs to be checked (Børre)
    1. This was probably too little memory allocated to java (the default is 64mb). Børre has now changed forrest.properties so that 256mb is allocated, and restarted the server.

4. Corpus gathering

See notes from the 12.9. meeting for details about the steps forward.

Tasks:

5. Corpus infrastructure

Naming conventions and directory structure

See [notes from the 12.9. meeting|Meeting_2005-09-12.html#Naming+conventions+and+directory+structure] for details about the decision and implementation, as well as a list of tasks.

Corpus conversion

Pdf to XML

Extraction priority list

  1. retain correct Sámi characters
  2. retain word and sentence order
  3. retain paragraph order
  4. retain structure
    1. paragraphs
    2. titles, headers
    3. metadata (author, year, etc.)
    4. lists
    5. tables

Problems found so far using open-source tools:

HTML to XML

6. Linguistics

Test

Name lexicon

See notes from the 12.9. meeting

North Sámi

Lule Sámi

Numerals

  1. An empirical overview
    1. Numeral generation
    2. Numeral inflection
    3. Numerals as parts of compounds
  2. A clear concept of how we want to treat them
    1. Tagging
  3. A treatment

7. Speller infrastructure

Aspell

Write documentation here as well.

The munch-list is working, and the affix file is improving. See [15.8. meeting memo|/admin/weekly/2005/Meeting_2005-08-15.html] for more.

See 12.9. meeting memo for a list of open issues.

8. Other

Technical issues

Bug fixing

Memo and meeting practice update

From now on, next week’s memo frame with task lists etc. will be made available at the same day as the previous meeting’s (finished) memo. This will make it easy to use the task list in that memo as a reminder, and facilitates updating it as you go - as soon as a task has been started, it can be commented and problems described ready to be included in the next meeting. This can also be done for the final status of the task, well ahead of the meeting.

9. Summary, task list

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

10. Next meeting, closing

26.09.2005 10:00

Closed at 11:10