Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Summing up last week (Meeting + conference)
  4. Documentation - divvun.no
  5. Corpus gathering
  6. Corpus infrastructure
  7. Linguistics
  8. Speller infrastructure
  9. Other issues
  10. Summary, task lists
  11. Closing

1. Opening, agenda review, participants

Opened at 10:12.

Present: Børre, Saara (half an hour), Sjur, Thomas, Tomi, Trond

Absent: Maaren

Main secretary: Børre

Agenda accepted with new point 3.

2. Reviewing the task list from the last meeting

Børre

Maaren

Sjur

Thomas

Tomi

Trond

3. Summing up last week (Meeting + conference)

4. Documentation

The giellatekno.uit.no site is deployed. Børre has read and fixed some of the technical documents.

The standing invitation remains: Document what you do, when you do it.

Especially document Aspell. The Makefile, how to build, what changes has been added to the lexicon, and so on.

5. Corpus gathering

Børre, Trond and Sjur had their meeting, and the Helsinki contract is quite good as it is. There are some points still to be discussed. We’ll continue this week.

Paths forward: We have a contract suggestion. Sjur and Børre should start the negotiation with the authors and/or publishers to get text. We should make a priority list for authors, and we should invite ourselves to their meeting to discuss. We should carry on the discussions both with Iđut and with Davvi Girji.

How to proceed:

  1. Get the contract suggestion ready
    1. Translated part 1 ok, part 2 and 3 missing. Done this week
    2. Get part 4 from Kimmo, and translate it
    3. Contact our lawyers, at SD and UIT (today, tomorrow).
    4. When the Norwegian version of the contracts are ready, make sure they’re corresponding to the Finnish ones in their legal interpretation (as far as possible); then publish both versions for others to reuse, with background documentation (in cooperation with Kimmo Koskenniemi)
  2. Approach the text owners (see ordered list below)

Independent of the contract work

  1. Bible: The new testament (Trond)
  2. Bureaucratic text:
    1. Sámi Parliament (Børre)
    2. Sámi Oahpahusráđđi (Børre)
    3. KRD (Børre, check whether we miss texts (discuss with Trond))
    4. the Sámi municipalities (Børre)
  3. Textbooks
    1. To the extent that text can be got directly from SO.

After the contracts are ready

Sjur and Børre should probably take a Tour-de-Sápmi, and meet with the most important persons and institutions. Børre as the responsible for collecting, Sjur as responsible for the project, and representative of SD

The tour should be planned, not in this meeting, but before the contracts are ready, i.e., it should be planned next week.

  1. Commercially published texts
    1. Author organisations’ meetings
    2. Key authors one by one
      1. (list of author names) Kerttu Vuolab, Kirsi Paltto,
    3. Iđut and key authors there (Børre)
    4. Davvi Girji and key authors there
  2. Newspaper text:
    1. Sámi Instituhtta’s (for the old archive of Min Áigi and Áššu)
    2. Áššu has been making a CD since the end of may, there should be a pile there. Børre suggests that they send us the CDs they have, so that we may look at them, and ensure that the routines work, and that we are able to utilize their format.
    3. Min Áigi

List of texts with lower priority (to be gathered when the above list is more or less fixed)

6. Corpus infrastructure

Do documentation.

Naming conventions and directory structure

We have a decision from Helsinki:

7. Linguistics

North Sámi

Lule Sámi

We do not know when we get the lexicon from Anders Kintel. We need a meeting with him and with Árran in order to coordinate the work.

Status quo on our parser:

8. Speller infrastructure

aSpell

Write documentation here as well.

Munch-list is working, and the affix file is improving. See [previous meeting memo|/admin/weekly/2005/Meeting_2005-08-15.html].

Issues: #The phonetic file should be systematically looked into. 1. Check that it works 1. Add more correspondences on an impressionistic basis

  1. Start work on collecting systematic spelling errors:
    1. Our in-house file typos.txt
    2. The soon-to-arrive error texts from newspapers
  2. The holes in the affix list should be mended
  3. We should, at some point, evaluate whether this is The Correct Approach to aspell-type speller building
  4. Affix file UTF-8 problem should be checked and reported.
  5. Then there is the UTF-8 root (or whatever) problem: The work-around using Latin 4 is no solution because of the following latin 4 bug: It treats đ as ð, and thus gives error on “ođđa” (wants “oðða”, this is since in Latin 4 đ and ð are unified (Latin 4). Latin 6 (= ISO/IEC 8859-10) or ISO-IR 197 (the old Dihtorlávdegoddi code page) could be a fix to this.
  6. The clitics issue: Today we have a manually created affix file in order to meet the muncher. Strategy: Do the munching without the clitics, but then enrich the manually-created affix file via a genfisuffix -program in order to get an automatically created suffix + clitic file to add the compiled lexicon to. We will have genfisuff taking affix + clitic and makes it into affix’. Then we use affix to munch but affix’ to spellcheck.
  7. We must create subcomponents under the Speller

MS Office spellers

Nothing new, see previous meeting memo.

OpenOffice.org

From last meeting:

The conversion from aspell to myspell will work trivially as soon as the myspell list becomes smaller.

Issue left open.

Hunspell

Hunspell is presently already working with OOo, and is a much better speller engine, linguistically speaking (can handle compounds much better than Aspell, as well as complex inflection and derivation). For pointers, see the previous meeting memo.

Issue left open.

Other engines

Børre and Sjur had a long discussion with the author of the SFST library/tool set. Next on his priority list is a feature for handling spelling errors in running text analysis. This is principally the same as we want, thus there should be a good opportunity for making SFST into the spelling engine. Sjur has a suggestion on how to implement this feature that may be mailed to the author of SFST.

9. Other

Technical issues

10. Summary, task list

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

10. Next meeting, closing

12.09.2005 10:00

Closed at 12:56