Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Linguistics
  7. Speller infrastructure
  8. Other issues
  9. Summary, task lists
  10. Closing

1. Opening, agenda review, participants

Opened at 10:12.

Present: Maaren, Saara, Sjur, Thomas, Tomi, Trond

Absent: Børre

Main secretary: Trond

Agenda accepted as is.

2. Reviewing the task list from the last meeting

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

3. Documentation

Documentation tasks:

  1. Add documentation on our corpus infrastructure and our corpus work in general (“To be done by the ones making the corpora”: Børre, Tomi, Trond, Saara).
  2. add/update Aspell documentation (Tomi)
  3. finish divvun2web script (Børre)
  4. as always: document what you’re doing:-) (all)

4. Corpus gathering

Since last meeting:

From last meeting:

Børre, Trond and Sjur had their meeting, and the Helsinki contract is quite good as it is. There are some points still to be discussed. We’ll continue this week.

Paths forward: We have a contract suggestion. Sjur and Børre should start the negotiation with the authors and/or publishers to get text. We should make a priority list for authors, and we should invite ourselves to their meeting to discuss. We should carry on the discussions both with Iđut and with Davvi Girji.

How to proceed:

  1. Get the contract suggestion ready
    1. Translated part 1 ok, part 2 and 3 missing. Done this week
    2. Get part 4 from Kimmo, and translate it
    3. Contact our lawyers, at SD and UIT (today, tomorrow).
    4. When the Norwegian version of the contracts are ready, make sure they’re corresponding to the Finnish ones in their legal interpretation (as far as possible); then publish both versions for others to reuse, with background documentation (in cooperation with Kimmo Koskenniemi)
  2. Approach the text owners (see ordered list below)

Independent of the contract work

  1. Bible: The new testament (Trond)
  2. Bureaucratic text:
    1. Sámi Parliament (Børre)
    2. Sámi Oahpahusráđđi (Børre)
    3. KRD (Børre, check whether we miss texts (discuss with Trond))
    4. the Sámi municipalities (Børre)
  3. Textbooks
    1. To the extent that text can be got directly from SO.

After the contracts are ready

Sjur and Børre should probably take a Tour-de-Sápmi, and meet with the most important persons and institutions. Børre as the responsible for collecting, Sjur as responsible for the project, and representative of SD

The tour should be planned, not in this meeting, but before the contracts are ready, i.e., it should be planned next week.

  1. Commercially published texts
    1. Author organisations’ meetings
    2. Key authors one by one
      1. (list of author names) Kerttu Vuolab, Kirsi Paltto,
    3. Iđut and key authors there (Børre)
    4. Davvi Girji and key authors there
  2. Newspaper text:
    1. Sámi Instituhtta’s (for the old archive of Min Áigi and Áššu)
    2. Áššu has been making a CD since the end of may, there should be a pile there. Børre suggests that they send us the CDs they have, so that we may look at them, and ensure that the routines work, and that we are able to utilize their format.
    3. Min Áigi

List of texts with lower priority (to be gathered when the above list is more or less fixed)

5. Corpus infrastructure

Do documentation.

Naming conventions and directory structure

We have a decision from Helsinki:

We do not have any notes from our Helsinki meeting (they were left on the whiteboard). The following is our reconstruction of our decision.

There are three directories, with the same substructure (a 6-way partition according to genre). The 3 directories contain different versions of the files as they are processed from original format, via intermediate, to final xml version.

orig
 (substructure)
    filename.doc, filename.html, filename.pdf
int
 (substructure)
    filename.int.xml
    filename.xsl
gt (and we want a new name for gt)
 (substructure)
    filename.xml

There is a substructure division according to genre:

bible (NT, OT, perhaps other liturgical txts)
newspaper
    Min Áigi
    Áššu
    Other
fiction
administrative
    central   (Oslo, Stockholm, Helsinki)
    samediggi (Kárášjohka, Giron, Anár)
    municipalities
factual (educational)
legal

For the linguistic search interface, all texts will probably be published uncorrected, with some portions manually corrected and published in parallel to the uncorrected variants.

Things to do next, and persons to do it:

  1. Rewrite the corpus directory (Børre)
  2. Document the corpus directory (Børre)
  3. Continue the work on translating texts from orig/ via int/ to gt/. (Børre, Saara, with the help of Tomi)
  4. Make a sister catalogue for smj, but with a completely flat structure within each orig/int/gt.
    1. corp/sme/(orig/int/gt)
    2. corp/smj/(orig/int/gt)
  5. Document the xsl conversion and scripts (Tomi)
  6. Make conversion for html documents (Tomi)
  7. Start looking at conversion of pdf documents (Saara)

6. Linguistics

Note: The Bugzilla bug categories for lexica and morphophonology are now split into sme and smj categories.

North Sámi

Names are inherently multilingual as well as cross-lingual. Cf. Appendices B thorugh Č in Sammallahti 1989.

Examples of place names:

Karasjok Produkter
       deatnulačča Nils Porsangera (82) go máhtii eanet
       deatnulačča Nils Porsangera (82) go máhtii eanet
            juoigi Nils Porsangera go máhtii eanet Deanu
drosjeeaiggát (NAF avd. Hammerfest-Karasjok 1984: 21).

Example of person names:

Báđár  - Paadar
Guhtur - Guttorm
Dámmot - Blind
Bieská - Pieska
Bieskán/Bieski - Pieski
Dommá  - Tommi
Duomis - Thomas

Niilas - Nils
Duommá - Thomas

A first step towards an xml infrastructure for a language-independent name database is the following.

named_entity Porsáŋgu
    semantic class information
        place name
    sme: Porsáŋgu
        continuation lexica
            norw stem and norw gr info
            sme stem and sme gr info
            ...
    nob: Porsanger
        continuation lexica
            norw stem and norw gr info
            sme stem and sme gr info
            ...
    fin: Porsanki
        continuation lexica
            norw stem and norw gr info
            sme stem and sme gr info
            ...


named_entity Porsanger
    semantic class information
        person name
    name_lg1 / all
        continuation lexica
            norw stem and norw gr info
            sme stem and sme gr info
            ...
    name_lg2 (-)

Conclusion: We need a name project.

Issues:

  1. What format do we want for our common base?
  2. What semantic information do we want to add to the names?

Planning:

  1. Who shall work on this?
    1. Name lexicon work group: Sjur, Trond, Maaren?
  2. What time plan shall it have?
    1. Kickoff at Oct 05, in Kautokeino
    2. Plans ready at some not too later point
    3. Do what we have to do during the winter
    4. Implement the name base as input for our parsers at some later point.

Classification:

  1. Preparatory work
    1. Talk to Kari Pitkänen in Tampere, who did a semantic classification for the Finnish propernoun lexicon (Trond)
    2. Look into other projects (Maaren)
    3. Make a draft (each) of What We Ideally Want (Maaren, Sjur, Trond)
  2. Substantial work
    1. Make a semantic theory
    2. Make a proposal with DTD and examples

Making the new base

  1. Identify status quo and a goal
  2. Write tools for semiautomatic transition
  3. Do a pilot test
  4. Move (parts of?) the name lexicon to the new format
    1. Part of the manual work could perhaps be given to part-time-workers

Incorporating the new base in the parser

  1. decide on a proper location for the name base
  2. conversion tool xml -> lexc

TODO:

  1. A Kickoff meeting in Kautokeino.
  2. Before the kickoff meeting, Sjur, Maaren and Trond to do some preparatory work as sketched above.

Lule Sámi

Lexicon work

We do not know when we get the lexicon from Anders Kintel. We need a meeting with him and with Árran in order to coordinate the work. Sjur will call Bitte and get updated wrt the license issues with the lexicon material.

The goal is to establish a mode of work where people in Árran and in the language technology projects all work on the same source code, for their respective projects. In order to get that far, we will need to meet with Anders and the other ones at Árran.

Work on Lule Sámi in general

We also need input from the other persons working with Lule Sámi when it comes to corpus gathering, terminology, linguistic issues, etc. After we have integrated the lexicon into the parser we will need a meeting with the persons working in Árran to find a mode of concrete cooperation. The Lule Sámi spellchecker needs as broad a basis as possible.

Status quo on the parser:

TODO:

  1. Continue the work on the lexicon (Børre, Thomas, Sjur)
  2. Plan a meeting between our Lule Sámi team and the people at Árran working on linguistic issues (Sjur, Trond, Thomas)
  3. Carry on the linguistic work (Thomas, Trond)

Numerals

We need

  1. An empirical overview
    1. Numeral generation
    2. Numeral inflection
    3. Numerals as parts of compounds
  2. A clear concept of how we want to treat them
    1. Tagging
  3. A treatment

TODO:

  1. Make a documentation chapter on numerals, identifying the open linguistic issues
  2. Look at implementation

Action plan: Trond and Maaren look into it.

7. Speller infrastructure

Aspell

Write documentation here as well.

The munch-list is working, and the affix file is improving. See [previous meeting memo|/admin/weekly/2005/Meeting_2005-08-15.html].

The problem with the affix file was that it did not accept UTF-8. It accepted Latin 4, but mixed đ and ð, since in Latin 6 ð=F0, đ=B9, and in Latin 4 đ=F0, and there is no ð.

There were problems with the latin 6 encoding of the suffix file. After updating to cvs, the Latin 6 encoding is corrupted.

Issues:

  1. The phonetic file should be systematically looked into.
    1. Check that it works
    2. Add more correspondences on an impressionistic basis
  2. Start work on collecting systematic spelling errors:
    1. Our in-house file typos.txt
    2. The soon-to-arrive error texts from newspapers
  3. The holes in the affix list should be mended
    1. Adjectives still to be done
  4. The munching process gets killed at cochise today
    1. Persons to talk to are Roy Dragseth and Steinar Trædal-Henden. Tomi contacts them.
  5. We should, at some point, evaluate whether this is The Correct Approach to aspell-type speller building
  6. Affix file UTF-8 problem should be checked and reported.
    1. Contact the Aspell author and ask for updates/fixes
  7. The clitics issue: Today we have a manually created affix file in order to meet the muncher. Strategy: Do the munching without the clitics, but then enrich the manually-created affix file via a genfisuffix -program in order to get an automatically created suffix + clitic file to add the compiled lexicon to. We will have genfisuff taking affix + clitic and makes it into affix’. Then we use affix to munch but affix’ to spellcheck:
    1. stems + affixlist + cliticlist
    2. where all 11 clitics (found in the K lexicon) are mapped onto each and every affix.
  8. Today, substandard forms are marked as “!SUB”. The speller should not include them by accepting them. The solution is to grep out the !SUB tagsduring speller compilation.”. What is useful is to use the !SUB forms for correction list (!SUB -> Correct). The proper place for doing this under aspell is as an addition to the sme_phonetic file. This requires the new lexicon format, to make us able to tie variant spellings of the same word together. Presently we can’t tell whether two lines are variants of the same word or not.
  9. Documentation
  10. We must create subcomponents under the Speller
    1. one for Aspell, another for MySpell, Hunspell, etc.
    2. TODO:
      1. Investigate
      2. Write procedures for doing so, for the .
    3. Directory structure:
      (spell)
       (src)
       (bin)
       (dist)
      

The conversion from aspell to myspell will work trivially as soon as the myspell list becomes smaller.

Issue left open.

Hunspell

Hunspell is presently already working with OOo, and is a much better speller engine, linguistically speaking (can handle compounds much better than Aspell, as well as complex inflection and derivation). For pointers, see the previous meeting memo.

Issue left open.

Other engines

Børre and Sjur had a long discussion with the author of the SFST library/tool set. Next on his priority list is a feature for handling spelling errors in running text analysis. This is principally the same as we want, thus there should be a good opportunity for making SFST into the spelling engine. Sjur has a suggestion on how to implement this feature that may be mailed to the author of SFST.

Sjur repeated the suggestion from our SFST man that we could do this ourselves. The whole SFST is written in C++.

The question is:

  1. Do we want to do that?
  2. Would Tomi be able to do it alone, or do we need more resources for it?
  3. The issue has wide-reaching implications - basically that of replacing the Xerox tools with SFST counterparts.

TODO:

The task deferred a month or so. Tomi/Sjur to look into the issue then and report back

8. Other

Technical issues

9. Summary, task list

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

19.09.2005 10:00

Closed at 13:20