Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Linguistics
  7. name lexicon infrastructure
  8. Other issues
  9. Summary, task lists
  10. Closing

1. Opening, agenda review, participants

Opened at 09:55.

Present: Børre, Maaren, Saara, Sjur, Thomas, Trond

Absent: Tomi (sick leave until Jan. 17th)

Main secretary: Sjur

Agenda accepted as is.

2. Reviewing the task list from the last meeting

Børre

Maaren

Saara

Sjur

Thomas

Tomi

On sick leave.

Trond

3. Documentation

Reviews

Add documentation on our corpus infrastructure and our corpus work in general (Børre, Tomi, Trond, Saara). For the basic corpora, we need 2 additional types of documentation, or doc for 2 target groups:

  1. For the users/linguists: What corpus are found, how do I use them (this info is now scattered). (Part of the HOWTO USE is documented in the catxml docu, what documents are found where etc + an overall documentation is not written, since the corpus is so sparsely populated)
    1. catxml done, which is what is needed mostly. Do we need more?
    2. Review: Thomas, Maaren, Linda, Ilona, Trond

Review setup and reporting to be posted to the newsgroup, possibly a summary in our documentation.

Deadline for reviews: by next meeting.

Catxml review: this should be based on the new tool made by Tomi (C++ version), called ccat. Tomi will install it, and announce it in news when ready for review.

Update: Tomi is on sick leave, and Saara will make the tool available in Tomi’s absence.

4. Corpus gathering

Contracts

Next step:

  1. wait for comments from the lawyers - remind them of the task? (Sjur, Trond)
  2. possibly update contracts with remarks from lawyers
  3. start using them!

Collecting

Nothing new, now hampered by the lawyers checking the final version of the contracts.

We want both parallell and errouneous (unproofed) text files. Børre to contact the Odin guy.

5. Corpus infrastructure

Updated task list:

  1. Include the xsl files under version control
    1. RCS version control is almost finished, but an issue with access control is still open. Discussed a bit in the meeting, but nothing conclusive. We’ll continue the discussion in the newsgroup.
  2. Incorporate language detection as part of the corpus processing (Saara)
    1. Almost finished. Some heuristics regarding other Sámi languages in the same document to be added.
  3. we need a way to deal with hyphenated documents (documents with (manually) inserted hyphenation marks) in catxml/preprocess.
    1. done, needs review (Saara)
  4. we need to review whether only automatic hyphen detection is good enough, or whether manual post-processing in some form is needed. Delayed until we have some results to base the review on.
    1. Acceptable results: 90% of all real hyphens correctly tagged.
  5. CGI-admin script to add xsl-file to a corpus file that doesn’t have one (Tomi)
    1. Saara will review the existing code, consult Tomi, and try to make a script to help Børre utilize the template and the infra in place.

6. Linguistics

North Sámi

Lule Sámi

Open tasks:

Numerals

The following North Sámi linguistic issues should be settled before going into the numeral project:

  1. Three-part compounds
  2. Diphthong simplification
  3. Derivation

These issues are recently done in Lule Sámi, and it is more efficient to complete them in North Sámi directly thereafter instead of beginning a new topic

Numeral treatment is on different level in the existing sme and smj parsers, but the issue itself is common to the two langauges, and should therefore be treated in parallel.

Numerals in North Sámi: Inventory is listed elsewhere.

Numerals in Lule Sámi: There are 70 lines of code setting up the structure for case inflection of basic numerals.

7. Name lexicon infrastructure

Summary: see the newsgroup

Complex names

Task list for this issue:

XML format

We had our meeting, and the result was pretty close to the structure of the existing riste.no term base. Now the conversion script to xml needs to be updated. The new format is documented in the newsgroup.

Tasks:

  1. update conversion from lexc to xml to reflect new xml format
  2. testing of conversion
  3. continue the discussion of the name lexicon format (Saara, Tomi, Sjur, Trond)
  4. implement a prototype in eXist
  5. eXist as editor:
    1. develop the needed XQueries and interface
    2. synchronisation between risten.no and
    3. test whether eXist as editor is actually working well

8. Other

Seminars

Technical issues

Bug fixing

28 open bugs (and 25 risten.no bugs)

Move Bugzilla

Bugzilla now works at the old URL [http://giellatekno.uit.no/bugzilla/].

9. Summary, task list

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

10. Next meeting, closing

09.01.2006 09:30

Closed at 11:43