Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Linguistics
  7. name lexicon infrastructure
  8. Other issues
  9. Summary, task lists
  10. Closing

1. Opening, agenda review, participants

Opened at 09:48.

Present: Børre (after 15 min), Saara, Sjur, Thomas (only a few minutes), Tomi

Absent: Maaren, Trond

Main secretary: Sjur

Agenda accepted as is.

2. Reviewing the task list from the last meeting

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

3. Documentation

Documentation tasks:

Add documentation on our corpus infrastructure and our corpus work in general (Børre, Tomi, Trond, Saara). For the basic corpora, we need 2 additional types of documentation, or doc for 2 target groups:

  1. For the users/linguists: What corpus are found, how do I use them (this info is now scattered) (Part of the HOWTO USE is documented in the catxml docu The what documents are found where etc + an overall documentation is not written, since the corpus is so sparsely populated)
    1. catxml done, which is what is needed mostly. Do we need more?
    2. we need a review: Thomas, Maaren, Linda, Ilona, Trond
  2. For the collectors: How do I add texts, where do I add them, how do I convert them (this is (partly?) done in the Corpus Conversion document)
    1. we need a review of the web interface for corpus uploading - what is still missing?
      1. Review: Sjur, Saara, Trond, Thomas

Review setup and reporting to be posted to the newsgroup, possibly a summary in our documentation. Saara to make a review template, to be posted in the newsgroup and commented before the actual review is started.

Deadline for comments and final template: by next meeting.

test:

Tomcat->static HTML progress

Now, all pages are generated directly from XML by Forrest within Tomcat. We’ll change to let Forrest pre-generate the HTML (and pdf), and serve these ready-made files directly.

Deadline: Finished by this week.

4. Corpus gathering

The Lule Sámi New Testament is ready for inclusion in our repository and will be added today. We then have our first Lule Sámi corpus text!

Contracts

Next step:

  1. Make our versions of the updated Helsinki contracts, and make sure they are according to our intention. (Sjur and Trond)
  2. send them to the SD lawyer and to the University lawyer through formal channels. (Sjur and Trond)

Contract 1 should have the main priority (contract 2 for Trond).

5. Corpus infrastructure

Updated task list:

  1. Include the xsl files under version control (Børre, Tomi, Saara)
    1. Saara has started a dicsussion in the newsgroup - please follow up!
      1. we can start using RCS right away, and we do so. The main users should be comfortable with two version control systems, and it is (relatively) easy to upgrade to CVS later, if we want that.
  2. Incorporate language detection as part of the corpus processing (Saara)
  3. we need a way to deal with hyphenated documents (documents with (manually) inserted hyphenation marks) in catxml/preprocess.
    1. What needs to be identified now is the conditions for the difference between “ealahus- ja ..." and "ea-la-hus-os-so-dat". The result should be: "ealahus- ja ..." and "ealahusossodat". Only hyphens as part of text should be tagged (that is, used to hyphenate a word at the end of a line, or to indicate syllabification for such hyphenation), hyphens in dates and other numeric expressions should be left as is.
      1. identify all cases of the first type, and replace all hyphens NOT in this set with the tag (Saara).
      2. we need to review whether only automatic hyphen detection is good enough, or whether manual post-processing in some form is needed. Delayed until we have some results to base the review on.
      3. Acceptable results: 90% of all real hyphens correctly tagged.

6. Linguistics

Name lexicon

Summary: see the newsgroup

The plan for this project was as follows: Two lines of work run in parallel:

  1. name markup
    1. Done! There are errors in the markup, people are urged to correct them as they pop up.

Complex names

Task list for this issue:

North Sámi

Lule Sámi

Great progress has been made on the G3 issue, just some minor points remain. oa:å has been carried over to ä:e

Open tasks:

Today’s compilation time:

real 5m17.157s user 3m26.827s sys 0m5.070s

Numerals

The following North Sámi linguistic issues should be settled before going into the numeral project:

  1. Three-part compounds
  2. Diphthong simplification
  3. Derivation

These issues are recently done in Lule Sámi, and it is more efficient to complete them in North Sámi directly thereafter instead of beginning a new topic

Numeral treatment is on different level in the existing sme and smj parsers, but the issue itself is common to the two langauges, and should therefore be treated in parallel.

Numerals in North Sámi: Inventory is listed elsewhere.

Numerals in Lule Sámi: There are 70 lines of code setting up the structure for case inflection of basic numerals.

7. Name lexicon infrastructure

Present proposal:

Present risten.no:

Possible new propsal 1: as risten.no

Possible new proposal 2: separate documents:

Porsanger both person and place Porsáŋgu only as place name, not as person name.

5 lgs give 10 Trosterud, 5 Timbuktu, it would be better to have 2 Trosterud and 1 Timbuktu, but 15 contlexica for these three concepts.

Discussion to continue in the newsgroup.

Tasks:

  1. testing of conversion
  2. continue the discussion of the name lexicon format (Saara, Tomi, Sjur, Trond)
  3. implement a prototype in eXist
  4. eXist as editor:
    1. develop the needed XQueries and interface
    2. synchronisation between risten.no and
    3. test whether eXist as editor is actually working well

8. Other

Technical issues

Video conferencing across firewalls

We’re still waiting for a working URL (working from outside SD, that is).

Bug fixing

28 open bugs (and 2 risten.no bugs)

Move Bugzilla

Move Bugzilla to the same server as the other ones (or make it work at the expected URL: http://giellatekno.uit.no/bugzilla/).

TODO, TODO. Thor Øivind.

risten.no

The risten.no data has been rescued, and a new version of eXist is ready for installation. The installation has not been done due to the network problems at SD. Will be done this week, probably today. Also, an even newer version of eXist has been released (snapshot from last Saturday).

Tomi will continue the proper name work.

Rugsacks

Were delivered on Nov 25. They have disappeared at UiTø, but that is now being investigated. Two were also properly delivered at SD/Guovdageaidnu.

9. Summary, task list

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

10. Next meeting, closing

12.12.2005 09:30

Closed at 11:12