Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Linguistics
  7. name lexicon infrastructure
  8. Other issues
  9. Summary, task lists
  10. Closing

1. Opening, agenda review, participants

Opened at 09:48.

Present: Børre, Sjur, Tomi, Trond

Absent: Maaren, Saara, Thomas

Main secretary: Børre

Agenda accepted as is.

2. Reviewing the task list from the last meeting

Børre

Maaren

Saara

Sjur

Thomas

On sick leave

Tomi

Trond

3. Documentation

Reviews

ccat review

Saara, Linda, Ilona, Trond

Conducted the review as part of the seminar, although Thomas and Maaren weren’t there. Outcome:

It works mostly as documented, a few glitches were found and corrected. The documentation is terse but concise (simple is beautiful, that is, in full accordance to KISS). ccat -h gives:

        -a              Print all text elements.
        -p              Print plain paragraphs. (default)
        -T              Print paragraphs with title type.
        -L              Print paragraphs with list type.
        -t              Print paragraphs with table type.
        -r <dir>        Recursively process directory dir and subdirs enountered.
        -h              Print this help message.

The [xml-based documentation was not completely up-to-date with the latest changes, fixed in the meeting.

Other documentation

4. Corpus gathering

Discussed briefly how the formalities should be implemented: signature, Websak, posting etc.

Collecting

See the previous meeting memo for what’s to be done.

TODO: Still a lot for Børre!

Odin

DONE: Trond, and then Børre to call Ove Sæth to re-establish contact.

Sæth to discuss with colleagues about how to implement the cooperation.

Bible texts

ccat -t zcorp/gt/sme/bible/ot/1Mos_09-01.doc.xml | less

This gives everything. What we want is to make a file-specific version of testament.xml, with these properties:

TODO:

We already have an embryonic converter: gt/script/testament.xsl Usage: xsltproc /path/to/testament.xsl bible-text.xml > converted-text.xml format.

5. Corpus infrastructure

Task list:

  1. Include the xsl files under version control
    1. RCS version control is almost finished, but an issue with access control is still open. Discussed a bit in the meeting, but nothing conclusive. We’ll continue the discussion in the newsgroup.
  2. Incorporate language detection as part of the corpus processing (Saara)
    1. Almost finished. Needs improved Finnish language model - presently it isn’t able to distinguish Finnish from Sámi (proving the family bonds:-)
  3. we need to review whether only automatic hyphen detection is good enough, or whether manual post-processing in some form is needed. Delayed until we have some results to base the review on.
    1. Acceptable results: 90% of all real hyphens correctly tagged.
  4. CGI-admin script to add xsl-file to a corpus file that doesn’t have one (Saara)

Things are moving forward, but still more work to do. The list is left as is.

6. Linguistics

Nothing today, our linguists are on sick leave or not participating. For the tasks and their status, see the previous meeting memo.

7. Name lexicon infrastructure

Complex names

Task list for this issue:

XML format

Tasks:

  1. make a test lexicon for evaluating the format, set up the editing, and test it (Saara)
    1. Done
  2. update conversion from lexc to xml to reflect new xml format (Saara)
    1. mostly done, some open questions left
  3. testing of conversion
  4. eXist as editor:
    1. develop the needed XQueries and interface
    2. synchronisation between risten.no and
    3. test whether eXist as editor is actually working well

8. Other

SGL Seminar

Technical issues

To ccat’s defence I must say that cat, in a similar situation, would have given far more error messages (hold on, testing still under way).

   preprocess file_name.txt - OK
   cat file_name.txt | preprocess - bug!!
   catxml file_name.xml | preprocess - ??
   ccat filename | preprocess - bug !!

This bug isn’t a high priority any more, because ccat behaves differently than cat, and because there is the possibility of avoiding cat when working locally.

BUG: close as Won’t fix.

Bug fixing

30 open bugs (and 25 risten.no bugs)

Norwegian ispell press release

The i18n section of Skolelinux plans a press release including a paragraph about our project. We will ask them to reformulate a couple of things, and remove the links they’ve included. Text submitted to them:

9. Summary, task list

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

10. Next meeting, closing

06.02.2006 09:30

Closed at 12:03