Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Linguistics
  7. name lexicon infrastructure
  8. Other issues
  9. Summary, task lists
  10. Closing

1. Opening, agenda review, participants

Opened at 09:57.

Present: Børre, Saara, Sjur, Trond

Absent: Maaren, Thomas, Tomi

Main secretary: Børre

Agenda accepted as is.

2. Reviewing the task list from the last meeting

Børre

Maaren

Saara

Sjur

Thomas

On sick-leave.

Tomi

Trond

3. Documentation

Reviews

XSLT processing part of the corpus infra review is finished. The code is finished and ready to use, but language identification still needs improvements.

4. Corpus gathering

Collecting

See a previous meeting memo for what’s to be done.

TODO: Send out the rest of the letters (Børre)

Since last meeting:

Next:

Odin

Waiting for Sæth to discuss with colleagues about how to implement the cooperation, and return to us.

TODO:

Bible texts

TODO:

5. Corpus infrastructure

We need more “version control” in the corpus work - we don’t know which version of the XSL script was used (but roughly whether it was used or not). We need to document within the XSL file which version of all tools were used, including the XSL file (common section/template) itself.

Transferring the old gt/sme/corp files to the new corpus repo:

Task list:

  1. Include the xsl files under version control
    1. RCS version control is almost finished, but an issue with access control is still open.
      1. Access control resolved through Unix groups: one group for corpus maintainers with write access, and another for corpus users, with read-only access.
  2. Improve Finnish language detection as part of the corpus processing
    1. Move to Bugzilla (Saara)
  3. Review automatic hyphen:
    1. Acceptable results: 90% of all real hyphens correctly tagged.
      1. Move to Bugzilla (Saara)

Further discussion about corpus analysis and computer use:

6. Linguistics

Anything? Nothing.

7. Name lexicon infrastructure

Complex names

TODO:

Move these issues to bugzilla (Børre)

Preprocessor optimization

To optimize one could build a targeted transducer only containing the relevant lexicons for preprocessing. But presently that leads to the following lexicons being reported as referenced but undefined:

Perhaps picking the

hum-tf4-ans142:~/gt/sme/src trond$ grep '% ' adv-sme-lex.txt
earret% eará adv ;
dan% dihte adv ;
...

TODO:

  1. make a lexc Root lexicon (first 40 lines of sme-lex.txt)
  2. extract the relevant parts of the relevant lexica from the main transducer
  3. built from the union of a and b.

Discussion will continue on the newsgroup.

XML format

TODO:

  1. testing of conversion
  2. eXist as editor:
    1. develop the needed XQueries and interface
    2. data synchronisation between risten.no and
    3. test whether eXist as editor is actually working well

8. Other

SGL Seminar

SGL has now been elected, with the folowing members:

SGL/normativity seminar:

Infra for new projects and ideas:

Bug fixing

30 open bugs (and 24 risten.no bugs)

9. Summary, task list

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

10. Next meeting, closing

20.02.2006 09:30

Closed at 11:33