Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 09:59.

Present: Sjur, Thomas, Trond, Børre, Tomi

Absent: Maaren, Saara

Agenda accepted as is.

2. Updated task status since last meeting

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

3. Documentation

TODO:

4. Corpus gathering

Collecting

See a previous meeting memo for what’s to be done.

TODO:

New contracts:

Olavi Korhonen’s Lule Sámi dictionary.

TODO:

KIO Grafisk and the Iđut books

TODO:

Bible texts

TODO:

Davvi Girji

TODO:

Min Áigi

TODO:

Kåfjord

TODO:

Sámi Instituhtta

TODO:

Čálliid Lágádus

[http://www.calliidlagadus.org/]

TODO:

Árran

TODO:

5. Corpus infrastructure

General

Errors in the Antiword conversions found when parsing the xml corpus.

There are also problems with the PDF conversion. Børre has found a tool which will produce xml output, he sent the link to Saara.

TODO:

User accounts and access

TODO:

Conclusion: we have the following classes of users:

Shell access

External users will get their own user account, belonging to the groups myself and bound, and will be able to install their own tools and programs for corpus processing, analysis, etc.

To let the bound group members be able to analyse, we need to do some minor adjustments - as other they automatically have full access to the Xerox tools, and the compiled fst’s are available in /opt/smi/sme/bin/sme-num.fst etc. The Xerox tools and vislcg are available in /opt/Xerox/bin. A couple of tools are missing right now, and need to be added to /opt/ by a crontab.

TODO:

Web browser access

Users of only the free corpus won’t need anything but a browser.

Users of the bound corpus will need a username and password to the Oslo computer (until the base is moved to Tromsø). These usernames and passwords will be created and administered by the Oslo people (modulo discussion), later by ourselves.

TODO:

More texts to the graphical corpus interface:

TODO:

  1. refine xml-tagged output (Saara and Tomi)
    1. done, but still open if it is finished
  2. add text to the server (Lars)

Aligner

Trond and Saara will continue this issue.

We need markup of parallelism in the corpus DTD, at least an indication of which documents belong together. Discussion to continue in the newsgroup (Saara has started it - please respond!).

TODO:

Language recognition

Still waiting for more smj text to improve it.

Free and non-free texts

Anything? Final check with Børre and Saara - waiting for them to return. Nothing more according to Børre. Only texts which are explicitly marked as free are now included in the free/ directory.

Corpus summary

Forrest goes into an endless loop when processing these files. It happens when converting the XML to Forrest format. More info on our bugzilla

TODO:

6. Infrastructure

Paradigm generation

Goal: Reuse Greenlandic code for paradigm generation.

TODO:

Hyphenator

TODO:

Automatic Bugzilla reminder for untouched bugs

TODO:

JSPWiki update

Sjur has corrected and improved the jspwiki parsing in Forrest, and found that mixed lists should not be supported. We should check whether we have any such lists anywhere, and correct them, otherwise we risk that they are not rendered in HTML, and as such impose an information loss. Sjur has sent in a patch to Forrest that will correct nested list behaviour, but it will at the same time make mixed lists invisible except for the first level. It seems that mixed lists aren’t part of the wiki format.

Mixed list example:

1. something numbered
    - some sub-thing with a bullet

* something bulleted
    1. some sub-thing with a number

Sjur tried to grep, but multiline pattern matching is beyond him. Tomi: Here’s the pattern to use:

egrep -C 3 -R "^\*.*[$]*{1,16}\#" *

TODO:

7. Linguistics

Name double-tagging

Conclusion, in a principled fashion:

  1. hardcoded sem-tags win
  2. There is a sem-tag conversion procedure: according to a hierarchy of sem-tags: Any Plc can be interpreted as Sur, etc. (to be spelled out)

TODO:

North Sámi

TODO:

Actio compounding

It is definitely productive. Whether this is a problem for our speller(s), we don’t know yet, but if there’s a lot of overlap or parallel forms with the same verbal stem producing compounds both with and without Actio, it is most likely a problem, unless we correct to both forms (but then we risk correcting to an impossible form, which is also bad).

Whether Actio is used or not in compounding a verbal stem follows from the semantic properties of the Actio. We need to try to identify this property, and formalise it in one way or another, otherwise we will overgenerate. And overgeneration is a speller’s worst enemy :-)

TODO:

Lule Sámi

TODO:

8. Name lexicon infrastructure

TODO:

  1. finish refactoring for multiple collections in the search interfarce (Sjur)
    1. investigation done, still not implemented
  2. develop the needed XQueries and interface (Sjur, Tomi)
    1. nothing this week
  3. data synchronisation between risten.no and the cvs repo (Tomi)
    1. discussion started on eXist-list, nothing useful came up. We need to reformulate the question from our perspective, and bring it up again (Sjur)
      1. not yet done

9. Public tender

Nothing received from PL yet. They have an extended deadline today. After that we will decide upon the information we have.

TODO:

10. Other

Summer vacation

From Bitte: “I følge fjorårets liste tok iallefall Børre ut 10 dager av årets ferie og Tomi tok 8 dager, dvs at Børre kan ta 3 uker med lønn og Tomi 3 uker og 2 dager. De kan selvsagt låne av neste års ferie dersom de ønsker det.”

She would like to receive a final vacation plan soon.

Who When
Børre 24.7 - 20.8
Linda ?
Maaren on sick leave
Saara July
Sjur at least 2 weeks in July, but still open
Thomas 3.7 - 7.8
Trond 3.7 - 14.8 (last two weeks off at summer school)
Tomi 8.7 - 16.7, 2 more weeks in July and/or August

Bug fixing

43 open Divvun/Disamb bugs (two down!), and 25 risten.no bugs

Guess: 1/3 of the bugs are fixed already (?)

Please help Saara with [bug 279|http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=279] (Perl locale). Not much help… Saara will contact Roy on this issue.

Gobby

TODO:

SEE 2.5 extensions

Future extensions and whish modes:

TODO:

Task lists as iCal entries

With the latest corrections to the Wiki parsing, and with the tasks at the end of the document, we should be able to use the intermediate XML format to extract the info needed to create iCal entries for all tasks.

iCal entries look like the following:

BEGIN:VTODO
DTSTAMP:20060619T090920Z
ORGANIZER;CN=Børre Gaup:MAILTO:boerre@skolelinux.no
CREATED:20050621T171425Z
UID:libkcal-939001838.216
SEQUENCE:1
LAST-MODIFIED:20050622T050540Z
SUMMARY:Ordne lenker
CLASS:PUBLIC
PRIORITY:5
DUE:20050824T073000Z
COMPLETED:20050622T050540Z
PERCENT-COMPLETE:100
END:VTODO

A reference can be found at Wikipedia

References should be of the type:

/doc/admin/weekly/2006/Tasks_2006-06-19_Sjur.ics

TODO:

11. Next meeting, closing

26.06.2006 09:30

Sjur might be away, will inform you later.

Closed at 11:20.

Appendix - task lists for the next week

Boerre

Maaren

Saara

Sjur

Thomas

Tomi

Trond