Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 09:37.

Present: Børre, Saara, Sjur, Thomas, Trond, Tomi

Absent: Maaren, Saara leaving 10.30.

Agenda accepted as is, some additions under Other.

2. Updated task status since last meeting

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

3. Documentation

TODO:

4. Corpus gathering

Collecting

See a previous meeting memo for what’s to be done.

TODO:

New contracts:

Olavi Korhonen’s Lule Sámi dictionary.

TODO:

KIO Grafisk and the Iđut books

TODO:

Bible texts

We will get text from Finland, but still haven’t received any. Swedish html has arrived, no paratext. Norsk bibelselskap has not sent corrected New Testament versions for sme, and not paratext for nno/nob.

TODO:

Davvi Girji

Talked to Harald Gaski, he will send the contracts to the writers’ organisation, to ensure the contract is ok.

TODO:

Min Áigi

TODO:

Kåfjord

Promised to send us texts. Some texts have arrived, but nothing from Ája.

TODO:

Sámi Instituhtta

TODO:

Čálliid Lágádus

Børre talked to them, they are positive, and will give us print-ready pdf’s. [http://www.calliidlagadus.org/]

Árran

The negotiations are underway, they discuss it in a meeting today, and we have an appointment for later today or tomorrow (both their internal texts, and discussions on Báhko). All further negotiations will go through Bård Eriksen / Báhko.

TODO:

5. Corpus infrastructure

User accounts and access

TODO:

More texts to the graphical corpus interface:

We need to get the infrastructure complete to be able to do this, then it should be a piece of cake.

Main obstacle for further progress: a tool to analyse text while keeping the xml structure. Here’s some very simple pseudo-code:

  1. parse xml
  2. send each <p> node to lookup etc
  3. analyse it
  4. add </s> after each <.>, <!>, <?> mark
  5. wrap up the analysis output in xml
  6. put it in a <p>node again

The first implementation in Perl (Saara), later possibly a C implementation (Tomi) for speed reasons.

TODO:

  1. make an analyser that retains the xml structure (Saara)
  2. Finish the tag unification (korpustags.txt) (Trond)
    1. progressing, but open questions remain (awaiting text to see how things crash)
  3. add text to the server (Lars)

Aligner

Trond and Saara will continue this issue.

We need markup of parallelism in the corpus DTD, at least an indication of which documents belong together. Discussion to continue in the newsgroup (Saara has started it - please respond!).

Language recognition

TODO:

Free and non-free texts

TODO:

Corpus summary

TODO:

Proofed vs unproofed corpus files

TODO:

6. Infrastructure

Paradigm generation

Goal: Reuse Greenlandic code for paradigm generation.

TODO:

Hyphenator

Thomas is finished with adding ^ tags to the sme noun file.

Trond and Thomas have been working on the smj rule component, and have improved both the treatment of weak grade consonant clusters (preconsonantal geminates) and on some loan word patterns.

TODO:

Automatic Bugzilla reminder for untouched bugs

We need to get a summary by e-mail for all bugs not touched in more than 5(?) weeks. Do we want it? Yes. It is possible? Børre will look around, if nothing found, he’ll ask Thor-Øivind.

TODO:

7. Linguistics

North Sámi

Topic: Actio+compound - how productive? How much does it destroy speller performeance by being close to real spelling errors with a high editing distance to the correct form?

Spelling error from typos.txt:
4       vuolggahansádji             vuolggasadji

vuolggahan
vuolggahan      vuolgga+N+Sg+Nom+Foc
vuolggahan      vuolggahit+V+TV+PrfPrc
vuolggahan      vuolggahit+V+TV+Ind+Prs+Sg1
vuolggahan      vuolggahit+V+TV+Actio+Acc
vuolggahan      vuolggahit+V+TV+Actio+Gen
vuolggahan      vuolggahit+V+TV+Actio+Nom

vuolggahansadji
vuolggahansadji vuolgga+N+SgNomCmp#ho+N+SgNomCmp#atnu+N+SgNomCmp#sadji+N+Sg+Nom
vuolggahansadji vuolgga+N+SgNomCmp#ho+N+SgNomCmp#atnu+N+SgNomCmp#sadji+N+Sg+Nom
vuolggahansadji vuolgga+N+SgNomCmp#ho+N+SgNomCmp#atnu+N+SgNomCmp#sadji+N+Sg+Nom
vuolggahansadji vuolggahit+V+TV+Actio#sadji+N+Sg+Nom
vuolggahansadji vuolggahit+V+TV+Actio#sadji+N+Sg+Nom
vuolggahansadji vuolggahit+V+TV+Actio#sadji+N+Sg+Nom

TODO:

Lule Sámi

There are some open issues in the marginal area of the smj transducer:

TODO:

8. Name lexicon infrastructure

TODO:

  1. finish refactoring for multiple collections in the search interfarce (Sjur)
    1. progressing, still major work to be done
  2. develop the needed XQueries and interface (Sjur, Tomi)
    1. progressing, done some, commited (adding new term, create-termc-entry.xq)
  3. data synchronisation between risten.no and the cvs repo (Tomi)
    1. discussion started on eXist-list, nothing useful came up. We need to reformulate the question from our perspective, and bring it up again (Sjur)

Timeline:

9. Public tender

TODO:

10. Other

Summer vacation

Who When
Børre August
Linda ?
Maaren ?
Saara July
Sjur at least some in July, but still open
Thomas 3.7 - 7.8
Trond 3.7 - 14.8 (last two weeks off at summer school)
Tomi 8.7 - 16.7, 2 more weeks in July and/or August

Bug fixing

43 open Divvun/Disamb bugs (two down!), and 25 risten.no bugs

Guess: 1/3 of the bugs are fixed already (?)

Please help Saara with [bug 279|http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=279] (Perl locale). Not much help… Saara will contact Roy on this issue.

After the corpus issues have been somewhat settled, we should do a bug barnraising. … and then a new one after the name lexicon is fixed.

Forrest: Unicode & JSPWiki

Sjur has identified a way of making forrest run read correctly the jspwiki files (UTF-8 encoded):

forrest run -Dforrest.jvmargs="-Dfile.encoding=utf-8"

It is added to our Forrest documentation.

RAM on our G5

We got some defect RAM. Our business contact has received a whole party of defect RAM, they are investigating this. After that is resolved we will receive “good” RAM.

Gobby

0.3 is working fine on Mac, Linux and Windows. Should be installed on all computers c.f. [http://darcs.0x539.de/trac/obby/cgi-bin/trac.cgi/wiki/InstallationGuide] (our preinstalled Xcode veriosn is 2.0, must be 2.1):

Trond has asked Lars Nygård and Tero Avellan to install Gobby as well. Per Langgård is already using it.

SEE 2.5 extensions

TODO:

11. Next meeting, closing

12.06.2006 09:30

Closed at 11:08

Appendix - task lists for the next week

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond