Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 09:37.

Present: Børre, Saara, Sjur, Thomas, Trond, Tomi

Absent: Maaren

Main secretary: the whole concept dropped for now - working collaboratively makes it obsolete.

Agenda accepted as is.

2. Updated task status since last meeting

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

3. Documentation

TODO:

4. Corpus gathering

Collecting

See a previous meeting memo for what’s to be done.

TODO:

New contracts:

Olavi Korhonen’s Lule Sámi dictionary.

Phoned Korhonen. He was willing to sign the contracts, and wanted some kind of access to our corpus. He wants to collect words for his Northern Sámi dictionary.

TODO:

KIO Grafisk and the Iđut books

TODO:

Bible texts

We will get text from Finland, but still haven’t received any. Swedish html has arrived, no paratext. Norsk bibelselskap has not sent corrected New Testament versions for sme, and not paratext for nno/nob.

TODO:

Davvi Girji

Called her last week. She said Davvi Girji os would give us permission to use texts by their authors. She wasn’t sure if we could get the texts directly from Davvi Girji, because of copyrights of pictures and other artwork in the books. Said they (Davvi Girji workers) were going to have a meeting right after our conversation by phone, and take some kind of desicion on this case. Haven’t heard anything from Kåven, though.

TODO:

Min Áigi

The Min Áigi format should be dealt with: \@ingress etc should be dealt with for the .txt, but business as usal for the .doc files. Saara has done the xsl conversion routine for the typographic tags. It still needs some fine tuning, as there are some @ tags that were not included in our initial list.

TODO:

Kåfjord

Promised to send us texts. Some texts have arrived, but nothing from Ája.

TODO:

Sámi Instituhtta

Børre contacted Richard Valkeapää, the IT-consult at NSI. He put it on his todo list, as he would have to contact the person who has worked with the newspaper texts anyway. He said this would be done in the near future (within a month).

TODO:

5. Corpus infrastructure

User accounts and access

We will probably have different kinds of users, some will only need access through the webinterface to the corpus, others might want and need access to the commandline, to utilize the corpus in an efficient way. This calls for an access policy for these users.

TODO:

Name change again?

TODO:

Free and non-free texts

More info in a [previous meeting memo.|/admin/weekly/2006/Meeting_2006-03-13.html]

TODO:

More texts to the graphical corpus interface:

We need to get the infrastructure complete to be able to do this, then it should be a piece of cake.

TODO:

Top-three priorities:

  1. Finish the tag unification (korpustags.txt) (Trond)
  2. change ccat to be able to create the right input for the corpus analysis (xml- tagged output) (Tomi). Work estimate: a few (2-3?) days - nothing this week, we will re-evaluate (and schedule) next Monday)
  3. add text to the server (Lars)

Language recognition

TODO:

Some Lule Sami text is found on Infonuorras site.

Corpus summary

TODO:

Proofed vs unproofed corpus files

The Min Áigi material contains partially parallel unproofed vs proofed documents. We need to find a decent way of handling this parallellism, and preferably a way to make use of the information contained in the diff between the two versions. At best, we should be able to automatically extract the corrections made, generating an XML document that contains correction markup as discussed in the newsthread “Corpus DTD: corrections” (from 17.03.2006). I don’t know how much work this is, or whether it is at all possible, but let’s discuss it.

For now there are so few such files that it hardly pays off, but Saara will make a list of these files to evaluate the potential benefit.

TODO:

Aligner

Trond and Saara will continue this issue.

6. Infrastructure

Paradigm generation

Goal: Reuse Greenlandic code for paradigm generation.

TODO:

Hyphenator

Thomas is finished with adding ^ tags to the smj noun file, and has continued working on the sme noun file.

Trond and Thomas have been working on the smj rule component, and have improved both the treatment of weak grade consonant clusters (preconsonantal geminates) and on some loan word patterns.

TODO:

7. Linguistics

General issues

Rethink the doubletagging procedure for names, consider grammatically motivated semtag conversion routines (“Helsinki” from Plc to Obj to Org, or the Lyndi England issue, Aftenposten Obj Org WoA Pub)

Possible rule:

If Plc then Obj

North Sámi

TODO:

Lule Sámi

There are some open issues in the marginal area of the smj transducer:

TODO:

8. Name lexicon infrastructure

TODO:

  1. finish refactoring for multiple collections in the search interfarce (Sjur)
    1. nothing done last week
  2. develop the needed XQueries and interface (Sjur, Tomi)
    1. progressing, done some, haven’t commited (adding new term, create-termc-entry.xq)
  3. data synchronisation between risten.no and the cvs repo (Tomi)
    1. discussion started on eXist-list, we’ll wait a couple of days to see what’s coming out of it, and if nothing useful to us, we’ll add our use case with questions
  4. test and review when ready

Timeline:

9. Spellers

We will remove this speller section till we have something to report.

10. Public tender

Finnut called, and here’s their evaluation: if we think that the offers are incomplete or otherwise not fully acceptable, we can enter negotiations with the companies, effectively cancelling the current public tender. This can only be done as long as we don’t change the public tender document (that is, the foundation for the public tender) - if it is changed, we have to announce the whole competition again, with the usual 53 days minimum deadline for applications.

Sjur will send an e-mail to the project board, outlining the different aspects of the two offers, and ask about their opinion on the following questions:

TODO:

11. Other

Summer vacation

Who When
Børre August
Linda ?
Maaren ?
Saara July
Sjur at least some in July, but still open
Thomas 3.7 - 7.8
Trond 3.7 - 14.8 (last two weeks off at summer school)
Tomi 8.7 - 16.7, more?

Bug fixing

43 open Divvun/Disamb bugs (two down!), and 25 risten.no bugs

Guess: 1/3 of the bugs are fixed already (?)

Please help Saara with [bug 279|http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=279] (Perl locale). Not much help… Saara will contact Roy on this issue.

After the corpus issues have been somewhat settled, we should do a bug barnraising. … and then a new one after the name lexicon is fixed.

Gobby

0.3 is working fine on Mac, Linux and Windows. Should be installed on all computers c.f. [http://darcs.0x539.de/trac/obby/cgi-bin/trac.cgi/wiki/InstallationGuide] (our preinstalled Xcode veriosn is 2.0, must be 2.1):

Easy way out when the standard Darwin Ports installation isn’t working: just get a copy of /opt/local/ from Børre.

Trond should ask Lars Nygård and Tero Avellan to install Gobby as well; has asked Per Langgård.

SEE 2.5 extensions

TODO:

12. Next meeting, closing

06.06.2006 09:30

Closed at 11:20

Appendix - task lists for the next week

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond