Language Technology at UiT The Arctic University of Norway

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Page Content

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 09:59.

Present: Sjur, Thomas, Trond, Tomi

Absent: Børre, Maaren, Saara

Agenda accepted as is.

2. Updated task status since last meeting

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

3. Documentation

TODO:

4. Corpus gathering

Collecting

See a previous meeting memo for what’s to be done.

TODO:

New contracts:

Olavi Korhonen’s Lule Sámi dictionary.

TODO:

KIO Grafisk and the Iđut books

TODO:

Bible texts

The Swedish Bible Society is reluctant to give us the paratext version, as it is linguistically inferiour. Instead they suggested a MS Word version of the text.

Børre and Trond suggest that we give the MS Word version a second try, at least to be able to get hard facts about the problems with that format when discussing with the swedes.

We have asked for any language version in paratext format from the Norwegian Bible Society, to use as a reference when making the smj paratext version.

We are still waiting for the Norwegian versions, but Finnish and Swedish are within reach. Finnish is available in a newsgroup (text-only) format, and we have received MS Word of the Swedish version. Børre has had some problems converting the Word file, though.

TODO:

Davvi Girji

Talked to Harald Gaski, he will send the contracts to the writers’ organisation, to ensure the contract is ok.

TODO:

Min Áigi

TODO:

Kåfjord

Promised to send us texts. Some texts have arrived, but nothing from Ája.

TODO:

Sámi Instituhtta

TODO:

Čálliid Lágádus

Børre talked to them, they are positive, and will give us print-ready pdf’s. [http://www.calliidlagadus.org/]

Árran

The negotiations are underway, they discuss it in a meeting today, and we have an appointment for later today or tomorrow (both their internal texts, and discussions on Báhko). All further negotiations will go through Bård Eriksen / Báhko.

TODO:

5. Corpus infrastructure

General

Errors in the Antiword conversions found when parsing the xml corpus. Main problem is to skip headers and footers - now they are included as part of the text, and are intermingled within the regular text, e.g. in the middle of a sentence.

TODO (all of these in priority order, the third option is really a last resort):

  1. fine-tune the initial conversion in antiword (Børre or Saara)
  2. make file-specific fixes in the file-speciflc xsl file (by having our local xsl expert read our corpus, and spot obvious omissions in the conversion (Saara)
  3. Manually fix the resulting files before sending off to analysis (??, this option should be postponed to just before the final, official release, and evaluated at that point)

User accounts and access

TODO:

More texts to the graphical corpus interface:

We need to get the infrastructure complete to be able to do this, then it should be a piece of cake.

Saaara has made corpus-analyze.pl, a script to analyse text while keeping the xml structure. It is working, but the output format needs some tweaking.

TODO:

  1. refine xml-tagged output (Saara and Tomi)
  2. add text to the server (Lars)

Aligner

Trond and Saara will continue this issue.

We need markup of parallelism in the corpus DTD, at least an indication of which documents belong together. Discussion to continue in the newsgroup (Saara has started it - please respond!).

Language recognition

Still waiting for more smj text to improve it.

Free and non-free texts

Anything? Final check with Børre and Saara - waiting for them to return.

Corpus summary

TODO:

Proofed vs unproofed corpus files

TODO:

6. Infrastructure

Paradigm generation

Goal: Reuse Greenlandic code for paradigm generation.

We have now received the original PHP code from Tero/Per. It seems quite easy to adapt, although it will of course require work on our part.

TODO:

Hyphenator

Thomas is finished with adding ^ tags to the sme noun file.

Trond and Thomas have been working on the smj rule component, and have improved both the treatment of weak grade consonant clusters (preconsonantal geminates) and on some loan word patterns.

TODO:

Automatic Bugzilla reminder for untouched bugs

We need to get a summary by e-mail for all bugs not touched in more than 5(?) weeks. Do we want it? Yes. It is possible? Børre will look around, if nothing found, he’ll ask Thor-Øivind.

TODO:

7. Linguistics

Name double-tagging

Conclusion, in a principled fashion:

  1. hardcoded sem-tags win
  2. There is a sem-tag conversion procedure: according to a hierarchy of sem-tags: Any Plc can be interpreted as Sur, etc. (to be spelled out)

TODO:

North Sámi

Topic: Actio+compound - how productive? How much does it destroy speller performeance by being close to real spelling errors with a high editing distance to the correct form?

Spelling error from typos.txt:
4       vuolggahansádji             vuolggasadji

vuolggahan
vuolggahan      vuolgga+N+Sg+Nom+Foc
vuolggahan      vuolggahit+V+TV+PrfPrc
vuolggahan      vuolggahit+V+TV+Ind+Prs+Sg1
vuolggahan      vuolggahit+V+TV+Actio+Acc
vuolggahan      vuolggahit+V+TV+Actio+Gen
vuolggahan      vuolggahit+V+TV+Actio+Nom

vuolggahansadji
vuolggahansadji vuolgga+N+SgNomCmp#ho+N+SgNomCmp#atnu+N+SgNomCmp#sadji+N+Sg+Nom
vuolggahansadji vuolgga+N+SgNomCmp#ho+N+SgNomCmp#atnu+N+SgNomCmp#sadji+N+Sg+Nom
vuolggahansadji vuolgga+N+SgNomCmp#ho+N+SgNomCmp#atnu+N+SgNomCmp#sadji+N+Sg+Nom
vuolggahansadji vuolggahit+V+TV+Actio#sadji+N+Sg+Nom
vuolggahansadji vuolggahit+V+TV+Actio#sadji+N+Sg+Nom
vuolggahansadji vuolggahit+V+TV+Actio#sadji+N+Sg+Nom

TODO:

Lule Sámi

There are some open issues in the marginal area of the smj transducer:

TODO:

8. Name lexicon infrastructure

TODO:

  1. finish refactoring for multiple collections in the search interfarce (Sjur)
    1. progressing, investigating options
  2. develop the needed XQueries and interface (Sjur, Tomi)
    1. progressing, done some, commited (adding new term, create-termc-entry.xq)
  3. data synchronisation between risten.no and the cvs repo (Tomi)
    1. discussion started on eXist-list, nothing useful came up. We need to reformulate the question from our perspective, and bring it up again (Sjur)

9. Public tender

We have received answers from LS, and are now wating for the PL answers. Their deadline is the coming Thursday.

TODO:

10. Other

Summer vacation

Who When
Børre August
Linda ?
Maaren ?
Saara July
Sjur at least 2 weeks in July, but still open
Thomas 3.7 - 7.8
Trond 3.7 - 14.8 (last two weeks off at summer school)
Tomi 8.7 - 16.7, 2 more weeks in July and/or August

Bug fixing

43 open Divvun/Disamb bugs (two down!), and 25 risten.no bugs

Guess: 1/3 of the bugs are fixed already (?)

Please help Saara with [bug 279|http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=279] (Perl locale). Not much help… Saara will contact Roy on this issue.

After the corpus issues have been somewhat settled, we should do a bug barnraising. … and then a new one after the name lexicon is fixed.

Gobby

Installed to most computers (only Saara missing), now we need to test it in a multiuser scenario.

Trond has asked Lars Nygård and Tero Avellan to install Gobby as well. Per Langgård is already using it.

TODO:

SEE 2.5 extensions

TODO:

11. Next meeting, closing

19.06.2006 09:30

Sjur is away on Friday June 16.

Closed at 11:21.

Appendix - task lists for the next week

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond