Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 09:43.

Present: Børre, Saara, Sjur, Thomas, Tomi, Trond

Absent: Maaren (sick leave)

Main secretary: Børre

Agenda accepted with additions under “Other”.

2. Reviewing the task list from the last meeting

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

3. Documentation

TODO:

4. Corpus gathering

Collecting

See a previous meeting memo for what’s to be done.

TODO: Send out the rest of the letters (Børre)

Odin

Sæth replied by e-mail, hasn’t had time to follow-up, but will try to include us in their plans.

Olavi Korhonen’s Lule Sámi dictionary.

KIO Grafisk and the Iđut books

TODO:

Bible texts

We will get text from Finland, but still haven’t received any. We have got the Swedish text from Sweden. As for the last html versions from Norway, Trond has not contacted them last week.

TODO:

Min Áigi

Had a meeting last week, nothing heard of it yet.

Kåfjord

Contacted us last week, they would like to give us texts. Excellent initiative (we didn’t contact them)! They were told to use the web interface, and contact us again if there are problems.

5. Corpus infrastructure

TODO:

Changes and updates because of the Divvun public tender

User account admin and infra: see [previous memo|/admin/weekly/2006/Meeting_2006-03-06.html].

TODO: see above under Documentation.

Automatic build of the content of our corpus repo: also see [previous memo|/admin/weekly/2006/Meeting_2006-03-06.html].

TODO:

Free and non-free texts

More info in a [previous meeting memo.|/admin/weekly/2006/Meeting_2006-03-13.html]

TODO:

Linking parallel files

How do we know that two (or more) files are parallel language versions of each other? Suggestions:

One option:

samefilename.sme.doc.xml
samefilename.nob.doc.xml

nno/facta/samefilename.nno.html.xml
sme/facta/samefilename.sme.html.xml <== parallel file

sme/facta/somefilename.html.xml <== file in one lg only

The other option: to store the parallel files as links in the meta info/header Then we can keep the original filename.

Should we allow for more than one file at a time when uploading? Use cases: parallel texts, chapters in a book, many documents from the same author. Saara will think about it, and discuss/propose something in the newsgroup.

DECISION: We’ll keep the original filename, and store linking info in the header (has to be added manually). Saara will develop the web interface for uploading to make it easier to add several documents in one go (as a serial process).

More texts to the graphical corpus interface:

TODO:

Top-two priorities:

  1. Linda and Trond to go through the taglist
  2. Saara and Trond to contact Anders in 0slo

Text upload

TODO:

Language recognition

As a work-around before Finnish recognition is reliable, treat all “Finnish” sections as North Sámi (we don’t yet have any Finnish texts(?)). We need to be able to recognise the other languages, to remove noise, to identify parallel texts in teh same docu, etc.

TODO:

6. Infrastructure

Aligner

Today, we have two anchor files in addition to the original one.

TODO:

Hyphenator

TODO:

7. Linguistics

General - hyphenation

We need to add word boundaries in our lexicons. All compounds need explicit word boundary markers, which will go to ` 0 ` in the regular transducer. It will go to ` - ` or # in a specially made hyphenation transducer, which will include the rules made by Trond (see: gt/sme/src/hyph-sme.txt).

It is not clear how this will be done, but Sjur has ideas.

Problematic word boundaries:

CVCV#CVCV  OK CV-CV-CV-CV  need no fix
CVCVC#CVCV OK CV-CVC-CV-CV need no fix
CVCVC#VCV  !! *CV-CV-C#V-CV -> CV-CVC#V-CV <= manually fix only these

Exceptions:

geografiija     ge-og-ra-fii-ja
Voionmaa        Voion-maa  => Voi-on-maa (oi no diphth)
 tak-si-eaig-gi :-)

These needs to be marked in the lexicons in each case, probably something like:

geo^grafiija
Voi^on#maa      Voi^on#maa
geo^grafiija    geo^gra-fii-ja
torne^träsk     tor-ne^träsk
  -- or --
torne#träsk     tor-ne#träsk

We need to introduce one new symbol: ^:0

Goal: Analyse divvunáhkus as div-vun#áh-kus and not as div-vu-náh-kus. That is, to preserve the word boundary from the lexicon in the output, but keep the output as word form, not as stem/baseform+analysis. The following is a suggestion on how to process the input (bottom) to arrive at the wanted output. Start reading from below, and upwards.

output level
twh upper divvun#áhkus <= first analysis  how do you get this transducer?
twh lower divvunáhkus  <= text input

hyp upper div-vun#áh-kus => hyph-sme.fst    ok, we know this one
twh lower divvun#áhkus

twl lower divvun#áh0ku0s  => twolhash-sme.bin      #:# ^:^, not #:0 ^:0
twl upper divvun#áhkkuX4s                       \
                                                   smehash.fst
lex lower divvun#áhkkuX4s                       /
lex upper divvun#áhkku+N+Sg+Loc => sme.save

----- mirror: below, regular order, above mirrored order

lex upper divvun#áhkku+N+Sg+Loc => sme.save
lex lower divvun#áhkkuX4s                      \
                                                   sme.fst
twl upper divvun#áhkkuX4s  =>   twol-sme.bin    /
twl lower divvun0áh0ku0s
input level

TODO:

OPEN questions:

DECISION:

North Sámi

Semantic feature system

Further discussion and details in the [previous meeting memos|/admin/weekly/2006/Meeting_2006-03-20.html] memos

Lule Sámi

TODO:

8. Name lexicon infrastructure

TODO:

  1. refactor and prepare risten.no for multiple collections:
    1. develop the Cocoon sitemap to delegate requests to the proper folder level, such that the most specific code is always used (Sjur)
      1. Done, now also for CSS, thus complete
    2. refactor the code into more and more specific components according to our folder hierarchy (Tomi, Sjur)
  2. develop the needed XQueries and interface (Sjur, Tomi)
  3. data synchronisation between risten.no and the cvs repo (Tomi)
    1. commiting is moving forward
  4. test and review when ready

9. Spellers

Nothing until the new proper noun lexicon is in place. We don’t have enough people to do both.

10. Other

Easter vacation/absenses

Who? When?
Børre from the 10th to the 12th of April
Saara at work normally
Sjur no vacation, possibly paternal leave
Thomas from the 10th to the 12th of April, 3 days
Tomi from the 10th to the 12th of April, might be at work offline
Trond don’t know yet

No meeting during easter.

Gobby

TODO:

SubEthaEdit update

TODO:

Bug fixing

35 open Divvun/Disamb bugs, and 25 risten.no bugs

Min Áigi letters

There are four texts on language correction, two interesting to us:

Key to the G5 room

All Tromsø people need access to Børre’s office, to be able to initiate group video conferences, but only Børre and Trond has it.

TODO:

11. Summary, task list

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

12. Next meeting, closing

18.04.2006 09:30

Sjur is on paternal leave.

Closed at 12:47