Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from last week
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 09:57.

Present: Børre, Ilona, Sjur, Thomas

Absent: Risten, Per-Eric, Trond, Tomi

Agenda accepted as is.

2. Updated task status since last meeting

Børre

Ilona

Maaren

Per-Eric

Risten

Saara

Sjur

Thomas

Tomi

Trond

3. Documentation

Needs to be fixed by Wednesday.

TODO:

4. Corpus infrastructure

Nothing.

5. Infrastructure

TODO:

6. Linguistics

North Sámi

Hyphenation test results:

Correct:
--------
konseartaprográmma      kon-sear-ta-pro-grám-ma
konseartaeahkedis       kon-sear-ta-eah-ke-dis
Márkomeanu      Már-ko-mea-nu
lávvardateahkeda        láv-var-dat-eah-ke-da
konseartaeahkedis       kon-sear-ta-eah-ke-dis
lávvardateahkeda        láv-var-dat-eah-ke-da
servodatberošteaddji    ser-vo-dat-be-roš-tead-dji
sámegillii      sá-me-gil-lii
sátnegovat      sát-ne-go-vat
morašluohti     mo-raš-luoh-ti
Justislávdegoddi        Jus-tis-láv-de-god-di
Sámedikkiin     Sá-me-dik-kiin
olggosaddán     olg-gos-ad-dán
luitet  lui-tet
kristtalaš      krist-ta-laš
orrun   or-run
Lotnolasealáhusas       Lot-no-las-ea-lá-hu-sas
juoiganjuristan juoi-gan-ju-ris-tan
issoras is-so-ras
suinna  suin-na
dáinna  dáin-na
Háliidivččen    Há-lii-divč-čen
bisánivčče      bi-sá-nivč-če
duostan duos-tan
Gárdegobba      Gár-de-gob-ba
Čeakčačahca     Čeak-ča-čah-ca
Gobba   Gob-ba
Sáivagobba      Sái-va-gob-ba

Missing hyph points (/):
------------------------
olgobáikkis     ol-go/báik-kis
máilmmi má/ilm-mi
gaskaijabeaivváš        gas-ka/i-ja-beaiv-váš
dehálaš de-há#laš

Incorrect (#):
--------------
lotnolasealáhussan      lot-no-la#s/e#a/lá/hus-san
bearrašiin      be#ar-ra/ši#in

PLX XFST code:
--------------
ol^go^báik^kis  NIE

gas^kai^ja^beaiv^váš-   NBOX
gas^kai^ja^beaiv^váš    NIOE
gas^kai^ja^beaiv^váš-   NIE
gas^kai^ja^beaiv^váš-   GaIE
gas^kai^ja^beaiv^váš    NBO
gas^kai^ja^beaiv^váš    GaBO

bear^ra^šiin    NIE

lot^no^la^sea^lá^hus^san        NIE

TODO:

Lule Sámi

Hyphenation test results:

Correct:
--------
viessomguhkes   vies-som-guh-kes
viessomvuohkáj  vies-som-vuoh-káj
árvvovuodo      árv-vo-vuo-do
väráltárbbe     vä-rált-árb-be
åhpadusorganisásjåvnån  åh-pa-dus-or-ga-ni-sá-sjåv-nån
häjmmadáfo      häjm-ma-dá-fo
árbbedáhpe      árb-be-dáh-pe
barggovuogijt   barg-go-vuo-gijt
rijkadajva      rij-ka-daj-va
ássjedåbdde     ás-sje-dåbd-de
láhkaásadimesa  láh-ka-á-sa-di-me-sa
giellalágajn    giel-la-lá-gajn
sámegiellaj     sá-me-giel-laj
buorrelágásj    buor-re-lá-gásj
árbbedábálattjat        árb-be-dá-bá-lat-tjat
láhkatæksta     láh-ka-tæks-ta
guosski guoss-ki
rijkalattjat    rij-ka-lat-tjat
organisásjåvnån or-ga-ni-sá-sjåv-nån
hábbmidime      hább-mi-di-me
rábmakonvensjåvnå       ráb-ma-kon-ven-sjåv-nå
árvvalasstet    árv-va-lass-tet
ássjedåbddejuogos       ás-sje-dåbd-de-juo-gos
unneplågogielajt        un-nep-lå-go-gie-lajt
láhkaásadimesa  láh-ka-á-sa-di-me-sa
biejvveavijsav  biejv-ve-a-vij-sav
sámegiellaj     sá-me-giel-laj
unneplågogielajn        un-nep-lå-go-gie-lajn
ministarjuohkusis       mi-nis-tar-juoh-ku-sis
guosski guoss-ki
Dussnagiehtje   Duss-na-gieh-tje
Divtasvuodna    Div-tas-vuod-na
Gåhpejávrre     Gåh-pe-jávr-re
Gásluokta       Gás-luok-ta
Helmukvuodna    Hel-muk-vuod-na
Ibboluokta      Ib-bo-luok-ta
Julevædno       Ju-lev-æd-no
Jåhkmåhkke      Jåhk-måhk-ke
Jåkmåhkke       Jåk-måhk-ke
Jåhkåmåhkke     Jåh-kå-måhk-ke

Missing hyph points (/):
------------------------
Jienastim#njuolgadusá   Jie/nas/tim#njuol/ga/du/sá (# not removed from input)
sierralágásj    sier-ra/lá/gásj
orgánajs        or-gá/najs
Orgánajs        Or-gá/najs

Incorrect (#):
--------------
javllamáno      ja#vl-la/má/no
biejvveávijsav  bie#jv/ve/á-vij/s#av
suomagiella     su#o-ma-giel-la
sierraláhkáj    sier-ra#láh-káj

PLX XFST code:
--------------
javl^la#má^no   NIE
javl^la#má^no   GaIOE

or^gá^najs      NIE

suo^ma#giel^la- NIE
suo^ma#giel^la- NBOX
suo^ma#giel^la  NIOE
suo^ma#giel^la  NBO

TODO:

7. Name lexicon infrastructure

Delayed till Divvun2 (or after release of Divvun1).

Decisions made in Tromsø can be found in [this meeting memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html]

TODO:

  1. fix bugs in lexc2xml; add comments to the log element (Saara)
  2. finish first version of the editing (Sjur)
  3. test editing of the xml files. If ok, then: (Sjur, Thomas, Trond)
  4. make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as well) (the morphological section should be kept intact, in e.g. propernoun-sme-morph.txt) (Sjur, Saara)
  5. convert propernoun-($lang)-lex.txt to a derived file from common xml files (Sjur, Tomi, Saara)
  6. implement data synchronisation between risten.no and the cvs repo, and possibly other servers (ie the G5 as an alternative server to the public risten.no - it might be faster and better suited than the official one; also local installations could be treated the same way)
  7. start to use the xml file as source file
  8. clean terms-sme.xml such that all names have the correct tag for their use (e.g. @type=secondary) (Thomas, Maaren, linguists)
  9. merge placenames which are errouneously in different entries: e.g. Helsinki, Helsingfors, Helsset (linguists)
  10. publish the name lexicon on risten.no (Sjur)
  11. add missing parallel names for placenames (linguists)
  12. add informative links between first names like Niillas and Nils (linguists)

8. Proofing tools

Hunspell

Continuously improving.

TODO:

Testing

Spelling Error Markup

This will wait till after the release.

TODO:

Automated testing

TODO:

MS Office

Windows was done = ok, Ilona has tested PowerPoint and Word on Mac, and Trond tested Excel (you have to turn on correction, otherwise it will give you English spell checking, it is the same for all languages). Entourage is still untested.

TODO:

Lexicon conversion to the PLX format

Open issues based on test results:

sámi-dáru - not accepted => Gen+hyph compound, is not allowed with hyphen. We can allow such compounds without too much overgeneration by adding the hyphen to the last part, ie -dáru in the PLX entry. => Bugzilla as feature request.

smj

sme

TODO:

InDesign tools

Almost finished, found a bug (reported to Polderland).

TODO:

Hyphenators

Testing! - Done, see above.

Final release

Schedule and tasks for the remaining weeks:

This week:

Next week:

Hotel is booked, you have to arrange the travelling to Oslo yourself (but make sure the travel agency sends the bill to SD).

TODO:

9. Other

Corpus contracts

Delayed till after final release.

TODO:

Bug fixing

When fixing bugs, record the version number containing the fix in the Bugzilla bug report, such that for each bug, we know exactly when it should have been fixed, in what file(s) and what version.

83 open Divvun/Disamb bugs (45 of these 83 are speller-related bugs, 38 are other bugs), and 23 risten.no bugs

10. Next meeting, closing

The next meeting is 10.12.2007, 09:30 Norwegian time.

The meeting was closed at 10:09.

Appendix - task lists for the next week

Boerre

Ilona

Maaren

Per-Eric

Risten

Saara

Sjur

Thomas

Tomi

Trond