Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from last week
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 09:38.

Present: Børre, Maaren, Per-Eric, Sjur, Steinar, Thomas, Tomi, Trond

Absent: Saara

Agenda accepted as is.

2. Updated task status since last meeting

Boerre

Maaren

Per-Eric

Saara

Sjur

Steinar

Thomas

Tomi

Trond

3. Documentation

The open documentation issues fall into these three categories:

TODO:

4. Corpus gathering

TODO:

5. Corpus infrastructure

Nothing this week.

6. Infrastructure

TODO:

7. Linguistics

North Sámi

Actio compounds: most of them are already accepted as freely made compounds, the border can be exemplified with the following:

  BOAHTALADDAN  *boahtalladanmuorra - not base form, but which form?=derived
  BOAHTIN boahtinmuorra - base form
  oahpahahttin for example causatives are allowed
  oahpaladdanvuohki is okei, and all these -laddat + subst

Actual speller behaviour on the examples above:

boahtalladanmuorra
boahtaladdanmuorra - not accepted, suggestions:
    bohtaladdanmuorra - okei with -o- (Maaren disagrees?no she does not)
    bohtaladdanmuorra
bohtaladdanmuorra       bohtat+V+IV+Der1+Der/l+V+Der2+Der/adda+V+Der3+Der/n+N+Sg+Nom#muorra+N+Sg+Nom
    boahtáladdanmuorra - okei with á

boahtinmuorra - ok
oahpahahttin - ok
oahpaladdanvuohki - not accepted (it should be?), suggestions:
    oahppaladdanvuohki - this is correct according to Thomas
    roahpaladdanvuohki
    soahpaladdanvuohki
    ...

Maaren and Duomma disagrees about what is correct and not, needs to be resolved. (slight pause…) It is resolved:)

TODO:

Lule Sámi

We received a pdf file containing the print ready dictionary. It has two columns, and it is difficult to extract any formatting information. This makes it difficult to differentiate between keyword and other information. Børre will try to get the dictionary in a more suitable format from Sami åhpadusguovdásj in Jokkmokk.

TODO:

norm-look-up:

tjiehkusit
tjiehkusit      tjiegos+A+Adv

basstelit
basstelit       basstelit       +?

muttágit
muttágit        mutták+A+Adv

gåbddågit
gåbddågit       gåbddåk+A+Adv

allagit
allagit allak+A+Adv

gåbddågit
gåbddågit       gåbddåk+A+Adv

galmasit
galmasit        galmasit        +?

galmmasit
galmmasit       galmas+A+Adv

suohkadit
suohkadit       suohkat+A+Adv

låssådit
låssådit        låssåt+A+Adv

dibmásit
dibmásit        dimes+A+Adv

bihtjasit
bihtjasit       bitjas+A+Adv
bihtjasit       bitjes+A+Adv

oabmásit
oabmásit        oames+A+Adv
oabmásit        oabme+N+Sg+Ill+PxSg2

tjalmmisit
tjalmmisit      tjalmmis+A+Adv

stuoragit
stuoragit       stuorak+A+Adv

8. Name lexicon infrastructure

Decisions made in Tromsø can be found in [this meeting memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html]

TODO:

  1. fix bugs in lexc2xml; add comments to the log element (Saara)
  2. finish first version of the editing (Sjur)
  3. test editing of the xml files. If ok, then: (Sjur, Thomas, Trond)
  4. make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as well) (the morphological section should be kept intact, in e.g. propernoun-sme-morph.txt) (Sjur, Saara)
  5. convert propernoun-($lang)-lex.txt to a derived file from common xml files (Sjur, Tomi, Saara)
  6. implement data synchronisation between risten.no and the cvs repo, and possibly other servers (ie the G5 as an alternative server to the public risten.no - it might be faster and better suited than the official one; also local installations could be treated the same way)
  7. start to use the xml file as source file
  8. clean terms-sme.xml such that all names have the correct tag for their use (e.g. @type=secondary) (Thomas, Maaren, linguists)
  9. merge placenames which are errouneously in different entries: e.g. Helsinki, Helsingfors, Helsset (linguists)
  10. publish the name lexicon on risten.no (Sjur)
  11. add missing parallel names for placenames (linguists)
  12. add informative links between first names like Niillas and Nils (linguists)

9. Spellers

OOo speller(s)

The MS Office Beta is now delivered, thus the following items move up on the priority list.

TODO:

Testing

Spelling Error Markup

TODO:

Testing tools

TODO:

Regression tests

TODO:

Localisation

TODO:

Lexicon conversion to the PLX format

TODO:

Compounding restrictions

How to include compounding restriction comment tags in the transducers:

giv0ri:giv'ri ALBMI ; !+SgNomCmp +SgGenCmp +PlGenCmp
=> (using a perl script or similar)
+SgNomCmp+SgGenCmp+PlGenCmpgiv0ri:giv'ri ALBMI ; !+SgNomCmp +SgGenCmp +PlGenCmp

TODO:

  1. improve prefix conversion to PLX (Tomi)
  2. improve middle noun conversion to PLX (Tomi)
  3. improve noun + adjective PLX conversion: (Tomi)
    1. compounding stems - how do we generate them? Using the java client? +SgNomCmp+Cmpnd = sáme–, should give the correct compounding stem, shouldn’t it? We want to optionally go from: sáme- NLI to sáme NL: - NLI (->) NL, which means we should be able to extract correct compounding stems using xfst methods only.
    2. compounding tags - we need to obey them when making the transducers. Suggestion - see above.
  4. make conversion test sample; add conversion testing to the make file (Tomi)
    1. to regression test / QA the PLX conversion.
  5. improve number conversion (Børre, Tomi)
    1. done
  6. ask for larger disk for the web server (Trond, Børre)
    1. done and installed

Errors in latest speller build

This list should be added to Bugzilla, and the list of fixes and known issues in future releases should be generated from it.

TODO after the final public beta:

Public Beta release

RELEASED on Tuesday May 29.

Press coverage has been very good within the Sámi community - the coverage in Sweden and Finland, and in Norwegian-language media is not known.

TEST before public beta release:

TODO:

10. Other

Summer vacation

When are we taking it? Please fill in the table below:

Name Starting Ending
Børre x x
Maaren 9.7. 10.8.
Per-Eric 9.7. 20.7.
Saara 2.7 3.8
Sjur x x
Steinar x x
Thomas 9.7. 12.8.
Tomi 9.7. 5.8.
Trond 25.6. x

Divvun people also need to send the dates to Julie Eira or Ellen Mienna Guttorm.

Corpus contracts

TODO:

Bug fixing

41 open Divvun/Disamb bugs, and 23 risten.no bugs

TODO:

The meeting in Drag

The Sámi Parliament board has its meeting June 19-21. We should use Monday 18. as our travel day, and return on Friday 22. Fly to Bodø, and go by rental car from there. It is also possible to go by car all the way from Tromsø, and it is even faster. Those going to Bodø are (at least):

11. Next meeting, closing

The next meeting is 11.6.2007, 10:30 Norwegian time.

The meeting was closed at 11:56.

Appendix - task lists for the next week

Boerre

Maaren

Per-Eric

Saara

Sjur

Steinar

Thomas

Tomi

Trond