Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from last week
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 10:15.

Present: Børre, Ilona, Per-Eric, Sjur, Thomas, Tomi

Absent: Risten, Trond

Agenda accepted as is.

2. Updated task status since last meeting

Børre

Ilona

Maaren

Per-Eric

Risten

Saara

Sjur

Thomas

Tomi

Trond

3. Documentation

Nothing new.

TODO:

4. Corpus infrastructure

Nothing.

5. Infrastructure

TODO:

6. Linguistics

North Sámi

Hyphenation is better, but still contains a lot of errors. Sjur will run the latest hyphenator on our test material, and discuss the test results with the rest.

TODO:

Lule Sámi

Trond and his team have found words to be added to the smj lexicon.

cat smesmj.txt | grep -v 'prop$' | cut -f2 | lookup -flags mbTT -utf8 ~/gt/smj/bin/smj.fst | grep '\?' | l

6581 words in the smesmj.txt lexicon. Disregarding the proper nouns, 1824 are not recognised by smj-norm.fst or by smj.fst. Many of these are loan words or they are derivations. Some examples:

čála    tjála   n
giehtačála      giehtatjála     n
vuolláičála     vuollájtjála    n <= :-)
vinjučála       vinjotjála      n
johtučála       jåhtotjála      n
čuokkisčála     tjuokkestjála   n
bajildusčála    bajeldustjála   n
mála    mála    n
tjála   čála    n
giehtatjála     giehtačála      n <=== :-((
vuollájtjála    vuolláičála     n
tjuokkestjála   čuokkisčála     n
vinjotjála      vinjučála       n
jåhtotjála      johtučála       n
leapma  liebma  n

ja/dahje        ja/dahje        +?
gobba   gobba   +?
gaiba   gaiba   +?
struhcca        struhcca        +?
fáhcca  fáhcca  +?
suorbmafáhcca   suorbmafáhcca   +?
vahca   vahca   +?
ohca    ohca    +?
juhca   juhca   +?

It seems to be a mixup of smj and sme in the material. That has to be cleaned up.

We have to test hyphenation for lulesami as well.

TODO:

7. Name lexicon infrastructure

Sjur got risten.no up and running on the G5. Worked only for him, though.

Decisions made in Tromsø can be found in [this meeting memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html]

TODO:

  1. set up Tomcat and risten.no on the G5 again (Sjur, Børre)
    1. install risten.no
      1. did it
  2. fix bugs in lexc2xml; add comments to the log element (Saara)
  3. finish first version of the editing (Sjur)
  4. test editing of the xml files. If ok, then: (Sjur, Thomas, Trond)
  5. make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as well) (the morphological section should be kept intact, in e.g. propernoun-sme-morph.txt) (Sjur, Saara)
  6. convert propernoun-($lang)-lex.txt to a derived file from common xml files (Sjur, Tomi, Saara)
  7. implement data synchronisation between risten.no and the cvs repo, and possibly other servers (ie the G5 as an alternative server to the public risten.no - it might be faster and better suited than the official one; also local installations could be treated the same way)
  8. start to use the xml file as source file
  9. clean terms-sme.xml such that all names have the correct tag for their use (e.g. @type=secondary) (Thomas, Maaren, linguists)
  10. merge placenames which are errouneously in different entries: e.g. Helsinki, Helsingfors, Helsset (linguists)
  11. publish the name lexicon on risten.no (Sjur)
  12. add missing parallel names for placenames (linguists)
  13. add informative links between first names like Niillas and Nils (linguists)

8. Proofing tools

Hunspell

Continuously improving.

TODO:

Testing

Spelling Error Markup

This will wait till after the release.

TODO:

Automated testing

TODO:

MS Office

An important aspect of this testing is to document in the user guide anything that could be a problem for users.

TODO:

Lexicon conversion to the PLX format

Open issues based on test results:

smj

482 - still problematic (prefix), 484 - double hyphens suggested, 575 - name+name = double hyphens in sugg, Svierigadárogielan - still rejected (prefix)

sme

397 - double hyphens (name+name), 419 - fixed, 425 - roman number, 431 - does not accept the correct string, but DO suggest the same; also hyphen final forms are accepted, but not the same form when part of a compound, 452 - fixed, 461 - ovda accepted, almost 50 % (17) gets correct suggestion, 489, 522 - fixed, 524 - fixed, Guovdageainnu-láđđi not accepted.

TODO:

InDesign tools

TODO:

Hyphenators

Testing!!!

Release version

Schedule and tasks for the remaining weeks:

TODO:

Actual release

December 12 is the most likely date, before 12:00. Still to be confirmed.

There will be a release party in the afternoon.

9. Other

Corpus contracts

Delayed till after final release.

TODO:

Bug fixing

When fixing bugs, record the version number containing the fix in the Bugzilla bug report, such that for each bug, we know exactly when it should have been fixed, in what file(s) and what version.

83 open Divvun/Disamb bugs (45 of these 83 are speller-related bugs, 38 are other bugs), and 23 risten.no bugs

Software updates

10. Next meeting, closing

The next meeting is 03.12.2007, 09:30 Norwegian time.

The meeting was closed at 11:42.

Appendix - task lists for the next week

Boerre

Ilona

Maaren

Per-Eric

Risten

Saara

Sjur

Thomas

Tomi

Trond