Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

Cf. one of the following, depending on context:

1. Opening, agenda review, participants

Opened at 10:19.

Present: Børre, Per-Eric, Sjur, Tomi, Trond

Absent: Maaren, Risten, Thomas

Agenda accepted as is.

2. Updated task status since last meeting

Børre

Ilona

Maaren

Per-Eric

Risten

Saara

Sjur

Thomas

Tomi

Trond

3. Documentation

InDesign was released just before Christmas (21.12).

Our documentation needs a thorough clean-up and reorganisation to make it clearer and easier to use.

TODO:

4. Corpus gathering

Per-Eric has been talking with the wife of Kurt Tore, and she has promised that their son Johnny will go through all the smj he has written. Hopefully we will get them soon. We need to know who signed the corpus contract with SD - Børre will check that.

We need to start gathering sma texts right away. Some sources of sma texts:

TODO:

5. Infrastructure

Forrest needs better handling of i18n, to help us get a more stable site. We should probably also look at a visual make-over of our site as well at the same time.

We now also have the time to start to explore the new collaboration features of the Leopard server: shared project calendars, possibly a wiki, permanent chat rooms for dedicated topics.

TODO:

6. Linguistics

North Sámi

Hyphenation bugs still there, needs improved test bench.

Lule Sámi

Hyphenation: same as for sme.

TODO:

South Sámi

Trond and Sjur needs to have a thourough look at the sma sources, and evaluate the present approach to its morphophonology.

TODO:

7. Name lexicon infrastructure

Decisions made in Tromsø can be found in [this meeting memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html]

TODO:

  1. fix bugs in lexc2xml; add comments to the log element (Saara)
  2. finish first version of the editing (Sjur)
  3. test editing of the xml files. If ok, then: (Sjur, Thomas, Trond)
  4. make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as well) (the morphological section should be kept intact, in e.g. propernoun-sme-morph.txt) (Sjur, Saara)
  5. convert propernoun-($lang)-lex.txt to a derived file from common xml files (Sjur, Tomi, Saara)
  6. implement data synchronisation between risten.no and the cvs repo, and possibly other servers (ie the G5 as an alternative server to the public risten.no - it might be faster and better suited than the official one; also local installations could be treated the same way)
  7. start to use the xml file as source file
  8. clean terms-sme.xml such that all names have the correct tag for their use (e.g. @type=secondary) (Thomas, Maaren, linguists)
  9. merge placenames which are errouneously in different entries: e.g. Helsinki, Helsingfors, Helsset (linguists)
  10. publish the name lexicon on risten.no (Sjur)
  11. add missing parallel names for placenames (linguists)
  12. add informative links between first names like Niillas and Nils (linguists)

8. Proofing tools

Hunspell

Proper nouns not yet working, and they do not contain anything to clearly identify the end of the stem. This makes it harder to generate proper HUNSPELL entries.

TODO:

Testing

Spelling Error Markup

This will wait till after the release.

TODO:

Automated testing

Paradigm testing is now fixed, and is working.

BUT: paradigms are not generated for smj verbs, in gt/smj/testing. Nouns, adjectives, propernouns, numbers all work fine.

TODO:

Lexicon conversion to the PLX format

Open issues based on test results:

sámi-dáru - not accepted => Gen+hyph compound, is not allowed with hyphen. We can allow such compounds without too much overgeneration by adding the hyphen to the last part, ie -dáru in the PLX entry. => Bugzilla as feature request.

smj

sme

TODO:

InDesign tools

Hyphenator released! The speller is coming in a first beta today or tomorrow.

TODO:

Hyphenators

Still issues to investigate.

Update

Released dec. 21.

Windows installer

New Windows installer NSIS

Benefits of using this:

Drawbacks:

TODO:

9. Other

South Sámi project startup meeting

We extend the meeting on our part, to have this project’s first gathering.

TODO:

Corpus contracts + open source

TODO:

Bug fixing

When fixing bugs, record the version number containing the fix in the Bugzilla bug report, such that for each bug, we know exactly when it should have been fixed, in what file(s) and what version.

83 open Divvun/Disamb bugs (45 of these 83 are speller-related bugs, 38 are other bugs), and 23 risten.no bugs

10. Summary, priority list going forward

11. Next meeting, closing

The next meeting is 14.1.2008, 09:30 Norwegian time.

The meeting was closed at 12:30.

Appendix - task lists for the next week

Boerre

Maaren

Per-Eric

Saara

Sjur

Thomas

Tomi

Trond