Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from last week
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 10:04.

Present: Børre, Sjur, Steinar, Thomas, Tomi, Trond

Absent: Maaren, Saara

Agenda accepted as is.

2. Updated task status since last meeting

Børre

Maaren

Saara

Sjur

Steinar

Thomas

Tomi

Trond

3. Documentation

The open documentation issues fall into these three categories:

TODO:

4. Corpus gathering

TODO:

5. Corpus infrastructure

Alignment

TODO

6. Infrastructure

TODO:

7. Linguistics

North Sámi

TODO:

Lule Sámi

Trond fixed a bug where initial capital vowel blocked the CG rule to work. The solution was to define capital letter vowel symbols as Vow.

TODO:

8. Name lexicon infrastructure

Decisions made in Tromsø can be found in [this meeting memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html]

TODO:

  1. fix bugs in lexc2xml; add comments to the log element (Saara)
  2. finish first version of the editing (Sjur)
  3. test editing of the xml files. If ok, then: (Sjur, Thomas, Trond)
  4. make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as well) (the morphological section should be kept intact, in e.g. propernoun-sme-morph.txt) (Sjur, Saara)
  5. convert propernoun-($lang)-lex.txt to a derived file from common xml files (Sjur, Tomi, Saara)
  6. implement data synchronisation between risten.no and the cvs repo, and possibly other servers (ie the G5 as an alternative server to the public risten.no - it might be faster and better suited than the official one; also local installations could be treated the same way)
  7. start to use the xml file as source file
  8. clean terms-sme.xml such that all names have the correct tag for their use (e.g. @type=secondary) (Thomas, Maaren, linguists)
  9. merge placenames which are errouneously in different entries: e.g. Helsinki, Helsingfors, Helsset (linguists)
  10. publish the name lexicon on risten.no (Sjur)
  11. add missing parallel names for placenames (linguists)
  12. add informative links between first names like Niillas and Nils (linguists)

9. Spellers

OOo speller(s)

TODO after the MS Office Beta is delivered:

Testing

Selecting test texts

In principle, we need the same text types as the ones we already aim at in our corpus. In practice, we must use new, unseen texts. We would like to have a balanced input of texts, but right now:

Storing test texts

TODO:

Testing tools

TODO:

Regression tests

TODO:

Localisation

We need to translate the info added to our front page (and a separate page) regarding the beta release. Also the press release needs to be translated.

TODO:

Lexicon conversion to the PLX format

TODO:

  1. Look at bottlenecks in existing code (Tomi, Børre)
    1. done - solved
  2. Look at xfst ways of doing it (Sjur, Trond, …)
    1. done for verbs
  3. add derivations to the PLX generation (Tomi)
    1. done
  4. add gt/cwb/paradigm.smj.txt file into gt/script/server_anl.pl (Saara)
  5. add prefixes to the PLX (Børre)
  6. middle nouns (Børre)
  7. add Makefile target for PLX conversion of lexc files (Tomi):
    1. adjectives
    2. nouns
    3. propernouns
    4. verbs derived into other POSes
    5. verbs - must be done on gtsvn.uit.no
      1. produced by the paradigm server on victorio? or regenerate every night, and make it available for regular download? (Tomi, Saara)
  8. make conversion test sample; add conversion testing to the make file (Tomi)
  9. improve number conversion (Børre, Tomi)

Public Beta release

Due to the problems with generating the PLX files discussed above, we need to move the release date further. End of March?

Linguistic issues still open:

DONE:

TODO:

Version identification of speller lexicons

See the Norwegian spellers for an example, with the trigger string tfosgniL.

Suggestion:

nuvviD -> Divvun
nuvviD -> Dávvisámegiella
nuvviD -> Veršuvdna_1.0b1 (based on cvs tag?)
nuvviD -> 12.2.2007  (automatically generated/added)
nuvviD -> Sjur_Nørstebø_Moshagen
nuvviD -> Børre_Gaup
nuvviD -> Thomas_Omma
nuvviD -> Maaren_Palismaa
nuvviD -> Tomi_Pieski
nuvviD -> Trond_Trosterud
nuvviD -> Saara_Huhmarniemi
nuvviD -> Steinar_Nilsen
nuvviD -> Lene_Antonsen
nuvviD -> Linda_Wiechetek

These correction rules (and their corresponding PLX entries) should be added automatically to the PLX file and the phonetic file as part of the compilation process, to include build date and version number.

TODO:

10. Other

Project meeting IRL

Reserve the whole week after easter for a project gathering, probably in Kåfjord. That is, the days 10-13.4.

Corpus contracts

TODO:

Updates:

Bug fixing

48 open Divvun/Disamb bugs, and 23 risten.no bugs

11. Next meeting, closing

The next meeting is 26.3.2007, 09:30 Norwegian time.

The meeting was closed at 11:36.

Appendix - task lists for the next week

Boerre

Maaren

Saara

Sjur

Steinar

Thomas

Tomi

Trond