Language Technology at UiT The Arctic University of Norway

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Page Content

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from last week
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 00:48.

Present: Børre, Maaren, Per-Eric, Sjur, Steinar, Thomas, Tomi, Trond

Absent: Saara

Agenda accepted as is.

2. Updated task status since last meeting

Børre

Maaren

Per-Eric

Saara

Sjur

Steinar

Thomas

Tomi

Trond

3. Documentation

Nothing new.

The open documentation issues fall into these three categories:

TODO:

4. Corpus gathering

Nothing new.

TODO:

5. Corpus infrastructure

Alignment

All parallell texts are now aligned. Lars has made some work in Oslo, but no visible changes to the pages yet.

TODO

6. Infrastructure

The [http://giellatekno.uit.no] main page has been restructured, and Saara’s nice paradigm generator has been given a more visible place. The North Sámi generator is in place, but the Lule Sámi one still has not been implemented. On the downside, the online wordform generators do not work, and all the links now accidently go to the English rather than the Sámi pages..

TODO:

7. Linguistics

North Sámi

Actio compounds not being allowed to compound creates problems for constructions like: boahtaladdan- ja vuolggasadji. We should probably look into whether we need to open up for actio compounds again, and the disambiguator people should then find another solution to their problem.

TODO:

Lule Sámi

Buorrek issue: -k clitic (or abessive?) Clitic issue seems phonological in lulesámi: tjuvdek but not *tjuvdevk, *tjuvdijk, *tjuvdijnk tjuvdijn ge instead of tjuvdijnk?

What is a clitic? Member of LEXICON K.

  1. totally uncritical when it comes to hosts (accept all hosts) <===
  2. does not (usually) interfere with the phonology of the host

If you may add -k to all wordforms pointing to the lexicon K today, then -k should stay in K. If no, it must go.

North sami: we use K instead of #, since basically all words (and word forms) may get -go, -ba, -ge, etc.

lijge, lage, muorrage - buorrek, *lak

Say, for the sake of the argument that

LEXICON K
 ENDLEX ;
###  +Clt:clitic ENDLEX ;
 +Clt+ge:#ge ENDLEX ;
 +Clt+gen:#gen ENDLEX ;
 +Clt+ga:#ga ENDLEX ;
 +Clt+k:#k ENDLEX ;

TODO:

8. Name lexicon infrastructure

Decisions made in Tromsø can be found in [this meeting memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html]

TODO:

  1. fix bugs in lexc2xml; add comments to the log element (Saara)
  2. finish first version of the editing (Sjur)
  3. test editing of the xml files. If ok, then: (Sjur, Thomas, Trond)
  4. make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as well) (the morphological section should be kept intact, in e.g. propernoun-sme-morph.txt) (Sjur, Saara)
  5. convert propernoun-($lang)-lex.txt to a derived file from common xml files (Sjur, Tomi, Saara)
  6. implement data synchronisation between risten.no and the cvs repo, and possibly other servers (ie the G5 as an alternative server to the public risten.no - it might be faster and better suited than the official one; also local installations could be treated the same way)
  7. start to use the xml file as source file
  8. clean terms-sme.xml such that all names have the correct tag for their use (e.g. @type=secondary) (Thomas, Maaren, linguists)
  9. merge placenames which are errouneously in different entries: e.g. Helsinki, Helsingfors, Helsset (linguists)
  10. publish the name lexicon on risten.no (Sjur)
  11. add missing parallel names for placenames (linguists)
  12. add informative links between first names like Niillas and Nils (linguists)

9. Spellers

OOo speller(s)

TODO after the MS Office Beta is delivered:

Testing

Spelling Error Markup

TODO:

Testing tools

TODO:

Regression tests

TODO:

Localisation

We need to translate the info added to our front page (and a separate page) regarding the beta release. Also the press release needs to be translated.

TODO:

Lexicon conversion to the PLX format

Børre got the conversion working all the way to the final speller, for sme. A new speller lexicon is soon available for download. smj is next.

TODO:

Compounding restrictions

How to include compounding restriction comment tags in the transducers:

giv0ri:giv'ri ALBMI ; !+SgNomCmp +SgGenCmp +PlGenCmp
=> (using a perl script or similar)
+SgNomCmp+SgGenCmp+PlGenCmpgiv0ri:giv'ri ALBMI ; !

TODO:

  1. improve prefix conversion to PLX (Tomi)
  2. improve middle noun conversion to PLX (Tomi)
  3. improve noun + adjective PLX conversion: (Tomi)
    1. compounding stems - how do we generate them? Using the java client? +SgNomCmp+Cmpnd = sáme–, should give the correct compounding stem, shouldn’t it? We want to optionally go from: sáme- NLI to sáme NL: - NLI (->) NL, which means we should be able to extract correct compounding stems using xfst methods only.
    2. compounding tags - we need to obey them when making the transducers. Suggestion - see above.
  4. make conversion test sample; add conversion testing to the make file (Tomi)
    1. to regression test / QA the PLX conversion.
  5. improve number conversion (Børre, Tomi)
  6. ask for larger disk for the web server (Trond, Børre)

Public Beta release

TODO:

10. Other

Corpus contracts

TODO:

Bug fixing

35 open Divvun/Disamb bugs, and 23 risten.no bugs

TODO:

The next gathering

The Sámediggiráđđi meeting is moved one week. Thus, instead of around 13. June, it will be around 20. That’s midsummer here in Finland:(

11. Next meeting, closing

The next meeting is 7.5.2007, 09:30 Norwegian time.

The meeting was closed at 11:30.

Appendix - task lists for the next week

Boerre

Maaren

Per-Eric

Saara

Sjur

Steinar

Thomas

Tomi

Trond