Language Technology at UiT The Arctic University of Norway

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Page Content

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from last week
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 9:39.

Present: Børre, Maaren, Sjur, Steinar, Thomas, Tomi, Trond

Absent: Saara

Agenda accepted as is.

2. Updated task status since last meeting

Børre

Maaren

Saara

Sjur

Steinar

Thomas

Tomi

Trond

3. Documentation

Missing documentation on corpus access: no link to unrestricted part of corpus for download, and no page describing how to apply for access to the corpus.

TODO:

4. Corpus gathering

Børre had a short look at the new sma files. Nothing else happened.

TODO:

5. Corpus infrastructure

Alignment

TODO:

Conversion issues

Character encoding errors are vanishing, thanks to Saara. What is left as an annoying problem is the pdf cutting of wordforms.

đi iešguđet prográmmasurggiin, mat laktásit dear vvašvuhtii, eallindillái ja sám
n/depts/Mangf027.pdf.xml:    <p>Kultuvra ja dear vvašvuohta Kultuvra ja dearvvaš
jat johtui dutkama sámiid eallineavttuid ja dear vvašvuođa birra. Galgá vuoruhit
idfuolahusa doaibmaplánaruđaiguin (ja mielladear vvašvuođa buoridanplánaruđaigui
                          vuhtiiváldima dárbu ár vvoštallojuvvo dan oktavuođas,

Another error type found/still not corrected:
 ánáidgárddiide, giellaoahpahussii ja diehtojuo­ hkin-, ovddidan- ja oaivadanbar

We have two suggestions for fixing some or most of these errors:

  1. String replacement in the xsl files, like / vva/vva/ etc.
  2. testing the string output of the pdf conversion in the morphological analyser:
- test each string returned for morphological recognisability
- if string is not recognised, and the following sting is not recognised either:
  - concatenate the to unrecognised strings, and try again
  - if the concatenation is recognised, then use the concatenated string, if
    not, leave the strings as is

Pro/con:

TODO:

6. Infrastructure

We need to test it for new corpus users, and external users in general, cf the e-mail announcement to the NoDaLi mailing list about to be made ready.

Test tasks:

Instructions:

TODO:

7. Linguistics

North Sámi

TODO:

Numbers:

We had a meeting last Friday, memo to be found here

TODO:

Classification of compounding with numerals:

Num N   ok golbmafanas
N   Num R not fanasgolbma

Num Num is a restricted type of compounding, very different from free compounding. Num Num is also already implemented in the transducer, but left out, due to circularity.

Written long numerals are marginal, we may consider generating some. Syntax:

* 1-9 + thousand + 1-9 + hundred +
    { 1-9 + ten +  1-9 / 1-9 + teen / 1-9 twenty + 1-9 }

From our corpus:

      1 golmma#beaivásaš
     10 golmma#geardán
      2 golmma#geardásaš
     20 golmma#gielat
      6 golmma#jagáš
      3 golmma#jahkásaš
      1 golmma#jienat
      1 golmma#juvllat
      1 golmma#lanjat
      7 golmma#lágan
      3 golmma#oasat
      2 golmma#oktasaš
      6 golmma#riika
      5 golmma#čiegat

Hyphenation problem

TODO:

Lule Sámi

TODO:

8. Name lexicon infrastructure

Decisions made in Tromsø can be found in [the meeting memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html]

Postponed:

TODO:

  1. restructure interface code for easier maintenance, coding and use
    1. well under way, still some work
      1. continued
  2. finish first version of the editing (Sjur)
  3. test editing of the xml files. If ok, then: (Sjur, Thomas, Trond)
  4. make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as well) (the morphological section should be kept intact, in e.g. propernoun-sme-morph.txt) (Sjur, Saara)
  5. convert propernoun-($lang)-lex.txt to a derived file from common xml files (Sjur, Tomi, Saara)
  6. start to use the xml file as source file
  7. clean terms-sme.xml such that all names have the correct tag for their use (e.g. @type=secondary) (Thomas, Maaren, linguists)
  8. merge placenames which are errouneously in different entries: e.g. Helsinki, Helsingfors, Helsset (linguists)
  9. publish the name lexicon on risten.no (Sjur)
  10. add missing parallel names for placenames (linguists)
  11. add informative links between first names like Niillas and Nils (linguists)

9. Spellers

Polderland data generation

PLX conversion made substantial progress, and we now cover most POSes, with the exception of numerals. Also derivations are still not covered. We now need to send off what we have to Polderland, to let them test what we have, and deliver a first version of the spellers based on PLX code from us.

Prefixes should be added as separate entries.

TODO:

  1. add closed POS and clitics to PLX generation (Børre, Tomi)
    1. done
  2. add compound stems to the PLX generation (Børre, Tomi)
    1. done
  3. Include numerals in the speller (Børre, Tomi)
  4. add prefixes to the PLX (Børre, Tomi)
  5. add smj to PLX conversion (Børre, Tomi)
  6. add derivations to the PLX generation (Børre, Tomi)

Aspell

TODO when the major part of the PLX conversion is done:

Testing

TODO:

Localisation

smj is now translated, and should be sent to Polderland.

TODO:

10. Other

Corpus contracts

TODO:

Bug fixing

55 open Divvun/Disamb bugs, and 23 risten.no bugs

11. Next meeting, closing

The next meeting is 29.1.2007, 09:30 Norwegian time.

The meeting was closed at 10:53.

Appendix - task lists for the next week

Boerre

Maaren

Saara

Sjur

Steinar

Thomas

Tomi

Trond