Language Technology at UiT The Arctic University of Norway

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Page Content

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from last week
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 9:24.

Present: Børre, Saara, Sjur, Steinar, Thomas, Trond

Absent: Maaren, Tomi

Agenda accepted as is.

2. Updated task status since last meeting

Børre

Maaren

Saara

Sjur

Steinar

Thomas

Tomi

Trond

3. Documentation

Børre has done all(!) open issues - thanks!

TODO:

4. Corpus gathering

TODO:

5. Corpus infrastructure

Aligner

TODO:

Conversion issues

Initial nbsp? Cf these:

ccat -l sme -r zcorp/bound/sme/admin/depts/ preprocess –abbr=bin/abbr.txt –corr=src/typos.txt lookup -flags mbTT -utf8 -f bin/missing grep ‘\?’ cut -f1 sort uniq -c sort -nr l
 19  –Danne (EM SPACE, U+2003)
 84  –Lea
 62 Bueng

Use UnicodeChecker to check spurious characters that look inocent (nbsp, em space, etc).

admin/depts still have many errors stemming from the problematic PDF conversion:

     34 kapih
     33 ttalat

TODO:

6. Infrastructure

Nothing

7. Linguistics

North Sámi

Compounds with Actio as the first part. Earlier, we accepted such compounds, but they gave rise to so much disamb errors that we excluded them, and opted for lexicalisation. Now they turn up, but the solution is probably to lexicalise them.

bajásšaddaneavttuid
olgunastinlága
ávkkástallanvugiide

Another problem:

The fix here is the same as for smj below: Hyphen must be included in the right context triggering lowering.

Third issue: stuorra-oslolaš does not work (that is, initial lower case oslolaš after compound with hyphen - there is always hyphen in front of name-based words).

TODO:

Numbers:

Needs to be included in the non-rec transducers, and their inflection needs to be tested. Test specification:

TODO:

Hyphenation problem

Vowel combinations that are not defined as diphthongs behave like diphthongs anyway, ie. they don’t get hyphenated. In North sámi we have this problem when there is a or u in second position:

laam
laam    laam (expected: la-am)
liam
liam    liam
luam
luam    luam
lyam
lyam    lyam
lium
lium    lium
luum
luum    luum
lyum
lyum    lyum

For Lule sámi ALL Vowel combos behave like diphtongs.

South Sámi needs supervising, too, there are reports indicating it has onset maximising rather than coda maximising.

TODO:

Lule Sámi

We had a twol problem – now fixed by Thomas. We still have a problem with actor nouns of type juhttsat:juhttse: when hyphened we get juhttsa-muorra instead of juhttse-muorra. There is a similar problem in sme with adjectives (see above).

The fix is to include the hyphen in the right context triggering vowel fronting.

TODO:

8. Name lexicon infrastructure

Sjur continued refactoring and redesign of the risten.no code: risten.no now has full i18n support, is using the latest Forrest to overcome several shortcomings of the previous version, and is soon ready for a more flexible editor.

Decided in Tromsø:

Details can be found in [the meeting memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html]

Postponed:

TODO:

  1. try to make a first version of xml2lexc in Perl for testing and preparation for the big jump (Saara)
    1. soon ready:-)
  2. finish first version of the editing (Sjur)
  3. test editing of the xml files. If ok, then: (Sjur, Thomas, Trond)
  4. make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as well) (the morphological section should be kept intact, in e.g. propernoun-sme-morph.txt) (Sjur, Saara)
  5. convert propernoun-($lang)-lex.txt to a derived file from common xml files (Sjur, Tomi, Saara)
  6. start to use the xml file as source file
  7. clean terms-sme.xml such that all names have the correct tag for their use (e.g. @type=secondary) (Thomas, Maaren, linguists)
  8. merge placenames which are errouneously in different entries: e.g. Helsinki, Helsingfors, Helsset (linguists)
  9. publish the name lexicon on risten.no (Sjur)
  10. add missing parallel names for placenames (linguists)
  11. add informative links between first names like Niillas and Nils (linguists)

9. Spellers

Polderland data generation

TODO:

  1. decide how to specify compounding behaviour info for the lexicon (Thomas, Trond, Sjur)
  2. add closed POS and clitics to PLX generation (Børre, Tomi)
  3. add compound stems to the PLX generation (Børre, Tomi)
  4. add derivations to the PLX generation (Børre, Tomi)
  5. Include numerals in the speller (Børre, Tomi)

Aspell

TODO when the major part of the PLX conversion is done:

Testing

TODO:

Localisation

TODO:

10. Other

Corpus contracts

TODO:

Bug fixing

57 open Divvun/Disamb bugs, and 23 risten.no bugs

11. Next meeting, closing

The next meeting is 15.1.2007, 09:00 Norwegian time.

The meeting was closed at 10:50.

Appendix - task lists for the next week

Boerre

Maaren

Saara

Sjur

Steinar

Thomas

Tomi

Trond