Language Technology at UiT The Arctic University of Norway

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Page Content

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from last week
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 09:49.

Present: Børre, Sjur, Thomas, Tomi, Trond

Absent: Maaren, Saara, Steinar

Agenda accepted as is.

2. Updated task status since last meeting

Børre

Maaren

Saara

Sjur

Steinar

Thomas

Tomi

Trond

3. Documentation

TODO:

4. Corpus gathering

TODO:

5. Corpus infrastructure

Alignment

Main news: We have a working parallel corpus online.

Notes about the interface (or lack of documentation): the first search field in the form needs to be filled; to get the parallell texts in the search result, make sure to click add phrase and specify the language to be the other one.

TODO:

Conversion issues

TODO:

6. Infrastructure

Børre and Steinar have both started on the task of testing and correcting the documentation.

TODO:

7. Linguistics

Numbers:

Thomas is almost finished with correcting the number part of the sme analyzer.

TODO:

North Sámi

TODO:

Lule Sámi

TODO:

8. Name lexicon infrastructure

Decisions made in Tromsø can be found in [the meeting memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html]

Postponed:

TODO:

  1. finish first version of the editing (Sjur)
  2. test editing of the xml files. If ok, then: (Sjur, Thomas, Trond)
  3. make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as well) (the morphological section should be kept intact, in e.g. propernoun-sme-morph.txt) (Sjur, Saara)
  4. convert propernoun-($lang)-lex.txt to a derived file from common xml files (Sjur, Tomi, Saara)
  5. implement data synchronisation between risten.no and the cvs repo, and possibly other servers (ie the G5 as an alternative server to the public risten.no - it might be faster and better suited than the official one; also local installations could be treated the same way)
  6. start to use the xml file as source file
  7. clean terms-sme.xml such that all names have the correct tag for their use (e.g. @type=secondary) (Thomas, Maaren, linguists)
  8. merge placenames which are errouneously in different entries: e.g. Helsinki, Helsingfors, Helsset (linguists)
  9. publish the name lexicon on risten.no (Sjur)
  10. add missing parallel names for placenames (linguists)
  11. add informative links between first names like Niillas and Nils (linguists)

9. Spellers

Polderland data generation

TODO:

  1. improve number conversion (Børre, Tomi)
  2. add prefixes to the PLX (Børre, Tomi)
  3. add derivations to the PLX generation (Børre, Tomi)
    1. next after numbers are fixed

OOo speller(s)

TODO after the MS Office Beta is delivered:

Testing

TODO:

Localisation

We need to translate the info added to our front page (and a separate page) regarding the beta release. Also the press release needs to be translated.

TODO:

Beta release

Tentative beta release: Thursday 15.2. - but it might be delayed till later in February, since we still have no beta from Polderland.

In the beta, sme is now Catalan, whereas smj is Basque.

All beta packages (mklex tools, Win and Mac tools) can be copied from Sjur’s home dir on the G5:

/Users/sjur/mklex.zip
/Users/sjur/SamiProofingtools_beta-Mac.dmg
/Users/sjur/SamiProofingtools_beta-Win.zip

mklex -M256 -p sami_north_phon* revInputSamiNort.plx mssp3samiNorthern.lex

SamiNortAsCatalan2007-02sp

The PLX compilers (one each for sme, smj), which compiles the specified source files into ready speller files (or almost ready in the Mac case), are now installed on the G5. Follow the instructions given in the Word document found inside the mklex.zip package above PLUS one addition when compiling the Mac version:

As a last step after the lexicon is compiled, use the tool /Developer/Tools/Rez to add a resource fork with the content in gt/sm(e|j)/polderland/*Lex.rsrc.hex to the lexicon file and tool /Developer/Tools/SetFile to add creator and type and set custom icon. The command sequence should be something like the following:

cd gt/
export RIncludes=/System/Library/Frameworks/Carbon.framework/Headers/
/Developer/Tools/Rez sme/polderland/CatalanLex.rsrc.hex -a -o $SpellerLexiconFil
/Developer/Tools/SetFile -a CI -c MSOF -t HMSD $SpellerLexiconFile

This step is necessary to make MS Office recognise the speller lexicon file as a real such file. It will add an icon and a language ID to the file.

PLX files:

All ok, except numerals.

DONE:

TODO:

Compilation

Adjectives compile at 60 sec/adjective, i.e. (500060) / 3600 = 83 hrs Nouns compile at 3 sec/noun, i.e. (236003) / 3600 = 19 hrs

Testing

Different ways of testing:

  1. Impressionistic, functionality: try the program, try all the functions
  2. Impressionistic, coverage: try the program on different texts, look for false positives
  3. Systematic (in order of importance):
    1. Make a corpus of texts, from different genres (can be done before 0.2 release)
      1. For each text, detect precision
      2. For each text, detect recall
      3. For each text, detect accuracy

Before beta release: precision is important, but have a look at recall as well.

Recall and precision

Definitions:

Timetable

  1. The next beta version (beta 0.2) is ready tuesday at xx h?
  2. Testing 0.2: Thomas, Steinar, Trond, Ilona, …
  3. 0.3 compilation starts at thursday
    • what to compile? only the improved *-sm{ej}-lex.txt files
  4. The next beta version (beta 0.3) is ready sunday
  5. Monday: Testing beta 0.3 for unpleasent surprises
  6. We release beta 0.3 on Tuesday, unless there are surprises
  7. If there are surprises, we must compile again, this time 0.4
  8. Deadline for documentation as already(?) stated

compile a: *sm(e|j)-lex.txt to *-plx.txt = 83 hrs? sort: *-plx.txt compile b: -plx.txt to .sp = ? hrs

two-phase sort:

now:

tomorrow:

one-phase sort:

10. Other

Corpus contracts

TODO:

Bug fixing

57 open Divvun/Disamb bugs, and 23 risten.no bugs

Moving G5

TODO:

11. Next meeting, closing

The next meeting is 26.2.2007, 09:30 Norwegian time.

The meeting was closed at 11:12.

Appendix - task lists for the next week

Boerre

Maaren

Saara

Sjur

Steinar

Thomas

Tomi

Trond