Language Technology at UiT The Arctic University of Norway

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Page Content

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from last week
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 09:53.

Present: Sjur, Steinar, Thomas

Absent: Børre, Maaren, Saara, Tomi, Trond

Agenda accepted as is.

2. Updated task status since last meeting

Børre

On Winter Holidays.

Maaren

Saara

Sjur

Steinar

Thomas

Tomi

Trond

3. Documentation

The open documentation issues fall into these three categories:

TODO:

4. Corpus gathering

TODO:

5. Corpus infrastructure

Alignment

TODO

6. Infrastructure

TODO:

7. Linguistics

North Sámi

TODO:

Lule Sámi

TODO:

8. Name lexicon infrastructure

Decisions made in Tromsø can be found in [this meeting memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html]

TODO:

  1. fix bugs in lexc2xml; add comments to the log element (Saara)
  2. finish first version of the editing (Sjur)
  3. test editing of the xml files. If ok, then: (Sjur, Thomas, Trond)
  4. make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as well) (the morphological section should be kept intact, in e.g. propernoun-sme-morph.txt) (Sjur, Saara)
  5. convert propernoun-($lang)-lex.txt to a derived file from common xml files (Sjur, Tomi, Saara)
  6. implement data synchronisation between risten.no and the cvs repo, and possibly other servers (ie the G5 as an alternative server to the public risten.no - it might be faster and better suited than the official one; also local installations could be treated the same way)
  7. start to use the xml file as source file
  8. clean terms-sme.xml such that all names have the correct tag for their use (e.g. @type=secondary) (Thomas, Maaren, linguists)
  9. merge placenames which are errouneously in different entries: e.g. Helsinki, Helsingfors, Helsset (linguists)
  10. publish the name lexicon on risten.no (Sjur)
  11. add missing parallel names for placenames (linguists)
  12. add informative links between first names like Niillas and Nils (linguists)

9. Spellers

OOo speller(s)

TODO after the MS Office Beta is delivered:

Testing

Different ways of testing

  1. Impressionistic, functionality: try the program, try all the functions
  2. Impressionistic, coverage: try the program on different texts, look for false positives
  3. Systematic (in order of importance):
    1. Make a corpus of texts, from different genres (can be done before 0.2 release)
      1. For each text, detect precision
      2. For each text, detect recall
      3. For each text, detect accuracy

Before beta release: precision is important, but have a look at recall as well.

Definitions

Recall and precision

Precision and recall testing

A testbed has been set up (Trond), and some texts are marked for errors and corrections (Steinar). Versions alpha, beta 0.1 and beta 0.2 have been tested.

Types of tests:

  1. Technical testing
  2. Testing for linguistic functionality
  3. Testing for lexical coverage
  4. Testing for normativity
  5. Testing the suggestions

The tester should identify these 4 values:

The spreadsheet will then calculate precision, recall and accuracy. Steinar has marked some texts like this: Errors are makred§marked with paragraph number followed by the correct form. Way of finding precision: Take out the § entries and evalate them for tp and fp. Way of finding recall: Remove the § entries and count the fn among the remaining words. Then fill in and collect results.

Testing of suggestion should follow the same lines:

Ordering of suggestions:

“Perceived Quality”, ie for all recognised errors/tp:

Testing on unseen texts

We need to use unknown texts in order to measure the performance of the speller.

Regression tests

We need to ensure that we do not take steps backwars, ie all known spelling errors in the corpus should be correctly identified, with a proper suggestion among the top five. For this purpose we can use the regular corpus with correction markup.

We also need to regression test the PLX conversion. In principle this is easy - just send the full word-form list (as generated by make wordlist TARGET=sme) through the speller. None should be rejected - any word form rejected is in principle a regression. In practice, this is not that easy, since the word list is so huge. We have to investigate alternatives for this testing.

TODO:

Storing test texts

Test texts should be stored in the corpus catalogue, separated from the ordinary corpus files. They should be marked as to whether their unknown words have been added to the lexicon or not (in the former case, they cannot be used for testing of performance any more, only for regression testing). When the words have been added, the whole text can be transferred to the regular corpus repository.

TODO:

The b0.3 / 2007.02.26 version

Known errors:

Localisation

We need to translate the info added to our front page (and a separate page) regarding the beta release. Also the press release needs to be translated.

TODO:

Lexicon conversion to the PLX format

We need to test that the conversion is correct and gives expected results in all cases, especially regarding compounding and derivation. For that we need a small set of test entries in lexc format, and the corresponding expected output in PLX format. By comparing the actual output with the expected output we get a measure of the quality of the conversion.

TODO:

  1. add derivations to the PLX generation (Tomi)
    1. working on it
  2. add prefixes to the PLX (Børre)
  3. middle nouns (Børre)
  4. make conversion test sample; add conversion testing to the make file (Tomi)
    1. Sjur added ms-speller to the Makefile
  5. improve number conversion (Børre, Tomi)

Public Beta release

Tentative public beta release: after the initial linguistic bugs and poor coverage, it is now moved to Thursday 15.3. - this time with derivations and numbers included:-)

Internal deadlines:

Linguistic issues still open:

The middle nouns are: beai, beal, geaš, oahpaheai, oai, miel, vuol. They are also marginally used initially (not found in the corpus):

 beai+ShCmp:beai  Rreal ; (not used init in our corpus)
 beal+ShCmp:beal  Rreal ; (init with Num -goalmmat, -guđát, -nuppi, lexicalized)
 geaš+ShCmp:geaš  Rreal ; (not used init in our corpus)
 oahpaheai+ShCmp:oahpaheai  Rreal ;  init, but then actually 2-part
 oai+ShCmp:oai 	  Rreal ;  (not used in corpus init oaivuolli (SUB? yes!)
 vuol+ShCmp:vuol  Rreal ;  (not used in our corpus)

The PLX format does not allow encoding a stem as middle-only. For the public beta we will encode them as Left-only (which is really non-right), and evaluate their effect on the quality of the speller as we progress.

DONE:

TODO:

Version identification of speller lexicons

See the Norwegian spellers for an example, with the trigger string tfosgniL.

Suggestion:

nuvviD -> Divvun
nuvviD -> Dávvisámegiella
nuvviD -> Veršuvdna_1.0b1 (based on cvs tag?)
nuvviD -> 12.2.2007  (automatically generated/added)
nuvviD -> Sjur_Nørstebø_Moshagen
nuvviD -> Børre_Gaup
nuvviD -> Thomas_Omma
nuvviD -> Maaren_Palismaa
nuvviD -> Tomi_Pieski
nuvviD -> Trond_Trosterud
nuvviD -> Saara_Huhmarniemi
nuvviD -> Steinar_Nilsen
nuvviD -> Lene_Antonsen
nuvviD -> Linda_Wiechetek

These correction rules (and their corresponding PLX entries) should be added automatically to the PLX file and the phonetic file as part of the compilation process, to include build date and version number.

TODO:

Conversion from LexC to PLX

Adjectives compile at 60 sec/adjective, i.e. (5000*60) / 3600 = 83 hrs
Nouns compile at 3 sec/noun,            i.e. (23600*3) / 3600 = 19 hrs

This is so far acceptable for nouns, but on the edge of being unacceptable for adjectives. These times will multiply many times when we add derivation, meaning we will need more than a week to convert the major POSes from LexC to PLX then.

We need to investigate why adjectives are so slow, and try to improve on the conversion speed.

TODO:

10. Other

Corpus contracts

TODO:

Bug fixing

57 open Divvun/Disamb bugs, and 23 risten.no bugs

11. Next meeting, closing

The next meeting is 12.3.2007, 09:30 Norwegian time.

The meeting was closed at 10:42.

Appendix - task lists for the next week

Boerre

Maaren

Saara

Sjur

Steinar

Thomas

Tomi

Trond