Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from last week
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 09:39.

Present: Børre, Ilona, Sjur, Thomas, Tomi, Trond

Absent: Maaren, Per-Eric

Agenda accepted as is.

2. Updated task status since last meeting

Børre

Ilona

Maaren

Per-Eric

Saara

Sjur

Thomas

Tomi

Trond

3. Documentation

Nothing new.

4. Corpus gathering

Børre has contacted several people, see task status above.

TODO:

5. Corpus infrastructure

Saara has removed the *.html lists from the xdoc folder, and our multilingual analyser interface is generated via xml.

6. Infrastructure

We are ordering a new server for faster processing. - Order not yet placed, we are waiting for the final, updated and corrected offer. It should arrive today.

7. Linguistics

North Sámi

Remaining twol issues: see bug #460. The error at present is connected to forms olmmoš– and háliit, as before.

Ilona is working on the list of sme names from Finland. The list includes lots of enare sámi names. Børre has sent her a new list that contains all northern sami placenames. Ilona has started with that list instead of the Finland name list. The new list should include all the sme names, also the names that are in the first list.

There is an empty file, gt/smn/src/propernoun-smn-lex.txt, in cvs. Inari names go there, and may then eventually be exported to other lgs.

TODO:

Lule Sámi

The æ-ä issue: see [bug 411 | http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=411].

TODO:

8. Name lexicon infrastructure

This sub-project needs to get up and running soon. Mainly Sjur’s task.

Decisions made in Tromsø can be found in [this meeting memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html]

TODO:

  1. fix bugs in lexc2xml; add comments to the log element (Saara)
  2. finish first version of the editing (Sjur)
  3. test editing of the xml files. If ok, then: (Sjur, Thomas, Trond)
  4. make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as well) (the morphological section should be kept intact, in e.g. propernoun-sme-morph.txt) (Sjur, Saara)
  5. convert propernoun-($lang)-lex.txt to a derived file from common xml files (Sjur, Tomi, Saara)
  6. implement data synchronisation between risten.no and the cvs repo, and possibly other servers (ie the G5 as an alternative server to the public risten.no - it might be faster and better suited than the official one; also local installations could be treated the same way)
  7. start to use the xml file as source file
  8. clean terms-sme.xml such that all names have the correct tag for their use (e.g. @type=secondary) (Thomas, Maaren, linguists)
  9. merge placenames which are errouneously in different entries: e.g. Helsinki, Helsingfors, Helsset (linguists)
  10. publish the name lexicon on risten.no (Sjur)
  11. add missing parallel names for placenames (linguists)
  12. add informative links between first names like Niillas and Nils (linguists)

9. Spellers

OOo spellers

Tomi is working on the lexicon conversion to the Hunspell format. It is moving forward.

plx    source ->    transducer -> wordlist in plxformat -> speller binary
       src/*        *-plx.fst         > 60 GB                2 MB
       polderland/*

hun    source -> transducer    -> java/perl-server program -> huncode
       src/*     *-hunspell.fst

My question: The set of noun-verb-adj continuation lexica is taken from a wordlist, generated by xfst.

  1. generate full paradigm per word with xfst (as for polderland today)
  2. extract stems automatically <= from the generated paradigm (60 GB) and not from the *-lex.txt files.
  3. turn the result into hunspell stem / cont

The hunspell generation process thus mirrors the plx generation process. Yes, but with a different-format output (and a much smaller file - Mb instead of Gb). That’s fine. The important point is that it has the same input, so that Tomi does not need to rewrite 8000 lines of smX-lex.txt + twol code. Of course:)

We have, in parallel, been looking at sfst. The results are good, sfst seems a good compiler for fst-s. How it behaves when scaled up to real size we do not know. A major point regarding spell checkers will be speed - how fast is the sfst engine?

The sfst version of smX will be put on hold to after newyear.

TODO:

Testing

Spelling Error Markup

TODO:

Automated testing

We need a separate speller pre-processor, to turn ccat output into suitable speller input. This we will need to be able to run whole documents through the speller, to test lexical coverage as well as p/r etc.

TODO:

Lexicon conversion to the PLX format

We have found a bug in the conversion. We also have the [double-hyphen issue|http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=458], which needs to be fixed in the LexC code (the R lexicon and surrounding matters).

We need a compounding form without a hyphen in speller. In xfst processing (adding PLX-tags) you can add hyphens but we don’t know how to remove a hyphen.

But is this a lexc problem? Probably not, see below (sme.fst, sme-norm.fst, spellernonrec-sme.save all give the same result):

-bash-3.00$ lookup -flags mbTT -utf8 sme/bin/spellernonrec-sme.save
0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100%
biila-
biila-  biila+N+SgNomCmp+Cmpnd

biila--
biila-- biila-- +?

-bash-3.00$ lookup -flags mbTT -utf8 sme/bin/spellernonrec-sme.fst
0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100%
bii^la
bii^la  biila+N+Sg+Nom

bii^la-
bii^la- biila+N+SgNomCmp+Cmpnd

bii^la--
bii^la--        bii^la--        +?

TODO:

New public beta

Delayed till the majority of the present bugs are fixed. The twolc bug is the major stopper.

10. Other

Corpus contracts

TODO:

Bug fixing

When fixing bugs, record the version number containing the fix in the Bugzilla bug report, such that for each bug, we know exactly when it should have been fixed, in what file(s) and what version.

57 open Divvun/Disamb bugs (29 of these 56 are speller-related bugs, 28 are general bugs), and 23 risten.no bugs

Project meeting

We’ll meet in September, 24-28, in Tromsø to work on the hardest remaining issues.

11. Next meeting, closing

The next meeting is 3.9.2007, 09:30 Norwegian time.

The meeting was closed at 10:54.

Appendix - task lists for the next week

Boerre

Ilona

Maaren

Per-Eric

Saara

Sjur

Thomas

Tomi

Trond