Language Technology at UiT The Arctic University of Norway

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Page Content

Meeting setup

Agenda

Cf. one of the following, depending on context:

Opening, agenda review, participants

Opened at 09:55.

Present: Børre, Ciprian, Maja, Sjur, Thomas, Tomi, Trond

Absent: none

Agenda accepted as is.

Updated task status since last meeting

Børre

Ciprian

Maja

Sjur

Thomas

Tomi

Trond

Oahpa!

Nothing new this week.

Meeting memos can be found at [http://giellatekno.uit.no/ped/index.html#Meeting+memos]

TODO

Corpus gathering

Børre has downloaded Sámi skuvvlahistorija. Maja has written some letters to authors about sma corpus gathering.

TODO:

Promoting Divvun

TODO:

Future plans, directions and ideas

See a separate document in plan/strat/5year.jspwiki.

Northern areas project

TODO:

Infrastructure

Corpus infra remake

Børre’s latest work.

There are two repositories: bound and free, containing orig. What about the gold standard directory, which is also bound and free?

bound/orig/
     /goldst-orig/
free/orig/
    /goldst-orig/

The upload dir is full of files (roughly 90 orig files). These should be converted. We also want to get the corpus summary updated, as they used to be.

TODO:

License

TODO:

Corpus interface

TODO:

Makefile + tag simplification

TODO:

  1. test that the output from the new transducers is identical to the old one (Tomi)
  2. working session to go through the remaining issues: 16.3. (Tomi, Sjur)
    1. done, but we should have one more
  3. write new build commands (Sjur, Tomi)
  4. when the new build infrastructure works as it should, delete the old ones (Sjur, Tomi)

General list

Trond will write an e-mail to the fit group explaining our situation.

To accommodate future enhancements in different directions (in rough order of importance):

  1. test bench for all parts of our language technology efforts
    1. test bench enhanced, but not yet complete
  2. improve Forrest i18n support with static sites
  3. reorganise the documentation:
    1. differ between target groups
    2. get better grouping
    3. decide what to write in Forrest and what in wiki (cf. Apertium and [http://xixona.dlsi.ua.es/apertium/]) for a similar split)
    4. update/add missing parts
  4. migrate lexc lexicons to XML, splitting the task
    1. Name lexica (the Name project)
    2. Dictionaries (already in XML, task is to integrate them)
    3. At least migrate the lexc open POSes (Komi as a pilot case)
  5. change the look of the documentation web
  6. corpus content moved to Max Planck repositories? Norsk språkbank?
  7. update infrastructure to allow content-restricted spellers for special target groups

TODO:

Linguistics

North Sámi

(nothing new, see proofing bugs below)

Lule Sámi

(nothing new, see proofing bugs below)

South Sámi

TODO:

Name lexicon/risten.no infrastructure

TODO:

Dictionaries

Sjur made a few updates and chagnes to the XXE config, please run:

cd $GTHOME/tools/xxe/
make

Trond changed the DTD reference and updated the files in fkv:nob to use our common dtd and stylesheets.

TODO:

Proofing tools

South Sámi

TODO:

HFST-based proofing tools

The work with Voikko+HFST is moving forward.

Testing

Spelling Error Markup

TODO:

Speller testing

TODO:

Testing open-source Norwegian spellers

Sjur has invited the open-source group to test their spell-checker using our test bench. The response has been positive, we’ll see what happens.

We should go to their developer meetings, and present our work and how to work with language technology.

Speller bugs

List of bugs returned from Polderland:

Tag reordering for abbreviations have caused a lot of problems:

smj:
hr.
hr.	hr+ABBR+Acc
cand.philol.
cand.philol.	cand.philol+ABBR+N+Acc
Per
Per	Per+N+Prop+Mal+Sg+Attr

sme:
hr.
hr.	hr+N+ABBR+Acc
Per
Per	Per+N+Prop+Mal+Sg+Attr

Open issues based on test results:

sme

Version: Davvisámi, version 1.2, 2009-09-18

smj

Version: Julevsáme, version 1.2, 2009-09-20

TODO:

Hyphenator bugs

Open issues based on test results :

sme

Lexicon version: Davvisámi, version 1.2, 2009-09-18

No known issues!

smj

Lexicon version: Julevsáme, version 1.2, 2009-09-20

sma

Command to test the hyphenator:

preprocess dev/corp/pressemelding.txt | lookup bin/hyph-sma.fst | cut -f2 | \
lookup bin/hyphrules-sma.fst | grep -v '^$' | cut -f2 | uniq | see

TODO:

Installer changes

TODO:

User documentation

TODO:

1.2 release

Content:

Other

Easter holidays

Holiday for all of us: th, fri, mon 1.4, 2.4, 4.4

Name Vacation
Børre 29-31 of march
Cip 26-??? South of Tromsø (Suden)
Maja 29-31 of march
Sjur 29-31 of March, late at work on April 6.
Thomas 29-31 of march
Tomi 1-2 days
Trond Not clear yet (but traveling the week after easter)

Thursday inhouse seminar

A short (less than 1h) seminar every Thursday at 10 AM. Possible topics:

Spring planning

Topics:

Dates:

Text to speech

TODO:

CAT

A-ITE seems to be released as 1.0, we will test it.

TODO:

Next meeting, closing

The next meeting is 07.04.2010, 13:00 Norwegian time.

The meeting was closed at 11:51.

Appendix - task lists for the next week

Boerre

Ciprian

Maja

Sjur

Thomas

Tomi

Trond