Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

Cf. one of the following, depending on context:

Opening, agenda review, participants

Opened at 10:17.

Present: Børre, Ciprian, Maja, Sjur, Thomas, Tomi, Trond

Absent: none

Agenda accepted as is.

Updated task status since last meeting

Børre

Ciprian

Maja

Sjur

Thomas

Tomi

Trond

Oahpa!

7 different Oahpa!’s now in svn! The outcome of the Bodø seminar. It was a great exercise for our Oahpa! infrastructure, and gave a lot of input and feedback for us.

Finnish version of Oahpa ready for testing and deployment on the production site.

Meeting memos can be found at [http://giellatekno.uit.no/ped/index.html#Meeting+memos]

TODO

Corpus gathering

Trond: we have been asked to come and present our tools in the first or second week of May to the Giellagiella people. Topics: terminology, and machine translation. We need to update our corpora and our interface (see further down).

TODO:

Promoting Divvun

TODO:

Future plans, directions and ideas

See a separate document in plan/strat/5year.jspwiki.

Northern areas project

TODO:

Infrastructure

License

Discussion. All need to be able to read the newsgroup. New Unison licenses:

Please read and comment in the newsgoup. Trond and Sjur will get the required licenses.

TODO:

Corpus interface

Made by Lars Nygård, UiT/Tekstlabben. Based on CWB from Stuttgart. Plus: made by linguists; minus: too many bells and whistles. We need a scaled-down version made for terminologists: baseforms, translations of baseforms, …

Our corpus (that is, Oslo) is using [http://cwb.sf.net/].

A more user friendly interface is wished for before the Giellagiella presentations. Deadline: end of April. We could base a very simple search form on the code for this VISL page.

TODO:

Makefile + tag simplification

TODO:

  1. test that the output from the new transducers is identical to the old one (Tomi)
  2. working session to go through the remaining issues: 16.3. (Tomi, Sjur)
  3. write new build commands (Sjur, Tomi)
  4. when the new build infrastructure works as it should, delete the old ones (Sjur, Tomi)

General list

Trond will write an e-mail to the fit group explaining our situation.

To accommodate future enhancements in different directions (in rough order of importance):

  1. test bench for all parts of our language technology efforts
    1. test bench enhanced, but not yet complete
  2. improve Forrest i18n support with static sites
  3. reorganise the documentation:
    1. differ between target groups
    2. get better grouping
    3. decide what to write in Forrest and what in wiki (cf. Apertium and [http://xixona.dlsi.ua.es/apertium/]) for a similar split)
    4. update/add missing parts
  4. migrate lexc lexicons to XML, splitting the task
    1. Name lexica (the Name project)
    2. Dictionaries (already in XML, task is to integrate them)
    3. At least migrate the lexc open POSes (Komi as a pilot case)
  5. change the look of the documentation web
  6. corpus content moved to Max Planck repositories? Norsk språkbank?
  7. update infrastructure to allow content-restricted spellers for special target groups

TODO:

Linguistics

North Sámi

(nothing new, see proofing bugs below)

Lule Sámi

(nothing new, see proofing bugs below)

South Sámi

TODO:

Name lexicon/risten.no infrastructure

TODO:

Dictionaries

Lecture on Thursday, 10.00 Norw. time.

TODO:

Proofing tools

South Sámi

TODO:

HFST-based proofing tools

The work with Voikko+HFST is moving forward.

Testing

Spelling Error Markup

TODO:

Speller testing

TODO:

Testing open-source Norwegian spellers

Sjur has invited the open-source group to test their spell-checker using our test bench. The response has been positive, we’ll see what happens.

We should go to their developer meetings, and present our work and how to work with language technology.

Speller bugs

List of bugs returned from Polderland:

Tag reordering for abbreviations have caused a lot of problems:

smj:
hr.
hr.	hr+ABBR+Acc
cand.philol.
cand.philol.	cand.philol+ABBR+N+Acc
Per
Per	Per+N+Prop+Mal+Sg+Attr

sme:
hr.
hr.	hr+N+ABBR+Acc
Per
Per	Per+N+Prop+Mal+Sg+Attr

Open issues based on test results:

sme

Version: Davvisámi, version 1.2, 2009-09-18

smj

Version: Julevsáme, version 1.2, 2009-09-20

TODO:

Hyphenator bugs

Open issues based on test results :

sme

Lexicon version: Davvisámi, version 1.2, 2009-09-18

No known issues!

smj

Lexicon version: Julevsáme, version 1.2, 2009-09-20

sma

Command to test the hyphenator:

preprocess dev/corp/pressemelding.txt | lookup bin/hyph-sma.fst | cut -f2 | \
lookup bin/hyphrules-sma.fst | grep -v '^$' | cut -f2 | uniq | see

TODO:

Installer changes

TODO:

User documentation

TODO:

1.2 release

Content:

Other

Easter holidays

Holiday for all of us: th, fri, mon 1.4, 2.4, 4.4

Name Vacation
Børre 29-31 of march
Cip orthodox or non-orthodox (I celebrate them all)
Maja 29-31 of march ?
Sjur 29-31 of March
Thomas 29-31 of march
Tomi Not sure yet
Trond Not clear yet (but traveling the week after easter)

Thursday inhouse seminar

A short (less than 1h) seminar every Thursday at 10 AM. Possible topics:

Spring planning

Topics:

Dates:

Text to speech

Tomi has familiarised himself with the code.

TODO:

CAT

Autshumato ITE won’t work on the Mac because of Apple software security policies. It still works fine on Windows and Linux, and could possibly be used at the Sámi Parliament.

A-ITE seems to be released as 1.0, we will test it.

TODO:

Next meeting, closing

The next meeting is 22.03.2010, 9.30 Norwegian time.

The meeting was closed at 12:31.

Appendix - task lists for the next week

Boerre

Ciprian

Maja

Sjur

Thomas

Tomi

Trond