Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

Cf. one of the following, depending on context:

Opening, agenda review, participants

Updated task status since last meeting

Børre

on vacation

Ciprian

Maja

Sjur

Thomas

Tomi

Trond

Oahpa!

See notes under SMA seminar in Trondheim further down.

TODO

Corpus gathering

National Library: there is a meeting in September, to which Trond is going. Trond will request a couple of Sámi OCR-ed texts (together with the nob texts we already have), and use them as test cases for spell-checking and (automatically) correcting the OCR-ed texts.

This work should be incorporated in the test bench project, and serve as a basis for developing a better nob (and potentially nno as well) morphologies, to be used in our different projects. The main project, developing an OCR-adapted Norwegian spell-checker (as open source) should be financed by the Language bank.

There are starting points both in our own svn and in Apertium, both based on the lexicons from UiO/Norsk ordbank.

Our main goal with this would be to help the Language bank and the Natinal Library provide us with usable Sámi (and possibly Norwegian) corpus material, but there are a number of other secondary goals and positive side effects of this as well.

TODO:

Promoting Divvun

Rune Fjelheim suggested last week that we should start a separate project to inform about, teach about and install our language technology tools (not only Divvun). This would also make a very good feedback channel.

TODO:

Future plans, directions and ideas

See a separate document in plan/strat/5year.jspwiki.

Northern areas project

First major obstacle: make working keyboards and fonts. In principle, there are two approaches to keyboards:

  1. Make the Russian-based keyboards work
  2. Design new Kildin-optimized keyboards

The first goal has priority. We should have text analysis in order to consider whether the second goal is feasible. Trond to follow up test machines.

Write trustworthy and detailed documentation (in Russian)

What we know:

TODO:

Infrastructure

Out of the box experiences:

Updated corpus online

See Ciprian´s document about the corpus content in $GTPRIV/plan/corpus/oslo_corpus_update_todo.txt.

Issues:

facta$convert2xml.pl --nolog --corpdir=/usr/local/share/corp L1allOrt.correct.txt

Error message:

sh: /home/sjur/gtmain/gt/script/text_cat: No such file or directory
L1allOrt.correct.txt: ERROR errors in /home/sjur/gtmain/gt/script/text_cat -q \
   -x -d /home/sjur/gtmain/gt/script/LM "/usr/local/share/corp/tmp/L1allOrt.correct.txt.tmp0":

text_cat isn’t part of our repository, we need to add it - we are using a modified version of the original file. It should be added with a short README and some license info, pointing to the original.

TODO:

Corpus infra remake

Børre has worked on improving the convert2xml.pl script.

TODO:

License

TODO:

Corpus interface

This depends on the infrastructure cleanup.

TODO:

Makefile + tag simplification

Problems with the proofing tools compilation, now solved.

TODO:

  1. test latest proofing tools, compare results with previous version (Tomi)
    1. done long ago
  2. write new build commands (Sjur, Tomi)
    1. make new targets in parallell to the old ones, not by remaking them
    2. use a prefix or suffix to make the new targets easily identifyable during the rewrite phase. We’ll remove the prefix/suffix as soon as everything is working fine.
      1. a first version commited, prefix is NEW-*
  3. test the new build commands (Sjur, Trond, Lene, Thomas)
    1. kommando: make TARGET=sme NEW-fst etc.
    2. test by comparing the output of the old and new analysers - it should be identical. Test both analysis and generation (ie all your favourite usecases)
  4. when the new build infrastructure works as it should, delete the old ones (Sjur, Tomi)

HFST compilation

Our HFST transducers do still contain all the extra tags resulting from the tag conversion (cf above). The Apertium versions do not, however. The main problem here is the forking of smi transducer builing and maintenance - one branch in Giellatekno, and another in Apertium. This is not good for future maintenance, and should be corrected ASAP. The tag-removing make steps should be incorporated in the giellatekno svn.

General list

Meänkieli adaptions in our infrastructure.

Requirements:

Tentative task list

To accommodate future enhancements in different directions (in rough order of importance):

  1. test bench for all parts of our language technology efforts
    1. test bench enhanced, but not yet complete
  2. improve Forrest i18n support with static sites
  3. reorganise the documentation:
    1. differ between target groups
    2. get better grouping
    3. decide what to write in Forrest and what in wiki (cf. Apertium and [http://xixona.dlsi.ua.es/apertium/]) for a similar split)
    4. update/add missing parts
  4. migrate lexc lexicons to XML, splitting the task
    1. Name lexica (the Name project)
    2. Dictionaries (already in XML, task is to integrate them)
    3. At least migrate the lexc open POSes (Komi as a pilot case)
  5. change the look of the documentation web
  6. corpus content moved to Max Planck repositories? Norsk språkbank?
  7. update infrastructure to allow content-restricted spellers for special target groups

TODO:

Linguistics

North Sámi

There are a lot of hyphens found in noun paradigms. They should not be there. Trond and Thomas will look into them.

TODO:

Lule Sámi

TODO:

South Sámi

TODO:

Name lexicon/risten.no infrastructure

TODO:

Dictionaries

Ciprian has made an intelligent version without XML, to be used with A-ITE and its StarDict component. Seems to work ok, but needs more testing.

StarDict is useless for Cyrillic languages, mainly because of the scanning function (ie point and look-up). We need to find an alternative to StarDict. Also, the Kildin Sámi users have different needs than we have had in mind. And they don’t have access to a Kildin Sámi keyboard in Windows.

Released:

Other things dictionary-related:

TODO:

Proofing tools

Spelling feedback from Malta:

South Sámi

Beta release: Aug 25. Contract now ok by Knowledge Concepts. Suggested timetable sent this morning.

There are no goldstandard files in the correct location in the corpus repository. This needs to be fixed before we start testing the sma speller.

TODO:

HFST- and Voikko-based proofing tools

Sjur met with the HFST people yesterday. Two things happening in parallell:

  1. hfst3 - being made ready for public release, with proper inclusion of autotools, and updated documentation
  2. speller/lookup library:
    1. speller/lookup library
    2. voikko integration of this library

TODO:

Speller bugs

List of bugs returned from Polderland:

Tag reordering for abbreviations have caused a lot of problems:

smj:
hr.
hr.	hr+ABBR+Acc
cand.philol.
cand.philol.	cand.philol+ABBR+N+Acc
Per
Per	Per+N+Prop+Mal+Sg+Attr

sme:
hr.
hr.	hr+N+ABBR+Acc
Per
Per	Per+N+Prop+Mal+Sg+Attr

Open issues based on test results:

sme

Version: Davvisámi, version 1.2, 2009-09-18

smj

Version: Julevsáme, version 1.2, 2009-09-20

TODO:

Hyphenator bugs

Open issues based on test results :

sme

Lexicon version: Davvisámi, version 1.2, 2009-09-18

No known issues!

smj

Lexicon version: Julevsáme, version 1.2, 2009-09-20

sma

Command to test the hyphenator:

preprocess dev/corp/pressemelding.txt | lookup bin/hyph-sma.fst | cut -f2 | \
lookup bin/hyphrules-sma.fst | grep -v '^$' | cut -f2 | uniq | see

TODO:

Installer changes

TODO:

User documentation

TODO:

1.2 release

Content:

2.0 release

Content:

Nordplus Språk proofing test bench project

Arno Teigseth is hired for the summer. He is sitting in Equador, 7 hours after Norway. He is working with a quechua speller (using hunspell) and dictionary.

A goal is to use texts from parallel domains, in order to be able to compare nob and sme. We start out with 100000 words, 5000 for each genre:

  1. Discussion fora on the net
  2. Blogs

Discussion fora can be problematic, as they contain a lot of very oral language with deliberate oral or dialectal forms used in the text. As such, the texts won’t be typical for the kind of texts people would like to spell-check, and thus the genre is problematic for the purpose at hand.

Possible other genres:

  1. Minutes from local associations
  2. Pupil texts - look for the home pages of the schools
  3. exam texts by 10th grade pupils, where the pupils have published the texts themselves
  4. skrivebua - Nordland fylkeskommune: all three Sámi languages

Text to speech

There will be regular meetings in this project from now on, every second week.

TODO:

Machine Translation

Background:

Kevin will work on smenob until mid August, Ryan on finsme this summer. We hope to have a student working one month on smesmj this summer. The world now knows (vaguely) that we work on MT, so we should do, as well. We also need to work together with the external ones.

All languages

TODO

finsme

Needed: Tag unification and lexicon completion. Grammatical transfer rules

There are two different issues:

TODO

smenob

Needed: Lexicon completion. Grammatical transfer rules

smesmj

New worker underway.

smesma

Infrastructure in place, to demo in Trondheim to encourage more work and projects.

CAT

Ciprian has made an intelligent version of smenob in text format, for use with A-ITE. Lene would like to have an intelligent version of risten.no in the nobsme direction. Lene would like to have a script for making translation memory xml-files from bilingual texts.

TODO:

SMA seminar in August/September

Theme:

TODO:

Other

Start to make yearly reports

TODO

Thursday inhouse seminar

Next time suggestion list:

  1. introduction to xslt - Ciprian to start out
    1. relevant xslt issues:
    2. basic principles of xslt …
    3. sorting in xslt … (have a look at the dictionary sort xslt script)
    4. converting from one xml format to another wilt xslt (sugg: convert from DivvunGT dictionary dtd to MacDict xml)

Future seminars:

  1. XQuery
  2. More XML (needs concretisation)
  3. UML
  4. other suggestions?

Summer planning

Topics:

Dates:

Summer vacations

Name Dates
Børre 28/6-11/7, 2/8-15/8
Ciprian 4 Weeks ??? to Finish and Kárášjohka Sápmi in the week 26.7-1.8, and to Gáivuotna between 12-25.7
Maja 09.07.-09.08 (4 uker)
Sjur 12/7-8/8
Thom 21/6-23/7
Tomi 5 (non-contiguous) weeks between 21/6-13/8
Trond 4 weeks between 5.7.-9.8. (?)
Lene 21-25.6 + 4 weeks between 12.7-15.8

Next meeting, closing

The next meeting is 16.8.2010, 09:30 Norwegian time.

The meeting was closed at 12:00.

Appendix - task lists for the next week

Boerre

Ciprian

Maja

Sjur

Thomas

Tomi

Trond