Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

Cf. one of the following, depending on context:

Opening, agenda review, participants

Updated task status since last meeting

Børre

Ciprian

Maja

Sjur

Thomas

Tomi

Trond

Oahpa!

Nothing new. But more input from Sweden. German is more finished than Swedish, but still not complete.

TODO

National Library

There was a meeting September 23, to which Trond and Sjur were going. Trond will request a couple of Sámi OCR-ed texts (together with the nob texts we already have), and use them as test cases for spell-checking and (automatically) correcting the OCR-ed texts.

What we need now is free North Sami - Norwegian parallel text, because of Autshumato.

Suggest a project to the National Library (NB): Concentrate upon Sami text as a project. There is a document in plan somewhere.

TODO

Corpus gathering

TODO:

Promoting Divvun

TODO:

Future plans, directions and ideas

See a separate document in plan/strat/5year.jspwiki.

Northern areas project

First major obstacle: make working keyboards and fonts. In principle, there are two approaches to keyboards:

  1. Make the Russian-based keyboards work
  2. Design new Kildin-optimized keyboards

The first goal has priority. We should have text analysis in order to consider whether the second goal is feasible. Trond to follow up test machines.

Write trustworthy and detailed documentation (in Russian)

What we know:

TODO:

Infrastructure

Out of the box experiences:

Updated corpus online

See Ciprian´s document about the corpus content in $GTPRIV/plan/corpus/oslo_corpus_update_todo.txt.

Issues:

facta$convert2xml.pl --nolog --corpdir=/usr/local/share/corp L1allOrt.correct.txt

Error message:

sh: /home/sjur/gtmain/gt/script/text_cat: No such file or directory
L1allOrt.correct.txt: ERROR errors in /home/sjur/gtmain/gt/script/text_cat -q \
   -x -d /home/sjur/gtmain/gt/script/LM "/usr/local/share/corp/tmp/L1allOrt.correct.txt.tmp0":

text_cat isn’t part of our repository, we need to add it - we are using a modified version of the original file. It should be added with a short README and some license info, pointing to the original.

TODO:

Corpus infra remake

TODO:

Corpus interface

This depends on the infrastructure cleanup.

TODO:

Makefile + tag simplification

First part done. All linguists need to test the new transducer, to see that we get what we want. Most important: that the descriptive / normative border is as it used to be. Tomi has been testing norm/desc fst’s with xfst minus operator. Now we also need to compare actual output from the two fst’s with before and after versions.

TODO:

HFST compilation

Our HFST transducers do still contain all the extra tags resulting from the tag conversion (cf above). The Apertium versions do not, however. The main problem here is the forking of smi transducer builing and maintenance - one branch in Giellatekno, and another in Apertium. This is not good for future maintenance, and should be corrected ASAP. The tag-removing make steps should be incorporated in the giellatekno svn.

Status:

TODO:

General list

To accommodate future enhancements in different directions (in rough order of importance):

  1. test bench for all parts of our language technology efforts
    1. test bench enhanced, but not yet complete
  2. improve Forrest i18n support with static sites
  3. reorganise the documentation:
    1. differ between target groups
    2. get better grouping
    3. decide what to write in Forrest and what in wiki (cf. Apertium and [http://xixona.dlsi.ua.es/apertium/]) for a similar split)
    4. update/add missing parts
  4. migrate lexc lexicons to XML, splitting the task
    1. Name lexica (the Name project)
    2. Dictionaries (already in XML, task is to integrate them)
    3. At least migrate the lexc open POSes (Komi as a pilot case)
  5. change the look of the documentation web
  6. corpus content moved to Max Planck repositories? Norsk språkbank?
  7. update infrastructure to allow content-restricted spellers for special target groups

TODO:

Linguistics

North Sámi

There are a lot of hyphens found in noun paradigms. They should not be there. Trond and Thomas will look into them.

TODO:

Lule Sámi

TODO:

South Sámi

TODO:

Name lexicon/risten.no infrastructure

Most of the items that used to be listed here are now moved to the new risten2 project, scheduled for next year.

TODO:

Dictionaries

Net/browser-based interface

Ciprian has made a new online/offline dictionary interface based on Apertium tools. See [http://gtsvn.uit.no/webdict/index.html].

TODO:

Mobile

WeDict dictionary client for iOS (uses stardict dictionary files): [http://app.weiphone.com/wedict/]

iPod/iPhone has (almost) no problem with Unicode text input (but users HAVE to know how to use unicode as input). There are two letters missing, ŧ, ŋ, the other ones are there. You may use Serbo-Croat, or QUERTY, when writing North Sami on iPod.

There is one possible issue: [http://code.google.com/p/wedictpro/issues/detail?id=2]

TODO:

Desktop

StarDict is useless for Cyrillic languages, mainly because of the scanning function (ie point and look-up). We need to find an alternative to StarDict. Also, the Kildin Sámi users have different needs than we have had in mind. And they don’t have access to a Kildin Sámi keyboard in Windows.

Released:

Content

TODO:

General

Other things dictionary-related:

TODO:

Proofing tools

Spelling feedback from Malta:

South Sámi

The sma speller is not compiling allright, and hasn’t been for the last couple of weeks. Diagnosis is still open, but pointing to regex filters.

TODO:

HFST- and Voikko-based proofing tools

TODO:

Speller bugs

List of bugs returned from Polderland:

Tag reordering for abbreviations have caused a lot of problems:

smj:
hr.
hr.	hr+ABBR+Acc
cand.philol.
cand.philol.	cand.philol+ABBR+N+Acc
Per
Per	Per+N+Prop+Mal+Sg+Attr

sme:
hr.
hr.	hr+N+ABBR+Acc
Per
Per	Per+N+Prop+Mal+Sg+Attr

Open issues based on test results:

sme

Version: Davvisámi, version 1.2, 2009-09-18

smj

Version: Julevsáme, version 1.2, 2009-09-20

TODO:

Hyphenator bugs

Open issues based on test results :

sme

Lexicon version: Davvisámi, version 1.2, 2009-09-18

No known issues!

smj

Lexicon version: Julevsáme, version 1.2, 2009-09-20

sma

Command to test the hyphenator:

preprocess dev/corp/pressemelding.txt | lookup bin/hyph-sma.fst | cut -f2 | \
lookup bin/hyphrules-sma.fst | grep -v '^$' | cut -f2 | uniq | see

TODO:

Installer changes

TODO:

User documentation

TODO:

1.2 release

Content:

2.0 release

Content:

Nordplus Språk proofing test bench project

Arno Teigseth is hired for the summer. He is sitting in Equador, 7 hours after Norway. He is working with a quechua speller (using hunspell) and dictionary.

A goal is to use texts from parallel domains, in order to be able to compare nob and sme. We start out with 100000 words, 5000 for each genre:

  1. Discussion fora on the net
  2. Blogs

Discussion fora can be problematic, as they contain a lot of very oral language with deliberate oral or dialectal forms used in the text. As such, the texts won’t be typical for the kind of texts people would like to spell-check, and thus the genre is problematic for the purpose at hand.

Possible other genres:

  1. Minutes from local associations
  2. Pupil texts - look for the home pages of the schools
  3. exam texts by 10th grade pupils, where the pupils have published the texts themselves
  4. skrivebua - Nordland fylkeskommune: all three Sámi languages

Text to speech

There will be regular meetings in this project from now on, every second week.

TODO:

Machine Translation

Francis wants to release the sme-nob MT, but the release requires testing first. There is an established testing procedure.

TODO

All languages

TODO:

finsme

Technical problems with compounds in Finnish: each part in the compound is separated by a space character (+Ux20), which is creating problems in the transfer.

TODO

smenob

Needed: Lexicon completion. Grammatical transfer rules

smesmj

New worker underway.

smesma

Infrastructure in place, to demo in Trondheim to encourage more work and projects.

CAT

Ciprian has made a translation memory based on our parallel corpus. The test looks promising, he has identified certain issues. We need to be clear about the license of the original parallel texts, and we need to know the original language (ie translation direction).

TODO:

SMA seminar in August/September followup

TODO:

Other

Start to make yearly reports

Directory in svn in place, now we only need to fill it with content…

Thursday inhouse seminar

10 AM Norwegian time.

Topic for this week: hfst + spelling + ocr/typos (Sjur, perhaps Lene)

Next time suggestion list:

  1. introduction to xslt - Ciprian to start out
    1. relevant xslt issues:
    2. basic principles of xslt …
    3. sorting in xslt … (have a look at the dictionary sort xslt script)
    4. converting from one xml format to another wilt xslt (sugg: convert from DivvunGT dictionary dtd to MacDict xml)

Future seminars:

  1. XQuery
  2. More XML (needs concretisation)
  3. UML
  4. other suggestions?

Fall planning

Topics:

Dates:

Next meeting, closing

The next meeting is 4.10.2010, 09:30 Norwegian time.

The meeting was closed at 11:49.

Appendix - task lists for the next week

Boerre

Ciprian

Maja

Sjur

Thomas

Tomi

Trond