Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

Cf. one of the following, depending on context:

Opening, agenda review, participants

Updated task status since last meeting

Børre

Ciprian

Maja

Sjur

Thomas

Tomi

Trond

Oahpa!

License and copyright discussions for Oahpa. GPL for the non-linguistic parts? What about the linguistic part? CC-BY license?

TODO

National Library

Letter is sent, we are waiting for an answer.

TODO

Corpus gathering

A large amount of new corpus files from SD and the government has been added by Børre and his atom-reader script. It is all bilingual.

Trond has added two files to orig/nob/facta/depts/ NOU200120010035000DDDPDFA.pdf, NOU199419940021000DDDPDFA.pdf, they are linked from the relevant files in sme/facta/depts/ (the files are NAC_2001_35.pdf.xsl and NAC_1994_21.pdf.xsl), but the nob files have no .xsl files themselves.

TODO:

Promoting Divvun

TODO:

Future plans, directions and ideas

See a separate document in plan/strat/5year.jspwiki.

Northern areas project

We got financing for the kick-off / intro seminar, 218 000 NOK. We will arrange the seminar in February probably. The keyboard project was not funded, needs to be part of the real project instead.

TODO:

Infrastructure

Out of the box experiences:

Updated corpus online

See Ciprian´s document about the corpus content in $GTPRIV/plan/corpus/oslo_corpus_update_todo.txt.

Issues:

facta$convert2xml.pl --nolog --corpdir=/usr/local/share/corp L1allOrt.correct.txt

Error message:

sh: /home/sjur/gtmain/gt/script/text_cat: No such file or directory
L1allOrt.correct.txt: ERROR errors in /home/sjur/gtmain/gt/script/text_cat -q \
   -x -d /home/sjur/gtmain/gt/script/LM "/usr/local/share/corp/tmp/L1allOrt.correct.txt.tmp0":

text_cat isn’t part of our repository, we need to add it - we are using a modified version of the original file. It should be added with a short README and some license info, pointing to the original.

TODO:

Corpus infra remake

TODO:

Corpus interface

This depends on the infrastructure cleanup.

TODO:

Makefile + tag simplification

In the lexc files do: +Err/Sub:W5, in twolc: use W5:0 as anchor for the sloppy rules (Tomi rules). Please rename the rules to something more descriptive.

TODO:

HFST compilation

Our HFST transducers do still contain all the extra tags resulting from the tag conversion (cf above). The Apertium versions do not, however. The main problem here is the forking of smi transducer builing and maintenance - one branch in Giellatekno, and another in Apertium. This is not good for future maintenance, and should be corrected ASAP. The tag-removing make steps should be incorporated in the giellatekno svn.

Status:

TODO:

General list

To accommodate future enhancements in different directions (in rough order of importance):

  1. test bench for all parts of our language technology efforts
    1. test bench enhanced, but not yet complete
  2. improve Forrest i18n support with static sites
  3. reorganise the documentation:
    1. differ between target groups
    2. get better grouping
    3. decide what to write in Forrest and what in wiki (cf. Apertium and [http://xixona.dlsi.ua.es/apertium/]) for a similar split)
    4. update/add missing parts
  4. migrate lexc lexicons to XML, splitting the task
    1. Name lexica (the Name project)
    2. Dictionaries (already in XML, task is to integrate them)
    3. At least migrate the lexc open POSes (Komi as a pilot case)
  5. change the look of the documentation web
  6. corpus content moved to Max Planck repositories? Norsk språkbank?
  7. update infrastructure to allow content-restricted spellers for special target groups

TODO:

Linguistics

North Sámi

There are a lot of hyphens found in noun paradigms. They should not be there. Trond and Thomas will look into them.

TODO:

Lule Sámi

TODO:

South Sámi

TODO:

Name lexicon/risten.no infrastructure

Most of the items that used to be listed here are now moved to the new risten2 project, scheduled for next year.

TODO:

Dictionaries

Net-/browser-based interface

Ciprian is compiling a new set of dictionaries.

TODO:

Mobile

WeDict dictionary client for iOS (uses stardict dictionary files): [http://app.weiphone.com/wedict/]

iPod/iPhone has (almost) no problem with Unicode text input (but users HAVE to know how to use unicode as input). There are two letters missing, ŧ, ŋ, the other ones are there. You may use Serbo-Croat, or QUERTY, when writing North Sami on iPod.

There is one possible issue: [http://code.google.com/p/wedictpro/issues/detail?id=2]

TODO:

Desktop

StarDict is useless for Cyrillic languages, mainly because of the scanning function (ie point and look-up). We need to find an alternative to StarDict. Also, the Kildin Sámi users have different needs than we have had in mind. And they don’t have access to a Kildin Sámi keyboard in Windows.

Released:

Content

TODO:

General

Other things dictionary-related:

TODO:

Proofing tools

Spelling feedback from Malta:

South Sámi

The sma speller is not compiling allright, and hasn’t been for the last couple of weeks. Diagnosis is still open, but pointing to regex filters.

TODO:

HFST- and Voikko-based proofing tools

TODO:

Speller bugs

List of bugs returned from Polderland:

Tag reordering for abbreviations have caused a lot of problems:

smj:
hr.
hr.	hr+ABBR+Acc
cand.philol.
cand.philol.	cand.philol+ABBR+N+Acc
Per
Per	Per+N+Prop+Mal+Sg+Attr

sme:
hr.
hr.	hr+N+ABBR+Acc
Per
Per	Per+N+Prop+Mal+Sg+Attr

Open issues based on test results:

sme

Version: Davvisámi, version 1.2, 2009-09-18

smj

Version: Julevsáme, version 1.2, 2009-09-20

TODO:

Hyphenator bugs

Open issues based on test results :

sme

Lexicon version: Davvisámi, version 1.2, 2009-09-18

No known issues!

smj

Lexicon version: Julevsáme, version 1.2, 2009-09-20

sma

Command to test the hyphenator:

preprocess dev/corp/pressemelding.txt | lookup bin/hyph-sma.fst | cut -f2 | \
lookup bin/hyphrules-sma.fst | grep -v '^$' | cut -f2 | uniq | see

TODO:

Installer changes

TODO:

User documentation

TODO:

1.2 release

Content:

2.0 release

Content:

Nordplus Språk proofing test bench project

Arno Teigseth is hired for the summer. He is sitting in Equador, 7 hours after Norway. He is working with a quechua speller (using hunspell) and dictionary.

A goal is to use texts from parallel domains, in order to be able to compare nob and sme. We start out with 100000 words, 5000 for each genre:

  1. Discussion fora on the net
  2. Blogs

Discussion fora can be problematic, as they contain a lot of very oral language with deliberate oral or dialectal forms used in the text. As such, the texts won’t be typical for the kind of texts people would like to spell-check, and thus the genre is problematic for the purpose at hand.

Possible other genres:

  1. Minutes from local associations
  2. Pupil texts - look for the home pages of the schools
  3. exam texts by 10th grade pupils, where the pupils have published the texts themselves
  4. skrivebua - Nordland fylkeskommune: all three Sámi languages

Text to speech

There will be regular meetings in this project from now on, every second week.

TODO:

Machine Translation

Francis wants to release the sme-nob MT, but the release requires testing first. There is an established testing procedure.

TODO

All languages

TODO:

finsme

Technical problems with compounds in Finnish: each part in the compound is separated by a space character (+Ux20), which is creating problems in the transfer.

TODO

smenob

Needed: Lexicon completion. Grammatical transfer rules

smesmj

New worker underway.

smesma

Infrastructure in place, to demo in Trondheim to encourage more work and projects.

CAT

Ciprian has updated the translation memory based on our parallel corpus. It is based on 923 parallel files. It needs to be tested, but it looks promising.

TODO:

SMA seminar in August/September followup

TODO:

Other

Start to make yearly reports

Directory in svn in place, now we only need to fill it with content…

Thursday inhouse seminar

10 AM Norwegian time.

Topic for this week: hfst + spelling + ocr/typos (Sjur, perhaps Lene)

Next time suggestion list:

  1. introduction to xslt - Ciprian to start out
    1. relevant xslt issues:
    2. basic principles of xslt …
    3. sorting in xslt … (have a look at the dictionary sort xslt script)
    4. converting from one xml format to another wilt xslt (sugg: convert from DivvunGT dictionary dtd to MacDict xml)

Future seminars:

  1. XQuery
  2. More XML (needs concretisation)
  3. UML
  4. other suggestions?

Fall planning

Topics:

Dates:

Next meeting, closing

The next meeting is 11.10.2010, 09:30 Norwegian time.

The meeting was closed at 10:53.

Appendix - task lists for the next week

Boerre

Ciprian

Maja

Sjur

Thomas

Tomi

Trond