Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 09:54.

Present: Børre, Saara, Sjur, Thomas, Trond, Tomi

Absent: Maaren

Main secretary: Tomi/Børre

Agenda accepted as is.

2. Reviewing the task list from the last meeting

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

3. Documentation

TODO:

4. Corpus gathering

Collecting

See a previous meeting memo for what’s to be done.

TODO: Send out the rest of the letters (Børre)

New contacts:

Odin

Olavi Korhonen’s Lule Sámi dictionary.

TODO: Børre to contact Olavi Korhonen and Kuhmunen

KIO Grafisk and the Iđut books

TODO:

Bible texts

We will get text from Finland, but still haven’t received any. We have got the Swedish text from Sweden. As for the last html versions from Norway, Trond has not contacted them last week.

Swedish html has arrived, no paratext. Norsk bibelselskap has not sent corrected New Testament versions for sme, and not paratext for nno/nob.

TODO:

Davvi Girji

A talk with Brita Kåven, revealed that they would have a look at the contracts after easter. Has been away, Børre will call her again today.

Min Áigi

Agreement: they will send us updates each month. Standard license.

We have problems with Unicode characters in filenames. This was solved once before, and we need to look at this again. The old Bugzilla issue should be reopened. The bug was reopened: http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=76

The Min Áigi format should be dealt with: \@ingress etc should be dealt with for the .txt, but business as usal for the .doc files.

Done so far: Trond and Saara? Trond made a list of tags, and Saara will make the xsl conversion routine for the typographic tags.

Saara is implementing the tag format.

TODO:

Kåfjord

Promised to send us texts. Some texts have arrived, but nothing from Ája.

TODOBørre will contact them.

Sámi Instituhtta

Børre contacted Richard Valkeapää, the IT-consult at NSI. He put it on his todo list, as he would have to contact the person who has worked with the newspaper texts anyway. He said this would be done in the near future (within a month).

5. Corpus infrastructure

https://giellalt.uit.no/lang/corp/corpus-summary.html

TODO:

Changes and updates because of the Divvun public tender

User account admin and infra: see [previous memo|/admin/weekly/2006/Meeting_2006-03-06.html].

TODO: see above under Documentation.

Automatic build of the content of our corpus repo: also see [previous memo|/admin/weekly/2006/Meeting_2006-03-06.html].

TODO:

Name change again? Trond has come up with some new suggestion:

gt -> gtbound/
gtbound -> bound/
gtfree -> free/

NB! keep symlinks from the old names to the new names for now.

Free and non-free texts

More info in a [previous meeting memo.|/admin/weekly/2006/Meeting_2006-03-13.html]

TODO:

More texts to the graphical corpus interface:

TODO:

Top-two priorities:

  1. Trond to discuss more with Lars on tag unification, and unify them.
  2. Tomi to change ccat to be able to create the right input for the corpus analysis.
  3. Lars to add text to the server.

Language recognition

TODO:

6. Infrastructure

Paradigm generation

Greenland’s language secretariat has a paradigm generator based upon Xerox tools, we have asked for their source code, and will get it (site in Greeenlandic, English and Danish).

Try e.g. illu “house”, aput “snow” (or any word in st/kal/src/noun-kal-lex.txt)

TODO:

Aligner

TODO:

Hyphenator

Trond and Thomas have been updating the propernoun file with ^ tags. We need the tag in front of compound parts beginning in a vowel or in two or more consonants. Compound parts beginning with one consonant are handled correctly.

TODO:

7. Linguistics

General - hyphenation

See discussion, open questions and decission in the [previous meeting memo.|/admin/weekly/2006/Meeting_2006-04-03.html]

We did a sme overkill: V^CV, when the rulebased hyphenator gave the right result. For sme: Find the default, only tag the exceptions. (issue: VCsCV, VsCV)

TODO:

North Sámi

There are some heavy bugs (11 sme bugs all in all):

We should have some linguistic workshops while Maaren is here.

TODO:

Lule Sámi

TODO:

8. Name lexicon infrastructure

TODO:

  1. refactor and prepare risten.no for multiple collections:
    1. refactor the code into more and more specific components according to our folder hierarchy (Tomi, Sjur)
      1. done for editors, some open issues for regular searches
  2. write down the most common editing scenarios (to be used as guides for making the editing interface) (adding / changing ) (Trond, Tomi)
    1. Done.
  3. develop the needed XQueries and interface (Sjur, Tomi)
    1. developing
  4. data synchronisation between risten.no and the cvs repo (Tomi)
    1. nothing this week
  5. test and review when ready
  6. Rethink the doubletagging procedure for names, consider grammatically motivated semtag conversion routines (“Helsinki” from Plc to Obj to Org, or the Lyndi England issue) (Trond)

9. Spellers

Nothing until the new proper noun lexicon is in place. We don’t have enough people to do both. Here’s our most important targets for spellers in the near future:

10. Public tender

TODO:

11. Other

Summer vacation

Who When
Børre ?
Linda ?
Maaren ?
Saara July
Sjur ?
Thomas 3.7 - 7.8
Trond July
Tomi 8.7 - 16.7, more?

Bug fixing

45 open Divvun/Disamb bugs, and 25 risten.no bugs

Guess: 1/3 of the bugs are fixed already (?)

Please help Saara with bug 279 (Perl locale). Not much help… Saara will contact Roy on this issue.

After the corpus issues have been somewhat settled, we should do a bug barnraising. … and then a new one after the name lexicon is fixed.

Move to victorio

Trond’s victorio problem:

sme:
FirstTag...1, ProperNoun...10000...
        - Warning:  Ignoring info strings.
20000...30000...flex scanner jammed
make: *** [sme/bin/sme.save] Error 2

smj:
Reading from 'smj/src/adv-smj-lex.txt'
adv...1, Adverb...1064

Reading from 'smj/src/noun-smj-lex.txt'
NounRoot...flex scanner jammed
make: *** [smj/bin/smj.save] Error 2

sma, kal compile without jamming.cd

Saara compiles sme without problems, has problems with smj. Conclusion: it is a source code problem. Tomi and Børre are compiling smj just fine.

Gobby

0.3 is working fine on Mac, Linux and Windows. Should be installed on all computers c.f. [http://darcs.0x539.de/trac/obby/cgi-bin/trac.cgi/wiki/InstallationGuide] (our preinstalled Xcode veriosn is 2.0, must be 2.1):

Trond should ask Lars Nygård, Per Langgård and Tero Avellan to install Gobby as well.

12. Summary, task list

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

13. Next meeting, closing

22.05.2006 09:30

Closed at 11:30