Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

The meeting was delayed due to the project board having a telephone meeting at our regular meeting time.

Opened at 13:17.

Present: Sjur, Thomas, Børre, Tomi

Absent: Maaren, Saara, Trond

Agenda accepted as is.

2. Updated task status since last meeting

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

3. Documentation

TODO:

4. Corpus gathering

Collecting

See a previous meeting memo for what’s to be done.

TODO:

New contracts:

Olavi Korhonen’s Lule Sámi dictionary.

Talked with him, sent him the contract.

TODO:

KIO Grafisk and the Iđut books

TODO:

Bible texts

TODO:

Davvi Girji

Called Brita Kåven. She can’t say anything about letting us have the texts untill she has talked with the employees, to get a work estimate on their part. Nothing was decided in their previous meeting. Davvi Girji seems generally very positive, but has problems finding the best way of actually delivering the texts to us.

TODO:

Min Áigi

TODO:

Kåfjord

TODO:

Sámi Instituhtta

When will we get the corpus? We don’t know, Børre will contact him again.

TODO:

Čálliid Lágádus

[http://www.calliidlagadus.org/]

TODO:

Árran

Talked to Bård Eriksen, he needs to discuss more with his coworkers.

TODO:

5. Corpus infrastructure

General

TODO:

User accounts and access

For details, see the previous meeting memo, as well as the memo from a [dedicated meeting|/infra/corpus_policy.html].

Shell access

TODO:

Web browser access

TODO:

More texts to the graphical corpus interface:

TODO:

  1. refine xml-tagged output (Saara and Tomi)
    1. done, but still open if it is finished
  2. add text to the server (Lars)

Aligner

More to be said about this? (certainly, but right now?)

Language recognition

Still waiting for more smj text to improve it.

Corpus summary

Forrest goes into an endless loop when processing these files. It happens when converting the XML to Forrest format. More info on our bugzilla

It is now implemented, but a test to skip this summary option for smaller collections did not work. Thus all genres in all languages are treated the same.

Improvement suggestion: instead of summarize files under a certain limit, one could also split Min Áigi into years, and display it year-wise. That will require some more info in the generated source file corpus-content.xml.

TODO:

6. Infrastructure

Paradigm generation

Goal: Reuse Greenlandic code for paradigm generation.

Saara has given a report on the PHP code in News. Please read.

TODO:

Hyphenator

TODO:

Automatic Bugzilla reminder for untouched bugs

TODO:

JSPWiki update

Here’s the pattern to use:

egrep -C 3 -R "^\*.*[$]*{1,16}\#" *

TODO:

7. Linguistics

Derivation and spellers like Aspell

It is impossible to create a model that can dynamically generate new derivations of verbs, and then let them nominalise for compounding. What we need is to extract all derived verbs not in the lexicon, and lexicalise most of them. We could potentially let the 5(?) most productive verbal derivations be unlexicalised, and generate them from our lexicon directly when creating entries. A tentive, first guess on which derivational suffixes this could be is (for sme):

+st, +l, +h, -goahtit, (o)juvvot (passive)

Later we might consider lexicalising all other derivations found in our corpus.

To make it easier to extract all derived stems, we should enhance the tags used for derivations in sme to make them easier to grep. The most straightforward solution is to make the tags follow the same pattern as for smj, +Der/NNN. Presently sme is only using the NNN part as a tag, where NNN represents the derivational suffix. That is, there is no single pattern to match against for sme. Example:

"<laktigohtet>"
        "laktit" V TV goahti Ind Prs Pl3 @+FMAINV

Only the tag goahti is identifying that the word form is a derivation. It should be extended to Der/goahti as in smj. Then it is easy to grep for the pattern Der/ to get all derived verbs.

TODO:

Name double-tagging

TODO:

North Sámi

TODO:

Following already derived verbs are not happy with further derivation. It seems like the most of them do not appear as Actio forms in first part of compounding either. The following holds for both sme and smj:

LEXICON MUITTASJ !Words ending -šit, -skit, -smit, -idit, -ldit, -git and
5-syllables, formerly directed to
MUITAL
 +V+TV: MUITALStem ;
### SHOULD be directed here as well:
### Reflexives on -dit
### Reciprocals on -dit, -(a)lit
### Momentatives on -dit, -(a)lit, -ádit, -ihit
### Frequentatives on -(a)lit, -(u)hit, -dit
### Continuatives on -dit, -(u)hit, -nit
### Inchoatives in -nit
### Translatives on -dit
### Essives on -dit and -stit
### Causatives on -dit, -stit

Lule Sámi

TODO:

8. Name lexicon infrastructure

TODO:

9. Public tender

We finally recieved their answer as well.

TODO:

10. Other

Summer vacation

Who When
Børre 24.7 - 20.8
Linda ?
Maaren on sick leave
Saara July
Sjur 3.7 - 23.7 + single days at other times
Thomas 3.7 - 7.8
Trond 3.7 - 14.8 (last two weeks off at summer school)
Tomi 8.7 - 16.7, 2 more weeks in July and/or August

Bug fixing

43 open Divvun/Disamb bugs (two down!), and 25 risten.no bugs

Guess: 1/3 of the bugs are fixed already (?)

Please help Saara with [bug 279|http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=279] (Perl locale). Not much help… Saara will contact Roy on this issue.

Gobby

TODO:

SEE 2.5 extensions

Future extensions and wish modes:

TODO:

Task lists as iCal entries

This feature requires that the patch Sjur sent in to Forrest regarding parsing of nested lists within wiki documents is applied. It was applied to Forrest last Friday, thus all who is interested in using this feature from their local Forrest should update their installation. It is easiest done using Subversion (svn) to update your local copy.

TODO:

Project meeting in Tromsø in august?

The project board has decided upon a meeting in Tromsø in august. We’ll discuss the date later this week.

11. Next meeting, closing

Next meeting is undefined due to summer vacation.

Closed at 14:36.

Appendix - task lists for the next week

Boerre

Maaren

Saara

Sjur

Thomas

Tomi

Trond