Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 09:47.

Present: Saara, Sjur, Thomas, Børre, Tomi, Trond

Absent: Maaren

Agenda accepted as is.

2. Updated task status since last meeting

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

3. Documentation

TODO:

4. Corpus gathering

Nothing has happened during the summer.

Olavi Korhonen’s Lule Sámi dictionary.

Waiting for an answer.

Bible texts

Will have a second round with the Word versions.

TODO:

Kåfjord

TODO:

Sámi Instituhtta

When will we get the corpus? We still don’t know, Børre will contact him again.

TODO:

Čálliid Lágádus

[http://www.calliidlagadus.org/]

TODO:

Árran

TODO:

5. Corpus infrastructure

General

What we would like: A make-type system that kept track of the file.xml and file.xsl -pairs, always converting the former whenever the latter had a newer date. Cf. Trond’s letter “Makefile for xsl corpus conversion?” in the news.

At first sight, this sounds like a good idea.

TODO:

User accounts and access

For details, see a previous meeting memo, as well as the memo from a [dedicated meeting|/infra/corpus_policy.html].

Shell access

TODO:

Web browser access

TODO:

More texts to the graphical corpus interface:

TODO:

Aligner

The aligner aligns fine, better than its competitors. Unfortunately it is slow, and dependent upon manual input.

TODO:

Language recognition

Still waiting for more smj and sma text to improve it. We need South Sámi, since Reindriftsnytt/Boazodoallo-ođđasat is trilingual, nob, sme, sma. We are presently unable to correctly identify sma.

Corpus summary

The time-based statistics is still missing.

TODO:

6. Infrastructure

Xerox tools wrapped as servers

To improve throughput and response time on heavy loads, it would really be nice to have the Xerox tools wrapped up as servers.

TODO:

Paradigm generation

Goal: Reuse Greenlandic code for paradigm generation.

Saara has given a report on the PHP code in News. Please read.

Conclusion: Only to be used as a source of inspiration. We’ll wait with further work until we have the server wrapper thing (see above) in place.

Hyphenator

TODO:

Automatic Bugzilla reminder for untouched bugs

TODO:

M4

Tomi and Saara did a lot in Tromsø. How far is it now? Probably finished today!

TODO:

7. Linguistics

Derivation and spellers like Aspell

To make it easier to extract all derived stems, we should enhance the tags used for derivations in sme to make them easier to grep. The most straightforward solution is to make the tags follow the same pattern as for smj, +Der/NNN. Presently sme is only using the NNN part as a tag, where NNN represents the derivational suffix. That is, there is no single pattern to match against for sme.

Problematic issue: the disamb output will presently give information only about non-lexicalised derivations. This can potentially give false data on the frequency of derivational affixes. To get (more) precise data, there should be an option in lookup2cg that favours derived analyses over lexicalised ones, everything else being equal. A similar option for compounds can also be useful in certain contexts. That is, the present behaviour should be partly turned upside-down (“select the analysis/-es with the fewest compounds and derivations available”). The best alternative would probably be to select the next to least complex analysis, that is, to allow only one compound border or one derivation.

TODO:

Semantic double-tagging of names

The policy needs documentation. Thus:

TODO:

North Sámi

The following already derived verbs (verbs ending in -šit, -skit, etc.) are not happy with further derivation. It seems that most of them do not appear as Actio forms in the first part of compounding either. The following holds for both sme and smj:

LEXICON MUITTASJ !Words ending -šit, -skit, -smit, -idit, -ldit, -git and
5-syllables, formerly directed to
MUITAL
 +V+TV: MUITALStem ;
### These derived verbs have now been redirected to MUITTASJ and similar lexica.
### Reflexives on -dit
### Reciprocals on -dit, -(a)lit
### Momentatives on -dit, -(a)lit, -ádit, -ihit
### Frequentatives on -(a)lit, -(u)hit, -dit
### Continuatives on -dit, -(u)hit, -nit
### Inchoatives in -nit
### Translatives on -dit
### Essives on -dit and -stit
### Causatives on -dit, -stit

Examples:

TODO:

Lule Sámi

TODO:

8. Name lexicon infrastructure

Decided in Tromsø:

Details can be found in [the meeting memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html]

Language entry example illustrating both the sem-tag on the sense elements, and the removal of occurence indicator:

<entry id="Agalin">
 <infl lexc="BERN" />
 <senses>
  <sense sem="plc" ref="Agalin"/>
  <sense sem="sur" ref="Agalin_1"/>
 </senses>
</entry>

TODO:

9. Public tender

TODO:

10. Tromsø meeting round-up

TODO:

11. Other

Bug fixing

43 open Divvun/Disamb bugs (two down!), and 25 risten.no bugs

Guess: 1/3 of the bugs are fixed already (?)

Please help Saara with [bug 279|http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=279] (Perl locale). Not much help… Saara will contact Roy on this issue.

Gobby

TODO:

Task lists as iCal entries

TODO:

cd $FORREST_HOME
svn up -r430284

11. Next meeting, closing

Next meeting 4.9.2006 at 9:30.

Closed at 11:25.

Appendix - task lists for the next week

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond