Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 09:37.

Present: Børre, Maaren, Saara, Sjur, Thomas, Tomi, Trond

Absent: none

Main secretary: Sjur

Agenda accepted as is.

2. Reviewing the task list from the last meeting

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

3. Documentation

Changes and updates because of the Divvun public tender

http://giellatekno.uit.no/dtd/corpus.dtd

This corresponds to the dir XXX on our public web server - details from Børre.

4. Corpus gathering

Børre in Umeå

Met with Mikael Svonni, Nils Henrik Sikku, John Erling Utsi (very positive). They received a copy of the contract. Will hopefully return with a signed contract and a bunch of books/documents.

Collecting

See a previous meeting memo for what’s to be done.

TODO: Send out the rest of the letters (Børre)

Odin

Waiting for Sæth to discuss with colleagues about how to implement the cooperation, and return to us.

TODO:

Olavi Korhonen’s Lule Sámi dictionary.

Korhonen and Oahpadusguovdásj have a shared copyright to the dictionary.

KIO Grafisk and the Iđut books

We need a test file in order to find out whether file conversion from Quark to InDesign works.

TODO:

Bible texts

TODO:

5. Corpus infrastructure

Transferring the old gt/sme/corp files to the new corpus repo:

TODO:

Further discussion about corpus analysis and computer use:

The new G5 is tremendeously faster than cochise, thus we want to use it. But cochise will continue to be our main corpus repo.

New tasks:

Changes and updates because of the Divvun public tender

User account admin and infra: see [previous memo|/admin/weekly/2006/Meeting_2006-03-06.html].

TODO: see above under Documentation.

Automatic build of the content of our corpus repo: also see [previous memo|/admin/weekly/2006/Meeting_2006-03-06.html].

TODO:

Free and non-free texts

It is useful in many cases to have access to all and only the free texts. After some discussion we decided upon the following:

Top level corpus dirs:

orig/
gtfree/
gtbound/ (renamed from gt/)

The amount of texts in the new free corpus dir tree is likely something like this:

gtfree (containing free texts only)
    sme
        admin   all
        bible   empty
        fact    some
        fict    some
        law     all
        news    some
    smj

The old gt/ dir is renamed to gtbound/. It contains all texts, and is functionally and content-wise identical to our old friend gt/.

The new gtfree/ dir tree is copied automatically from gtbound/, and should not increase the maintenance burden. It contains all and only the texts with a free usage license.

The final corpus directory will look like this:

drwxr-x--x dis/ <= manually analysed not duplicated
drwxr-x--x ga/  <= autom analysed (perhaps duplicated, if a need arises)
drwxrwx--x broken/
drwxrwx--x orig/
drwxrwx--x bin/
drwxrwx--x gt/  -> gtbound/
drwxrwxr-x gtfree/  -> new! (copies of free texts)
drwxrwxrwx archived_2005_10_25_16.56/
drwxrwxrwx upload/
drwxrwxrwx tmp/

More texts to the graphical corpus interface:

6. Infrastructure

We need to set up anonymous, read-only access to our cvs repo as outlined by our friend in Skolelinux.

Howto/who:

Aligner

We now have a go from Knut Hofland in Bergen, with download link, username/password, and some documentation:

“Dokumentasjonen er i øyeblikket litt tynn, men jeg håper at dere kan bli i stand til å tilrettelegge tekst og kjøre programmet fra den info som ligger der.

Vi jobber også med noen små forbedringer og er interessert i tilbakemeldinger.

Når det gjelder det separate programmet for setningsinndeling så må det forbedres til å ta tekst i utf-8 (setningsinndelingen blir nå feil hvis tegnet etter punktum er en stor bokstav i utf-8). Men dere har gjerne andre program til å dele inn tekst i setninger (eller andre enheter). Programmet vil imidlertid også dele der det finner ## (som ikke kommer i utfilen), så det er en litt uelegant måte å komme unna. (Jeg burde kanskje også hatt et tilsvarende unntak).

Programmet som nummerere elementer virker på <p>, <head> og . og <head> nummereres fra 1 og utover. Det har kommet ønske fra en gjest om nummerering innenfor <p> (1.1 1.2 .. 2.1 2.2 ..) og det blir etterhvert lagt inn (det var jo slik vi hadde det i ENPC). Men her finnes det kanskje også andre tilgjengelige program/skript.

Foreløpig er det to ut-formater, ett med navn cor, der xml-elementene som sammenstilt får et ekstra attributt corresp med liste over tilhørende setninger i den andre filen. Det andre formatet med navn new lager to filer med like mange linjeavslutninger, en for hver gruppe av setninger som blir sammenstilt. Dette formatet kan brukes videre i programmer som Paraconc eller MultiConcord (men vet ikke om noen av disse er klar for utf-8). Vi bruker også en litt bearbeidet versjon av dette formatet til å last inn i Corpus WorkBench (på Linux). Vi planlegger også et linkGrp/link format i en ekstern fil (kan lett lages med et separat program fra -new-filene).

Det er også et lite program i katalogen som tar to -new-filer og lager en HTML-tabell av disse.”

TODO:

Language recogniser

We don’t have enough Finnish text. We will look at the Helsinki corpus.

7. Linguistics

North Sámi

TODO:

General notes on SGL

Lule Sámi

TODO:

21-22. March: Research seminar at Árran. How to go forward with Lule Sámi research.

Trond is working on a beta version of the smj disamb. Smj name morphology still open.

We need to implement the oslolaš issue for smj as well (Tomi)

8. Name lexicon infrastructure

Complex names

TODO:

XML format

TODO on eXist as editor:

  1. refactor and prepare risten.no for multiple collections:
    1. develop the Cocoon sitemap to delegate requests to the proper folder level, such that the most specific code is always used
    2. refactor the code into more and more specific components according to our folder hierarchy
  2. develop the needed XQueries and interface (Sjur, Tomi)
  3. data synchronisation between risten.no and the cvs repo (Tomi)
  4. test whether eXist as editor is actually working well (linguists)

Data synchronisation task list/specification:

9. Spellers

Nothing until the new proper noun lexicon is in place.

10. Other

SGL Seminar

SGL wants to have a meeting with us in May in Karasjok (but no seminar). We need a meeting with Laila before that.

CVS UTF-8 bug in filename

gt/sme/corp/examples/ex-uhccán.txt

cvs update: cannot write ./ex-uhccán.txt: Invalid argument

Problem solved by renaming the file. Please do cvs up

Bug fixing

34 open Divvun/Disamb bugs, and 25 risten.no bugs

11. Summary, task list

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

12. Next meeting, closing

20.03.2006 09:30

Closed at 11:41