Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 09:40.

Present: Børre, Linda (from topic 7 onwards), Maaren, Saara, Sjur, Thomas, Tomi, Trond

Absent: none

Main secretary: Tomi

Agenda accepted as is.

2. Reviewing the task list from the last meeting

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

3. Documentation

Changes and updates because of the Divvun public tender

TODO:

Permlink location for all our dtd’s (filename will vary, of course):

http://giellatekno.uit.no/dtd/corpus.dtd

This corresponds to the dir ~/gt/public_html/dtd/ on our public web server. The DTD’s has to be manually copied to this location, but since they don’t change that often, that shouldn’t be a big problem.

TODO:

4. Corpus gathering

Collecting

See a previous meeting memo for what’s to be done.

TODO: Send out the rest of the letters (Børre)

Odin

Waiting for Sæth to discuss with colleagues about how to implement the cooperation, and return to us.

TODO:

Olavi Korhonen’s Lule Sámi dictionary.

Korhonen and Oahpadusguovdásj have a shared copyright to the dictionary. They are both very positive.

KIO Grafisk and the Iđut books

Iđut and KIO Grafisk won’t give access to their Quark files, due to copyright issues with fonts and pictures. It is a principle for them.

Citations from one of the discussions we have had with Quark experts:

TODO:

Bible texts

TODO:

5. Corpus infrastructure

TODO in transferring the old gt/sme/corp files to the new corpus repo:

TODO for access control:

Further discussion about corpus analysis and computer use:

TODO dtd usage and documentation:

HTML conversion problem

We need to extract only the table from input like below, since our DTD does not allow tables to be nested inside paragraphs. Cf newsgroup message from Saara.

<p>
  <table>
  ...
  </table>
</p>

The solution is a simple XSL template that will only match the relevant structures, and then “eat” the paragraphs before continue processing of the table:

  <xsl:template match="p[table]">
    <xsl:apply-templates select="./table">
  </xsl:template>

Correction tags?

There are many scenarios where information about spelling and other errors is useful, especially if combined with the correct string. As a simple way of marking up such info, Sjur proposed the following in the newsgroup:

... this is <error correct="text">tekst</error> with an error...

No problem in adding it to the DTD, together with corresponding info in the header/meta part. That info should be a single element stating that/whether it was manually edited for corrections. Absence of this element is the same as the document NOT being edited.

TODO:

OPEN ISSUES:

Changes and updates because of the Divvun public tender

User account admin and infra: see [previous memo|/admin/weekly/2006/Meeting_2006-03-06.html].

TODO: see above under Documentation.

Automatic build of the content of our corpus repo: also see [previous memo|/admin/weekly/2006/Meeting_2006-03-06.html].

TODO:

Free and non-free texts

More info in the [previous meeting memo.|/admin/weekly/2006/Meeting_2006-03-13.html]

TODO:

More texts to the graphical corpus interface:

6. Infrastructure

We need to set up anonymous, read-only access to our cvs repo as outlined by our friend in Skolelinux.

Howto/who:

Aligner

TODO:

Language recogniser

We don’t have enough Finnish text. We will look at the Helsinki corpus (Saara).

7. Linguistics

North Sámi

TODO:

                                 Concrete
                   +/                             \-
            Animate                                 Verbal Content
        +/           \-                            +/            \-
      Human          Moving                  Control               Mass
     +/    \-      +/      \-               +/     \-            +/    \-
#humans# Moving #vehicles# Movable    Perfective Perfective #features# Count
.......

TODO:

Semantic tags we already have: Actio Plc (for proper nouns such as “London London+N+Prop+Plc+Sg+Nom”)

Place names:

Now: Tags Plc Sur and combinations (London, Trosterud).

Problem: 19000 Plc 900 SurPlc (Trosterud BERN-surplc) Oslo sur. Berlin Bonn? 10500 Sur

TODO (Trond):

  1. Discussion testing
  2. infrastructure
  3. semiatomatic retagging

Lule Sámi

TODO:

Trond will go to Drag tomorrow. Issues for the trip? No unobvious ones.

8. Name lexicon infrastructure

Complex names

TODO:

XML format

TODO on eXist as editor:

  1. refactor and prepare risten.no for multiple collections:
    1. develop the Cocoon sitemap to delegate requests to the proper folder level, such that the most specific code is always used (Sjur)
      1. Progressing well
    2. refactor the code into more and more specific components according to our folder hierarchy (Tomi)
  2. develop the needed XQueries and interface (Sjur, Tomi)
  3. data synchronisation between risten.no and the cvs repo (Tomi)
    1. done some, but it didn’t work out, will need to start on a different trail
  4. test whether eXist as editor is actually working well (linguists)

Data synchronisation task list/specification:

9. Spellers

Nothing until the new proper noun lexicon is in place.

10. Other

Gobby

A cross-platform alternative to SubEthaEdit, Gobby, is now available for OS X through DarwinPorts. I haven’t tested it in collaborative use, but it is worth looking at for collaboration involving Windows and Linux users.

Requirements for easy install:

Install and run the above as admin user. Then find and install Gobby (hint: use the search field), and wait. It took something like 8 hours on my computer to download, compile and install all dependencies, but there was no problems, hicups or other complications. When finished, open X11, and just type ‘gobby’ in any terminal window.

Bug fixing

35 open Divvun/Disamb bugs, and 25 risten.no bugs

SPR language policy decision

Last week’s SPR meeting decided upon a language policy. Their decision was the following (cf. their meeting minutes for the final version):

«En fungerende språklig infrastruktur er av avgjørende betydning for at de samiske språkene skal kunne fungere som bruksspråk i et moderne samfunn. I en slik infrastruktur inngår en velutviklet terminologi, enspråklige ordbøker, ordbøker og lærebøker de samiske språkene i mellom, uttømmende grammatikker som gir informasjon om alle sider ved språket, og språkprogrammer som gjør det mulig å søke etter informasjon, finne termer, rette skrivefeil og grammatikk, og oversette maskinelt fra ett språk til et annet. For å få dette til trengs det representative tekstsamlinger, helst på flere hundre millioner ord, både enspråklige og parallellspråklige, det trengs grammatisk og språkteknologisk forsking. SPR sin rolle er å legge til rette for dette arbeidet.»

11. Summary, task list

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

12. Next meeting, closing

27.03.2006 09:30

Maaren will be away the next four weeks, starting next week. After that she will work in Tromsø all May, and will share office with one of the Tromsø gang then. She will need her own key for that period (Trond).

Closed at 11:48