Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 09:58.

Present: Børre, Saara, Sjur, Thomas, Tomi, Trond

Absent: Maaren

Agenda accepted as is.

2. Updated task status since last meeting

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

3. Documentation

TODO:

4. Corpus gathering

Børre will focus on the gathering this week, we need more material…

TODO:

5. Corpus infrastructure

General

Our way of dealing with the conversion of input documents has now reached an advanced level. At some point we might consider publish our results, to the benefit of the rest of the research community.

TODO:

User accounts and access

For details, see a previous meeting memo, as well as the memo from a [dedicated meeting|/infra/corpus_policy.html].

Shell access

TODO:

Web browser access

TODO:

More texts to the graphical corpus interface:

TODO:

Aligner

TODO:

Language recognition

TODO:

6. Infrastructure

Xerox tools wrapped as servers

Saara has made a prototype, available as server_anl.pl (the server) and client_anl.pl (for a client session). It still needs more development, but can be tested.

The server communicates purely over TCP/IP, which means that in principle any client can talk to it.

Very brief user instructions:

In one window: server_anl.pl. In another window: client_anl.pl -p. Type “quit” to exit the client and server.

TODO:

Hyphenator

First hyphenating transducer was made last week, but it produces wrong output because of overgeneration on the generator side.

gahpira => gah-pi-ra and ga-hpir, should be only the first one.

                         ´  hyphentated output
|     |
|   hyphenation rules
                        /     |
filter.fst & hyph.fst <-    generator <-------- overgeneration:
                        \     | ----------baseform/analysis
|   analyser
|     |
                         `  input

We need a “filter” fst: a-z... -:0 -:- ^:0 #:0

Sketch:

[%-, ^, %# ] (<-) 0 ;

and the rest by default: a = a:a.

TODO:

Automatic Bugzilla reminder for untouched bugs

TODO:

M4

TODO:

7. Linguistics

Derivation and spellers like Aspell

Semantic double-tagging of names

TODO:

North Sámi

TODO:

Lule Sámi

TODO:

8. Name lexicon infrastructure

Decided in Tromsø:

Details can be found in [the meeting memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html]

TODO:

9. Tromsø meeting follow-up

TODO:

10. Other

Bug fixing

64 open Divvun/Disamb bugs (two down!), and 25 risten.no bugs

Guess: 1/3 of the bugs are fixed already (?)

Meetings and Marratech

TODO:

Task lists as iCal entries

TODO:

11. Next meeting, closing

Next meeting 25.9.2006 at 9:30.

Closed at 11:03.

Appendix - task lists for the next week

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond