Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 09:45.

Present: Sjur, Thomas, Tomi, Trond

Absent: Børre, Maaren, Saara

Agenda accepted as is.

2. Updated task status since last meeting

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

3. Documentation

TODO:

4. Corpus gathering

Trond talked with Bierna Bientie, got permission to use the texts, so far as bound texts. We shall contact SD to get the texts they have there, he will send whatever they don’t have.

The texts at SD are available via Sig-Britt Persson, Sjur will ask her about them.

TODO:

5. Corpus infrastructure

Nothing new on any of the sub-issues.

User accounts and access

TODO:

More texts to the graphical corpus interface:

No new texts available in the web interface yet. Saara and Trond have got access to the Oslo computer, and could in principle add texts themselves. Some instructions will be necessary, though.

TODO:

Aligner

Øystein Reigem has commented Børre’s work, and will perhaps look at it.

TODO:

Language recognition

TODO:

6. Infrastructure

Xerox tools wrapped as servers

Tomi has modified the server a bit, causing it to NOT work with the perl client at the moment. It should not be a big deal to fix it (three lines?). Saara will fix it.

The paradigm grammar needs to be written for Saara to be able to finish the server. The format should follow the Noun sample below:

N+Number+Case+Possessive?+Clitic?
A
V
Adv

Can we do without clitics? We would like to just list them in a PLX format that allows correct behaviour:

Is
InC lex
MeC lex
FiC clit

That is, we want mánás + go to be accepted, without letting mánás make compounds freely. We don’t know how to specify this in the PLX format. Sjur will discuss clitics with Polderland, to solve this issue.

TODO:

Hyphenator

TODO:

7. Linguistics

Names and multilinguality

In the terms-sme.xml file:
<entry id="Helsset">
  <appl excl="speller,hyph"/> -- just an example! --
  <infl lexc="BERN"/>
  <senses>
    <sense ref="Helsingfors"/>
  </senses>
</entry>

<entry id="Helsingfors" type="secondary">
  <appl incl="disamb, IR" />
  <infl lexc="BERN"/>
  <senses>
    <sense ref="Helsingfors"/>
  </senses>
</entry>

TODO:

TODO:

  1. finish first version of the editing (Sjur, Tomi)
  2. add @type=secondary and @excl=speller,hyph to all names marked with !SUB (Saara)
  3. test editing of the xml files. If ok, then: (Sjur, Thomas, Trond)
  4. make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as well) (the morphological section should be kept intact, in e.g. propernoun-sme-morph.txt) (Sjur, Saara)
  5. convert propernoun-($lang)-lex.txt to a derived file from common xml files (Sjur, Tomi, Saara)
  6. start to use the xml file as source file
  7. clean terms-sme.xml such that all names have the correct tag for their use (e.g. @type=secondary) (Thomas, Maaren, linguists)
  8. merge placenames which are errouneously in different entries: e.g. Helsinki, Helsingfors, Helsset (linguists)
  9. publish the name lexicon on risten.no (Sjur)
  10. add missing parallel names for placenames (linguists)
  11. add informative links between first names like Niillas and Nils (linguists)

Cohort policy

A further issue is whether we should reconsider our cohort policy. Today, Sur and Plc are different readings. An alternative would be to have them as secondary tags, not in conflict with each other:

"<Trosterud>"
        "Trosterud" N Prop Sur Sg Nom <<< @HNOUN
        "Trosterud" N Prop Plc Sg Nom <<< @HNOUN
"<Trosterud>"
        "Trosterud" N Prop Sg Nom <Sur> <Plc> <<< @HNOUN
"<Trosterud>"
        "Trosterud" N Prop Sg Nom &Sur &Plc <<< @HNOUN

This issue is dependent on CG3, and is delayed untill we know more about the possibilities with it.

Derivation and spellers like Aspell

TODO:

North Sámi

The following words are included in the normative list despite being marked with !SUB:

accompagnerejun
ábuhuvvože
ábuhuvvože      ábuhit+V+TV+Pass+Pot+Prs+Du1
áccohallagođežedne

TODO:

Lule Sámi

TODO:

8. Name lexicon infrastructure

Decided in Tromsø:

Details can be found in [the meeting memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html]

TODO:

Postponed:

9. Spellers

Polderland data generation

Tomi has made the first version of the PLX data generation, it ran for 6,5 hours before it crashed, and generated all entries until line 13639 in the noun-sme-lex.txt file (njuvdin):

13639 njuvdin:njuvdim BOAHTIN ;

The generated PLX file contains 1 439 273 lines, roughly 32 Mb.

Tomi: The hyphenator produces multiple suggestions. Sjur: it needs to be used with the hyphenation filter.

TODO:

Automatic testing of the Word spellchecker

It should be possible to write a script that runs texts through Word from the command line, using a combination of shell script and AppleScript. MS Word has the needed AppleScript commands to run the spell checker.

TODO:

Aspell

Børre has worked on the Aspell code, mainly to be able to explain how it works and help out Petter Reinholdtsen, who would like to include the Aspell sme speller in Linux distributions. He already gave back some improvements to our code.

TODO:

TODO when the major part of the PLX conversion is done:

10. Other

Corpus contracts

TODO:

Bug fixing

62 open Divvun/Disamb bugs, and 24 risten.no bugs

Guess: 1/3 of the bugs are fixed already (?)

Task lists as iCal entries

TODO:

Employee seminar in Alta

SD has an employee seminar in Alta 7.-8. December - should we go there? Yes.

TODO:

11. Next meeting, closing

The next meeting is 27.11.2006, 09:30 Norwegian time.

The meeting was closed at 11:28.

Appendix - task lists for the next week

Boerre

Maaren

Saara

Sjur

Thomas

Tomi

Trond