Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 09:51.

Present: Børre, Saara, Sjur, Thomas, Tomi, Trond

Absent: Maaren

Agenda accepted as is.

2. Updated task status since last meeting

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

3. Documentation

TODO:

4. Corpus gathering

NSI: the Mac has crashed during the summer, they need a new motherboard for it, but they don’t have money to buy it! This is a major setback for us - the major part of Min Áigi and Áššu can’t be delivered. Unless… Børre is going to Kautokeino on a course in the near future, and could probably get the hard disk mounted on his computer. Discuss and plan this with the NSI guy.

Bård Eriksen has sent the book list. It contains only about 10 titles.

Lars Kintel has contacted us, and offered all his translations to smj free of charge. We still need to contact the authours, to get their approval as well.

TODO:

5. Corpus infrastructure

General

TODO:

User accounts and access

The whole issue should be moved into Bugzilla as several tasks.

TODO:

More texts to the graphical corpus interface:

We are waiting for parallel texts, and aligning parallel text. SD Parliament meeting memos are excellent for parallell testing, as they are accurately translated in both directions (but some of the Parliament plenary meeting memos do not have a nob counterpart, or we haven’t got one yet).

Børre has found a lot of old Parliament meeting protocols, and will probably be able to add missing translations/parallell documents.

TODO:

Aligner

Discuss NT parallel corpus

originals: Word, paratext, txt our format: xml

It boils down to two issues:

  1. Which original to choose
  2. Whether to align according to verses or according to our own preprocessor (verses) (send both options to the parsers and see which one we like)

Goal: Have the Bible texts in the same format as .

TODO:

Language recognition

TODO:

6. Infrastructure

Xerox tools wrapped as servers

No restrictions on IP numbers or domains from which the server is accessed.

Modules to be includes / types of services:

Supported input formats:

Output formats:

Access clients:

XML output format:

Analyser and disambiguator:
<w form="Tromssan">
<reading lemma="Tromsa" analysis="N+Sg+Loc+.."/>
..
</w>

Generator and hyphenator should follow the same pattern. Saara will make something:-)

TODO:

Hyphenator

Tros^te^rud TROSTERUD

Čále dahje kopiere sániid lásii, ja deatti “Analysere”. Čilgehus giellaoahpalaš

čá^le da^hje kopiere sá^niid lá^sii , ja deat^ti a^na^ly^se^re .

Čilgehus čil^ge^heh^kos ≠ čil^ge^heh^kos ≠ čil^ge^hus = čil^ge^hus = Čil^ge^hus <==

TODO:

Automatic Bugzilla reminder for untouched bugs

Some perl-libraries needed by Bugzilla weren’t in the path, causing it to not work. Adding them should fix the issue.

TODO:

M4

TODO:

7. Linguistics

Names and multilinguality

We need a more principled approach to this.

Background: the name lexicon is getting attention from the SD name/terminology sections, and they would like to use our name lexicon also for public searching.

Observations:

1) Multilinguality is always optional.

2) We can observe that “foreign” names in texts follows a domination pattern: majority language forms can be found in minority language texts as real names (“Kautokeino produkter”), whereas minority language names almost always occur in majority language texts as citations. And citations should not be considered a natural part of the text.

3) When looking at our name classification, multilinguality varies according to:

Ani - weak/none? (pet, myth anim.  names)
Fem - weak (informative)
Mal - weak (informative)
Obj - strong
Org - strong
Plc - strong for the national and country names, weak (informative) for foreign
       names
Sur - none
Tit - strong (titles)

Suggestion:

We need to reconsider the all names in all languages policy. That policy is valid only for Fem, Mal, and Sur (and Ani and Tit?). For Obj, Org, Plc the rule should be that if they have multilingual names, each name should only be used in it’s own language. Then we need a modification saying that majority language names can be included in minority language lexicons if attested in our corpus. Also, the majority language varies according to country (obviously), which means that in a speller context, we might consider tailoring spellers for each country, leaving out noise relating to majority language names from another country.

A further issue is whether we should reconsider our cohort policy. Today, Sur and Plc are different readings. An alternative would be to have them as secondary tags, not in conflict with each other:

"<Trosterud>"
        "Trosterud" N Prop Sur Sg Nom <<< @HNOUN
        "Trosterud" N Prop Plc Sg Nom <<< @HNOUN
"<Trosterud>"
        "Trosterud" N Prop Sg Nom <Sur> <Plc> <<< @HNOUN
"<Trosterud>"
        "Trosterud" N Prop Sg Nom &Sur &Plc <<< @HNOUN

Derivation and spellers like Aspell

North Sámi

Nothing this week?

Lule Sámi

TODO:

Komi

It appears that the conversion form LexC to XML went wrong wrt to upper:lower pairs, consider the following entry from the Attic (noun file):

вос:воск N ;

The corresponding XML is:

<entry>
  <lemma>воск</lemma>
  <stem/>
  <contlex>Noun1</contlex>
  <POS>N</POS>
  <article>
    <ENG></ENG>
    <FIN></FIN>
    <CF></CF>
  </article>
</entry>

It should have been:

  <lemma>вос</lemma>
  <stem>воск</stem>

There seems to be an issue. Trond will look into it.

On Victorio:

/usr/local/cvs/repository/kt/kom/src/Attic/
~$ll /usr/local/cvs/repository/kt/kom/script/kom-utf.xml,v

TODO:

8. Name lexicon infrastructure

Decided in Tromsø:

Details can be found in [the meeting memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html]

TODO:

Postponed:

9. Spellers

Speller data generation

TODO:

10. Other

Meeting with the map authorities

Sjur is going to a meeting in Oslo on Wednesday, meeting with the Ministeries and the Statens kartverk.

Issues:

Bug fixing

64 open Divvun/Disamb bugs, and 25 risten.no bugs

Guess: 1/3 of the bugs are fixed already (?)

Meetings and Marratech

Børre talked to Leif Åge about the Marratech server. All users need to be inside the SD firewall. Thus, Marratech is not an option.

Quoting Marratech:

[Marratech on proxies and firewalls|http://www.marratech.com/userman/manager/app_firewalls.html]

What about using proxies for Maaren? iChat has a setup page for it in Preferences->Accounts. Børre will investigate

TODO:

Task lists as iCal entries

TODO:

Words section

You all need to check out CVS/words, and link to the relevant place, cf how gt/doc/ is linked.

11. Next meeting, closing

Next meeting 16.10.2006 at 9:30.

Saara will work offline next week, and won’t be present in the meeting.

Closed at 12:26.

Appendix -task lists for the next week

Boerre

Maaren

Saara

Sjur

Thomas

Tomi

Trond