Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

The meeting was delayed one day.

Opened at 10:56.

Present: Børre, Maaren, Saara, Sjur, Thomas, Tomi

Absent: Trond

Agenda accepted as is.

2. Updated task status since last meeting

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

3. Documentation

The new documentation is up and running on the giellatekno server: http://giellatekno.uit.no:8888

The main difference from our existing version: i18n and l10n - all menus, tabs etc are now translated (including the generated pdf files), and documents available in more than one language will appear with a language selection list at the top of the page.

TODO:

4. Corpus gathering

Børre got in contact with an author Erling Persen, got a book. Also got some texts from Lars Kintel. Some of the texts need to get author clarification which Børre is working on. Most of the unproblematic texts from Lars Kintel has been added to the repository.

Number of words in our corpus is now:

TODO:

5. Corpus infrastructure

User accounts and access

TODO:

More texts to the graphical corpus interface:

TODO:

Aligner

TODO:

Language recognition

TODO:

6. Infrastructure

Xerox tools wrapped as servers

TODO:

Hyphenator

TODO:

7. Linguistics

Names and multilinguality

We need a more principled approach to this.

Background: the name lexicon is getting attention from the SD name/terminology sections, and they would like to use our name lexicon also for public searching.

Observations:

1) Multilinguality is always optional.

2) We can observe that “foreign” names in texts follows a domination pattern: majority language forms can be found in minority language texts as real names (“Kautokeino produkter”), whereas minority language names almost always occur in majority language texts as citations. And citations should not be considered a natural part of the text.

3) When looking at our name classification, multilinguality varies according to:

Ani - weak/none? (pet, myth anim.  names)
Fem - weak (informative)
Mal - weak (informative)
Obj - strong
Org - strong
Plc - strong for the national and country names, weak (informative) for foreign
       names
Sur - none
Tit - strong (titles)

Suggestion:

We need to reconsider the all names in all languages policy. That policy is valid only for Fem, Mal, and Sur (and Ani and Tit?). For Obj, Org, Plc the rule should be that if they have multilingual names, each name should only be used in it’s own language. Then we need a modification saying that majority language names can be included in minority language lexicons if attested in our corpus. Also, the majority language varies according to country (obviously), which means that in a speller context, we might consider tailoring spellers for each country, leaving out noise relating to majority language names from another country.

A further issue is whether we should reconsider our cohort policy. Today, Sur and Plc are different readings. An alternative would be to have them as secondary tags, not in conflict with each other:

"<Trosterud>"
        "Trosterud" N Prop Sur Sg Nom <<< @HNOUN
        "Trosterud" N Prop Plc Sg Nom <<< @HNOUN
"<Trosterud>"
        "Trosterud" N Prop Sg Nom <Sur> <Plc> <<< @HNOUN
"<Trosterud>"
        "Trosterud" N Prop Sg Nom &Sur &Plc <<< @HNOUN

Derivation and spellers like Aspell

TODO:

North Sámi

The sme generator (make wordlist TARGET=sme) (possibly also smj) generates nonsense strings, at least in the sense that they are not recognised by the analysing transducers (e.g. by sme.fst). This should not happen!

Example of generated string that’s not recognised by the analyser:

šuorpmoskáidilašruoksatčeavžžatvuođaineattet

Two lessions learned:

  1. make sure to always use identical versions of source files and compiled files in all testing!
  2. there was a hidden circularity in the propernoun file

TODO:

Lule Sámi

TODO:

8. Name lexicon infrastructure

Decided in Tromsø:

Details can be found in [the meeting memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html]

TODO:

Postponed:

9. Spellers

Speller data generation

Tomi has done more work. We are now awaiting full specification of the PLX format to be able to generate correct output.

There was discussions in the newsgroup last week, but no conclusion yet. We need to finish and conclude the discussion.

TODO:

Automatic testing of the Word spellchecker

It should be possible to write a script that runs texts through Word from the command line, using a combination of shell script and AppleScript. MS Word has the needed AppleScript commands to run the spell checker.

TODO:

10. Other

Report from Gothenburg

The seminar focused on actions to establish common infrastructure for language technology in the Nordic countries, promoting cooperation and sharing of resources.

Sjur would like our projects to take some first public steps, by announcing on the NoDaLi e-mail list that our corpus contracts and our infrastructure is available for everyone interested.

Prerequisites:

TODO:

Bug fixing

64 open Divvun/Disamb bugs, and 24 risten.no bugs

Guess: 1/3 of the bugs are fixed already (?)

Task lists as iCal entries

TODO:

Employee seminar in Alta

SD has an employee seminar in Alta 7.-8. December - should we go there? Sjur will ask Julie Eira if we have to go there.

TODO:

11. Next meeting, closing

Closed at 12:27.

Appendix - task lists for the next week

Boerre

Maaren

Saara

Sjur

Thomas

Tomi

Trond