Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 09:45.

Present: Børre, Maaren, Saara, Sjur, Thomas, Tomi, Trond

Absent: none

Agenda accepted as is.

2. Updated task status since last meeting

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

3. Documentation

One small problem: Forrest seems to crash on raw HTML. Børre will check it.

TODO:

4. Corpus gathering

Two more authors contacted, both positive. Åge Solbakk is coming to Tromsø next week, Børre will meet with him then.

Børre has been digging more in the SD archives, and found some more texts.

sma

Inger Johansen found Trond’s memory stick, we thus have some sma texts, and we will also get 200 000 words from Ove Lorentz (there is probably overlap). Bierna Bientie has been travelling this autumn, should be back this week, Trond will contact him for sma Bible texts.

smj

The Disamb project plans to have an smj week next week, and would very much like to have some more texts by then.

TODO:

5. Corpus infrastructure

User accounts and access

TODO:

More texts to the graphical corpus interface:

We have sent approximately 10 texts to Oslo, aligned and with sme analysis. Now, the nob texts must be analysed and the texts added to the interface. We will not send more texts now, but use these ones for testing.

TODO:

Aligner

TODO:

Language recognition

Trond and Saara has done some work on paragraphs with mixed content.

Types of mixed paragraphs in the newspaper texts:

  1. Norwegian quotations (titles, repliques, etc.)
  2. Bilingual text, separated by some separator: (/)
  3. Systematic omissions in the original translations
  4. Technical text for the typographers
  5. Names
  6. Unsystematic Norwegian parts of sentences

Examples of the types:

  1. Muhtomin láve friddjavuođadovdu ja eará háve fas dakkár dovdu ahte “Dere tråkker faen ikke på meg lenger”, logai Mari ovdalgo lávllestii maŋemus lávlaga ja konsearta lei ollislaš. ¶ Riikkaviidosaš aviisa Dagbladet lea jođihan dán iskkadeami. Logi eanemus liikojuvvon divttat leat earret eará Arnulf Øverland dikta «Du må ikke sove», ¶
  2. Vi har spurt eldre samer om hvordan de hadde det før i tiden./ Mii leat jearahallan vuorraset sápmelaččain mo sis lei bajásšattadettiin? ¶
  3. Du lihkkologut: 1, 14, 27 og 31 ¶
  4. BILDE:Kjell Kemi og Mai Britt Utsi ¶ HOVEDSAK: Bilde av Sponheim og rein ¶
  5. Eambbo dieđuid daid ortnegiid birra ja ohcanskoviid gávnnat min ruovttusiidduin, www.slf.dep.no - Økologisk. Sáhtát maid ságastit SLF:ain, Šaddobuvttadeami- ja ekologalaš eanadoalu ossodat, Seksjon planteproduksjon og økologisk landbruk. ¶
  6. 1992:s lei NSR sámi delegašuvnnas mii soabadii Justisministariin om opplegget for sedvanerettsundersoekelsen og folkerettsutredningen.

The first is the most common one. In the MÁ corpus, there are 4000 strings with quotations, 2500 of them with directed (left/right) quotations.

Suggestion for handling the types:

  1. Quoted strings: Pick out the quoted strings and check them separately
    1. when? preprocessor or conversion to XML? conversion to XML, see below
  2. Do nothing or look for known separators when the recognition returns alternatives? Safest approach: pessimistic - do nothing.
  3. Do nothing, and add “og” as a loan word in the lexicon (with !SUB!!!)
  4. Identify the technical words BILDE, HOVEDSAK, then mark them as non-wanted(?)
  5. Do nothing. (we go for CC “og”)
  6. Do nothing for the time being (bilingual analysis in the future?)

Conversion of quotations:

<p lang=a>...dovdu ahte «Dere... » ...</p>
  -- converted to: --
<p lang=a>...dovdu ahte <span type="quote" lang="nb">«Dere... »</span> ...</p>

Types of quotations:

Norwegian sequences could be strung together, and treated as an un-analyzible part of the sentence.

Language distribution of paragraphs, as identified by the language recogniser:

LANG  # hits - reality:
sme   68431  - true
nob   10595  - true
smj    8468  - mostly sme, some smj
nno    1220  - mostly nob
eng     994  - true
fin     956  - some true, most sme?
dan     482  - false
ger     252  - false
sma      81  - mostly true, some short paragraphs may be false
isl       9  - false

TODO:

6. Infrastructure

Xerox tools wrapped as servers

Paradigm generator is now finished (some problems with the XML still). The server interface has changed, Tomis script needs to be updated.

Saara needs paradigm grammars for all POSes, see the example for N:

N+Subclass?+Number+Case+Possessive?+Clitic?
V+
A+
Adv+
Pron+
...

The inflector (generator as server) has four output options:

  1. short paradigm (nom, gen, gen pl)
  2. standard paradigm (full w/o poss and clitics)
  3. complete (incl poss. clitics)
  4. take any single string including tags, return inflected form

Input is one of:

Next: add the hyphenation filter to the hyphenator server

TODO:

Hyphenator

sma

Trond had some discussions with Ove Lorentz. We have done “maximize coda”, he wants “maximize onset”.

Unrecognised word forms

The unrecognised forms are forms generated by the nonrec transducer, but come out as question marks after going through the hyphenating transducer.

The command sequence is:

  1. log in to victorio, move to gt/
  2. make wordlist TARGET=sme (the result is: sme/wordlist-sme.txt.gz)
  3. move wordlist-sme.txt.gz to local computer (or G5?)
  4. make TARGET=sme (gives sme.fst)
  5. make hyph TARGET=sme (gives hyph-sme.fst)
  6. gunzip wordlist-sme.txt.gz
  7. cat wordlist-sme.txt lookup -flags mbTT -utf8 bin/hyph-sme.fst > output.txt

TODO:

M4

It is problematic for the CG rules, as the rule numbering gets mixed up. The hope is that it is possible to get the same effect in CG3. We still have no progress with CG3, though, but we will carry on the discussion with Odense.

7. Linguistics

Names and multilinguality

We need a more principled approach to this.

Background: the name lexicon is getting attention from the SD name/terminology sections, and they would like to use our name lexicon also for public searching.

Observations:

1) Multilinguality is always optional.

2) We can observe that “foreign” names in texts follows a domination pattern: majority language forms can be found in minority language texts as real names (“Kautokeino produkter”), whereas minority language names almost always occur in majority language texts as citations. And citations should not be considered a natural part of the text.

3) When looking at our name classification, multilinguality varies according to:

Ani - weak/none? (pet, myth anim.  names)
Fem - weak (informative)
Mal - weak (informative)
Obj - strong
Org - strong
Plc - strong for the national and country names, weak (informative) for foreign
       names
Sur - none
Tit - strong (titles)

Suggestion:

We need to reconsider the all names in all languages policy. That policy is valid only for Fem, Mal, and Sur (and Ani and Tit?). For Obj, Org, Plc the rule should be that if they have multilingual names, each name should only be used in it’s own language. Then we need a modification saying that majority language names can be included in minority language lexicons if attested in our corpus. Also, the majority language varies according to country (obviously), which means that in a speller context, we might consider tailoring spellers for each country, leaving out noise relating to majority language names from another country.

A further issue is whether we should reconsider our cohort policy. Today, Sur and Plc are different readings. An alternative would be to have them as secondary tags, not in conflict with each other:

"<Trosterud>"
        "Trosterud" N Prop Sur Sg Nom <<< @HNOUN
        "Trosterud" N Prop Plc Sg Nom <<< @HNOUN
"<Trosterud>"
        "Trosterud" N Prop Sg Nom <Sur> <Plc> <<< @HNOUN
"<Trosterud>"
        "Trosterud" N Prop Sg Nom &Sur &Plc <<< @HNOUN

Derivation and spellers like Aspell

TODO:

North Sámi

Unwanted word forms:

Questionable forms:

a-a     a-a+Interj
á-a     a-a+Interj  !SUB
ASDF:a  ASDF+N+ACR+Sg+Gen
A:a     A/S+N+ACR+Sg+Acc
f:a     f:a     +? (wanted: f Gen, f:s Loc)

from SGL meeting:
003/05: Davvisámegiela sánit normeremii
(Gažaldagat leat boahtán divvun-prográmma ráhkadeddjiin)
1)
Mot galgá merket oanádusaid sojaheami omd. NRK:as, NRKas, NRK-as?
Ovddeš Sámi giellaráđđi (Norggas) lea evttohan ná čállojuvvot:
	Nom.	NSR
	Akk.	NSR
	Gen	NSR`
Jearaldat lea, ahte galgatgo ain ná oanidit?
Mearrádus:
Oanádusat sojahuvvojit dainna lágiin:
nom. 	NSR
 	Akk. 	NSR  (not NSR:a) <== a and NOT a:a
	Gen. 	NSR  <== a and NOT a:a
	Ill. 	NSR:i
	Lok.	NSR:s
	Kom.	NSR:in
	Ess.	NSR:n

Correct:
abstrávttabuinnán
abstrávttabuinnán       abstrákta+A+Comp+Sg+Com+PxSg1
abstrávttabuinnán       abstrákta+A+Comp+Pl+Loc+PxSg1

Error?
abstrávttaboiinnán
abstrávttaboiinnát
abstrávttaboiinnis
abstrávttaboiinniset
abstrávttaboiinnán
abstrávttaboiinnán      abstrávttaboiinnán      +?

**Dessa är med trots att dom är !SUB
***må taes bort!

accompagnerejun V+TV+Der/j+Pass+PrfPrc
ábuhuvvože      ábuhit+V+TV+Pass+Pot+Prs+Du1
áccohallagođežedne V+IV+Der/alla+Der/goahti+Pot+Prs+Du1

*en del var märkta !sub (med små bokstäver, av mej? Gör det nån skillnad?). Jag
har ändrat dom. Märkt dock att ovanstående INTE hade små bokstäver.

NB! SUB marking has to be with uppercase SUB to be removed.

TODO:

Lule Sámi

TODO:

8. Name lexicon infrastructure

Decided in Tromsø:

Details can be found in [the meeting memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html]

TODO:

Postponed:

9. Spellers

Speller data generation

It reads the lexc files, communicates with the server, xml communication needs to be redone. No speller output yet, needs to be done by this week. It outputs the entries built by the conversion engine.

TODO:

Automatic testing of the Word spellchecker

TODO:

10. Other

Bug fixing

64 open Divvun/Disamb bugs, and 24 risten.no bugs

Guess: 1/3 of the bugs are fixed already (?)

Task lists as iCal entries

Børre should have a look at Maaren’s computer when he is in Kautokeino.

TODO:

Employee seminar in Alta

SD has an employee seminar in Alta 7.-8. December - should we go there? Sjur will ask Julie Eira if we have to go there.

TODO:

11. Next meeting, closing

Next meeting 6.11.2006 at 9:30 (on the Swedish day in Finland - Swedish as the language, not the country:-) ).

Closed at 12:14.

Appendix - task lists for the next week

Boerre

Maaren

Saara

Sjur

Thomas

Tomi

Trond