Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 09:47.

Present: Børre, Sjur, Thomas, Tomi, Trond

Absent: Maaren, Saara

Agenda accepted as is.

2. Updated task status since last meeting

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

3. Documentation

The xtdoc/sd part in cvs has a branch i18n-reform that has been i18n’ized. What lacks is some kind of mechanism to be able to get access to a document in another language, if available, at best by listing all other language versions.

Also the work on getting the Sámi chars into the PDF output has been done here. It works, but only with absolute paths. This isn’t portable across computers, which isn’t acceptable. Thus needs more work.

TODO:

4. Corpus gathering

Nothing has happened last week.

5. Corpus infrastructure

General

Saara has made the makefile and written [documentation for the process | /ling/corpus_conversion_tech.html]:

cd /usr/local/share/corp
make LANGUAGE=sme GENRE=facta

or

make bound/sme/admin/sd/dc_1_04.doc.xml

The default language is “sme” if not given in command line. GENRE can be omitted (in theory - for now, some of the MinAigi filenames contain characters that make cannot handle (%), so GENRE has to be specified explicitely (and not as “news”); we are working with the troublesome filenames).

TODO:

User accounts and access

For details, see a previous meeting memo, as well as the memo from a [dedicated meeting|/infra/corpus_policy.html].

Shell access

TODO:

Web browser access

TODO:

More texts to the graphical corpus interface:

TODO:

Aligner

The aligner aligns fine, better than its competitors. Unfortunately it is slow, and dependent upon manual input. It is a Java application, working only on single files at a time. It can be downloaded - ask Trond for the link. Example text needs to be in a certain XML format (TEI, it seems) - again, ask Trond for a sample.

TODO:

Language recognition

Saara has worked on the issue. Short paragraph (e.g. phone numbers) are problematic, and paragraphs under 50 words are now given the same language as the document’s main language. We will also need more sma text. There is still room for improvement here.

TODO:

Corpus summary

The time-based statistics is still missing.

TODO:

6. Infrastructure

Xerox tools wrapped as servers

To improve throughput and response time on heavy loads, it would really be nice to have the Xerox tools wrapped up as servers.

TODO:

Hyphenator

TODO:

Automatic Bugzilla reminder for untouched bugs

TODO:

M4

Setup and infra finished. Now we are ready to start using M4.

TODO:

7. Linguistics

Derivation and spellers like Aspell

To make it easier to extract all derived stems, we should enhance the tags used for derivations in sme to make them easier to grep. The most straightforward solution is to make the tags follow the same pattern as for smj, +Der/NNN. Presently sme is only using the NNN part as a tag, where NNN represents the derivational suffix. That is, there is no single pattern to match against for sme.

Problematic issue: the disamb output will presently give information only about non-lexicalised derivations. This can potentially give false data on the frequency of derivational affixes. To get (more) precise data, there should be an option in lookup2cg that favours derived analyses over lexicalised ones, everything else being equal. A similar option for compounds can also be useful in certain contexts. That is, the present behaviour should be partly turned upside-down (“select the analysis/-es with the fewest compounds and derivations available”). The best alternative would probably be to select the next to least complex analysis, that is, to allow only one compound border or one derivation.

Only regard derivations for now. Two ways:

  1. Remove complex verbs and nouns from the lexicon files.
    1. search stalla
  2. Turn the lookup2cg evaluation upside down.

Output from our transducer as it is now found below, showing that strategy Nº1 above might not be that easy. In fact, it can only be used if such words are explicitly marked, and we now that those words will also be analysed without the lexicalised variants.

albmástallat
albmástallat    albmástallat+V+IV+Inf
albmástallat    albmástallat+V+IV+Ind+Prs+Pl1

Why not:
almmái+N+Der/stalla+V...
attástallat
No derivations N->V, only v->V
jeagoheapmi+A+Der/huvva+V jeagohuvvat
jeagoheapmi+A+Der/huhtti+V jeagohuhttit
jeagoheapmi+A+Der/hudda+V jeagohuddat
     ^^^^^^
-> disappears. Does this happen to all A's on -heapmi? With all derivations?
yes to derivations. When A's compound, only with attr. form.

muorahisvuohta
goikkis > goikebiergu vs. gievra > gievrras olmmái

muorra+N+Der/huvvat+V muorahuvvat
muorra+N+Der/heapmi+A muoraheapmi

jeagohuvvat
jeagohuvvat     jeagohuvvat+V+IV+Inf
jeagohuvvat     jeagohuvvat+V+IV+Ind+Prs+Pl1
jeagohuvvat     jeagoheapme+A+Der/huvva+V+IV+Inf
jeagohuvvat     jeagoheapme+A+Der/huvva+V+IV+Ind+Prs+Pl1
jeagohuvvat     jeagoheapmi+A+Der/huvva+V+IV+Inf
jeagohuvvat     jeagoheapmi+A+Der/huvva+V+IV+Ind+Prs+Pl1

attástallat
attástallat     attistit+V+TV+Der/alla+Inf
attástallat     attistit+V+TV+Der/alla+Ind+Prs+Pl1
attástallat     attestit+V+TV+Der/alla+Inf
attástallat     attestit+V+TV+Der/alla+Ind+Prs+Pl1
attástallat     attástallat+V+TV+Inf
attástallat     attástallat+V+TV+Ind+Prs+Pl1
attástallat     addit+V+TV+Der/st+Der/alla+Inf
attástallat     addit+V+TV+Der/st+Der/alla+Ind+Prs+Pl1

"<attástallat>"
         "attistit" V TV Der/alla Ind Prs Pl1
         "addit" V TV Der/st Der/alla Inf
         "attástallat" V TV Inf
         "attestit" V TV Der/alla Ind Prs Pl1
         "attestit" V TV Der/alla Inf
         "addit" V TV Der/st Der/alla Ind Prs Pl1
         "attástallat" V TV Ind Prs Pl1
         "attistit" V TV Der/alla Inf

bisuhit IV>TV
      we have N+Dim+N+Dim+N  goađázaš

TODO:

Semantic double-tagging of names

The policy needs documentation. Thus:

TODO:

North Sámi

Nothing this week, but see above re: derivations.

Lule Sámi

TODO:

8. Name lexicon infrastructure

Decided in Tromsø:

Details can be found in [the meeting memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html]

Names found containing double inflection definitions:

Genova  adding multiple infl classes.
Guttorm adding multiple infl classes.
Heddy   adding multiple infl classes.
Heimo   adding multiple infl classes.
J?vreg?ddi      adding multiple infl classes.
Klaipeda        adding multiple infl classes.
Territory       adding multiple infl classes.

These are all wrong, and should be corrected. There should be no names with two different inflection lexicons for the same name.

TODO:

9. Tromsø meeting round-up

TODO:

10. Other

Bug fixing

43 open Divvun/Disamb bugs (two down!), and 25 risten.no bugs

Guess: 1/3 of the bugs are fixed already (?)

Please help Saara with [bug 279|http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=279] (Perl locale). Not much help… Saara will contact Roy on this issue.

Gobby

TODO:

Compilation on victorio

...
DeverbalVerbsVUORDIL...4, DeverbalVerbsALIST...5,
DeverbalVerbsSUOTNJAL...4, DeverbalVerbsBOTNJAS...2,
DeverbalVerbsLASSAN...1, DeverbalVerbsCOASKKIT...4,
DeverbalVerbsARVIL...3, K...14, K-son...2, ENDLEX...1

Reading from 'sme/src/noun-sme-lex.txt'
AspellAffix...21, GuessNoun...1, NounSecond...9, NounRoot...
        - Warning:  Ignoring info strings.
10000...20000...
22682

Reading from 'sme/src/verb-sme-lex.txt'
Negativeverb...1, negmood...3, negind...9, negimp...9, negsup...9,
Copula...2, Finitecop...15, Prscop...10, Prtcop...11, Impcop...11,
Infinitecop...11, STRAYFORMS...1, VerbRoot...10000...14271

Reading from 'sme/src/adv-sme-lex.txt'
Adverb...3005, gadv...2, adv...1, adv-comp...1, adv-sup...1,
IHTTAS...10, DABBELAS...2, DABBELACCA-...11, COMPDIRADV...3

Reading from 'sme/src/closed-sme-lex.txt'
input in flex scanner failed
make: *** [sme/bin/sme.save] Error 2
gt$em sme/src/closed-sme-lex.txt.

Earlier fix: Specify right Xerox tools in makefile, and do make clean. The earlier fix doesn’t work now.

Hypothesis: closed-sme-lex.txt is broken Problem: The file compiles on the other computers, hence it is not easy to see what to eventually look for in the closed file.

TODO:

Meetings and Marratech

Now that Tomi has moved to Helsinki, Maaren is back from her sick leave, and we are trying to get more people to the project, we are growing out of iChat’s 4-way video conferencing. We can still use audio-only conferencing, but unless SD has upgraded their firewall, we still won’t be able to include Maaren.

There are two choices: going back to the old phone conference calls (how stone age-ish!), or try to use the Marratech solution provided by SD. I suggest we try the last option first. A new version of [Marratech | http://www.marratech.com/download/] is available, with improved performeance.

TODO:

Task lists as iCal entries

TODO:

cd $FORREST_HOME
svn up -r430284

11. Next meeting, closing

Next meeting 11.9.2006 at 9:30.

Closed at 11:52.

Appendix - task lists for the next week

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond