Language Technology at UiT The Arctic University of Norway

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Page Content

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 10:07.

Present: Børre, Maaren, Saara, Sjur, Thomas, Tomi, Trond

Absent: none

Agenda accepted as is.

2. Updated task status since last meeting

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

3. Documentation

There’s a problem with the reliability of the Jetty (forrest run) version of our site. Thus, still using the static version on our main/public site.

TODO:

4. Corpus gathering

We finally received the Áššu and Min Áigi files from 1995 to 2001. Many of the Áššu files are in Quark Express and WriteNow. Quark files should be readable by InDesign (which we have), and WriteNow files should be readable by many Mac applications. All files can be expected to be encoded in MacSámi.

No newer files received from Áššu, Børre should contact them again.

To be able to convert WriteNow files (WriteNow is a long dead MacOS Classic application), we need the commercial software package [MacLinkPlus | http://www.dataviz.com/products/maclinkplus/index.html] from [DataViz | http://www.dataviz.com/]. It should be available from MacOffice in Tromsø, or can be bought online from their home page (USD 79.99).

More South Sámi files are forthcoming soon.

TODO:

5. Corpus infrastructure

More texts to the graphical corpus interface:

Lars suggested changes (<p> only policy) to the corpus DTD (in news). Good suggestions, but we will consider waiting till next year (and new project money) to see whether we’ll follow through.

TODO:

Aligner

The latest cvs version of our copy of the Aligner can be run from the command line, thanks to work by Børre:

The exact instructions on how to compile and run the program is in the README.txt file. Basic usage as follows:

tca2 -a anchorfile samifile otherlangfile

TODO:

Language recognition

TODO:

6. Infrastructure

Xerox tools wrapped as servers

The PLX format for clitics have been checked with Polderland, and we should be able to do fine without them. Thus, we generate all paradigms without clitics.

TODO:

Hyphenator

The list of unrecognised words is found in victorio in /tmp/hyph-errors.txt, whereas words to check can be found in the same location in the file maaren-liste.txt.gz.

Typical problem words/reports (in hyph-errors.txt) are:

No forms for šveicalažžanat
No forms for šveicalažžaneamet
No forms for šveicalažžaneame
No forms for šveicalažžanan
No forms for šveicalažžanis
No forms for šveicalažžaneaset
....

TODO:

7. Linguistics

Names and multilinguality

TODO:

  1. finish first version of the editing (Sjur)
    1. working on it
  2. add @type=secondary and @excl=speller,hyph to all names marked with !SUB (Saara)
    1. done
  3. test editing of the xml files. If ok, then: (Sjur, Thomas, Trond)
  4. make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as well) (the morphological section should be kept intact, in e.g. propernoun-sme-morph.txt) (Sjur, Saara)
  5. convert propernoun-($lang)-lex.txt to a derived file from common xml files (Sjur, Tomi, Saara)
  6. start to use the xml file as source file
  7. clean terms-sme.xml such that all names have the correct tag for their use (e.g. @type=secondary) (Thomas, Maaren, linguists)
  8. merge placenames which are errouneously in different entries: e.g. Helsinki, Helsingfors, Helsset (linguists)
  9. publish the name lexicon on risten.no (Sjur)
  10. add missing parallel names for placenames (linguists)
  11. add informative links between first names like Niillas and Nils (linguists)

Derivation and spellers like Aspell

TODO:

North Sámi

The following words are included in the normative list despite being marked with !SUB:

accompagnerejun
ábuhuvvože
ábuhuvvože      ábuhit+V+TV+Pass+Pot+Prs+Du1
áccohallagođežedne

TODO:

Lule Sámi

TODO:

8. Name lexicon infrastructure

Decided in Tromsø:

Details can be found in [the meeting memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html]

TODO:

Postponed:

9. Spellers

Polderland data generation

All nouns are now generated in PLX, not yet the other POSes. Also the closed POSes will be generated the same way, which means we need paradigm grammars for them as well. In the simplest case that is just one feature/tag combination.

Also, there might be possible to ask lookup to print out the whole paradigm. This should be investigate.

We need to close the compound specification discussion. We’ll have a meeting for it later in the week with Thomas, Trond and Sjur.

TODO:

Aspell

In preparation for the Aspell work, Børre has sent an e-mail to the Aspell developer e-mail list. Here is the most interesting part of the thread:

> What features in hunspell would you specifically like to have in aspell?

Possibly:

- Max. 65535 affix classes and twofold affix stripping

- Handling conditional affixes, circumfixes, fogemorphemes, forbidden
   words, pseudoroots and homonyms.

- Support complex compoundings

I believe some of these will benefit you.

However I only want to implement them if these is a clear benefit to it.
For example based on what several people have told be complex compounding
rules are not worth it.

Aspell is far more complex then Myspell and each feature needs to
implemented carefully so that it will behave sensibly with the
suggestion code.  Also it is important that the addition of the
feature won't degrade performance, Especially when the feature isn't used.

The whole thread is available [here | http://lists.gnu.org/archive/html/aspell-devel/2006-11/msg00031.html].

TODO when the major part of the PLX conversion is done:

Testing

When the PLX-based speller is ready: use the generated word list as test input: all should be accepted (coverage self-testing). Pick random 1% and randomly change them with edit distance 1, run through speller = testing false positives

We need a meeting to plan testing. We’ll do it shortly this week, and perhaps a longer meeting in Alta.

TODO:

10. Other

Corpus contracts

TODO:

Bug fixing

57 open Divvun/Disamb bugs, and 24 risten.no bugs

Guess: 1/3 of the bugs are fixed already (?)

Task lists as iCal entries

TODO:

Employee seminar in Alta

TODO:

11. Next meeting, closing

The next meeting is 4.12.2006, 09:30 Norwegian time.

The meeting was closed at 11:27.

Appendix - task lists for the next week

Boerre

Maaren

Saara

Sjur

Thomas

Tomi

Trond