Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. The Árran journey
  4. Documentation - divvun.no
  5. Corpus gathering
  6. Corpus infrastructure
  7. Linguistics
  8. Speller infrastructure
  9. Other issues
  10. Summary, task lists
  11. Closing

1. Opening, agenda review, participants

Opened at 10:12.

Present: Børre, Saara, Sjur, Thomas, Trond

Absent: Maaren, Tomi

Main secretary: Sjur

Agenda accepted with Árran as an additional point.

2. Reviewing the task list from the last meeting

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

3. Árran trip

The fifth Sámi Conference

Killer Whale Safari

A wonderful (put in your favourite travel noun here)!

Meeting with Anders Kintel

He is using Filemaker Pro, with two fields in his database. Sámi word in one field, and the rest of the lemma article in the other.

We have asked for the first field only, and we will put it in the corpus repository as is, and use it as a regular corpus text (not for disambiguation, though).

Meeting with Bård Eriksen, publisher from Báhko

Very positive, Børre will return to him around the middle of December. Báhko is publishing everything from Árran (the center).

Presentation

Sjur held a 15-20 min presentation of the Divvun project, and a short speller demo. The demo didn’t go that well, due to the test document. Conclusion: we need to have a pre-made test document, to be able to properly test and demonstrate the speller.

4. Documentation

Documentation tasks:

Add documentation on our corpus infrastructure and our corpus work in general (Børre, Tomi, Trond, Saara):

For the basic corpora, we need 2 additional types of documentation, or doc for 2 target groups:

  1. For the users/linguists: What corpus are found, how do I use them (this info is now scattered) (Part of the HOTWO USE is documented in the catxml docu The what documents are found where etc + an overall documentation is not written, since the corpus is so sparsely populated)
  2. For the collectors: How do I add texts, where do I add them, how do I convert them (this is (partly?) done in the Corpus Conversion document)

test:

Divvun.no down again

Tomcat is running out of memory in between. Børre will look into changing to Forrest generating static html pages (forrest site), and serve those off of the standard Apache server. He will also look at utilizing Forrestbot as the tool to update the site, instead of our homegrown script.

Update: Only one small change needed in our own script. Binary download section should be included.

5. Corpus gathering

Governmental documents (earlier in pdf, now in html)

Børre has gathered files from the Sámediggi Will go on gathering files from Odin.

Contracts

Update: All SD versions now synchronised with the templates. Trond met with the lawyer, and she commented the updated versions. Trond will soon update the contracts.

6. Corpus infrastructure

Updated task list:

  1. Include the xsl files under version control (Børre, Tomi, Saara)
  2. Incorporate language detection as part of the corpus processing (Tomi)
  3. we need a way to deal with hyphenated documents (documents with (manually) inserted hyphenation marks) in catxml/preprocess. (Tomi, Børre, (discussion in the newsgroup:) Sjur, Trond, Saara)
    1. Discuss details in the newsgroup
    2. in normal cases hyphenation points should be removed
    3. when testing the robustness of our parsers, as well as when testing the hyphenator, the hyphenation points should be retained:
      1. This is true for examples like “eala-hus", they should be converted to something like: "ealahus", in order to both keep the hyphenation information when needed, and get it out of the way when not.
      2. In cases of truncated compounds like “ealahus- ja ...", we want the hyphen to stay untouched, and be part of the linguistic processing.
      3. There are sporadically text books with explicit hyphenation points, like: ea-la-hus. In these documents, all hyphens, without exception, should be converted to .

7. Linguistics

Name lexicon

Summary: see the newsgroup

The plan for this project was as follows: Two lines of work run in parallel:

  1. name markup
  2. testing of conversion
  3. eXist as editor:
    1. develop the needed XQueries and interface
    2. synchronisation between risten.no and
    3. test whether eXist as editor is actually working well

I updated the file gt/common/src/proper-nouns.xml with different formats for printing the namelex in the xml. (The line wrap makes it difficult to present them here.) These were among the default ones in xmltwig package. Also other formats are possible, e.g. having the whole entry in one line, but I found it difficult to read. My favorite is the record_c -format, where each <form> is in its own line.

When these two tasks are done (at some point in the future), the conversion will be done.

Status quo on the two lines of work:

The mark up of the remaining 400 entries until conversion starts (People allocated look at the rest: Maaren, Ilona, Trond, Børre). This week’s status quo is as follows (exactly 100 names not assigned):

  31 BERN
  19 LONDON
  16 NIILLAS
  15 MARJA
  11 ACCRA
   4 HEANDARAT
   3 ANAR
   1 ALEUHTAT

The technical issues are specified in earlier memos. Conducted by: Tomi, Saara, Sjur. Sjur and Tomi will tomorrow Tuesday report back on a plan for using risten.no as editor for our name lexicon

A very short example is found at common/src/proper-nouns.xml. Saara has made a conversion script which is ready to use. More discussions on the layout of the resulting xml file is needed.

Complex names

Task list for this issue:

The file proper-complex.xml has been added to gt/common/src. NOTE! It is not an xml file, but simply a lexc file taken from the 1.126 version of propernouns-sme-lex.txt and converted to utf-8.

The details of the new XML format needs to be further discussed in the newsgroup and integrated with the rest of the XML work and discussion.

North Sámi

Lule Sámi

Sjur, Thomas and Trond will cont. Lule Sámi issues.

Tasks:

Numerals

The issue awaits closure of the propernames project, and is postponed to next week.

8. Speller infrastructure

Nothing this week either.

9. Other

Technical issues

XXE updates

Who has the latest XXE (3.0) and the latest forrest config?

Børre is updating the ones not yet up to speed.

Video conferencing across firewalls

The problem we’ve had with the SD firewall persists, and there doesn’t seem to be any resources available to help us. Geir Kaaby instead suggested we look at the Marratech package, and try it out. So please download the MacOS X client (or get it from me), and I’ll send you the URL to the meeting room as soon as I get it.

Bug fixing

24 open bugs (and 24 risten.no bugs)

Bugzilla update

When Bugzilla is being moved, it should also be updated to the newest version, and the UTF-8 bug should be resolved.

risten.no

AppleCare extended warranty

All Divvun computers (PowerBook G4s) have received an extended warranty to the end of the project period. The warranty product (AppleCare) needs to be registered before it is effective. Please do that as soon as possible when you receive the package, and NO later than 24.11. You should also include your wireless keyboard as part of the registration (you register all Apple hardware covered by AppleCare, which is all equipment bought at the same time: computer and keyboard).

9. Summary, task list

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

10. Next meeting, closing

21.11.2005 09:30

Closed at 12:31