Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Linguistics
  7. Speller infrastructure
  8. Other issues
  9. Summary, task lists
  10. Closing

1. Opening, agenda review, participants

Opened at 10:08.

Present: Børre, Maaren, Saara, Sjur, Thomas, Tomi, Trond

Absent: none

Main secretary: Børre

Agenda accepted as is.

2. Reviewing the task list from the last meeting

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

3. Documentation

Documentation tasks:

Add documentation on our corpus infrastructure and our corpus work in general (Børre, Tomi, Trond, Saara):

For the basic corpora, we need 2 additional types of documentation, or doc for 2 target groups:

  1. For the users/linguists: What corpus are found, how do I use them (this info is now scattered) (Part of the HOTWO USE is documented in the catxml docu The what documents are found where etc + an overall documentation is not written, since the corpus is so sparsely populated)
  2. For the collectors: How do I add texts, where do I add them, how do I convert them (this is (partly?) done in the Corpus Conversion document)

test:

Divvun.no down again

Tomcat is running out of memory in between. Børre will look into changing to Forrest generating static html pages (forrest site), and serve those off of the standard Apache server. He will also look at utilizing Forrestbot as the tool to update the site, instead of our homegrown script.

Update: Only one small change needed in our own script. Binary download section should be included.

4. Corpus gathering

Governmental documents (earlier in pdf, now in html)

Børre has gathered files from the Sámediggi Will go on gathering files from Odin.

Contracts

Sjur had a meeting with Kimmo Koskenniemi, resolving all the issues that he had with it. Trond and Sjur has also discussed these, changed them a little bit. The contracts are ready to be sent to the lawyer (who sadly is ill).

5. Corpus infrastructure

Quoting from the convert2xml.pl file:
    26  my $xsl_file = '';
    27  my $dir = '';
    28  my $log_dir = '';
    90  my $log_file = $log_dir . "/" . $file . ".log";
    91  open STDERR, '>>', "$log_file" or die "Can't redirect STDERR: $!";

Problem analysed and will be corrected (Tomi)

Updated task list:

  1. Make a system for file and directory permission (today: we all belong to the cvs group), to only allow people with root user privileges write access to the corpus repository, at least regarding original files (Børre)
    1. Done
  2. Include the xsl files under version control (Børre, Tomi, Saara)
  3. Incorporate language detection as part of the corpus processing (Tomi)
  4. we need a way to deal with hyphenated documents (documents with (manually) inserted hyphenation marks) in catxml/preprocess. (Tomi, Børre, (discussion in the newsgroup:) Sjur, Trond, Saara)
    1. Discuss details in the newsgroup
    2. in normal cases hyphenation points should be removed
    3. when testing the robustness of our parsers, as well as when testing the hyphenator, the hyphenation points should be retained

Corpus dtd issue

To summarize (taken from Saara’s newsgroup posting of Fri, 11 Nov 2005:

<!ELEMENT collection (#PCDATA) >
<!ELEMENT metadata (complete|incomplete)>
<!ELEMENT complete EMPTY>
<!ELEMENT incomplete EMPTY>

<!ELEMENT wordcount #PCDATA>

Saara’s suggestion:

<!ELEMENT availability (free|license)
<!ELEMENT free EMPTY>
<!ELEMENT license EMPTY>
<!ATTLIST license
	type (type1|type2|..) #REQUIRED
 >

Saara will update the dtd.

6. Linguistics

Name lexicon

Summary: see the newsgroup

The plan for this project was as follows: Two lines of work run in parallel:

  1. name markup
  2. testing of conversion

When these two tasks are done (at some point in the future), the conversion will be done.

Status quo on the two lines of work:

The mark up of the remaining 400 entries until conversion starts (People allocated look at the rest: Maaren, Ilona, Trond, Børre). This week’s status quo is as follows (some 400 names not assigned):

 323 NYSTØ
  32 BERN
  20 LONDON
  18 MARJA
  17 NIILLAS
  12 ACCRA
   5 HEANDARAT
   4 ANAR
   2 ALEUHTAT

The technical issues (specified in earlier memos: Conducted by: Tomi, Saara, Sjur. Sjur and Tomi will tomorrow Tuesday report back on a plan for using risten.no as editor for our name lexicon

A very short example is found at common/src/proper-nouns.xml. Saara has made a conversion script which is ready to use. More discussions on the layout of the resulting xml file is needed.

Complex names

In the present lexicon, complex names are treated as a class of first parts (see below), and the last part is stored as a regular simplex name.

With the new XML lexicon format, the complex names should be restored. The present lexicon format can easily be reconstructed (by splitting at the space character), and the list of complex names can also be used for other purposes in the future.

Also, integration with risten.no and the kvensk project (and through that, also Kartverket in one way or another), presupposes that we can store the complete and “true” name.

There are ~100 first parts of complex names, the name lexicon contained 739 such complex names before they were broken up in 2004/10/14.

First-part tags are now listed separately:

LEXICON ProperNounFirstPart
El% Baradej BERN-sur ;
 FirstTag ;
Badje FirstTag ;
Bajimus FirstTag ;
Bajit FirstTag ;
Bassi FirstTag ;

The format we left a year ago looked like this:

Aleksander% I%:a% suo0lu:Aleksander% I%:a% suollu SUOLU ;
Amerihká% Ovttastuvvan% Stáhtat:Amerihká% Ovttastuvvan% Stáhta ALEUHTAT ;
Amery% jiekn1arav0da:Amery% jiekn1arav'da DEATNU ;
Amundsena-Scotta% stas1uvdna DEATNU ;
Austrália% Álppat:Austrália% Ál'pa ALEUHTAT ;
Badje% Riebejoh0ka:Badje% Riebejoh'ka DEATNU ;
Badje% Stuorjoh0ka:Badje% Stuorjoh'ka DEATNU ;
Bajimus% Fielvuonjáv0ri:Bajimus% Fielvuonjáv'ri DEATNU ;
Bajimus% Molles1jáv0ri:Bajimus% Molles1jáv'ri DEATNU ;

They were broken up with the following argumentation:

revision 1.127
date: 2004/10/14 09:38:17;  author: trond;  state: Exp;  lines: +4653 -5153
This is the great % removal revision. The background was that our
pre-composed multiword names, such as Davimus Borsejoh0ka, etc. did
not work. They passed the preprocessor only in the nominative, and not
in other cases. In the worst case, their parts were not recognised as
such, and the result would be a missing analysis. Now, the first part
has been assigned to a separate lexicon, ProperNounFirstPart, that get
the tag +N+Prop+Attr only. This lexicon contains entries like Davimus,
Guhkes, Helse, magnehtalas1, and other first parts of complex
names. These should be disambiguated in sme-dis.rle, leaving the tag
only when there is a N Prop following it. As a result of this, the
file bin/attr.txt is drastically reduced.

Task list for this issue:

Example of how the old lexicon can be used to identify complex name last parts that are also used as simple names:

$grep Riebejo gt/sme/src/propernoun-sme-lex.txt
Badje% Riebejoh0ka:Badje% Riebejoh'ka DEATNU ;
Riebejoh0ka:Riebejoh'ka DEATNU ;

The details of the new XML format needs to be further discussed in the newsgroup and integrated with the rest of the XML work and discussion.

North Sámi

Lule Sámi

Sjur, Thomas and Trond will cont. Lule Sámi issues.

Tasks:

Numerals

The issue awaits closure of the propernames project, and is postponed to next week.

Árran meeting

Børre, Anne Britt and Sjur go to Árran on Wednesday, for meetings on Thursday. Main meeting is with Anders Kintel about using his Lule Sámi dictionary in our projects.

7. Speller infrastructure

Nothing this week either.

8. Other

Technical issues

XXE updates

Who has the latest XXE (3.0) and the latest forrest config?

Børre is updating the ones not yet up to speed.

Video conferencing across firewalls

The problem we’ve had with the SD firewall persists, and there doesn’t seem to be any resources available to help us. Geir Kaaby instead suggested we look at the Marratech package, and try it out. So please download the MacOS X client (or get it from me), and I’ll send you the URL to the meeting room as soon as I get it.

Bug fixing

24 open bugs (and 23 risten.no bugs)

Bugzilla update

When Bugzilla is being moved, it should also be updated to the newest version, and the UTF-8 bug should be resolved.

risten.no

9. Summary, task list

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

10. Next meeting, closing

21.11.2005 10:00

Closed at 11:47