Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Linguistics
  7. name lexicon infrastructure
  8. Other issues
  9. Summary, task lists
  10. Closing

1. Opening, agenda review, participants

Opened at 09:45.

Present: Børre, Maaren, Saara, Sjur, Trond

Absent: Thomas, Tomi (sick leave until Jan. 17th)

Main secretary: Trond

Agenda accepted as is.

2. Reviewing the task list from the last meeting

Børre

Maaren

Saara

Sjur

Thomas

Sick leave.

Tomi

Sick leave.

Trond

3. Documentation

Reviews

Add documentation on our corpus infrastructure and our corpus work in general (Børre, Tomi, Trond). For the basic corpora, we need 2 additional types of documentation, or doc for 2 target groups:

  1. For the users/linguists: What corpus are found, how do I use them (this info is now scattered) (Part of the HOWTO USE is documented in the catxml docu, what documents are found where etc + an overall documentation is not written, since the corpus is so sparsely populated)
    1. catxml done, which is what is needed mostly. Do we need more?
    2. Review: Thomas, Maaren, Linda, Ilona, Trond

Saara will update the user documentation, and add new if necessary. We will do the review as part of the meeting next week.

Findings so far: Basic text presentation works fine, but the text-type options does not.

4. Corpus gathering

Contracts

Ready - start using them!

Collecting

List of people/organisations/companies to contact to be found in an [old meeting memo|/admin/weekly/2005/Meeting_2005-09-05.html#5.+Corpus+gathering]. Based on those, here’s an updated list:

  1. Anders Kintel (Børre)
  2. Newspaper text:
    1. Sámi Instituhtta’s (for the old archive of Min Áigi and Áššu) (Børre)
    2. Áššu has been making a CD since the end of May, there should be a pile there. Børre suggests that they send us the CDs they have, so that we may look at them, and ensure that the routines work, and that we are able to utilize their format. (Børre)
    3. Min Áigi (Børre)
  3. Commercially published texts
    1. Iđut and key authors there (Børre)
    2. Davvi Girji and key authors there (Børre)
    3. Author organisations’ meetings (Børre)
    4. Key authors one by one
      1. (list of author names) Kerttu Vuolab, Kirsi Paltto, …

List of texts with lower priority (to be gathered when the above list is more or less fixed)

TODO: a lot for Børre!

Odin

We want both parallell and errouneous (unproofed) text files. What we need is a direct contact with Odin, in order to have as good coverage as possible.

TODO: Trond, and then Børre to call Ove Sæth to re-establish contact.

Bible texts

We have received Norwegian texts and a contract draft from Bibelselskapet, in essence requiring a separate contract for internet use of extracts of the text. The texts are in the paratext format, an international standard for the bible in electronic form. Most translations are available in this format.

TODO: Trond will accept the contract as is, and then negotiate a separate contract with them for use with the online, searchable parallel corpus when it is ready.

5. Corpus infrastructure

Task list:

  1. Include the xsl files under version control
    1. RCS version control is almost finished, but an issue with access control is still open. Discussed a bit in the meeting, but nothing conclusive. We’ll continue the discussion in the newsgroup.
  2. Incorporate language detection as part of the corpus processing (Saara)
    1. Almost finished. Needs improved Finnish language model - presently it isn’t able to distinguish Finnish from Sámi (proving the family bonds:-)
  3. we need to review whether only automatic hyphen detection is good enough, or whether manual post-processing in some form is needed. Delayed until we have some results to base the review on.
    1. Acceptable results: 90% of all real hyphens correctly tagged.
  4. CGI-admin script to add xsl-file to a corpus file that doesn’t have one (Tomi)
    1. Saara will review the existing code, consult Tomi, and try to make a script to help Børre utilize the template and the infra in place.

We will have a major review of all these things next week.

6. Linguistics

North Sámi

Lule Sámi

Open tasks:

Numerals

The following North Sámi linguistic issues should be settled before going into the numeral project:

  1. Three-part compounds
  2. Diphthong simplification
  3. Derivation

These issues are recently done in Lule Sámi, and it is more efficient to complete them in North Sámi directly thereafter instead of beginning a new topic

Numeral treatment is on different level in the existing sme and smj parsers, but the issue itself is common to the two langauges, and should therefore be treated in parallel.

Numerals in North Sámi: Inventory is listed elsewhere.

Numerals in Lule Sámi: There are 70 lines of code setting up the structure for case inflection of basic numerals.

7. Name lexicon infrastructure

Complex names

Task list for this issue:

XML format

We had our meeting, and the result was pretty close to the structure of the existing risten.no term base. Now the conversion script to xml needs to be updated. The new format is documented in the newsgroup.

Tasks:

  1. make a test lexicon for evaluating the format, set up the editing, and test it (Saara)
  2. update conversion from lexc to xml to reflect new xml format (Saara)
    1. mostly done, some open questions left
  3. testing of conversion
  4. continue the discussion of the name lexicon format (Saara, Tomi, Sjur, Trond)
  5. implement a prototype in eXist
  6. eXist as editor:
    1. develop the needed XQueries and interface
    2. synchronisation between risten.no and
    3. test whether eXist as editor is actually working well

8. Other

SGL Seminar

Divvun/Disamb Seminar in Tromsø

Maaren is able to attend Monday morning and Tuesday (all day) Sjur will probably arrive in Tromsø at lunchtime Monday (7.25 from HKI, TOS at 11.00). Trond will check with Linda and Ilona, trying to have a Meeting startup after lunch on Monday. Saara will stay one day shorter/less, remember to adjust the schedule accordingly

Practical arrangements:

Suggested content for project meeting:

Teaching sessions (list of free thoughts and personal frustrations). There must be one teacher and at least two pupils before we run a course.

Final schedule to be worked out by Trond and Sjur (Tuesday 10 AM?). The above plan will be copied to a separate document which will become the starting point for further work on the seminar planning.

Technical issues

   preprocess file_name.txt OK
   cat file_name.txt | preprocess !!
   catxml file_name.xml | preprocess ??

After some heavy investigation, we got no further. There is no difference between different bash versions (2.0x, 3.0x), neither between different locales as long as they are UTF-8, and whether it is stored in LANG or LC_ALL.

One new insight though: both cat and print gives the same errors, thus indicating that the error is NOT strictly reltated to cat. Is it after all a conflict between the locale support in OS X and perl?

Bug fixing

29 open bugs (and 25 risten.no bugs)

C implementation of preprocess.pl

Do we want to have a C/C++ implementation for speed reasons? Is it going to be included in the spellers?

9. Summary, task list

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

10. Next meeting, closing

30.01.2006 09:30

Closed at 11:43