Language Technology at UiT The Arctic University of Norway

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Page Content

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Board meeting summary
  4. Documentation - divvun.no
  5. Corpus gathering
  6. Corpus infrastructure
  7. Linguistics
  8. Speller infrastructure
  9. Other issues
  10. Summary, task lists
  11. Closing

1. Opening, agenda review, participants

Opened at 10:15.

Present: Børre, Maaren, Saara, Sjur, Thomas, Tomi, Trond

Absent: none

Main secretary: Trond

Agenda accepted with revisions.

2. Reviewing the task list from the last meeting

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

3. Board meeting summary

The participants were satisifed with the progress of the Divvun project, and were impressed by the alpha version of the speller. The meeting discussed the delay due to the work done within the risten.no project.

The meeting went through the criteria for participating as well as for selecting subcontractors, and also the list of deliveries.

The suggestion for a permanent maintenance organisation was accepted, and will be brought forward, in order to get the financing in place. Making such a financing for 2007 and onward will have to start now.

The proposed South Sámi project was accepted as well, and work for finding financing for it was decided to go in parallel with finding finance for the permanent maintenance organization.

4. Documentation

Documentation tasks:

  1. Add documentation on our corpus infrastructure and our corpus work in general (“To be done by the ones making the corpora”: Børre, Tomi, Trond, Saara).

  2. Now we have 4 documents:

    1. Correct corpus (disamb usage)
    2. Corpus plan (for the disamb corpus cwb)
    3. Corpus conversion, two versions, in infra and in ling. Tomi and Børre have done parallell work ;-(
    4. catxml

For the basic corpora, we need 3 types of documentation, or doc for 3 target groups:

  1. For the users/linguists:
    1. What corpus are found, how do I use them (this info is now scattered)
  2. For the collectors:
    1. How do I add texts, where do I add them, how do I convert them (this is the Corpus conversion doc)
  3. For the programmer
    1. What did I actually do? (this is partly the catxml doc)

For the work on the graphical user interface, we need documentation as well, in principle along the same lines, except that the user is not the same linguist as above.

  1. add/update Aspell documentation (Tomi)
    1. Some documentation has been written, but there still is work to be done.
  2. as always: document what you’re doing:-) (all)

5. Corpus gathering

Tasks:

Contracts

Tasks:

The most problematic issue:

Who has the copyright of extracted material, like single words, collections of words, syntactic structure (potentially with some words filled in)? We need this to be controlled by us, not by the authors. The exact borderline is hard to define.

North Sámi New Testament

If we don’t hear anything from Bibelselskapet, we will have to use the version we already got.

Lule Sámi Dictionary

We will invite Anders Kintel to a meeting in Tysfjord on Nov 17th, where we will discuss the possibilities and requirements for cooperation. The overall meeting in Tysfjord is there to prepare the municipality for being included into the legally defined Sámi-speaking area.

Bitte and Børre will participate, as well as Sjur, given that Anders Kintel is available.

6. Corpus infrastructure

Naming conventions and directory structure

Tasks:

  1. Make a system for file and directory permission (today: we all belong to the cvs group), to only allow people with root user privileges write access to the corpus repository, at least regarding original files
  2. Include the xsl files under version control (cvs? rcs?)
  3. Incorporate language detection as part of the corpus processing.
  4. the dir structure is:
    1. one dir for orig, containing also the meta-info and interm. files
    2. another dir for our ready-to-use xml files after conversion
  5. dir structure for web-posted corpus files:
    1. subdivision according to week or month, we start out with month till we see the amount of traffic (yyyy-mm)
  6. we need a way to deal with hyphenated documents in catxml/preprocess:
    1. in normal cases hyphenation points should be removed
    2. when testing the robustness of our parsers, as well as when testing the hyphenator, the hyphenation points should be retained

Corpus conversion

Pdf to XML

Saara has made a new conversion module, it is almost finished. We’ll return to the issue, evaluation, etc. on the next meeting.

Task: Saara to prepare for this.

HTML to XML

Tomi has been looking at this, and is making an xsl script for it. The web form developed by Tomi should be augmented to allow posting of URL’s as well as documents from the local file system.

The URL posting need to check whether the same URL has been posted before, and if so, whether the page has changed.

XHTML to XML

Tomi has been looking at this as well.

Task: Tomi and Saara to present status quo and suggest routines, merger, etc. on the next meeting.

7. Linguistics

Name lexicon

Summary: see the newsgroup

Motivation:

Needed: A plan for this project:

a. do the main markup in the present propernoun file b. make a script for converting it to xml (to be done one time) c. make a script for xml2lexc (to be done by the makefile) d. make the tags etc. in the parser

Conversion:

  1. This week
    1. clean up the present infl. lexicons (merge BLIND and BERN, VUOLAB and LONDON) - Trond
    2. Make an emacs mode for markup (Saara). Options: fem, mal, sur, plc, org, obj, none). Combinations: surplc
  2. (end of this week and) Next week:
    1. Mark up as much as possible within a week or so (Maaren to do the Sámi names, and to split CNAME into BERN and LONDON, Trond and Børre to look at the rest)
    2. Then convert to xml
    3. Then mark up the rest with correct semantic tags
    4. This means we would need a seventh option, the unspecified name.
    5. Look into efficient editing of the XML lexicon
    6. Look into synchronisation issues with risten.no - we want the names there as well

Status quo: Entries: 35200 Converted: 10116 Time used: 5 h

Needed tools: An emacs mode doing this (Saara):

  1. Go to next “ NAME ;” ( where NAME is a string of symbols “A-Z-”)
  2. Wait for input, one of these: m f s p o b
  3. Replace “ NAME ;” with “ NAME-mal ;”, “ NAME-fem ;” etc. and go to next “ NAME ;”

Possible refinement: Encode for combined options (both plc and sur, e.g.) already in this phase.

Twol SETS definition issue

The definition of G1, G2, G3 in Lule Sámi is still open. and we would like to have input on this issue.

SUGGESTION (Trond): We have a separate meeting, e.g. Thomas, Trond, Sjur on this issue, on Tuesday.

North Sámi

Lule Sámi

Lule Sámi issues will be discussed at the Tuesday meeting between Sjur, Thomas and Trond.

Numerals

  1. An empirical overview
    1. Numeral generation
    2. Numeral inflection
    3. Numerals as parts of compounds
  2. A clear concept of how we want to treat them
    1. Tagging
  3. A treatment

We will return to this issue after the name conversion.

8. Speller infrastructure

Nothing this week.

9. Other

Technical issues

Bug fixing

13 open bugs (and 24 risten.no bugs)

Buying

risten.no

10. Summary, task list

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

10. Next meeting, closing

17.10.2005 10:00

Closed at 12:25