Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Linguistics
  7. name lexicon infrastructure
  8. Other issues
  9. Summary, task lists
  10. Closing

1. Opening, agenda review, participants

Opened at 09:47.

Present: Børre, Maaren, Saara, Sjur, Thomas, Tomi, Trond

Absent: none

Main secretary: Tomi

Agenda accepted as is.

2. Reviewing the task list from the last meeting

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

3. Documentation

Documentation tasks:

Add documentation on our corpus infrastructure and our corpus work in general (Børre, Tomi, Trond, Saara). For the basic corpora, we need 2 additional types of documentation, or doc for 2 target groups:

  1. For the users/linguists: What corpus are found, how do I use them (this info is now scattered) (Part of the HOWTO USE is documented in the catxml docu, what documents are found where etc + an overall documentation is not written, since the corpus is so sparsely populated)
    1. catxml done, which is what is needed mostly. Do we need more?
    2. we need a review: Thomas, Maaren, Linda, Ilona, Trond
  2. For the collectors: How do I add texts, where do I add them, how do I convert them (this is (partly?) done in the Corpus Conversion document)
    1. we need a review of the web interface for corpus uploading - what is still missing?
      1. Review: Sjur, Saara, Trond, Thomas

Review setup and reporting to be posted to the newsgroup, possibly a summary in our documentation.

Saara has made a review template, final version ready today.

Deadline for reviews: by next meeting.

test:

Tomcat->static HTML progress

Almost done. Thor Øivind is back now, and will help with the last URL fixes.

Deadline: in operation by next meeting.

4. Corpus gathering

Contracts

Next step:

  1. Make our versions of the updated Helsinki contracts, and make sure they are according to our intention. (Sjur and Trond)
  2. send them to the SD lawyer and to the University lawyer through formal channels. (Sjur and Trond)

Contract 1 should have the main priority (contract 2 for Trond).

Deadline: This should be finished this week.

Collecting

Børre has completed the HTML version of the Lule Sámi New Testamente.

Trond discussed with the editor of Odin, we will get a direct contact with him on routines for text delivery.

5. Corpus infrastructure

Updated task list:

  1. Include the xsl files under version control
    1. RCS to be used, Saara to include it in the (upload) processing of new corpus files, as well as documentation (Børre to review)
  2. Incorporate language detection as part of the corpus processing (Saara)
    1. The tool needs better training material.
  3. we need a way to deal with hyphenated documents (documents with (manually) inserted hyphenation marks) in catxml/preprocess.
    1. Saara has made a hyphen detection script that tries to discriminate between real hyphenation and other uses of hyphens (recall that only real hyphenation should be tagged ).

Examples of false positives (hard cases) - these should not be converted:

teknihkalaš<hyph>luonddudieđalaš
Norplus<hyph>prográmma
Mjøs<hyph>lávdegotti
norgga<hyph>ruoŧa
dánska<hyph>norgalaš

Precision, Recall, a repetition

 False positives: Hyphens that should be kept as is
 False negatives: Soft hyphens not recognised.

tp = true positives, fp = false positives
tn = true negatives, fn = false negatives

P = tp/tp+fp and R = tp/tp+tn

P = (number of real hyphens detected) / (number of hyphens found)
R = (number of real hyphens detected) / (number of real hyphens in the text)

If we pick only hyph at line end, then the number of false positives will drop. So will the number true positives…

  1. we need to review whether only automatic hyphen detection is good enough, or whether manual post-processing in some form is needed. Delayed until we have some results to base the review on.
  2. Acceptable results: 90% of all real hyphens correctly tagged.

6. Linguistics

North Sámi

(negative of háliidit and gen/acc of mii etc.)
the "in háliit"/"in ??" and maid/*mait and guliid/*guliit
d>t only for lexical stems, not for suffixes and closed class words.
Since we do not have any suffix boundary symbol %>, this is difficult.

hum-tf4-ans175:~/gt/sme trond$ kwic-snt 'h.liid ' corp/*
h.liid
hum-tf4-ans175:~/gt/sme trond$ kwic-snt 'h.liit ' corp/*
h.liit
ielddaválggat, ja NSR:ii ges Sámediggi. Utsi ii háliit dasa dadjat maide, go dál
                     ollašuvvan. Okta gielda ii háliit oassálastit barggus danne
                     ollašuvvan. Okta gielda ii háliit oassálastit barggus danne
aid maid Suodjalus loahpaha. Muhto go stáhta ii háliit oastit, de Suodjalus beas
                   loahpaha. Muhto go stáhta ii háliit oastit, de Suodjalus beas
ččii boahtteáiggi sámi servodaga. Muhto mii eat háliit ruovttoluotta dološáigái.
                  sámi servodaga. Muhto mii eat háliit ruovttoluotta dološáigái.
                                         Dál ii háliit šat joatkit dan birra ság
                                         Dál ii háliit šat joatkit dan birra ság

Status quo after Thomas' last bug fix:
háliit  háliidit+V+TV+Ind+Prs+ConNeg
háliid  háliidit+V+TV+Ind+Prs+ConNeg   <==== should this one be allowed?
maid    maid+Interj
mait    mait    +?

Lule Sámi

Open tasks:

Numerals

The following North Sámi linguistic issues should be settled before going into the numeral project:

  1. Three-part compounds
  2. Diphthong simplification
  3. Derivation

These issues are recently done in Lule Sámi, and it is more efficient to complete them in North Sámi directly thereafter instead of beginning a new topic

Numeral treatment is on different level in the existing sme and smj parsers, but the issue itself is common to the two langauges, and should therefore be treated in parallel.

Numerals in North Sámi: Inventory is listed elsewhere.

Numerals in Lule Sámi: There are 70 lines of code setting up the structure for case inflection of basic numerals.

7. Name lexicon infrastructure

Summary: see the newsgroup

Complex names

Task list for this issue:

XML format

Basis for the progress: separate documents according to project:

Discussion to continue in the newsgroup. Sjur will post a draft XML structure based on the above.

Tasks:

  1. testing of conversion
  2. continue the discussion of the name lexicon format (Saara, Tomi, Sjur, Trond)
  3. implement a prototype in eXist
  4. eXist as editor:
    1. develop the needed XQueries and interface
    2. synchronisation between risten.no and
    3. test whether eXist as editor is actually working well

8. Other

New server

Ordered a new computer last Friday: Quad G5 (PowerMac), with 30” screen. Will be placed in Tromsø. Usage:

Code conventions

Suggestions from Sjur:

These conventions decided upon, and should be used from now on.

Technical issues

Video conferencing across firewalls

A new server is bought, but it is open when it will be installed and useable. It may take quite some time, and the video quality will be low, based on a simple test. But the most important thing is group voice chat across the SD firewall, which probably will work quite ok.

Bug fixing

28 open bugs (and 25 risten.no bugs)

Move Bugzilla

Move Bugzilla to our new server when it arrives, and make it work at both the old URL [http://giellatekno.uit.no/bugzilla/] as well as a similar one based on the name of the new server (e.g. something like [http://project.divvun.no/bugzilla], where the ‘project’ part is still open for discussion) (Thor Øivind and Børre).

risten.no

It is back on the air since last Wednesday, with a small correction on Thursday.

Rugsacks

They are all in Guovdageaidnu. Tomi can pick them up after Christmas on his way back from vacation.

9. Summary, task list

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

10. Next meeting, closing

19.12.2005 09:30

Closed at 12:02