Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Linguistics
  7. name lexicon infrastructure
  8. Other issues
    1. speech synthesis
      1. Things happening within SD
      2. the UiT NFR application
    2. Proofing article: deadline Dec. 5.
  9. Summary, task lists
  10. Closing

1. Opening, agenda review, participants

Opened at 09:56.

Present: Børre, Maaren, Saara, Sjur, Thomas, Tomi, Trond

Absent: none

Main secretary: Sjur

Agenda accepted with some additions, and replaced speller infra with proper name lexicon infra.

2. Reviewing the task list from the last meeting

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

3. Documentation

Documentation tasks:

Add documentation on our corpus infrastructure and our corpus work in general (Børre, Tomi, Trond, Saara):

For the basic corpora, we need 2 additional types of documentation, or doc for 2 target groups:

  1. For the users/linguists: What corpus are found, how do I use them (this info is now scattered) (Part of the HOTWO USE is documented in the catxml docu The what documents are found where etc + an overall documentation is not written, since the corpus is so sparsely populated)
  2. For the collectors: How do I add texts, where do I add them, how do I convert them (this is (partly?) done in the Corpus Conversion document)

test:

Tomcat->static HTML progress

Now, all pages are generated directly from XML by Forrest within Tomcat. We’ll change to let Forrest pre-generate the HTML (and pdf), and serve these ready-made files directly.

4. Corpus gathering

All the available law texts are downloaded from Odin, with (most of?) their corresponding Norwegian originals.

All available texts from NSI’s web site have been downloaded.

The texts are now in our corpus repository, but not converted to the corpus format. Some bugs relating to this process have been posted to Bugzilla.

Contracts

Update: All SD versions now synchronised with the templates. Trond met with the lawyer, and she commented the updated versions. Trond has updated the contracts.

Next step:

  1. Make our versions of the updated Helsinki contracts, and make sure they are according to our intention. (Sjur and Trond)
  2. send them to the SD lawyer and to the University lawyer through formal channels. (Sjur and Trond)

Contract 1 should have the main priority.

5. Corpus infrastructure

Updated task list:

  1. Include the xsl files under version control (Børre, Tomi, Saara)

    1. Saara has started a dicsussion in the newsgroup - please follow up!
  2. Incorporate language detection as part of the corpus processing (Tomi)
    1. Tomiwill start, but possibly hand it over to Saara if it takes too much time (Tomi will have to prioritize the name lexicon/risten.no)
  3. we need a way to deal with hyphenated documents (documents with (manually) inserted hyphenation marks) in catxml/preprocess. (Tomi, Børre, (discussion in the newsgroup:) Sjur, Trond, Saara)
    1. Discuss details in the newsgroup
      1. What needs to be discussed now is the conditions for the difference between the second and third last points below.
    2. in normal cases hyphenation points should be removed
      1. Done by Saara in preprocess. It is possible to turn it back to a regular hyphen with a flag.
    3. when testing the robustness of our parsers, as well as when testing the hyphenator, the hyphenation points should be retained:
      1. This is true for examples like “eala-hus", they should be converted to something like: "ealahus", in order to both keep the hyphenation information when needed, and get it out of the way when not.
      2. In cases of truncated compounds like “ealahus- ja ...", we want the hyphen to stay untouched, and be part of the linguistic processing.
      3. There are sporadically text books with explicit hyphenation points, like: ea-la-hus. In these documents, all hyphens, without exception, should be converted to .

6. Linguistics

Name lexicon

Summary: see the newsgroup

The plan for this project was as follows: Two lines of work run in parallel:

  1. name markup
    1. Done! There are errors in the markup, people are urged to correct them as they pop up.

Complex names

Task list for this issue:

North Sámi

Lule Sámi

Great progress has been made on the G3 issue, just some minor points remain. oa:å has been carried over to ä:e

Open tasks:

Today’s compilation time:

real 5m17.157s user 3m26.827s sys 0m5.070s

Numerals

The following North Sámi linguistic issues should be settled before going into the numeral project:

  1. Three-part compounds
  2. Diphthong simplification
  3. Derivation

These issues are recently done in Lule Sámi, and it is more efficient to complete them in North Sámi directly thereafter instead of beginning a new topic

Numeral treatment is on different level in the existing sme and smj parsers, but the issue itself is common to the two langauges, and should therefore be treated in parallel.

Numerals in North Sámi: Inventory is listed elsewhere.

Numerals in Lule Sámi: There are 70 lines of code setting up the structure for case inflection of basic numerals.

7. Name lexicon infrastructure

The kvensk place name lexicon have their own info needs:

Linguistic info
 1. oppslagsform, headword
 2. bøyningsform, inflection
 3. kode for bøyningsform
 4. evt. grenser mellom navneledd og –element, boundaries of the word component
    and name element(s)
15. kvensk(e) navnevariant(er)
16. samisk parallellnavn
17. norsk parallellnavn
18. beslekta navn, relating toponyms, vertailunimi(-nimet)
24. etymologi

Geographical/location info:
 5. type sted, type of place, paikanlaji
 6. kode for navntype, iflg. SSR
 7. gnr., bnr.
+
10. kommune
11. fylke
14. koordinat

Legal info:
 8. status etter Lov om stadnamn
 9. vedtaksmyndighet etter Lov om stadnamn

Source info:
19. informantforklaringer
20. informant(er)
21. innsamler(e), årstall
22. arkiv
23. litteratur, kilder

Unclassified:
12. kartprodukt
13. kartblad
25. pilhenvisning, nuoliviite, til annen artikkel
26. lydfil
27. bilde(r), illustrasjone(r)
28. andre kommentarer, “sekkepost”

Present proposal:

Present risten.no:

Possible new propsal 1: as risten.no

Possible new proposal 2: separate documents:

Porsanger both person and place Porsáŋgu only as place name, not as person name.

5 lgs give 10 Trosterud, 5 Timbuktu, it would be better to have 2 Trosterud and 1 Timbuktu, but 15 contlexica for these three concepts.

Discussion to continue in the newsgroup.

Tasks:

  1. testing of conversion
  2. continue the discussion of the name lexicon format (Saara, Tomi, Sjur, Trond)
  3. implement a prototype in eXist
  4. eXist as editor:
    1. develop the needed XQueries and interface
    2. synchronisation between risten.no and
    3. test whether eXist as editor is actually working well

8. Other

speech synthesis

Things happening within SD

the UiT NFR application

Application for international cooperation to NFR, Deadline Dec. 1. Preproject deadline

Both Divvun and Disamb milieus see this as an important development path, and we will participate in the different planning processes. We aim at sending in an application to NFR at Dec. 1.

Proofing article: deadline Dec. 5.

We go for it - 2 pages.

“The official versions should be prepared according to the technical requirements of Springer LNCS series. You are kindly asked to consult our detailed instructions at the location [http://www.ling.helsinki.fi/events/FSMNLP2005/instructions-official.html].”

(Trond, Sjur and Tomi)

Technical issues

Video conferencing across firewalls

We’re still waiting for a working URL (working from outside SD, that is).

Bug fixing

24 open bugs (and 24 risten.no bugs)

Move Bugzilla

Move Bugzilla to the same server as the other ones (or make it work at the expected URL: http://giellatekno.uit.no/bugzilla/).

TODO, TODO. Thor Øivind.

risten.no

Risten.no crashed badly last week, with no traces of what happened. Tomi and Sjur are working on restoring it, with all data and updated eXist version.

Tomi will then continue the proper name work.

AppleCare extended warranty

Only Maaren still to register it, will do it today.

Rugsacks

Not delivered yet! Sjur will investigate.

9. Summary, task list

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

10. Next meeting, closing

05.12.2005 09:30

Closed at 12:53