Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from last week
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 10:51.

Present: Børre, Maaren, Sjur, Steinar, Thomas, Tomi

Absent: Saara, Trond

Agenda accepted as is.

2. Updated task status since last meeting

Børre

Maaren

Saara

Sjur

Steinar

Thomas

Tomi

Trond

3. Documentation

The open documentation issues fall into these three categories:

TODO:

4. Corpus gathering

Børre has added texts from Min Áigi to the prooftest corpus dir.

TODO:

5. Corpus infrastructure

Alignment

TODO

6. Infrastructure

TODO:

7. Linguistics

North Sámi

TODO:

Lule Sámi

TODO:

8. Name lexicon infrastructure

Decisions made in Tromsø can be found in [this meeting memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html]

TODO:

  1. fix bugs in lexc2xml; add comments to the log element (Saara)
  2. finish first version of the editing (Sjur)
  3. test editing of the xml files. If ok, then: (Sjur, Thomas, Trond)
  4. make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as well) (the morphological section should be kept intact, in e.g. propernoun-sme-morph.txt) (Sjur, Saara)
  5. convert propernoun-($lang)-lex.txt to a derived file from common xml files (Sjur, Tomi, Saara)
  6. implement data synchronisation between risten.no and the cvs repo, and possibly other servers (ie the G5 as an alternative server to the public risten.no - it might be faster and better suited than the official one; also local installations could be treated the same way)
  7. start to use the xml file as source file
  8. clean terms-sme.xml such that all names have the correct tag for their use (e.g. @type=secondary) (Thomas, Maaren, linguists)
  9. merge placenames which are errouneously in different entries: e.g. Helsinki, Helsingfors, Helsset (linguists)
  10. publish the name lexicon on risten.no (Sjur)
  11. add missing parallel names for placenames (linguists)
  12. add informative links between first names like Niillas and Nils (linguists)

9. Spellers

OOo speller(s)

TODO after the MS Office Beta is delivered:

Testing

Spelling Error Markup

Procedure for marking up:

  1. pick a file in: /home/steinarn/gt/sme/zcorp/prooftest/bound/sme/news/MinAigi/, e.g.: index2.php_id_647.html.xml
  2. rename it from .xml to .correct.xml: index2.php_id_647.html.correct.xml
  3. copy to your own computer
  4. open in SEE or XMLEditor
  5. add manual markup according to the established convention
  6. when done, copy the file back to victorio - see dir structure below

Directory structure and file locations for manually corrected files:

1  prooftest/.../orig/file.html      loading
1  prooftest/.../orig/file.html.xsl  converting
2  prooftest/.../bound/file.html.xml to this file, copying back to orig as 3a
3a prooftest/.../orig/file.html.correct.xml speling§spelling  working on this
   manually, using RCS to check in each generation of manual markup
3b prooftest/.../bound/file.html.xml <error corr="spelling">speling</error>

TODO:

Testing tools

A first version of statistics and test result processing is finished and working. It doesn’t work with our current version of Forrest (but works ok with the standard Forrest version), and it is presently targeted at typos.txt analysis. Thus, further improvements are needed. But already this version gives us valuable feedback on areas for improvements.

Test output can temporarily be found on [http://88.114.121.148:8888/doc/proof/spelling/testing/spelltest-typos-plx-sme_20070404.html] (Sjur’s computer - NB! The IP number will change! - but it will probably stay the same for the rest of the week)

TODO:

Regression tests

TODO:

Localisation

We need to translate the info added to our front page (and a separate page) regarding the beta release. Also the press release needs to be translated.

TODO:

Lexicon conversion to the PLX format

Postverbal clitics

Numbers

Numbers as figures need to be generated up front. The output should be:

1	UILH
2	UILH
...
1000000 	UILH

TODO:

Compounding restrictions

How to include compounding restriction comment tags in the transducers:

giv0ri:giv'ri ALBMI ; !+SgNomCmp +SgGenCmp +PlGenCmp
=> (using a perl script or similar)
+SgNomCmp+SgGenCmp+PlGenCmpgiv0ri:giv'ri ALBMI ; !

TODO:

  1. improve prefix conversion to PLX (Tomi)
  2. improve middle noun conversion to PLX (Tomi)
  3. improve noun + adjective PLX conversion: (Tomi)
    1. compounding stems - how do we generate them? Using the java client? +SgNomCmp+Cmpnd = sáme–, should give the correct compounding stem, shouldn’t it? We want to optionally go from: sáme- NLI to sáme NL: - NLI (->) NL, which means we should be able to extract correct compounding stems using xfst methods only.
    2. compounding tags - we need to obey them when making the transducers. Suggestion - see above.
  4. add propernouns to xfst-based conversion
  5. make conversion test sample; add conversion testing to the make file (Tomi)
  6. improve number conversion (Børre, Tomi)
  7. run xfst-based PLX conversion on victorio, make the result available on our public server (Saara, Sjur)

Public Beta release

Due to the problems with generating the PLX files discussed above, we need to move the release date further. Mid-April, before the “physical” meeting.

DONE:

TODO:

Version identification of speller lexicons

The date stamp isn’t automatically updated, it needs to be.

TODO:

10. Other

Project meeting IRL

The planned gathering will have to be on 16.-20.4., in Guovdageaidnu. All of Divvun should participate, and some from the UiTø project will as well.

Corpus contracts

TODO:

Bug fixing

51 open Divvun/Disamb bugs, and 23 risten.no bugs

New team member

Per-Eric Kuoljok started working in the Divvun project April 1. He needs to get all accounts and working equipments in place before we meet in Kautokeino April 16.

TODO:

11. Next meeting, closing

The next meeting is 23.4.2007, 09:30 Norwegian time.

The meeting was closed at 11:34.

Appendix - task lists for the next week

Boerre

Maaren

Saara

Sjur

Steinar

Thomas

Tomi

Trond