Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from last week
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 10:42.

Present: Børre, Saara, Sjur, Steinar, Thomas, Trond

Absent: Maaren, Tomi

Agenda accepted as is.

2. Updated task status since last meeting

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

3. Documentation

TODO:

4. Corpus gathering

We should probably concentrate upon mending what we have, up to Febrary 8th.

Børre has been fixing the meta information for many files, adding info on parallel language versions.

TODO:

5. Corpus infrastructure

Aligner

The command line aligner is in place, and most(?)/all(?) potential texts are aligned.

TODO:

Conversion issues

Analysis statistics:

	words	missing	% recognised
adm	1493746	28967	98,06
bib	241233	2651	98,90
fac	382353	11371	97,03
fic	70601	7847	88,89
law	284700	6269	97,80
new	4032848	17134	99,58

There are still a lot of conversion errors, and fixing them should be a priority this month. Error types:

So far, we only have had character replacement (even context-less character replacement). We can use XSL string replacement to correct spelling errors to constructs like, cf the corpus.dtd and [a previous meeting memo | https://giellalt.uit.no/admin/weekly/2006/Meeting_2006-03-20.html#Correction+tags%3F]:

<error correct="corrected text">erroneous text</error>

Either we use the function Trond found, or we install and use Saxon 8 and XSL 2, which have good string manipulation tools.

Flow:

Do we have good tools for localising conversion errors? Problem: I spot an error in the analysis. Which file is it in? ccat removes file source info, and we are thus in the blind. Conclusion: we don’t have a good tool for this, only grep.

TODO:

6. Infrastructure

Xerox tools wrapped as servers

TODO:

7. Linguistics

Names and multilinguality

TODO:

  1. finish first version of the editing (Sjur)
  2. test editing of the xml files. If ok, then: (Sjur, Thomas, Trond)
  3. make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as well) (the morphological section should be kept intact, in e.g. propernoun-sme-morph.txt) (Sjur, Saara)
  4. convert propernoun-($lang)-lex.txt to a derived file from common xml files (Sjur, Tomi, Saara)
  5. start to use the xml file as source file
  6. clean terms-sme.xml such that all names have the correct tag for their use (e.g. @type=secondary) (Thomas, Maaren, linguists)
  7. merge placenames which are errouneously in different entries: e.g. Helsinki, Helsingfors, Helsset (linguists)
  8. publish the name lexicon on risten.no (Sjur)
  9. add missing parallel names for placenames (linguists)
  10. add informative links between first names like Niillas and Nils (linguists)

North Sámi

TODO:

Lule Sámi

TODO:

8. Name lexicon infrastructure

Decided in Tromsø:

Details can be found in [the meeting memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html]

TODO:

Postponed:

9. Spellers

Polderland data generation

TODO:

Aspell

TODO when the major part of the PLX conversion is done:

Testing

TODO:

Localisation

The Windows installer should be localised. We have received the string file from Polderland, now they should be translated.

The Mac installer they use is the standard MacOS X installer, and as sme isn’t generally turned on as a UI language in MacOS X, there is no big point in translating it. It could be done, though:-)

TODO:

10. Other

Corpus contracts

TODO:

Bug fixing

55 open Divvun/Disamb bugs, and 23 risten.no bugs

Guess: 1/3 of the bugs are fixed already (?)

11. Next meeting, closing

The next meeting is 8.1.2007, 09:00 Norwegian time.

The meeting was closed at 12:12.

Appendix - task lists for the next week

Boerre

Maaren

Saara

Sjur

Steinar

Thomas

Tomi

Trond