Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Linguistics
  7. name lexicon infrastructure
  8. Other issues
  9. Summary, task lists
  10. Closing

1. Opening, agenda review, participants

Opened at 09:38.

Present: Børre, Saara, Sjur, Tomi, Trond

Absent: Maaren, Thomas

Main secretary: Trond

Agenda accepted as is, we’ll try to finish by 10.55, to allow for joining the celebration of the Sámi national day.

2. Reviewing the task list from the last meeting

Børre

Maaren

Saara

Sjur

Thomas

On sick leave.

Tomi

Trond

3. Documentation

Reviews

Anything? Nothing.

Other docu

Anything? Documentation has been added for disambiguation.

4. Corpus gathering

Collecting

See a previous meeting memo for what’s to be done.

Sent letter to Iđut and Kåfjord.

TODO: Still a lot for Børre!

Odin

Waiting for Sæth to discuss with colleagues about how to implement the cooperation, and return to us.

Nothing heard.

Bible texts

TODO:

5. Corpus infrastructure

Task list:

  1. Include the xsl files under version control
    1. RCS version control is almost finished, but an issue with access control is still open. Discussed a bit in the meeting, but nothing conclusive. We’ll continue the discussion in the newsgroup.
  2. Incorporate language detection as part of the corpus processing (Saara)
    1. Almost finished. Needs improved Finnish language model - presently it isn’t able to distinguish Finnish from Sámi (proving the family bonds:-)
  3. we need to review whether only automatic hyphen detection is good enough, or whether manual post-processing in some form is needed. Delayed until we have some results to base the review on.
    1. Acceptable results: 90% of all real hyphens correctly tagged.
  4. CGI-admin script to add xsl-file to a corpus file that doesn’t have one (Saara)

Things are moving forward, but still more work to do. The list is left as is.

E-mail address in case of upload errors:

corpus@giellatekno.uit.no (-> Børre?) Also for reports about new uploads.

/www/opt/www/cgi-bin/smi/upload.cgi (no Forrest) http://localhost:8888/upload/upload_corpus_file.html (Forrest)

One option is to ask the cochise team, that would be royd or steinar and the address cc.uit.no.

*Problems with greek letter in Word documents. With font Sam Times Uni(versal) - (Børre) Can’t we just manually change the letters and fonts in the few documents affected?

We forget about these texts for the time being, they’ll be put in a dir. for broken texts. Such texts can be looked upon later , if wanted/needed.

Suggestion for Script for text analysis.

We would like a shadow catalogue ga/ (giella analysed) parallel to the gt/ catalogue, with one file for each of the five directories. A way of getting this is to ach night (afternoon!): Make a crontab job, run the following command, for each directory admin, bible, facta, ficti, laws, news:

ccat -a -r /usr/local/share/corp/gt/sme | preprocess --abbr=bin/abbr.txt
|  lookup -flags mbTT -utf8 bin/sme.fst | lookup2cg
|  vislcg --grammar src/sme-dis.rle > /usr/local/share/corp/ga/sme/dir.txt

For example:

ccat -a -r /usr/local/share/corp/gt/sme | preprocess --abbr=bin/abbr.txt
|  lookup -flags mbTT -utf8 bin/sme.fst | lookup2cg | vislcg --grammar src/sme-dis.rle
  > /usr/local/share/corp/ga/sme/admin.txt

TODO:

6. Linguistics

Anything? Nothing.

7. Name lexicon infrastructure

Complex names

TODO:

XML format

TODO:

  1. update conversion from lexc to xml to reflect new xml format (Saara)
    1. mostly done, some open questions left
  2. testing of conversion
  3. eXist as editor:
    1. develop the needed XQueries and interface
    2. data synchronisation between risten.no and
    3. test whether eXist as editor is actually working well

More TODO:

Definitions/terminology:

8. Other

SGL Seminar

Technical issues

To ccat’s defence I must say that cat, in a similar situation, would have given far more error messages (hold on, testing still under way).

   preprocess file_name.txt - OK
   cat file_name.txt | preprocess - bug!!
   catxml file_name.xml | preprocess - ??
   ccat filename | preprocess - bug !!

This bug isn’t a high priority any more, because ccat behaves differently than cat, and because there is the possibility of avoiding cat when working locally.

BUG: close as Won’t fix. (Børre)

Bug fixing

32 open bugs (and 24 risten.no bugs)

9. Summary, task list

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

10. Next meeting, closing

13.02.2006 09:30

Closed at 10:37