The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages
View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no
Opened at 14:09.
Present: All
Absent: None
Main secretary: All except Ilona
Agenda accepted as is?
This was our original plan:
Start Finish Status quo
1 Språkuavhengig preprosessering 2004-1 2004-1 ok
2 Infrastruktur for disambiguering 2004-1 2004-2 ok
3 Korpusgrensesnitt - prototyp 2004-1 2004-4 (2004-3/2005-3)
4 Grunnarbeid for nordsamisk 2004-1 2004-4 ok
5 Nordsamisk disambiguering - prototype 2004-1 2005-2 ok
6 Revidere morfologiske analyseprogram 2004-1 2006-4F ongoing, with divvun
7 Grunnarbeid for lulesamisk 2004-3 2005-4 L
8 Lulesamisk disambiguering - prototyp 2004-4 2005-4 L
9 Parallelltekstkorpora - prototyp 2005-1 2005-2 late, 2006-1?
10 Korpusgrensesnitt - beta 2005-1 2005-4 late, 2006-2?
11 Nordsamisk disambiguering - beta 2005-3 2005-4 (2005-1)
12 Parallelltekstkorpora - beta 2005-3 2006-1 late
13 Lulesamisk disambiguering - ferdig 2005-4 2006-2 L
14 Nordsamisk disambiguering - ferdig 2006-1 2006-4F L
15 Korpusgransesnitt - ferdig 2006-1 2006-4F
16 Parallelltekstkorpora - ferdig 2006-2 2006-4F
Comments to Lule Sámi (L):
Lule Sámi is late, since we still have not got any lexicon. Divvun will make a Lule Sámi speller, and we participate in the basic morphological work, but the transducer will be finished so late in our project period, that we will not spend much time on developing a parser for it this year. Our board (the KUNSTI program board) has suggested that we omit Lule Sámi if the full smj.fst analyser gets much more delayed than it already is. The role of Lule Sámi will not become more than being a test ground for alternative approaches to North Sámi, with the possible exception of a sme/smj bilingual rawtext corpus.
Here, we give a short, subjective statement of what the status quo is and what we see as the main tasks. Planning proper comes at the next point of the agenda.
Graphical interface is basical ready, although we haven’t been in contact with Lars Nygård at UiO for a while. Also, we still have too little text. Saara is working with the corpus interface to collect texts.
With the parallel text corpus, nothing has been done. We have at least bibles. Bibles are aligned, i.e. the verses have the same numbers. For other texts, we need a text aligner.
Ilona works on missing word lists from the xml-based corpora. The bulk of the missing list contains typos, i.e. our lexical coverage starts to be good (judging from non-hapaxes). She now goes through rest lists again, to see whether they are covered by new versions of the
The analyser is constantly improved. The generator now wildly overgenerates. It is not so bad for our corpus analyser project, but for other applications it is very bad.
Linda has been owrking on single words, e.g. oktii, disambiguating most, but not all cases. Also she has been working on numerals, the problem is that they can vary in meaning. The complete tag list is needed for the corpus interface. The documentation of the tag list is intended (at least) to be up to date at any point.
We will need regular expressions (such as date) for numerals in the lexc format, this will be looked into, especially by Linda (suggestions) and Saara (comments).
abbreviations: links to non-abbreviated forms would be interesting
example in the lexicon: geogr ab-dot-adj ; !geografiijas
Four large open issues: locativepl/comitativesg, gen/acc, locatives/pxsg3, numerals
gt/sme/corp/examples/
ex-ComLoc.txt ex-PxLoc.txt ex-buot.txt ex-seammas.txt
ex-AdvComp.txt ex-GenAcc.txt ex-Verb.txt ex-maid.txt
ex-Aktio.txt ex-Num.txt ex-alit.txt ex-oktii.txt
Goal: Finish in time, with a disambiguator and a graphical search interface for grammatically analysed Sámi text, monolingual, and perhaps also bilingual.
There are 2 x 2 different tasks here:
Work plan:
Testing: Linda, Trond, Ilona
We will need to differentiate between sentence adverbials and other adverbials.
Tasks:
Goal: Have a disambiguator that is good enough to min
What do do, when.
texts asap, the rest as outlined above: interface with goal at the end of the spring grammar, a deadline (not final one) at the end of the spring.
Eckhard’s message to us:
Jeg har en glædelig julenyhed: Vi har fået et positivt svar på den nye PaNoLa-ansøgning, dog med en reduceret bevilgingsramme på 200.000, lige som sidst - så det bliver nok det samme aktivitetsniveau som i 2005. Vi må snart beslutte hvor og hvordan vi arrangerer workshoppene.
[http://beta.visl.sdu.dk/visl/smi/]
Lähettäjä Bjarte.Toftaker@hum.uit.no Päiväys: 4. januar 2006 09.09.32 GMT+01:00 Vastaanottaja: Gulbrand.Alhaug@hum.uit.no, Trond.Trosterud@hum.uit.no Fakultetet er bedt om å prioritere søknadene om midler til bilateralt forskningssamarbeid. Dere har begge søkt, kan dere oversende en kopi av søknadene til meg, så vi får gjort prioriteringene. Haster en del, så det hadde vært fint om dere kunne fått gjort dette snarest.
Ovdalaš juovllaid lea bidjon diehtu intranehttii ahte mis lea bargiidseminára ođđajagimánu 25. ja 26. beivviid Kárášjogas. Sávan ahte dii lehpet oaidnán dan dieđu. Prográmma maid biddjo intranehttii doaivumis boahtte vahkus. Like før jul ble det lagt ut informasjon på intranettet at vi har personalseminar 25. og 26. januar i Karasjok. Håper at dere har sett denne informasjonen. Programmet vil bli lagt ut på intranettet i løpet av neste uke.
One of the two final weeks of January.
Physical, in Helsinki or Tromsø
Ilona wants to know via sms.