The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages
Børre, Maaren, Sjur, Thomas, Tomi, Lene, Trond, Saara
infra to generate word form lists for Polderland/Aspell/HunSpell type spellers. We need to setle on plans for this infrastructure now.
Worst-case scenario: The name project failed, took too long time, etc., and we will have to build separate name lexica in traditional lexc format for each language. In the best case we will get it up and running in a reasonable amount of time.
Time | Tuesday | Wednesday | Thursday |
---|---|---|---|
8:30-10:00 | A Presentation (9:00->) | T lexc2Xspell | Machine update + planning |
L Consequences of eval | Aligner (Trond, Saara) | ||
10:00-10:30 | A Reports, plans | Coffee | A Coffee |
10:30-12:00 | a Polderland | T lexc2Xspell | - |
a Polderland | L Consequences of eval | - | |
12:00-13:00 | A Lunch | A Lunch | A Lunch |
13:00-14:00 | A G3/Howto/m4/Wiki | A Name lexicon 1 | (exit Saara, Maaren) |
13:45: Coffee | (what shall we store) | - | |
14:00-14:30 | T Video with PL (1h) | A Coffee | A Coffee |
L Evaluate | - | ||
14:30-16:00 | A *G3/Howto/m4/Wiki | T Name lexicon 2 (how) | - |
- | - | L (3part) | - |
... | preprocess --abbr=bin/abbr.txt | corrtypos.pl | ..
... | preprocess --abbr=bin/abbr.txt --corr=src/typos.txt | lookup ...
A = all
a = all - Lene
T = Saara, Sjur, Trond, Tomi, Børre
L = Lene, Maaren, Thomas
A Presentation with Polderland
T Video with PL (Saara, Sjur, Trond, Tomi, Børre, Maaren, Thomas)
L Evaluate our linguistic analysers (Lene, Maaren, Thomas)
How good are the tools? (M, T explaining L what input she can expect from M, T)
What does it take to make them better?
Do we need tools for measurement
Or: Office as usual, working
Machine updates:
Corpus health care
rule for id field in common.xml
default: you are your own id
overriding: your id is different from yourself, and points to another lemma
common.xml
Kautokeino .. nob, nno, eng ... id=Guovdageaidnu
Guovdageaidnu ... sme, sma, smj ... id=Guovdageaidnu
sme.xml (id info is inherited, perhaps)
Kautokeino
Guovdageaidnu
|
`´
sme.lexc (pure lexc file, generated)
Kautokeino contlex-i ;
Guovdageaidnu contlex-j ;
HYPH:
SHORTCOMP:
DIPHSIMPL:
norm: oahpaheddjiid
also: oahpaheaddjiid and oahpaheddjiid
==> include/exclude the G3-sensitivity of the diphthong simplification rules
non-m4-variation (variation in the lexicon):
grep -v '^[CN]^'