August 2006 gathering

Place: Tromsø
Dates: 22-24?

Participants

Børre, Maaren, Sjur, Thomas, Tomi, Lene, Trond, Saara

Topics

Presentation, status quo and plans for divvun and disamb
Linguistic issues
- Normativity
- Lexical coverage
- 3-part compounds
Lang-tech issues:
- finalise Xerox hyphenation (Trond + Sjur 2-3h)
- G3 issue for sme
Technical issues
- The munching deadlock
- M4 macros for hyph/spell/dis versions of TWOL
- our own wiki? for (technical) CG documentation,
Speller plans
Polderland work
- Online meeting with them? Yes. Memo
Proper noun status and action plan
- action plan:
- technical evaluation: how do we make proper noun editing work in Emacs?
- how much more

Commented topics

linguistic issues
- General linguistic discussion (morphology (and syntax?)):
  - What is on top of the priority list?
  - What are our most serious problems?
  - What can we learn from each other?
  - How should we work together across the projects in the remaining months?
- Lexical coverage
- 3-part compounds
lang-tech issues:
- finalise Xerox hyphenation (Trond + Sjur 2-3 timar)
- G3 issue for sme
technical issues
- The munching deadlock
- M4 macros for hyph/spell/dis versions of TWOL
- our own wiki? for (technical) CG documentation, (open source) fst technology and Saami language technology (basically the parts of our current documentation that isn’t project-specific. Goal: to invite and engage a larger community (CG, fst/transducer users, i.e., the grammatically-based bottum-up parsing community) to improve and create documenation for the above topics. Potential partners are Helsinki, Oslo, Odense, Stuttgart.
speller plans
infra to generate word form lists for Polderland/Aspell/HunSpell type spellers. We need to setle on plans for this infrastructure now.
Polderland work
- online meeting with them? Yes, memo available
proper noun status and action plan
- action plan: we need to systematically go through the most typical/useful tasks (problem now: Sjur and Tomi do not edit lexicons enough to really know what is most important)
- technical evaluation: how do we make proper noun editing work in Emacs, and who? Are the coding (as in XML tags/structure) solutions good? (speed is not an issue, at least not on my new Mac:-), and shouldn’t be on the server either)

Worst-case scenario: The name project failed, took too long time, etc., and we will have to build separate name lexica in traditional lexc format for each language. In the best case we will get it up and running in a reasonable amount of time.

corpus collection:
- how much more (in terms of effort and invested time)? Børre needs to start looking at testing pretty soon - we should have tools for people to test this fall, and we need a working feedback infra to make sure we get the respons we need
divvun plans ahead

Time table

Time	Tuesday	Wednesday	Thursday
8:30-10:00	A Presentation (9:00->)	T lexc2Xspell	Machine update + planning
		L Consequences of eval	Aligner (Trond, Saara)
10:00-10:30	A Reports, plans	Coffee	A Coffee
10:30-12:00	a Polderland	T lexc2Xspell	-
	a Polderland	L Consequences of eval	-
12:00-13:00	A Lunch	A Lunch	A Lunch
13:00-14:00	A G3/Howto/m4/Wiki	A Name lexicon 1	(exit Saara, Maaren)
	13:45: Coffee	(what shall we store)	-
14:00-14:30	T Video with PL (1h)	A Coffee	A Coffee
	L Evaluate		-
14:30-16:00	A *G3/Howto/m4/Wiki	T Name lexicon 2 (how)	-
-	-	L (3part)	-

... | preprocess --abbr=bin/abbr.txt | corrtypos.pl | ..
... | preprocess --abbr=bin/abbr.txt --corr=src/typos.txt | lookup ...

A = all
a = all - Lene
T = Saara, Sjur, Trond, Tomi, Børre
L = Lene, Maaren, Thomas

Tuesday afternoon 1

A Presentation with Polderland
T Video with PL (Saara, Sjur, Trond, Tomi, Børre, Maaren, Thomas)
L Evaluate our linguistic analysers (Lene, Maaren, Thomas)
  How good are the tools? (M, T explaining L what input she can expect from M, T)
  What does it take to make them better?
  Do we need tools for measurement
  Or: Office as usual, working

Tuesday afternoon 2

G3: Trond, Sjur, Thomas
Howto: Maaren explains practical linguistic work to Lene <= M&L find out how to cooperate…
m4: Saara, Tomi
Wiki: Børre (todo: write a spec)

Thursday morning

Plenary machine update (all(?)) <==
Aligner (Trond, Saara)

Machine updates:

forrest issues / installations / updates
readline / bash

Thursday evening

G3 (Sjur, Trond, Thomas) <==
Hyphenation (Sjur, Trond) <==
Bible format discussion
Corpus health care
Saami chars in pdf
- Solved! Almost:-) - path needs to be generalised, then checked in to CVS.
i18n finalisation: language selection menu (using dispatcher)

rule for id field in common.xml
default: you are your own id
overriding: your id is different from yourself, and points to another lemma

common.xml
 Kautokeino .. nob, nno, eng ... id=Guovdageaidnu
 Guovdageaidnu ... sme, sma, smj ... id=Guovdageaidnu

sme.xml (id info is inherited, perhaps)
 Kautokeino
 Guovdageaidnu
   |
  `´
sme.lexc (pure lexc file, generated)
 Kautokeino contlex-i ;
 Guovdageaidnu contlex-j ;

M4 flags:

HYPH:

exclude or include the hyphenation marks when making a hyphenator transducer (“#”, “-“ and “^”)

SHORTCOMP:

Permitting truncated two- and three-part compounds or not

DIPHSIMPL:

Exclude/include twol ruleset to allow diphthong simplification.

norm: oahpaheddjiid
also: oahpaheaddjiid and oahpaheddjiid
==> include/exclude the G3-sensitivity of the diphthong simplification rules

non-m4-variation (variation in the lexicon):

Circular transducer or not? Today: grep -v '^[CN]^'
Including SUB entries of not?
EAST, WEST, ALLDIALECTS Differenting between eastern and western dialectal forms, tolerating both
SOUTH Tolerating cleary non-standard (like Locative Sg -n)

Language Technology at UiT

Page Content