Meeting setup
- Date: 4.12.2006
- Time: 09.30 Norw. time
- Place: Where we are
- Tools: SubEthaEdit, iChat
Agenda
- Opening, agenda review
- Reviewing the task list from last week
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 09:48.
Present: Børre, Sjur, Thomas, Tomi, Trond
Absent: Maaren, Saara
Agenda accepted as is.
2. Updated task status since last meeting
Børre
- contact authors who have already received the corpus licensing contract
- consider a script for automatic testing of the spell checker in Word
- Began work. Some AppleScript done
- meeting Wednesday at 9.30 with Sjur to plan testing
sma
discussions with SD (with Sjur, Trond)
- add a simple password protection to risten.no in the G5
- consider infra for testing feedback
- Some plans made during the meeting with Sjur
- get an Intel Mac for testing Windows spellers; get a WinXP license from SD
- update all forrest installations, including local patches
- fix bugs!
- Done some work on bug 348
Maaren
- investigate the generated word form list sent to Polderland - use the command
make wordlist TARGET=sme
in victorio
- investigate unrecognised word forms in the hyphenator
Saara
- finalize server of the Xerox tools.
- help Trond with some shell commands
- re-analyze parallel files
- consider implementing some new features to the corpus files
- add closed POSes to the paradigm gen, if needed.
- fix bugs!
Sjur
- name lexicon:
- refactor SD-terms editor code
- finally got multiuser editing safe against data corruption - after more than
a year!
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
- hire linguist and programmer
- investigate unrecognised word forms in the hyphenator
- nothing done, but this has become much easier with the debug output from
Tomi’s PLX conversion.
- decide how to specify compounding behaviour info in the lexicon
- meeting Tuesday 12.30
- had a meeting, but didn’t finish, and the second part disappeared in other
issues
sma
discussions with SD (with Børre, Trond)
- check why some SUB-marked entries got included in the normative transducer
- consider a script for automatic testing of the spell checker in Word
- discussed with Børre, pseudocode in place
- get an Intel Mac for testing Windows spellers; get a WinXP license from SD
- make sure we get one/more licenses in Alta
- publish corpus contracts and project infra on NoDaLi-sta
- ask SD/Sig-Britt Persson about some of the South Sámi bible texts
- check whether
lookup
can be used to generate paradigms for closed POSes
- done, doesn’t look like it
- meeting Wednesday at 9.30 with Børre to plan testing
- done, memo available soon (not checked in) in the
gt/doc/proof/
dir
- fix bugs!
- done some for risten.no, gleaned through others
Thomas
- refine
smj
proper noun lexica, cf. the propernoun-smj-lex.txt
- redo the derived words work for
smj
- decide how to specify compounding behaviour info in the lexicon
- begun
- meeting Tuesday 12.30
- check why some SUB-marked entries got included in the normative transducer
- investigate unrecognised word forms in hyphenator
- test editing of the xml files
- clean terms-sme.xml such that all names have the correct tag for their use
(e.g. @type=secondary)
- write paradigm grammar for the closed POSes
- fix bugs!
Tomi
- make PLX lexicon for all open POSes
- not done - how much is done, what is left?
- there is something fishy in the server, Saara has got it working but I
haven’t, we have same compilation
- add derivations to the PLX generation
- add compound stems to the PLX generation
- add closed POS and clitics to PLX generation
- fix bugs!
Trond
- refine
smj
proper noun lexica, cf. the propernoun-smj-lex.txt
- get more
sma
texts
- investigate unrecognised word forms in the hyphenator
- decide how to specify compounding behaviour info in the lexicon
- meeting Tuesday 12.30
- Meeting kept, but more to discuss.
sma
discussions with SD (with Børre, Sjur)
- redo the derived words work for
smj
3. Documentation
TODO:
- update all forrest installations, including local patches (Børre)
- not yet, will do it in Alta
4. Corpus gathering
Nothing new during last week.
TODO:
- get
sma
Bible / NT texts (Trond)
- Discussions with the Sámi Parliament about
sma
(Børre, Sjur, Trond)
- ask SD/Sig-Britt Persson about some of the South Sámi bible texts (Sjur)
5. Corpus infrastructure
More texts to the graphical corpus interface
Trond has talked with Lars, who is writing documentation for the users.
TODO:
- Consider whether to implement the <p> only policy or not (Saara)
- add text to the server (Lars)
- Discuss with Lars (Trond)
Aligner
No more texts yet, Saara has included the aligner in the relevant perl script.
TODO:
- gather more parallel texts (Trond, Børre)
- re-analyze parallel files using the command-line version (Saara)
6. Infrastructure
The server hasn’t been working for Tomi, but is now working again. The
paradigm generator now only generates 16-17 word forms - far too few. It seems
all possessives have disappeared:
sajin NIR # !SUB
sa^jis NIR
sa^jiin NIR
sa^ji NIR
sad^jái NIR
sad^ji NIR
sad^je NIRL - this is done by client
sa^je NIRL
sa^ji NIRL
sa^je NIRL
sajiis NIR # !SUB
sa^jiin NIR
sa^jii^guin NIR
sa^jiid NIR
sa^jii^de NIR
sa^jit NIR
sa^jiid NIRL
sajin NIR #N+Sg+Loc
sa^jis NIR #N+Sg+Loc
sa^jiin NIR #N+Sg+Com
sa^ji NIR #N+Sg+Acc
sad^jái NIR #N+Sg+Ill
sad^ji NIR
sad^je NIRL #N+Sg+Nom
sa^je NIRL #N+Sg+Gen
sa^ji NIRL #N+Sg+Gen
sa^je NIRL #N+Sg+Gen
sajiis NIR #N+Pl+Loc
sa^jiin NIR #N+Pl+Loc
sa^jii^guin NIR #N+Pl+Com
sa^jiid NIR #N+Pl+Acc
sa^jii^de NIR #N+Pl+Ill
sa^jit NIR #N+Pl+Nom
sa^jiid NIRL #N+Pl+Gen
Using fsts:
/opt/smi/sme/bin/isme.fst
/opt/smi/sme/bin/hyph-sme.fst
It should have been using: ifst-norm: inverse-norm.fst
.
The file is available to the server, cf /opt/smi/sme/bin/
:
-rwxrwxr-x 1 root cvs 2257 jun 21 10:13 allcaps.fst
-rwxrwxr-x 1 root cvs 92 jun 21 10:16 cap-sme
-rwxrwxr-x 1 root cvs 6995574 des 4 00:38 hyph-sme.fst
-rwxrwxr-x 1 root cvs 1206092 des 4 00:38 isme.fst
-rwxrwxr-x 1 root cvs 3106957 des 4 00:38 isme-norm.fst
-rwxrwxr-x 1 root cvs 674609 des 4 00:38 sme-dis.rle
-rwxrwxr-x 1 root cvs 1251450 des 4 00:38 sme.fst
TODO:
- remove clitics from the paradigm generator (Saara)
- investigate why possessives have disappeared from the paradigm generator
(Number, also a facultative (?) category, has not disappeared)
(Saara, Tomi)
- make sure the normative generator is used when generating paradigms (Tomi)
Hyphenator
TODO:
- investigate unrecognised word forms (Maaren, Thomas, Trond, Sjur)
- possibly the main cause was found in the meeting (non-normative forms
included in the generated output - these are not recognised by the
hyphenator, which is strictly normative)
7. Linguistics
Names and multilinguality
TODO:
- finish first version of the editing (Sjur)
- test editing of the xml files. If ok, then: (Sjur, Thomas, Trond)
- make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as
well) (the morphological section should be kept intact, in e.g.
propernoun-sme-morph.txt) (Sjur, Saara)
- convert propernoun-($lang)-lex.txt to a derived file from common xml files
(Sjur, Tomi, Saara)
- start to use the xml file as source file
- clean terms-sme.xml such that all names have the correct tag for their use
(e.g. @type=secondary) (Thomas, Maaren, linguists)
- merge placenames which are errouneously in different entries: e.g. Helsinki,
Helsingfors, Helsset (linguists)
- publish the name lexicon on risten.no (Sjur)
- add missing parallel names for placenames (linguists)
- add informative links between first names like Niillas and Nils
(linguists)
Derivation and spellers like Aspell
TODO:
- redo the work for
smj
, including discussion regarding Actio
(Thomas, Sjur, Trond)
North Sámi
The following words are included in the normative list despite being marked
with !SUB:
accompagnerejun -JUVVON
accompagnerejun accompagneret+V+TV+Der1+Der/j+Der/Pass+PrfPrc
ábuhuvvože -ETNE
ábuhuvvože ábuhit+V+TV+Pass+Pot+Prs+Du1
áccohallagođežedne -ETNE
áccohallagođežedne áccohallat+V+TV+Der3+Der/goahti+Pot+Prs+Du1
In sme-lex.txt:
+Der2+Der/Pass:uvvo DOHPPEINCH ;
+Der/Pass+PrfPrc:un K ; !SUB
+Du1:e K ; !SUB
+Du1:edne K ; !SUB
+Du1:etne K ;
These are generated by make wordlist TARGET=sme
, which uses
nonrec-sme.fst
(print lower
).
The last version of the wordlist does not include the errouneous words anymore.
They seem to have disappeared as part of other changes.
TODO:
- investigate the generated word form list sent to Polderland - use the command
make wordlist TARGET=sme
in victorio (Maaren, Thomas)
- check why some SUB-marked entries got included in the normative transducer
(Thomas, Sjur)
Lule Sámi
TODO:
- refine
smj
proper noun lexica, cf. the propernoun-smj-lex.txt
(Thomas, Trond)
- hire new linguist (Sjur)
8. Name lexicon infrastructure
Decided in Tromsø:
- add logging facilities to the interface
- add option to download local copies of the lexicon files directly from the db
- batch editing (change all entries in the found set), should later be enhanced
to allow selection of exceptions (the found set minus deselected items)
- tag for excluding/including a name from certain applications
- future epxansion: choose what info to display in the single language browser
- display existing language entries when adding a new language to a record
- add editor to change single, existing entries
Details can be found in [the meeting
memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html]
TODO:
- develop the needed XQueries and UI (Sjur, Tomi)
Postponed:
- data synchronisation between risten.no and the cvs repo
- new version of xml2lexc (based on ccat), should handle complex names correct:
construct entries like we have now from the different parts of a complex name
entry
9. Spellers
Polderland data generation
All other open POSes now included in the paradigm generator, below is a verb
example:
galggaže VIR #V+Pot+Prs+Du1
galggažedne VIR #V+Pot+Prs+Du1
galg^ga^žet^ne VIR #V+Pot+Prs+Du1
galg^ga^žeh^pet VIR #V+Pot+Prs+Pl2
galg^gaš VIR #V+Pot+Prs+ConNeg
galg^ga^žat VIR #V+Pot+Prs+Pl1
galg^ga^žit VIR #V+Pot+Prs+Pl1
galg^ga^žan VIR #V+Pot+Prs+Sg1
galg^ga^žea^ba VIR #V+Pot+Prs+Du3
galg^ga^žat VIR #V+Pot+Prs+Sg2
galg^ga^žeahp^pi VIR #V+Pot+Prs+Du2
galg^ga^žit VIR #V+Pot+Prs+Pl3
galg^ga^ža VIR #V+Pot+Prs+Sg3
galg^ga^leim^me VIR #V+Cond+Prs+Du1
galg^ga^šeim^me VIR #V+Cond+Prs+Du1
galg^ga^leid^det VIR #V+Cond+Prs+Pl2
galg^ga^šeid^det VIR #V+Cond+Prs+Pl2
galg^ga^le VIR #V+Cond+Prs+ConNeg
galg^ga^še VIR #V+Cond+Prs+ConNeg
galg^ga^leim^met VIR #V+Cond+Prs+Pl1
galg^ga^šeim^met VIR #V+Cond+Prs+Pl1
...
TODO:
- write paradigm grammar for the closed POSes (Thomas)
- check whether
lookup
can be used to generate paradigms for closed POSes
(Sjur)
- decide how to specify compounding behaviour info for the lexicon
(Thomas, Trond, Sjur)
- meeting Tuesday at 12.30
- done, needs to be followed up
- make inflection PLX lexicon for all open POSes (Tomi)
- add closed POS and clitics to PLX generation (Tomi)
- add derivations to the PLX generation (Tomi)
- add compound stems to the PLX generation (Tomi)
Aspell
TODO when the major part of the PLX conversion is done:
- add Aspell/Hunspell data generation to the lexc2xspell (Tomi - after the
PLX data generation is finished)
- study Hunspell, perhaps also Soikko (Børre, Sjur, Tomi)
Testing
When the PLX-based speller is ready: use the generated word list as test input:
all should be accepted (coverage self-testing). Pick random 1% and randomly
change them with edit distance 1, run through speller = testing false positives
We need a meeting to plan testing. We’ll do it shortly this week, and perhaps
a longer meeting in Alta.
TODO:
- meeting Wednesday at 9.30 (Sjur, Børre)
- consider a script (shell+AppleScript?) for automatic testing of MS Word
(Sjur, Børre)
- pseudocode written, some AppleScript
- get an Intel Mac for testing Windows spellers; get a WinXP license from SD
(Børre, Sjur)
10. Other
Corpus contracts
TODO:
- publish corpus contracts and project infra on NoDaLi-sta (Sjur)
Bug fixing
56 open Divvun/Disamb bugs, and 23 risten.no bugs
Guess: 1/3 of the bugs are fixed already (?)
Task lists as iCal entries
TODO:
- update Maaren’s Forrest installation (Børre)
11. Next meeting, closing
The next meeting is 11.12.2006, 09:30 Norwegian time.
The meeting was closed at 11:22.
Appendix - task lists for the next week
Boerre
- contact authors who have already received the corpus licensing contract
- continue work on script for automatic testing of the spell checker in Word
sma
discussions with SD (with Sjur, Trond)
- get an Intel Mac for testing Windows spellers; get a WinXP license from SD
- update all forrest installations, including local patches
- fix bugs!
Maaren
- investigate the generated word form list sent to Polderland - use the command
make wordlist TARGET=sme
in victorio
Saara
- finalize server of the Xerox tools.
- help Trond with some shell commands
- re-analyze parallel files
- consider implementing some new features to the corpus files
- add closed POSes to the paradigm gen, if needed.
- investigate why possessives have disappeared from the paradigm generator
- fix bugs!
Sjur
- name lexicon:
- refactor SD-terms editor code
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
- hire linguist and programmer
- decide how to specify compounding behaviour info in the lexicon
sma
discussions with SD (with Børre, Trond)
- get an Intel Mac for testing Windows spellers; get a WinXP license from SD
- publish corpus contracts and project infra on NoDaLi-sta
- ask SD/Sig-Britt Persson about some of the South Sámi bible texts
- fix bugs!
Thomas
- refine
smj
proper noun lexica, cf. the propernoun-smj-lex.txt
- decide how to specify compounding behaviour info in the lexicon
- fix bugs!
Tomi
- add closed POS and clitics to PLX generation
- add derivations to the PLX generation
- add compound stems to the PLX generation
- make sure the normative generator is used when generating paradigms
- investigate why possessives have disappeared from the paradigm generator
- fix bugs!
Trond
- refine
smj
proper noun lexica, cf. the propernoun-smj-lex.txt
- get more
sma
texts
- decide how to specify compounding behaviour info in the lexicon
sma
discussions with SD (with Børre, Sjur)
fix bugs!.