Meeting setup
- Date: 15.01.2007
- Time: 09.00 Norw. time
- Place: Where we are
- Tools: SubEthaEdit, iChat
Agenda
- Opening, agenda review
- Reviewing the task list from last week
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 9:44.
Present: Børre, Maaren, Saara, Sjur, Steinar, Thomas, Tomi, Trond
Absent: none
Agenda accepted as is.
2. Updated task status since last meeting
Børre
- contact authors who have already received the corpus licensing contract
- not done
- continue work on script for automatic testing of the spell checker in Word
- not done
- fix
smetexts in corpus this month- not done
- find missing
nobparallel texts in corpus- not done
- translate Windows installer to
sme- some done, helped Thomas
- work on the Polderland data generation (PLX format conversion)
- done, not finished.
- go through other directories, fix parallelity information for other documents
- not done
- fix bugs!
- not done
Maaren
- investigate the generated word form list sent to Polderland - use the command
make wordlist TARGET=smein victorio- not done
Saara
- fix
smetexts in corpus this month- in progress
- send aligned, xml
nobtexts to Lars - add correction markup to the xml files (string-to-correction markup)
- done, but see newsgroup message
- first new version of xml2lexc in Perl
- done
- fix bugs!
- fixed couple of bugs
Sjur
- name lexicon:
- rewrite the integration with forrest, to get a more flexible integration
with proper i18n, solving some problems with the previous solution, and
make a foundation for better search and editing interfaces.
- search interface finished, editor half-way; still needs some javascript and css tweaks to be really well-behaved, but can b
- refactor SD-terms editor code
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
- rewrite the integration with forrest, to get a more flexible integration
with proper i18n, solving some problems with the previous solution, and
make a foundation for better search and editing interfaces.
- hire linguist and programmer
- decide how to specify compounding behaviour info in the lexicon
- finally done!
- get an Intel Mac for testing Windows spellers
- publish corpus contracts and project infra on NoDaLi-sta
- fix stuorra-oslolaš lower case
o - fix bugs!
Steinar
- conversion error screening
- not done
- missing lists
- done some work
- report conversion errors to Saara
- not done
- Go through the Num bugs
- not done
- Look at the actio compound issue when adding from missing lists
- added words
- fix bugs!
- not done
- worked with cg-sets
- done some
Thomas
- refine
smjproper noun lexica, cf. the propernoun-smj-lex.txt- nothing this week
- decide how to specify compounding behaviour info in the lexicon
- decided
- translate Windows installer to
smeandsmj- ready soon
- Actio compounds: The disamb crew is satisfied. Now it is up to the divvun
folks to see whether it is too hard to lexicalise
- nothing this week
- Lack of lowering before hyphen: Twol rewrite.
- nothing this week
- include numbers in the non-recursive transducers
- not done
- Go through the Num bugs
- not done
- Write diphthong hyphenation pseudocode
- done
- fix stuorra-oslolaš lower case
o- not done
- fix bugs!
- worked
Tomi
- add closed POS and clitics to PLX generation
- done with help from Børre
- add derivations to the PLX generation
- not done
- add compound stems to the PLX generation
- not done
- fix bugs!
Trond
- update the
smjproper noun lexicon, and refine the morphological analysis, cf. the propernoun-smj-lex.txt- No smj last week.
- decide how to specify compounding behaviour info in the lexicon **Decided
- Set up work on missing and conversion screening with Steinar and Ilona.
- Done.
- fix
smetexts in corpus this month- Continuously working on this one.
- find missing
nobparallel texts in corpus, go through Saara’s list - report conversion errors to Saara
- Saara has been leading this work…
- Write twol rules for
sme, smjon hyphen-triggered lowering with Thomas- Not done
- Go through the Num bugs
- Not done
- Make numeral testbed
- Not done.
- Rewrite hyphenation-code (pseudocode from Thomas)
sme, smj- Done
- Get input on
smahyphenations- Not done.
- fix stuorra-oslolaš lower case
o- This one I would like to pass over to Tomi.
- include numbers in the non-recursive transducers for
sme, smj- Started work on this one. Split the closed-smX-lex.txt file with Børre.
- fix bugs!.
3. Documentation
Nothing this week.
4. Corpus gathering
Trond finally got the sma texts from Snåsa, quite a lot of text, but not
all. Børre will add it to the corpus repository.
The relevant persons have worked on the tasks below.
TODO:
smetexts: no new additions, fix corpus errors during this month (Børre, Trond, Saara)- missing
nobparallel texts should be added if such wholes are found (Børre, Trond) - Go through the list of missing or errouneous nob texts, based upon Saaras perfect list (Børre, Trond)
- add
smatexts to the corpus repository (Børre)
5. Corpus infrastructure
Lars Nygård has left UiO. Anders Nøklestad is back in his old position.
For us, this means that Anders will be the person to contact for technical
matters, and Kristin Hagen the one for parsing of the nob parallel
texts.
Alignment
TODO:
- go through other directories, fix parallelity information for other documents
(Børre)
- Still to be done.
- re-analyze parallel files using the command-line version (Saara)
- done all existing files
- when aligned, send aligned, xml
nobtexts to Kristin (Saara)- not yet done
Conversion issues
TODO:
- add correction markup to the xml files (string-to-correction markup)
(Saara)
- see news discussion - we will and should allow text corrections concerning character encoding problems.
- report conversion errors to Saara (Trond, Steinar)
- Not done.
6. Infrastructure
Nothing this week.
7. Linguistics
North Sámi
TODO:
- lexicalise actio compounds. Example: vuolggasadji vs. vuolginsadji (Thomas, Maaren, Steinar)
- Lack of lowering before hyphen: Twol rewrite. (Thomas, Trond)
- fix stuorra-oslolaš lower case
o(Sjur, Thomas, Trond)
Numbers:
One problem we have is to correctly identify base forms of numerals, cf: (the baseform of 16 is given as 6)
guhttanuppelohkái
guhttanuppelohkái guhtta+Num+Sg+Nom
guhttanuppelohkái guhtta+Num+Sg+Acc
TODO:
- discontinous case inflection (but only for maximally three-part compound
numerals) (
viđain/goalmmát/logiinandguvttiin/logiin/viđain) (Thomas, Trond) - produce correct base forms in the analyzer (Thomas, Trond)
- include numbers in the non-recursive transducers (i.e. split the recursive and the non-recursive part of the numerals) (Trond, Thomas)
- Set up test bed for numerals, test and revise (who?)
- Make a test bed
make num-paradigm(Trond) - Go through the Num bugs (Trond, Thomas, Steinar)
- Preprocessing of ordinals at the end of sentences - reported as bug #368. (Trond)
Hyphenation problem
TODO:
- write diphthong hyphenation pseudocode (Thomas)
- done for both
smeandsmj
- done for both
- rewrite hyphenation code (Trond)
- done for both
smeandsmj
- done for both
- ask Ove Lorentz to report on our
smahyphenator (Trond)- Not done.
Lule Sámi
It could actually be that the smj numerals are not recursive. They were made
differently from the sme ones, since Spiik reported them as written sepa-
rately.
TODO:
- refine
smjproper noun lexica, cf. the propernoun-smj-lex.txt (Thomas, Trond) - Lack of lowering/fronting before hyphen: Twol rewrite. (Thomas, Trond)
- include numbers in the non-recursive transducers
- Set up a test bed for numerals, test and revise (who?)
8. Name lexicon infrastructure
Decisions made in Tromsø can be found in [the meeting memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html]
Postponed:
- data synchronisation between risten.no and the cvs repo
TODO:
- try to make a first version of xml2lexc in Perl for testing and preparation
for the big jump (Saara)
- done
- restructure interface code for easier maintenance, coding and use
- well under way, still some work
- finish first version of the editing (Sjur)
- test editing of the xml files. If ok, then: (Sjur, Thomas, Trond)
- make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as well) (the morphological section should be kept intact, in e.g. propernoun-sme-morph.txt) (Sjur, Saara)
- convert propernoun-($lang)-lex.txt to a derived file from common xml files (Sjur, Tomi, Saara)
- start to use the xml file as source file
- clean terms-sme.xml such that all names have the correct tag for their use (e.g. @type=secondary) (Thomas, Maaren, linguists)
- merge placenames which are errouneously in different entries: e.g. Helsinki, Helsingfors, Helsset (linguists)
- publish the name lexicon on risten.no (Sjur)
- add missing parallel names for placenames (linguists)
- add informative links between first names like Niillas and Nils (linguists)
9. Spellers
Polderland data generation
There is now a decision on compound parts, and compounding can now be included in the PLX generation. Compounding is a sine qua non (a must) for the beta version. The specification is found in this document.
We have a UTF-8 problem with the paradigm server in some cases, some characters are returned as Latin1. When the server runs on G5, everything works fine. But when it is run on victorio, some conversion errors turn up. The problem may be Java-related, according to some net sources, and also with the perl settings in victorio, related to the change in perl setup.
Suggestion: Just use the G5, and not victorio, since there is no time to fix the setup in victorio (the real error).
TODO:
- decide how to specify compounding behaviour info for the lexicon
(Thomas, Trond, Sjur)
- Done!
- add closed POS and clitics to PLX generation (Børre, Tomi)
- Progressing.
- add compound stems to the PLX generation (Børre, Tomi)
- add derivations to the PLX generation (Børre, Tomi)
- Include numerals in the speller (Børre, Tomi)
Aspell
TODO when the major part of the PLX conversion is done:
- add Aspell/Hunspell data generation to the lexc2xspell (Tomi - after the PLX data generation is finished)
- study Hunspell, perhaps also Soikko (Børre, Sjur, Tomi)
Testing
TODO:
- get an Intel Mac for testing Windows spellers (Børre, Sjur)
- nothing yet
Localisation
TODO:
- translate Windows installer text to
smeandsmj(Børre, Thomas)- progressing (smj is mostly done, lots lacking in sme)
10. Other
Corpus contracts
TODO:
- publish corpus contracts and project infra on NoDaLi-sta (Sjur)
- not done
Bug fixing
56 open Divvun/Disamb bugs, and 23 risten.no bugs
11. Next meeting, closing
The next meeting is 22.1.2007, 09:30 Norwegian time.
The meeting was closed at 10:44.
Appendix - task lists for the next week
Boerre
- continue work on script for automatic testing of the spell checker in Word
- fix
smetexts in corpus this month - find missing
nobparallel texts in corpus - translate Windows installer text to
smeandsmj - work on the Polderland data generation (PLX format conversion)
- Concentrate on compounding
- go through other directories, fix parallelity information for other documents
- add
smatexts to the corpus repository - fix bugs!
Maaren
- tasks according to Thomas
Saara
- fix
smetexts in corpus this month - send aligned, xml
nobtexts to Kristen - fix problems with xml2lexc if needed
- fix bugs!
Sjur
- name lexicon:
- restructure interface code for easier maintenance, coding and use
- refactor the rest of the SD-terms editor code
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
- hire linguist and programmer
- get an Intel Mac for testing Windows spellers
- publish corpus contracts and project infra on NoDaLi-sta
- fix stuorra-oslolaš lower case
o - fix bugs!
Steinar
- Complete the semantic sets in sme-dis.rle
- missing lists
- report conversion errors to Saara
- Look at the actio compound issue when adding from missing lists
- Go through the Num bugs
- fix bugs!
Thomas
- refine
smjproper noun lexica, cf. the propernoun-smj-lex.txt - work with compounding
- translate Windows installer to
smeandsmj - lexicalise actio compounds
- Lack of lowering before hyphen: Twol rewrite.
- Go through the Num bugs
- fix stuorra-oslolaš lower case
o - include basic numbers in the non-recursive transducers
- implement discontinous case inflection for numbers
- produce correct base forms in the analyzer
- fix bugs!
Tomi
- add compound stems to the PLX generation
- add closed POS and clitics to PLX generation
- add derivations to the PLX generation
- fix bugs!
Trond
- update the
smjproper noun lexicon, and refine the morphological analysis, cf. the propernoun-smj-lex.txt (not this week) - fix
smetexts in corpus this month - find missing
nobparallel texts in corpus, go through Saara’s list - report conversion errors to Saara
- Write twol rules for
sme, smjon hyphen-triggered lowering with Thomas - Go through the Num bugs
- Make numeral testbed
- Get input on
smahyphenations - include numbers in the non-recursive transducers for
sme, smj - implement discontinous case inflection for numbers
- produce correct base forms in the analyzer
- fix bugs!.