Meeting setup
- Date: 15.01.2007
- Time: 09.00 Norw. time
- Place: Where we are
- Tools: SubEthaEdit, iChat
Agenda
- Opening, agenda review
- Reviewing the task list from last week
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 9:44.
Present: Børre, Maaren, Saara, Sjur, Steinar, Thomas, Tomi, Trond
Absent: none
Agenda accepted as is.
2. Updated task status since last meeting
Børre
- contact authors who have already received the corpus licensing contract
- continue work on script for automatic testing of the spell checker in Word
- fix
sme
texts in corpus this month
- find missing
nob
parallel texts in corpus
- translate Windows installer to
sme
- work on the Polderland data generation (PLX format conversion)
- go through other directories, fix parallelity information for other documents
- fix bugs!
Maaren
- investigate the generated word form list sent to Polderland - use the command
make wordlist TARGET=sme
in victorio
Saara
- fix
sme
texts in corpus this month
- send aligned, xml
nob
texts to Lars
- add correction markup to the xml files (string-to-correction markup)
- done, but see newsgroup message
- first new version of xml2lexc in Perl
- fix bugs!
Sjur
- name lexicon:
- rewrite the integration with forrest, to get a more flexible integration
with proper i18n, solving some problems with the previous solution, and
make a foundation for better search and editing interfaces.
- search interface finished, editor half-way; still needs some javascript and
css tweaks to be really well-behaved, but can b
- refactor SD-terms editor code
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
- hire linguist and programmer
- decide how to specify compounding behaviour info in the lexicon
- get an Intel Mac for testing Windows spellers
- publish corpus contracts and project infra on NoDaLi-sta
- fix stuorra-oslolaš lower case
o
- fix bugs!
Steinar
- conversion error screening
- missing lists
- report conversion errors to Saara
- Go through the Num bugs
- Look at the actio compound issue when adding from missing lists
- fix bugs!
- worked with cg-sets
Thomas
- refine
smj
proper noun lexica, cf. the propernoun-smj-lex.txt
- decide how to specify compounding behaviour info in the lexicon
- translate Windows installer to
sme
and smj
- Actio compounds: The disamb crew is satisfied. Now it is up to the divvun
folks to see whether it is too hard to lexicalise
- Lack of lowering before hyphen: Twol rewrite.
- include numbers in the non-recursive transducers
- Go through the Num bugs
- Write diphthong hyphenation pseudocode
- fix stuorra-oslolaš lower case
o
- fix bugs!
Tomi
- add closed POS and clitics to PLX generation
- done with help from Børre
- add derivations to the PLX generation
- add compound stems to the PLX generation
- fix bugs!
Trond
- update the
smj
proper noun lexicon, and refine the morphological analysis,
cf. the propernoun-smj-lex.txt
- decide how to specify compounding behaviour info in the lexicon
**Decided
- Set up work on missing and conversion screening with Steinar and Ilona.
- fix
sme
texts in corpus this month
- Continuously working on this one.
- find missing
nob
parallel texts in corpus, go through Saara’s list
- report conversion errors to Saara
- Saara has been leading this work…
- Write twol rules for
sme, smj
on hyphen-triggered lowering with Thomas
- Go through the Num bugs
- Make numeral testbed
- Rewrite hyphenation-code (pseudocode from Thomas)
sme, smj
- Get input on
sma
hyphenations
- fix stuorra-oslolaš lower case
o
- This one I would like to pass over to Tomi.
- include numbers in the non-recursive transducers for
sme, smj
- Started work on this one. Split the closed-smX-lex.txt file with Børre.
- fix bugs!.
3. Documentation
Nothing this week.
4. Corpus gathering
Trond finally got the sma
texts from Snåsa, quite a lot of text, but not
all. Børre will add it to the corpus repository.
The relevant persons have worked on the tasks below.
TODO:
sme
texts: no new additions, fix corpus errors during this month
(Børre, Trond, Saara)
- missing
nob
parallel texts should be added if such wholes are found
(Børre, Trond)
- Go through the list of missing or errouneous nob texts, based upon Saaras
perfect list (Børre, Trond)
- add
sma
texts to the corpus repository (Børre)
5. Corpus infrastructure
Lars Nygård has left UiO. Anders Nøklestad is back in his old position.
For us, this means that Anders will be the person to contact for technical
matters, and Kristin Hagen the one for parsing of the nob
parallel
texts.
Alignment
TODO:
- go through other directories, fix parallelity information for other documents
(Børre)
- re-analyze parallel files using the command-line version (Saara)
- when aligned, send aligned, xml
nob
texts to Kristin (Saara)
Conversion issues
TODO:
- add correction markup to the xml files (string-to-correction markup)
(Saara)
- see news discussion - we will and should allow text corrections concerning
character encoding problems.
- report conversion errors to Saara (Trond, Steinar)
6. Infrastructure
Nothing this week.
7. Linguistics
North Sámi
TODO:
- lexicalise actio compounds. Example: vuolggasadji vs. vuolginsadji
(Thomas, Maaren, Steinar)
- Lack of lowering before hyphen: Twol rewrite. (Thomas, Trond)
- fix stuorra-oslolaš lower case
o
(Sjur, Thomas, Trond)
Numbers:
One problem we have is to correctly identify base forms of numerals, cf:
(the baseform of 16 is given as 6)
guhttanuppelohkái
guhttanuppelohkái guhtta+Num+Sg+Nom
guhttanuppelohkái guhtta+Num+Sg+Acc
TODO:
- discontinous case inflection (but only for maximally three-part compound
numerals) (
viđain/goalmmát/logiin
and guvttiin/logiin/viđain
)
(Thomas, Trond)
- produce correct base forms in the analyzer (Thomas, Trond)
- include numbers in the non-recursive transducers (i.e. split the recursive and
the non-recursive part of the numerals) (Trond, Thomas)
- Set up test bed for numerals, test and revise (who?)
- Make a test bed
make num-paradigm
(Trond)
- Go through the Num bugs (Trond, Thomas, Steinar)
- Preprocessing of ordinals at the end of sentences - reported as bug #368.
(Trond)
Hyphenation problem
TODO:
- write diphthong hyphenation pseudocode (Thomas)
- done for both
sme
and smj
- rewrite hyphenation code (Trond)
- done for both
sme
and smj
- ask Ove Lorentz to report on our
sma
hyphenator (Trond)
Lule Sámi
It could actually be that the smj
numerals are not recursive. They were made
differently from the sme
ones, since Spiik reported them as written sepa-
rately.
TODO:
- refine
smj
proper noun lexica, cf. the propernoun-smj-lex.txt
(Thomas, Trond)
- Lack of lowering/fronting before hyphen: Twol rewrite. (Thomas, Trond)
- include numbers in the non-recursive transducers
- Set up a test bed for numerals, test and revise (who?)
8. Name lexicon infrastructure
Decisions made in Tromsø can be found in [the meeting
memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html]
Postponed:
- data synchronisation between risten.no and the cvs repo
TODO:
- try to make a first version of xml2lexc in Perl for testing and preparation
for the big jump (Saara)
- done
- restructure interface code for easier maintenance, coding and use
- well under way, still some work
- finish first version of the editing (Sjur)
- test editing of the xml files. If ok, then: (Sjur, Thomas, Trond)
- make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as
well) (the morphological section should be kept intact, in e.g.
propernoun-sme-morph.txt) (Sjur, Saara)
- convert propernoun-($lang)-lex.txt to a derived file from common xml files
(Sjur, Tomi, Saara)
- start to use the xml file as source file
- clean terms-sme.xml such that all names have the correct tag for their use
(e.g. @type=secondary) (Thomas, Maaren, linguists)
- merge placenames which are errouneously in different entries: e.g. Helsinki,
Helsingfors, Helsset (linguists)
- publish the name lexicon on risten.no (Sjur)
- add missing parallel names for placenames (linguists)
- add informative links between first names like Niillas and Nils
(linguists)
9. Spellers
Polderland data generation
There is now a decision on compound parts, and compounding can now be
included in the PLX generation. Compounding is a sine qua non (a must) for the
beta version. The specification is found in
this document.
We have a UTF-8 problem with the paradigm server in some cases, some characters
are returned as Latin1. When the server runs on G5, everything works fine. But
when it is run on victorio, some conversion errors turn up. The problem may be
Java-related, according to some net sources, and also with the perl settings in
victorio, related to the change in perl setup.
Suggestion: Just use the G5, and not victorio, since there is no time to fix the
setup in victorio (the real error).
TODO:
- decide how to specify compounding behaviour info for the lexicon
(Thomas, Trond, Sjur)
- Done!
- add closed POS and clitics to PLX generation (Børre, Tomi)
- Progressing.
- add compound stems to the PLX generation (Børre, Tomi)
- add derivations to the PLX generation (Børre, Tomi)
- Include numerals in the speller (Børre, Tomi)
Aspell
TODO when the major part of the PLX conversion is done:
- add Aspell/Hunspell data generation to the lexc2xspell (Tomi - after the
PLX data generation is finished)
- study Hunspell, perhaps also Soikko (Børre, Sjur, Tomi)
Testing
TODO:
- get an Intel Mac for testing Windows spellers (Børre, Sjur)
Localisation
TODO:
- translate Windows installer text to
sme
and smj
(Børre, Thomas)
- progressing (smj is mostly done, lots lacking in sme)
10. Other
Corpus contracts
TODO:
- publish corpus contracts and project infra on NoDaLi-sta (Sjur)
Bug fixing
56 open Divvun/Disamb bugs, and 23 risten.no bugs
11. Next meeting, closing
The next meeting is 22.1.2007, 09:30 Norwegian time.
The meeting was closed at 10:44.
Appendix - task lists for the next week
Boerre
- continue work on script for automatic testing of the spell checker in Word
- fix
sme
texts in corpus this month
- find missing
nob
parallel texts in corpus
- translate Windows installer text to
sme
and smj
- work on the Polderland data generation (PLX format conversion)
- Concentrate on compounding
- go through other directories, fix parallelity information for other documents
- add
sma
texts to the corpus repository
- fix bugs!
Maaren
- tasks according to Thomas
Saara
- fix
sme
texts in corpus this month
- send aligned, xml
nob
texts to Kristen
- fix problems with xml2lexc if needed
- fix bugs!
Sjur
- name lexicon:
- restructure interface code for easier maintenance, coding and use
- refactor the rest of the SD-terms editor code
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
- hire linguist and programmer
- get an Intel Mac for testing Windows spellers
- publish corpus contracts and project infra on NoDaLi-sta
- fix stuorra-oslolaš lower case
o
- fix bugs!
Steinar
- Complete the semantic sets in sme-dis.rle
- missing lists
- report conversion errors to Saara
- Look at the actio compound issue when adding from missing lists
- Go through the Num bugs
- fix bugs!
Thomas
- refine
smj
proper noun lexica, cf. the propernoun-smj-lex.txt
- work with compounding
- translate Windows installer to
sme
and smj
- lexicalise actio compounds
- Lack of lowering before hyphen: Twol rewrite.
- Go through the Num bugs
- fix stuorra-oslolaš lower case
o
- include basic numbers in the non-recursive transducers
- implement discontinous case inflection for numbers
- produce correct base forms in the analyzer
- fix bugs!
Tomi
- add compound stems to the PLX generation
- add closed POS and clitics to PLX generation
- add derivations to the PLX generation
- fix bugs!
Trond
- update the
smj
proper noun lexicon, and refine the morphological analysis,
cf. the propernoun-smj-lex.txt (not this week)
- fix
sme
texts in corpus this month
- find missing
nob
parallel texts in corpus, go through Saara’s list
- report conversion errors to Saara
- Write twol rules for
sme, smj
on hyphen-triggered lowering with Thomas
- Go through the Num bugs
- Make numeral testbed
- Get input on
sma
hyphenations
- include numbers in the non-recursive transducers for
sme, smj
- implement discontinous case inflection for numbers
- produce correct base forms in the analyzer
- fix bugs!.