Meeting setup
- Date: 08.01.2007
- Time: 09.00 Norw. time
- Place: Where we are
- Tools: SubEthaEdit, iChat
Agenda
- Opening, agenda review
- Reviewing the task list from last week
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 9:24.
Present: Børre, Saara, Sjur, Steinar, Thomas, Trond
Absent: Maaren, Tomi
Agenda accepted as is.
2. Updated task status since last meeting
Børre
- contact authors who have already received the corpus licensing contract
- continue work on script for automatic testing of the spell checker in Word
- recreate our forrest tarball
- update setup and installation instructions for new users/computers
- create new forrest tarball
- cvs up of the public server should be done for
xtdoc/sd/documentation/
- fix
sme
texts in corpus this month
- find missing
nob
parallel texts in corpus
- aligner status quo
- translate Windows installer to
sme
and smj
- fix bugs!
- other
- had to setup (www.)divvun.no to serve static content directly via apache.
jetty runs out of memory if it has to serve huge files. Only for big/huge
files, the rest of the site is served by Jetty as before.
Maaren
- investigate the generated word form list sent to Polderland - use the command
make wordlist TARGET=sme
in victorio
Saara
- help Trond with some shell commands
- this is related to bug 355.
- fix
sme
texts in corpus this month
- aligner status quo
- send aligned, xml
nob
texts to Lars
- add correction markup to the xml files (string-to-correction markup)
- first new version of xml2lexc in Perl
- started, test version will be available today.
- fix bugs!
Sjur
- name lexicon:
- refactor SD-terms editor code
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
- continued refactoring and redesign of the risten.no code: risten.no now has
full i18n support, is using the latest Forrest to overcome several
shortcomings of the previous version, and is soon ready for a more flexible
editor
- hire linguist and programmer
- decide how to specify compounding behaviour info in the lexicon
- get an Intel Mac for testing Windows spellers
- publish corpus contracts and project infra on NoDaLi-sta
- fix forrest installations for Maaren, Disamb
- send Windows installer translation text to Børre and Thomas
- fix bugs!
Steinar
- conversion error screening
- missing lists
- report conversion errors to Saara
- fix bugs!
Thomas
- refine
smj
proper noun lexica, cf. the propernoun-smj-lex.txt
- decide how to specify compounding behaviour info in the lexicon
- translate Windows installer to
sme
and smj
- fix bugs!
Tomi
Trond
- update the
smj
proper noun lexicon, and refine the morphological analysis,
cf. the propernoun-smj-lex.txt
- get more
sma
texts
- aligner status quo
- decide how to specify compounding behaviour info in the lexicon
- Set up work on missing and conversion screening with Steinar.
- fix
sme
texts in corpus this month
- find missing
nob
parallel texts in corpus
- report conversion errors to Saara
- Done. Progress on conversion last week.
- fix bugs!.
- Fixed some, actually, and found new ones
3. Documentation
Børre has done all(!) open issues - thanks!
TODO:
- either fix forrest installations (Sjur), or create a new tarball
(Børre)
- cvs up of the public server should be done for
xtdoc/sd/documentation/
- notice the directory! (Børre)
- update setup and installation instructions (Børre)
4. Corpus gathering
TODO:
sme
texts: no new additions, fix corpus errors during this month
(Børre, Trond, Saara)
- missing
nob
parallel texts should be added if such wholes are found
(Børre, Trond)
- Go through the list of missing or errouneous nob texts, based upon Saaras
perfect list (Børre, Trond)
5. Corpus infrastructure
Aligner
TODO:
- make an overview of status quo (Trond, Børre, Saara).
- go through other directories, fix parallelity information for other documents
(Børre)
- the
sme/admin/
files are now double checked.
- gather more parallel texts (Trond, Børre)
- not any more? (depends on what we are missing and how hard they are to get,
but we will not put much time into it)
- re-analyze parallel files using the command-line version (Saara)
- when aligned, send aligned, xml
nob
texts to Lars (Saara)
Conversion issues
Initial nbsp? Cf these:
ccat -l sme -r zcorp/bound/sme/admin/depts/ |
preprocess –abbr=bin/abbr.txt –corr=src/typos.txt |
lookup -flags mbTT -utf8 -f bin/missing |
grep ‘\?’ |
cut -f1 |
sort |
uniq -c |
sort -nr |
l |
19 –Danne (EM SPACE, U+2003)
84 –Lea
62 Bueng
Use UnicodeChecker to check spurious characters that look inocent (nbsp, em
space, etc).
admin/depts
still have many errors stemming from the problematic PDF
conversion:
TODO:
- add correction markup to the xml files (string-to-correction markup)
(Saara)
- report conversion errors to Saara (Trond, Steinar)
6. Infrastructure
Nothing
7. Linguistics
North Sámi
Compounds with Actio as the first part. Earlier, we accepted such compounds, but
they gave rise to so much disamb errors that we excluded them, and opted for
lexicalisation. Now they turn up, but the solution is probably to lexicalise
them.
bajásšaddaneavttuid
olgunastinlága
ávkkástallanvugiide
Another problem:
- we get
heajus-Oslo
instead of heajos-Oslo
- we get
lojis-Oslo
instead of lojes-Oslo
- we get
álkis-Oslo
instead of álkes-Oslo
The fix here is the same as for smj
below: Hyphen must be included in the
right context triggering lowering.
Third issue: stuorra-oslolaš does not work (that is, initial lower case oslolaš
after compound with hyphen - there is always hyphen in front of name-based
words).
TODO:
- Actio compounds: The disamb crew is satisfied. Now it is up to the divvun
folks to see whether it is too hard to lexicalise
(Thomas, Maaren, Steinar)
- Lack of lowering before hyphen: Twol rewrite. (Thomas, Trond)
- fix stuorra-oslolaš lower case
o
(Sjur, Thomas, Trond)
Numbers:
Needs to be included in the non-rec transducers, and their inflection needs to
be tested. Test specification:
- inflection (case, number, ordinality)
- generation of higher numerals (not inflected)
TODO:
- include numbers in the non-recursive transducers (i.e. split the recursive and
the non-recursive part of the numerals) (Trond, Thomas)
- Set up test bed for numerals, test and revise (who?)
- Make a test bed
make num-paradigm
(Trond)
- Go through the Num bugs (Trond, Thomas, Steinar)
Hyphenation problem
Vowel combinations that are not defined as diphthongs behave like diphthongs
anyway, ie. they don’t get hyphenated. In North sámi we have this problem when
there is a
or u
in second position:
laam
laam laam (expected: la-am)
liam
liam liam
luam
luam luam
lyam
lyam lyam
lium
lium lium
luum
luum luum
lyum
lyum lyum
For Lule sámi ALL Vowel combos behave like diphtongs.
South Sámi needs supervising, too, there are reports indicating it has onset
maximising rather than coda maximising.
TODO:
- write diphthong hyphenation pseudocode (Thomas)
- rewrite hyphenation code (Trond)
- ask Ove Lorentz to report on our sma hyphenator (Trond)
Lule Sámi
We had a twol problem – now fixed by Thomas. We still have a problem with
actor nouns of type juhttsat:juhttse
: when hyphened we get
juhttsa-muorra
instead of juhttse-muorra
. There is a similar problem in
sme
with adjectives (see above).
The fix is to include the hyphen in the right context triggering vowel fronting.
TODO:
- refine
smj
proper noun lexica, cf. the propernoun-smj-lex.txt
(Thomas, Trond)
- Lack of lowering/fronting before hyphen: Twol rewrite. (Thomas, Trond)
- include numbers in the non-recursive transducers
- Set up a test bed for numerals, test and revise (who?)
8. Name lexicon infrastructure
Sjur continued refactoring and redesign of the risten.no code: risten.no now
has full i18n support, is using the latest Forrest to overcome several
shortcomings of the previous version, and is soon ready for a more flexible
editor.
Decided in Tromsø:
- add logging facilities to the interface
- add option to download local copies of the lexicon files directly from the db
- batch editing (change all entries in the found set), should later be enhanced
to allow selection of exceptions (the found set minus deselected items)
- tag for excluding/including a name from certain applications
- future epxansion: choose what info to display in the single language browser
- display existing language entries when adding a new language to a record
- add editor to change single, existing entries
Details can be found in [the meeting
memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html]
Postponed:
- data synchronisation between risten.no and the cvs repo
- new version of xml2lexc (based on ccat), should handle complex names correct:
construct entries like we have now from the different parts of a complex name
entry
TODO:
- try to make a first version of xml2lexc in Perl for testing and preparation
for the big jump (Saara)
- soon ready:-)
- finish first version of the editing (Sjur)
- test editing of the xml files. If ok, then: (Sjur, Thomas, Trond)
- make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as
well) (the morphological section should be kept intact, in e.g.
propernoun-sme-morph.txt) (Sjur, Saara)
- convert propernoun-($lang)-lex.txt to a derived file from common xml files
(Sjur, Tomi, Saara)
- start to use the xml file as source file
- clean terms-sme.xml such that all names have the correct tag for their use
(e.g. @type=secondary) (Thomas, Maaren, linguists)
- merge placenames which are errouneously in different entries: e.g. Helsinki,
Helsingfors, Helsset (linguists)
- publish the name lexicon on risten.no (Sjur)
- add missing parallel names for placenames (linguists)
- add informative links between first names like Niillas and Nils
(linguists)
9. Spellers
Polderland data generation
TODO:
- decide how to specify compounding behaviour info for the lexicon
(Thomas, Trond, Sjur)
- add closed POS and clitics to PLX generation (Børre, Tomi)
- add compound stems to the PLX generation (Børre, Tomi)
- add derivations to the PLX generation (Børre, Tomi)
- Include numerals in the speller (Børre, Tomi)
Aspell
TODO when the major part of the PLX conversion is done:
- add Aspell/Hunspell data generation to the lexc2xspell (Tomi - after the
PLX data generation is finished)
- study Hunspell, perhaps also Soikko (Børre, Sjur, Tomi)
Testing
TODO:
- get an Intel Mac for testing Windows spellers (Børre, Sjur)
Localisation
TODO:
- send translation text to Børre and Thomas (Sjur)
- translate Windows installer text to
sme
and smj
(Børre, Thomas)
- Børre has started, Thomas (and perhaps Steinar and Maaren)
will continue
10. Other
Corpus contracts
TODO:
- publish corpus contracts and project infra on NoDaLi-sta (Sjur)
Bug fixing
57 open Divvun/Disamb bugs, and 23 risten.no bugs
11. Next meeting, closing
The next meeting is 15.1.2007, 09:00 Norwegian time.
The meeting was closed at 10:50.
Appendix - task lists for the next week
Boerre
- contact authors who have already received the corpus licensing contract
- continue work on script for automatic testing of the spell checker in Word
- fix
sme
texts in corpus this month
- find missing
nob
parallel texts in corpus
- translate Windows installer to
sme
- fix bugs!
- work on the Polderland data generation (PLX format conversion)
- go through other directories, fix parallelity information for other documents
Maaren
- investigate the generated word form list sent to Polderland - use the command
make wordlist TARGET=sme
in victorio
Saara
- fix
sme
texts in corpus this month
- send aligned, xml
nob
texts to Lars
- add correction markup to the xml files (string-to-correction markup)
- first new version of xml2lexc in Perl
- fix bugs!
Sjur
- name lexicon:
- rewrite the integration with forrest, to get a more flexible integration
with proper i18n, solving some problems with the previous solution, and
make a foundation for better search and editing interfaces.
- refactor SD-terms editor code
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
- hire linguist and programmer
- decide how to specify compounding behaviour info in the lexicon
- get an Intel Mac for testing Windows spellers
- publish corpus contracts and project infra on NoDaLi-sta
- fix stuorra-oslolaš lower case
o
- fix bugs!
Steinar
- conversion error screening
- missing lists
- report conversion errors to Saara
- Go through the Num bugs
- Look at the actio compound issue when adding from missing lists
- fix bugs!
Thomas
- refine
smj
proper noun lexica, cf. the propernoun-smj-lex.txt
- decide how to specify compounding behaviour info in the lexicon
- translate Windows installer to
sme
and smj
- Actio compounds: The disamb crew is satisfied. Now it is up to the divvun
folks to see whether it is too hard to lexicalise
- Lack of lowering before hyphen: Twol rewrite.
- include numbers in the non-recursive transducers
- Go through the Num bugs
- Write diphthong hyphenation pseudocode
- fix stuorra-oslolaš lower case
o
- fix bugs!
Tomi
- add closed POS and clitics to PLX generation
- add derivations to the PLX generation
- add compound stems to the PLX generation
- fix bugs!
Trond
- update the
smj
proper noun lexicon, and refine the morphological analysis,
cf. the propernoun-smj-lex.txt
- decide how to specify compounding behaviour info in the lexicon
- Set up work on missing and conversion screening with Steinar and Ilona.
- fix
sme
texts in corpus this month
- find missing
nob
parallel texts in corpus, go through Saara’s list
- report conversion errors to Saara
- Write twol rules for
sme, smj
on hyphen-triggered lowering with Thomas
- Go through the Num bugs
- Make numeral testbed
- Rewrite hyphenation-code (pseudocode from Thomas)
sme, smj
- Get input on
sma
hyphenations
- fix stuorra-oslolaš lower case
o
- include numbers in the non-recursive transducers for
sme, smj
- fix bugs!.