Meeting setup
- Date: 26.11.2007
- Time: 09.30 Norw. time
- Place: Internet
- Tools: SubEthaEdit, iChat/Skype
Agenda
- Opening, agenda review
- Reviewing the task list from last week
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 10:15.
Present: Børre, Ilona, Per-Eric, Sjur, Thomas, Tomi
Absent: Risten, Trond
Agenda accepted as is.
2. Updated task status since last meeting
Børre
- move Steinar’s error markup in the xml files to (a copy of) the original
- fix bug 550
- fix Windows CD installation bug
- discuss more parallel texts
- finalise InDesign hyphenator
- update usage and installation documentation
- fix bugs!
- Other:
- Continued work on hunspell
- Updated OS X on xserve, added RAM.
- Made new logos
Ilona
- other
smj
tasks, ask Thomas
- Buy the DVD for Leopard
- Bought, and downloaded Leopard disk image to the computer. Have to burn it to
the DVD and then install it on the computer. Will try to do it today.
- other tasks:
- Done some testing in
sme
speller and reported it to Thomas
Maaren
- lexicalise actio compounds
Per-Eric
- lexicalise words from the Olavi missing list
- derivations tests
- fix bugs!
- Not done anything this week
Risten
- set up CD-printing printer
- get price and schedule for printed CD cover
Saara
- add new XSL/XML headers for proofing test docs
- Set up ways of adding meta-information for proofing correct corpus docs
(source info, used in testing or not, added to lexicon or not)
- add nested error markup to xml conversion
- discuss more parallel texts
Sjur
- work on the XML name editor/risten.no integration
- set up risten.no on the G5 again
- made it, although Trond reports problems
- test new and nested error markup
- improve hyphenation testing
- fix bug 550
- fix Windows CD installation bug
- tried, but it’s not working. Work-around should be documented
- finalise InDesign hyphenator
- update usage and installation documentation
- fix bugs!
- other:
- several improvements to the CD creation process
- worked on the CD cover (proofreading many times)
- faroese speller telephone meeting
Thomas
sme->smj
lexicon conversion to build bilingual lexicon resources
- check for bad hyphenation
- look at test cases still not behaving properly
- paradigm testing
- test the proofing tools with all MS Office applications
- finalise InDesign hyphenator
- update usage and installation documentation
- fix bugs!
Tomi
Trond
sme->smj
lexicon conversion to build bilingual lexicon resources
- Worked on this, refined, work is on schedule
- telephone meeting with Sjur and the faroese group re faroese speller
- discuss more parallel texts
- fix bugs!.
3. Documentation
Nothing new.
TODO:
4. Corpus infrastructure
Nothing.
5. Infrastructure
TODO:
- add Jabber account in iChat (all)
6. Linguistics
North Sámi
Hyphenation is better, but still contains a lot of errors. Sjur will run the
latest hyphenator on our test material, and discuss the test results with the
rest.
TODO:
- test latest hyphenator (Sjur)
- analyse test results (Thomas, Sjur, Trond)
Lule Sámi
Trond and his team have found words to be added to the smj lexicon.
cat smesmj.txt | grep -v 'prop$' | cut -f2 | lookup -flags mbTT -utf8 ~/gt/smj/bin/smj.fst | grep '\?' | l
6581 words in the smesmj.txt lexicon. Disregarding the proper nouns, 1824 are
not recognised by smj-norm.fst or by smj.fst. Many of these are loan words
or they are derivations. Some examples:
čála tjála n
giehtačála giehtatjála n
vuolláičála vuollájtjála n <= :-)
vinjučála vinjotjála n
johtučála jåhtotjála n
čuokkisčála tjuokkestjála n
bajildusčála bajeldustjála n
mála mála n
tjála čála n
giehtatjála giehtačála n <=== :-((
vuollájtjála vuolláičála n
tjuokkestjála čuokkisčála n
vinjotjála vinjučála n
jåhtotjála johtučála n
leapma liebma n
ja/dahje ja/dahje +?
gobba gobba +?
gaiba gaiba +?
struhcca struhcca +?
fáhcca fáhcca +?
suorbmafáhcca suorbmafáhcca +?
vahca vahca +?
ohca ohca +?
juhca juhca +?
It seems to be a mixup of smj and sme in the material. That has to be cleaned
up.
We have to test hyphenation for lulesami as well.
TODO:
- lexicalise words from the Olavi missing list, but check against the pdf
original where in doubt (Per-Eric)
sme->smj
lexicon conversion to build bilingual lexicon resources, and
increase smj
coverage (Trond, Thomas, Svenne). Add the words.
- test hyphenation (Sjur, Thomas)
7. Name lexicon infrastructure
Sjur got risten.no up and running on the G5. Worked only for him, though.
Decisions made in Tromsø can be found in [this meeting
memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html]
TODO:
- set up Tomcat and risten.no on the G5 again (Sjur, Børre)
- install risten.no
- did it
- fix bugs in lexc2xml; add comments to the log element (Saara)
- finish first version of the editing (Sjur)
- test editing of the xml files. If ok, then: (Sjur, Thomas, Trond)
- make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as
well) (the morphological section should be kept intact, in e.g.
propernoun-sme-morph.txt) (Sjur, Saara)
- convert propernoun-($lang)-lex.txt to a derived file from common xml files
(Sjur, Tomi, Saara)
- implement data synchronisation between risten.no and
the cvs repo, and possibly other servers (ie the G5 as an alternative server
to the public risten.no - it might be faster and better suited than the
official one; also local installations could be treated the same way)
- start to use the xml file as source file
- clean terms-sme.xml such that all names have the correct tag for their use
(e.g. @type=secondary) (Thomas, Maaren, linguists)
- merge placenames which are errouneously in different entries: e.g. Helsinki,
Helsingfors, Helsset (linguists)
- publish the name lexicon on risten.no (Sjur)
- add missing parallel names for placenames (linguists)
- add informative links between first names like Niillas and Nils
(linguists)
Hunspell
Continuously improving.
TODO:
- Hunspell lexicon conversion (Tomi, Børre)
Testing
Spelling Error Markup
This will wait till after the release.
TODO:
- Set up ways of adding meta-information (source info, used in testing or not,
added to lexicon or not) (Saara)
- move Steinar’s error markup in the xml files to (a copy of) the original
(Børre, Kimme)
- add nested error markup to xml conversion (Saara)
- test new and nested error markup (Sjur)
Automated testing
TODO:
- improve hyphenation testing (Sjur)
MS Office
An important aspect of this testing is to document in the user guide anything
that could be a problem for users.
TODO:
- test the proofing tools with all MS Office applications (Børre, Thomas)
- Thomas has tested all Windows apps - they all work fine with our tools
Open issues based on test results:
smj
482 - still problematic (prefix), 484 - double hyphens suggested, 575 -
name+name = double hyphens in sugg, Svierigadárogielan - still rejected (prefix)
sme
397 - double hyphens (name+name), 419 - fixed, 425 - roman
number, 431 - does not accept the correct string, but DO suggest the same; also
hyphen final forms are accepted, but not the same form when part of a compound,
452 - fixed, 461 - ovda accepted, almost 50 %
(17) gets correct suggestion, 489, 522 - fixed, 524 - fixed, Guovdageainnu-láđđi not accepted.
TODO:
- look at test cases still not behaving properly (Thomas, Tomi)
- check that the
smj
R lexicon is identical to sme
(Thomas)
TODO:
- improve hyphenation testing (Sjur)
Hyphenators
Testing!!!
Release version
Schedule and tasks for the remaining weeks:
TODO:
- set up CD-printing printer (Risten, Leif Åge)
- fix Windows CD installation bug (Sjur, Børre)
- put on hold - work-around should be documented
- get price and schedule for printed CD cover (Risten)
- done: 3980,- + 900,- + VAT for 1000 covers.
- 8 days production time
- 100 covers will be picked up in Tromsø (Børre)
- Print 50 CDs, take them to Oslo (Risten, Julie)
- Burn the CDs in Oslo (Sjur)
- fix remaining bugs - golden master by end of this Monday (all)
- finalise InDesign hyphenator (Sjur, Børre, Thomas)
- testing = hyphenation testing
- documentation
- installation
- update usage and installation documentation (Børre, Thomas, Sjur)
- translate all new documentation (all)
- QA all documentation (all)
- do as much hunsopell as possible (Børre, Tomi)
Actual release
December 12 is the most likely date, before 12:00. Still to be confirmed.
There will be a release party in the afternoon.
9. Other
Corpus contracts
Delayed till after final release.
TODO:
- publish corpus contracts and project infra as open-source on NoDaLi-sta
(Sjur)
Bug fixing
When fixing bugs, record the version number containing the fix in the Bugzilla
bug report, such that for each bug, we know exactly when it should have been
fixed, in what file(s) and what version.
83 open Divvun/Disamb bugs (45 of these 83 are speller-related bugs,
38 are other bugs), and 23 risten.no bugs
Software updates
- Leopard, 10.5
- Ilona - Will have it tomorrow or at latest on Wednesday. Probably needs help
with installing.
- Per-Eric - ready for updating tomorrow - Børre will help
10. Next meeting, closing
The next meeting is 03.12.2007, 09:30 Norwegian time.
The meeting was closed at 11:42.
Appendix - task lists for the next week
Boerre
- move Steinar’s error markup in the xml files to (a copy of) the original
- fix bug 550
- finalise InDesign hyphenator
- update usage and installation documentation
- fix bugs!
Ilona
- lexicalise
smj
missing words.
- Help Trond with the
smj
dictionary.
- Install Leopard
Maaren
- lexicalise actio compounds
Per-Eric
- check some unusual words from the Olavi missing list which are still not
lexicalised
- derivations tests
- Install Leopard
- fix bugs!
Risten
- set up CD-printing printer
- test printer
- finish the cd cover and cd design
Saara
- add new XSL/XML headers for proofing test docs
- Set up ways of adding meta-information for proofing correct corpus docs
(source info, used in testing or not, added to lexicon or not)
- add nested error markup to xml conversion
- discuss more parallel texts
Sjur
- fix bug 550
- document Windows CD installation work-around
- finalise InDesign hyphenator
- update usage and installation documentation
- test latest hyphenator
- analyse hyphenation test results
- fix bugs!
Thomas
sme->smj
lexicon conversion to build bilingual lexicon resources
- test hyphenation
- analyse hyphenation test results
- look at test cases still not behaving properly
- paradigm testing
- finalise InDesign hyphenator
- update usage and installation documentation
- check that the
smj
R lexicon is identical to sme
- fix bugs!
Tomi
Trond
sme->smj
lexicon conversion to build bilingual lexicon resources
- test hyphenation
- analyse hyphenation test results
- fix bugs!.