Meeting setup
- Date: 8.10.2007
- Time: 09.30 Norw. time
- Place: Internet
- Tools: SubEthaEdit, iChat/Skype
Agenda
- Opening, agenda review
- Reviewing the task list from last week
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 09:39.
Present: Børre, Ilona, Per-Eric, Risten, Sjur, Thomas, Tomi, Trond
Absent: none
Agenda accepted as is.
2. Updated task status since last meeting
Børre
- move Steinar’s error markup in the xml files to (a copy of) the original
- Hunspell lexicon conversion
- nouns, adjs and verbs seem to work okay, other POS’es and CPOS’es (?) don’t work as expected
- collect/build an e-mail notify list
- fix bugs!
Ilona
- lexicalise missing words
- Never done… But now already pretty far with a missing list made of
Assu-files.
- make
sms
propernoun-list
- Change NIILLAS-names to ANAR or DUORTNUS.
Maaren
- lexicalise actio compounds
Per-Eric
- expand the smj typos list
- add missing smj words
- lexicalise words from the Olavi missing list
- fix bugs!
Risten
- fixed and open issues to README files
- update translations of README-files - Thursday afternoon
Saara
- add new XSL/XML headers for proofing test docs
- Set up ways of adding meta-information for proofing correct corpus docs
(source info, used in testing or not, added to lexicon or not)
Sjur
- document the AppleScript testing tool
- document the testing procedures
- work on the XML name editor/risten.no integration
- fixed and open issues to README files
- test correct-type markup with latest enhancements
- collect/build an e-mail notify list
- update translations of README-files - Thursday afternoon
- update installation packages
- announce the beta
- today - we need a multilingual e-mail text
- fix bugs!
- yes, and reported new ones
- other things:
- received, installed and tested InDesign hyphenation - works great! (but there are hyphenation errors, we need a hyphenation command line tool to test the behaviour of the Polderland hyphenation; it has been ordered, and should arrive within the next two weeks).
Thomas
- explain compound-tags to Tomi
- add
oslolaš
type derivation test cases to smj
regresssion file
sme->smj
lexicon conversion to build bilingual lexicon resources
- update translations of README-files - Thursday afternoon
- fix bugs!
Tomi
- make PLX conversion test sample; add conversion testing to the make file
- Hunspell lexicon conversion
- fix stuorra-oslolaš lower case
o
- this one is fixed? Yes, the latest regression tests are very good:)
sme->smj
lexicon conversion to build bilingual lexicon resources
- test whether we can revert Makefile changes, and if positive, revert them
- fix bugs!
Trond
- update the
smj
proper noun lexicon, and refine the morphological
analysis, cf. the propernoun-smj-lex.txt
- fix stuorra-oslolaš lower case
o
- add
sma
texts to the corpus repository
- Analysed the sma texts, they look promising, but will require work in order to be added properly. I suggest postponing this to after christmas (as i have done so far, also)
sme->smj
lexicon conversion to build bilingual lexicon resources
- Great progress done, still some minor changes left until it is working
- update translations of README-files - Thursday afternoon
- Done (well, it might have been Friday…)
- fix bugs!.
3. Documentation
Bugzilla 3.0.x has some nice features we would like to use, like shared,
filtered queries.
TODO:
- add semi-automatic updates of fixed and open issues to README files
(Sjur)
- update Bugzilla (Børre)
4. Corpus gathering
Trond had a look at the sma
bible texts. We will postpone adding them
till after Christmas, when we hopefully will have a dedicated sma
project.
Børre has received lots of texts from Torkel Rasmussen, they will
be added this week.
TODO:
- test correct-type markup with latest enhancements (Sjur)
- add texts from Torkel Rasmussen (Børre)
5. Corpus infrastructure
Nothing.
6. Infrastructure
Speller testing is still fluctuating a bit.
7. Linguistics
North Sámi
Čorru > čorut
*Oslolaš with hyphen required, is printed now, but shouldn't
oslolaš - is done correctly now
This one is fixed by the latest changes in the PLX conversion.
TODO:
- lexicalise actio compounds. Example: vuolggasadji vs. vuolginsadji
(Maaren)
- fix stuorra-oslolaš lower case
o
(Tomi)
Lule Sámi
We have the same oslolaš derivation in smj
too, but with another derivation.
Tjårro > tjårok
*Oslolaš with hyphen required, is printed now, but shouldn't
oslolaš - is done correctly now
Correct compounds are still not recognised:
Stuorafuoskok => input, should be accepted
(218) Stuorauvsuk
(217) Stuorraluohkák
fuoskok
fuoskok Fuossko+N+Prop+Plc+Sg+Gen+Der1+Der/k+N+Sg+Nom
StuorFuoskok Stuorafuoskok 2
(220) Stuorruskak
(220) Stuoruduskak
(219) SUFUR-Fuoskok
Fuoskok is in the PLX lexicon, but does not take part in this type of
compounding. It should actually have been fuoskok.
Fuoskok Fuossko+N+Prop+Plc+Pl+Nom+Clt+ge
Fuoskok Fuossko+N+Prop+Plc+Sg+Gen+Clt+ge
smj propernoun bug issue:
procedure
- convert from common base (which means sme base)
- Words not convertable should be added to separate smj lexicon, and words that
should not be converted from sme sme should be moved to non-convert lexicon
in sme???
- send to
smj
morphology
The original todo was to correct the smj morphology.
Current test shows weaknesses in both camps:
- conversion errors
- words that should not have been converten
- missing smj-unique names
- errors in the morphology
Testing procedures:
- analyse baseforms (as for sme)
- generate a couple of caseforms from the baseforms, and inspect result
Suggestion:
Let us first analyse the proper noun base forms for Lule Sámi, and thereafter
look at the morphology.
TODO:
- refine
smj
proper noun lexica, cf. the propernoun-smj-lex.txt
(Thomas)
- it is only about adding
smj
names now, working on it
- lexicalise words from the Olavi missing list, but check against the pdf
original where in doubt (Per-Eric)
- add
oslolaš
type derivation test cases to the regresssion file
(Thomas)
sme->smj
lexicon conversion to build bilingual lexicon resources, and
increase smj
coverage (Trond, Thomas, Svenne)
- add proper nouns (Thomas, Ilona)
8. Name lexicon infrastructure
This sub-project needs to get up and running soon. Mainly Sjur’s task.
Decisions made in Tromsø can be found in [this meeting
memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html]
TODO:
- fix bugs in lexc2xml; add comments to the log element (Saara)
- finish first version of the editing (Sjur)
- test editing of the xml files. If ok, then: (Sjur, Thomas, Trond)
- make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as
well) (the morphological section should be kept intact, in e.g.
propernoun-sme-morph.txt) (Sjur, Saara)
- convert propernoun-($lang)-lex.txt to a derived file from common xml files
(Sjur, Tomi, Saara)
- implement data synchronisation between risten.no and
the cvs repo, and possibly other servers (ie the G5 as an alternative server
to the public risten.no - it might be faster and better suited than the
official one; also local installations could be treated the same way)
- start to use the xml file as source file
- clean terms-sme.xml such that all names have the correct tag for their use
(e.g. @type=secondary) (Thomas, Maaren, linguists)
- merge placenames which are errouneously in different entries: e.g. Helsinki,
Helsingfors, Helsset (linguists)
- publish the name lexicon on risten.no (Sjur)
- add missing parallel names for placenames (linguists)
- add informative links between first names like Niillas and Nils
(linguists)
Hunspell
Sami languages are not supported in OpenOffice.org, until that is fixed we will
have to do the same tricks we apply in Microsoft Office 2004 for Mac.
TODO:
- Hunspell lexicon conversion (Tomi, Børre)
- Begin adding support for the sami languages in OpenOffice.org (Børre)
Testing
Spelling Error Markup
TODO:
- Set up ways of adding meta-information (source info, used in testing or not,
added to lexicon or not) (Saara)
- move Steinar’s error markup in the xml files to (a copy of) the original
(Børre, Kimme)
Automated testing
The infrastructure is about to settle.
TODO:
- document the AppleScript testing tool (Sjur)
- document the testing procedures (Sjur)
TODO:
- fix
oslolaš
bug (Tomi)
- fixed for
sme
, still open for smj
We have received the first hyphenation beta for InDesign. It has been tested,
and seems to work fine.
We should give the beta to Min Áigi and Davvi Girji.
TODO:
- make available InDesign hyphenator to Min Áigi/Davvi Girji (Sjur)
- document the InDesign tools (Sjur)
- add hyphenation testing (Sjur)
Hyphenators
There are some hyphenation errors we need to debug.
TODO:
- get command line hyphenator (Sjur)
- collect list of problematic words for the hyphenator (Sjur, Thomas, all)
New public beta
TODO:
- collect/build an e-mail notify list; we make it simple, a text document with
e-mail addresses (Sjur, Børre, others)
- update list of fixed and known issues - Tuesday afternoon (Sjur, Risten)
- update translations of README-files - Thursday afternoon
(Risten, Thomas, Sjur, Trond)
- update installation packages (Sjur)
- announce the beta (Sjur)
Release version
The CD cover etc. will be worked on by John-Marcus Kuhmunen, and will follow the
SD design rules. He is now waiting for the text to be put on the CD cover and
other places.
TODO:
- write text to go on the CD cover (Risten)
- set up CD-printing printer (Risten)
10. Other
Corpus contracts
Delayed till after final release.
TODO:
- publish corpus contracts and project infra as open-source on NoDaLi-sta
(Sjur)
Bug fixing
When fixing bugs, record the version number containing the fix in the Bugzilla
bug report, such that for each bug, we know exactly when it should have been
fixed, in what file(s) and what version.
59 open Divvun/Disamb bugs (26 of these 56 are speller-related bugs,
33 are other bugs), and 23 risten.no bugs
11. Next meeting, closing
The next meeting is 15.10.2007, 09:30 Norwegian time.
Trond will be away.
The meeting was closed at 10:43.
Appendix - task lists for the next week
Boerre
- move Steinar’s error markup in the xml files to (a copy of) the original
- Hunspell lexicon conversion
- update Bugzilla to 3.0.x
- begin adding support for the sami languages in OpenOffice.org
- add texts from Torkel Rasmussen
- fix bugs!
Ilona
- lexicalise missing words
- Will I have new missing lists to do?
- Check the Finnish translation
- add
smj
proper nouns
- other
smj
tasks
Maaren
- lexicalise actio compounds
Per-Eric
- expand the smj typos list
- add missing smj words
- lexicalise words from the Olavi missing list
- fix bugs!
Risten
- write text to go on the CD cover
- set up CD-printing printer
Saara
- add new XSL/XML headers for proofing test docs
- Set up ways of adding meta-information for proofing correct corpus docs
(source info, used in testing or not, added to lexicon or not)
Sjur
- document the AppleScript testing tool
- document the testing procedures
- work on the XML name editor/risten.no integration
- test correct-type markup with latest enhancements
- get command line hyphenator for automated testing of the hyph-lexicons
- collect list of problematic words for the hyphenator
- make available InDesign hyphenator to Min Áigi/Davvi Girji
- document the InDesign tools
- add hyphenation testing
- fix bugs!
Thomas
sme->smj
lexicon conversion to build bilingual lexicon resources
- add
smj
proper nouns
- fix bugs!
Tomi
- Hunspell lexicon conversion
sme->smj
lexicon conversion to build bilingual lexicon resources
- fix oslolaš bug in
smj
(Tomi)
- fix bugs!
Trond
sme->smj
lexicon conversion to build bilingual lexicon resources
- fix bugs!.