Meeting setup
- Date: 11.06.2007
- Time: 09.30 Norw. time
- Place: Internet
- Tools: SubEthaEdit, iChat/Skype
Agenda
- Opening, agenda review
- Reviewing the task list from last week
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 09:57.
Present: Børre, Maaren, Per-Eric, Sjur, Steinar, Thomas, Tomi, Trond
Absent: Saara
Agenda accepted as is.
2. Updated task status since last meeting
Børre
- add
sma
texts to the corpus repository
- run all known spelling errors in the prooftest corpus through the speller
- add extraction of all known spelling errors in the regular corpus (not the
prooftest
corpus), and check that they are properly corrected
- update and fix our documentation and infrastructure as Steinar finds
problem areas - low priority
- study the Hunspell formalism in detail
- contact Davvi Girji / Mikal Aase
- install larger disks, new RAM on the G5 when they arrive
- Arrived. Will install it asap.
- move list of known bugs to Bugzilla
- update/check installed file list and paths for Windows
- fix bugs!
Inga
- expand the smj typos list
- add missing smj words
Maaren
- lexicalise actio compounds
- Manually mark speller test documents for typos
Per-Eric
- expand the smj typos list
- add missing smj words
Saara
- improve cgi-bin scripts
- add new XSL/XML headers for proofing test docs
- Try to add files with Lars to the corpus interface.
- fix bugs!
Sjur
- run all known spelling errors in the corpus through the speller
- not done, depends on speller test bench improvements
- document the AppleScript testing tool
- integrate regression self tests with the make file
- improve speller test bench
- worked on it, problems with speller test result processing, perl script
- integrate the ccat speller testing options in the make file
- worked on it, problems with speller test result processing, perl script
- fix internet setup for Per-Eric’s satelite modem
- look over the Bugzilla status mails
- contact Davvi Girji / Mikal Aase
- ask Xerox for a commercial lisense for the xfst tools on the G5
- check with Sámi publishing houses whether support for CS2 is still needed
- checked Min Áigi, Áššu and Davvi Girji - CS2 not needed so far
- fix stuorra-oslolaš lower case
o
- topic for the Drag meeting
ö/ä
vs ø/æ
in speller
- topic for the Drag meeting
- study the Hunspell formalism in detail
- topic for the Drag meeting
- move list of known bugs to Bugzilla
- resend the press release to some channels in Sweden, Finland and Norway
- publish corpus contracts and project infra as open-source on NoDaLi-sta
- fix bugs!
- other:
- finished installation of Parallels Desktop, Windows XP, Office 2007 and our
Windows proofing tools for testing Windows version of the spellers.
Steinar
- Beta testing: Align manually (shorter texts)
- Manually mark speller test texts for typos (making them into gold standards),
add the texts to the orig/catalogue
- Complete the semantic sets in sme-dis.rle
- missing lists
- fix bugs!
Thomas
- work with compounding
- Lack of lowering before hyphen: Twol rewrite.
smj
: öä not accepted, only øæ (except for lexicalised names)
- fix stuorra-oslolaš lower case
o
- investigate why actios of 3-syllable verbs are not accepted by the speller
- had some help with this, we will see
- investigate why some adverbs of 3-syllable adjectives are not accepted by the
speller
- fix bugs!
Tomi
- add compounding restrictions to the PLX conversion
- make PLX conversion test sample; add conversion testing to the make file
- improve prefix and middle-noun PLX conversion
- integrate the
ccat
speller testing options in the Makefile
- first part of multiword expressions not accepted
- open up compounding for all actios
- fix bugs!
Trond
- Work on the web corpus issues
- update the
smj
proper noun lexicon, and refine the morphological
analysis, cf. the propernoun-smj-lex.txt
- Fixed a fatal bug here (1/3 of names restored!), but not worked more
on the morphological issue.
- Go through the Num bugs
- fix stuorra-oslolaš lower case
o
- fix bugs!.
- Closed several, but opened more, I am afraid.
3. Documentation
TODO:
- write form to request corpus user account (Børre, Sjur, Trond)
- document how to apply for access to closed corpus, and details on the corpus
and its use in general (Børre, Sjur, Trond)
- correct and improve it based on feedback from Steinar (Børre)
4. Corpus gathering
Sjur spoke to Davvi Girji, we will send them a list of the authors contacted
and DG will send a list of all works DG has published - ever (not all of them
available in digital form, though)!
TODO:
sme
texts: no new additions, fix corpus errors during this month
(Børre, Trond, Saara)
- missing
nob
parallel texts should be added if such holes are found
(Børre, Trond)
- Go through the list of missing or errouneous
nob
texts, based upon
Saara’s perfect list (Børre, Trond)
- add
sma
texts to the corpus repository (Børre)
- contact Davvi Girji / Mikal Aase (Børre, Sjur)
5. Corpus infrastructure
Nothing this week either.
6. Infrastructure
TODO:
- update and fix our documentation and infrastructure as Steinar finds
problem areas (Børre)
- fix internet setup for Per-Eric’s satelite modem (Sjur, Børre)
- this influences iChat, SEE sharing, and ARD connetions
7. Linguistics
North Sámi
Actio compounds: Maaren and Duomma disagrees about what is correct and
not, needs to be resolved. We need some more clarifications about the system,
which we will do in Drag.
TODO:
- lexicalise actio compounds. Example: vuolggasadji vs. vuolginsadji
(Maaren)
- vuolgin- and vuolgga- , both are okei vuolggasadji and vuolgindássi for eks
- possibly turn on free compounding as part of the PLX conversions (ie free
compounding in the spellers, but not in the analyzers/transducers)
- fix stuorra-oslolaš lower case
o
(Sjur, Thomas, Trond)
- open up compounding for all actios (Tomi)
Lule Sámi
TODO:
- refine
smj
proper noun lexica, cf. the propernoun-smj-lex.txt
(Thomas, Trond)
ö/ä
vs ø/æ
in speller (Thomas, Sjur)
- lexicalise words from the Olavi missing list, but check against the pdf
original where in doubt (Inga)
- add normativity issues to our normativity document (Inga, Thomas)
- investigate why actios of 3-syllable verbs are not accepted by the speller
(Thomas)
- norm-lookup does not see these, ordinary look-up sees
- these were grepped out because they containted the string
SUB
as part
of their lexicon names. Now the names are changed, and it should be fixed
now, needs to be tested in the new speller
- investigate why some adverbs of 3-syllable adjectives are not accepted by the
speller (Thomas)
- norm-look-up sees some, but not all, ordinary look-up sees
- it seems to be fixed, needs to be tested in the new speller
8. Name lexicon infrastructure
Decisions made in Tromsø can be found in [this meeting
memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html]
TODO:
- fix bugs in lexc2xml; add comments to the log element (Saara)
- finish first version of the editing (Sjur)
- test editing of the xml files. If ok, then: (Sjur, Thomas, Trond)
- make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as
well) (the morphological section should be kept intact, in e.g.
propernoun-sme-morph.txt) (Sjur, Saara)
- convert propernoun-($lang)-lex.txt to a derived file from common xml files
(Sjur, Tomi, Saara)
- implement data synchronisation between risten.no and
the cvs repo, and possibly other servers (ie the G5 as an alternative server
to the public risten.no - it might be faster and better suited than the
official one; also local installations could be treated the same way)
- start to use the xml file as source file
- clean terms-sme.xml such that all names have the correct tag for their use
(e.g. @type=secondary) (Thomas, Maaren, linguists)
- merge placenames which are errouneously in different entries: e.g. Helsinki,
Helsingfors, Helsset (linguists)
- publish the name lexicon on risten.no (Sjur)
- add missing parallel names for placenames (linguists)
- add informative links between first names like Niillas and Nils
(linguists)
9. Spellers
OOo spellers
Børre, Sjur, Tomi will have a session on this in Drag.
TODO:
- add Hunspell data generation to the lexc2xspell (Tomi - after the
PLX data generation is finished)
- study the Hunspell formalism in detail (Børre, Sjur, Tomi)
Testing
Spelling Error Markup
Text in other languages should not be marked as spelling errors.
TODO:
- Manually mark test texts for typos (making them into gold standards)
(Steinar)
- Set up ways of adding meta-information (source info, used in testing or not,
added to lexicon or not) (Saara)
Sjur is trying to get the ccat typos option integrated in the test targets
in the Makefile. Hopefully done soon.
TODO:
- document the AppleScript testing tool (Sjur)
- improve speller test bench (Sjur)
- integrate the ccat speller testing options in the Makefile (Sjur, Tomi)
Regression tests
Nothing new
TODO:
- add extraction of all known spelling errors in the corpus (not the
prooftest
corpus), and check that they are properly corrected
(Børre, Sjur)
- test the
typos.txt
list, and check that all entries are properly corrected
(Børre, Sjur)
- consider how to do a regression self-test, ie, how to test the full
wordlist (Børre, Sjur)
- extract all the base forms in the lexicon, and run them through the speller
- extract all SUB-marked entries, and run them through the lexicon
- integrate these in the make file (Sjur)
TODO:
- install larger disks, new RAM on the G5 when they arrive (Børre)
- received, will be installed soon.
- ask for mklex for Linux (victorio) from Polderland (Sjur)
- ask Xerox for a commercial lisense for the xfst tools on the G5 (Sjur)
- add compounding restrictions to the PLX conversion (Tomi)
- done, seems correct, but needs more testing when a new speller is ready.
Compounding restrictions
Compounding restrictions are now integrated in the PLX conversion, thanks to
Tomi.
TODO:
- improve prefix conversion to PLX (Tomi)
- done
- improve middle noun conversion to PLX (Tomi)
- done
- improve noun + adjective PLX conversion: (Tomi)
- compounding stems - how do we generate them? Using the java client?
+SgNomCmp+Cmpnd
= sáme–
, should give the correct compounding stem,
shouldn’t it? We want to optionally go from: sáme- NLI
to
sáme NL
: - NLI (->) NL
, which means we should be able to
extract correct compounding stems using xfst methods only.
- done
- compounding tags - we need to obey them when making the transducers.
Suggestion - see above.
- done
- make conversion test sample; add conversion testing to the make file
(Tomi)
- to regression test / QA the PLX conversion.
- not done
Public Beta follow-up
TODO:
- fix clitics (Tomi)
- done after the release, has to be tested
- can be tested in the small speller - tested,
all accepted:
Sjurgo buorrege Trosterutge biilago
- file list in Windows not complete (Børre, Sjur)
- test smj on typos (Børre)
- tried, but got an error, thus skipped. Needs to be checked now.
- celebrate
- NOT done - will do in Drag:)
- resend the press release to some channels in Sweden, Finland and Norway
(Sjur)
- Per-Eric will follow up in Sweden, Tomi in Finland, to make sure we
have got reasonable coverage, or at least enough users for the beta spellers.
Finnish press coverage.
Other finnish institutions to contact could be:
- Samiradio (Tomi) - they’re planning to make a report
- Sami parliament (Tomi)
- Oulu - giellagas (Tomi)
- Lapin yliopisto - Rantala (Trond)
- Helsingin yliopisto - Seurujärvi-Kari (Tomi)
- KOTUS (Sjur)
- Citysaamit (Tomi)
- Oulun saamelaiset (Tomi)
- move list of known errors to Bugzilla (Børre, Sjur)
10. Other
Summer vacation
When are we taking it? Please fill in the table below:
Name |
Starting |
Ending |
Børre |
x |
x |
Maaren |
9.7. |
10.8. |
Per-Eric |
9.7. |
20.7. |
Saara |
2.7 |
3.8 |
Sjur |
x |
x |
Steinar |
x |
x |
Thomas |
9.7. |
12.8. |
Tomi |
9.7. |
5.8. |
Trond |
2.7. |
12.8, but working at the end |
Divvun people also need to send the dates to Julie Eira or
Ellen Mienna Guttorm.
Corpus contracts
TODO:
- publish corpus contracts and project infra as open-source on NoDaLi-sta
(Sjur)
Bug fixing
When fixing bugs, record the version number containing the fix in the Bugzilla
bug report, such that for each bug, we know exactly when it should have been
fixed, in what file(s) and what version.
56 open Divvun/Disamb bugs (21 of these 56 are speller bugs, 35 are
general bugs), and 23 risten.no bugs
TODO:
- look over the Bugzilla status mails (Børre)
The meeting in Drag
The Sámi Parliament board has its meeting June 19-21. We should use Monday 18.
as our travel day, and return on Friday 22. Fly to Bodø, and go by rental car
from there. It is also possible to go by car all the way from Tromsø, and it is even faster. Those going to Bodø are (at least):
Topics for Drag:
- two-level fixes (stuorra-oslolaš)
- OOo/Hunspell
- QA session
- Actio compounding clarifications
- smj work in general
- loan words in -áhta or -áhtta (example: advokáhtta or advokáhta)
SD-ráddi presentation (1 hour):
- demo Divvun
- demo risten.no
- drift av divvun
- drift av risten.nno
- forlenging/nytt prosjekt (ie drift)
- sørsamisk
- terminologi-utvikling
- parallellkorpus
- nordisk samarbeid
Sjur will order rooms for all (except Per-Eric) on Hamarøy Hotell, meeting room either at the Hotel or at Árran. Beds are needed as follows:
- Monday: Sjur, Maaren, Tomi
- Tuesday: Sjur, Maaren, Thomas, Tomi, Trond, Børre
- Wedday: Sjur, Maaren, Tomi, Børre (not at Hamarøy Hotell - it is full)
- Thursday: Sjur, Maaren, Tomi, Børre
TODO:
- order rooms (Sjur)
- order meeting room (Sjur)
- plan presentation (Sjur)
A commercial
An alternative compiler to Xerox is coming up, in
- Haifa
- [FSMNLP conference in
september|http://www.ling.uni-potsdam.de/fsmnlp2007/index.php?show=1]
11. Next meeting, closing
The next meeting is 25.6.2007, 10:30 Norwegian time (or possibly in the
afternoon).
The meeting was closed at 11:28.
Appendix - task lists for the next week
Boerre
- add
sma
texts to the corpus repository
- run all known spelling errors in the prooftest corpus through the speller
- add extraction of all known spelling errors in the regular corpus (not the
prooftest
corpus), and check that they are properly corrected
- update and fix our documentation and infrastructure as Steinar finds
problem areas - low priority
- study the Hunspell formalism in detail
- follow-up contact with Davvi Girji
- install larger disks, new RAM on the G5
- update/check installed file list and paths for Windows
- study the Hunspell formalism in detail
- fix bugs!
Maaren
- lexicalise actio compounds
- Manually mark speller test documents for typos
Per-Eric
- expand the smj typos list
- add missing smj words
- contact media in Sweden about the beta release
Saara
- add new XSL/XML headers for proofing test docs
- Try to add files with Lars to the corpus interface.
- fix bugs!
Sjur
- run all known spelling errors in the corpus through the speller
- document the AppleScript testing tool
- integrate regression self tests with the make file
- improve speller test bench
- integrate the ccat speller testing options in the make file
- fix internet setup for Per-Eric’s satelite modem
- look over the Bugzilla status mails
- ask Xerox for a commercial lisense for the xfst tools on the G5
- check with Sámi publishing houses whether support for CS2 is still needed
- resend the press release to some channels in Sweden, Finland and Norway
- publish corpus contracts and project infra as open-source on NoDaLi-sta
- study the Hunspell formalism in detail
- fix bugs!
Steinar
- Beta testing: Align manually (shorter texts)
- Manually mark speller test texts for typos (making them into gold standards),
add the texts to the orig/catalogue
- Complete the semantic sets in sme-dis.rle
- missing lists
- fix bugs!
Thomas
- work with compounding
- Lack of lowering before hyphen: Twol rewrite.
smj
: öä not accepted, only øæ (except for lexicalised names)
- fix stuorra-oslolaš lower case
o
- add normativity issues to our normativity document
- test new speller for actios of 3-sybbable verbs and adverbs of 3-s adjs.
- fix bugs!
Tomi
- make PLX conversion test sample; add conversion testing to the make file
- integrate the
ccat
speller testing options in the Makefile
- first part of multiword expressions not accepted
- open up compounding for all actios
- contact Finnish institutions about the speller beta release
- study the Hunspell formalism in detail
- add Hunspell data generation/conversion
- fix bugs!
Trond
- Work on the web corpus issues
- update the
smj
proper noun lexicon, and refine the morphological
analysis, cf. the propernoun-smj-lex.txt
- fix bugs!.