Meeting setup
- Date: 05.09.2005
- Time: 10.00 Norw. time
- Place: Wherever we are :-)
- Tools: iChat, SubEthaEdit
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Summing up last week (Meeting + conference)
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Linguistics
- Speller infrastructure
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 10:12.
Present: Børre, Saara (half an hour), Sjur, Thomas, Tomi, Trond
Absent: Maaren
Main secretary: Børre
Agenda accepted with new point 3.
2. Reviewing the task list from the last meeting
Børre
- Add crontab specification for the cvs update/export script Tomi made
- Adjust it so it will work with the new server
- Working on it, almost done
- reopen the jspwiki + UTF-8 issue
- Add issue to forrest issue tracker about utf-8 ihtml documents.
- Contact Svenska bibelsällskapet
- discuss with Anders Kintel about possible cooperation
- Not done, been in Helsinki
- Continue writing a contract draft, together with Trond and Sjur.
Coordinate with Kimmo Koskenniemi
- Follow up on CVS mailing:
- check that Thomas and Maaren receive forwarded mail from cochice
- set up Thomas and Maaren
- Thomas is set up, not sure about Maaren
- Investigate compatibility between MySpell and Aspell source files
- Debian has common source files for these.
- Finish giellatekno forrest transition.
- Other tasks:
- Has looked at sfst, begun reading “THE book” (about the Xerox tools)
Maaren
- The missing list, both the overall missing list from our xml corpus,
and a file-for-file review, in order to get different terminology
- Go through the missing list from risten.no
- remember headphones, 2*money and batteries for Helsinki
Sjur
- risten.no bugs and fixes
- Nothing during the last two weeks
- complete the action summary after our half-year evaluation
- follow up on:
- voice group-chat not working to Sámediggi
- This won’t work without a new or upgraded Firewall
- Maaren has problems with SubEthaEdit (can’t connect)
- Maaren’s computer updated and fixed, the problem might be solved,
but I haven’t had a chance to test it.
- To the board:
- write proposal for permanent maintenance organisation
- write draft specification for the outsourced tasks
- write half-yearly project report with progress and bugdet status
- write agenda
- Deadline for the board tasks: 3 weeks ahead of meeting (meeting is
October 4th)
- add Divvun to the UNESCO Hall-of-Fame list
- Plan the Helsinki meeting
- meeting about corpus contract
- project planning with Trond
Thomas
- work on Lule Sami compounding and derivation, check closed POSes
**checked closed POSes, started with derivation
Tomi
- Aspell: Continue working on the affix file
- investigate Aspell/MySpell/OOo issues
- MySpell uses (almost) the same kind of wordlist & affix files as aspell
- OOo is turning over to hunspell, which has a much more powerfull and
linguistically adequate formalism,
- three-part compounding
- decapitalisation of proper nouns when (if?) compounded, and when derived
(the oslolaš issue)
- This is working, but should be added to makefile and CVS
- corpus infrastructure: dtd location (both public and internal)
- corpus infrastructure: file and dir organisation
Trond
- Work on the bug list (Lule Sámi).
- No work on Lule Sámi last week)
- Work on compounds (three-part + oslolaš, with Tomi)
- The oslolaš issue was solved in the Helsinki conference
- Three-part compounds still open
- Work on the corpus interface (with Lars)
- Corpus infrastructure: dtd location
- Made a principled decision in Helsinki, not implemented.
- Get the new version of the New Testament
- Got an html version in Helsinki from a collegue(!), still haven’t got
hold of Helge Almås in Bibelselskapet.
- Check Hans-Ragnars names.
- Look at the Sámi names issue
- Not done. But the CD is on my desk, so we will look into it here in Tromsø.
- Plan the Helsinki meeting
- New coworker
- meeting about corpus contract
- We had a meeting, the work continues.
- check the new giellatekno site
- project planning with Sjur
3. Summing up last week (Meeting + conference)
- Meetings: Our meetings went well.
- Demo: We should have had a prepared text with collected authenthical spelling
errors, and learned how to use KWord… Otherwise our demo performance was as
good as we had capacity to.
- The conference was too advanced from time to time (especially the mathematical
parts), but the twol day (and some of the lectures) was ok.
- We met the community and discussed, that was very positive. E.g. discussion
with the person behind sfst, the Xerox persons, etc.
4. Documentation
The giellatekno.uit.no site is deployed. Børre has read and fixed some of the
technical documents.
The standing invitation remains: Document what you do, when you do it.
Especially document Aspell. The Makefile, how to build, what changes has been
added to the lexicon, and so on.
5. Corpus gathering
Børre, Trond and Sjur had their meeting, and the Helsinki contract is quite good
as it is. There are some points still to be discussed. We’ll continue this week.
Paths forward: We have a contract suggestion. Sjur and Børre should start the
negotiation with the authors and/or publishers to get text. We should make a
priority list for authors, and we should invite ourselves to their meeting to
discuss. We should carry on the discussions both with Iđut and with Davvi Girji.
How to proceed:
- Get the contract suggestion ready
- Translated part 1 ok, part 2 and 3 missing. Done this week
- Get part 4 from Kimmo, and translate it
- Contact our lawyers, at SD and UIT (today, tomorrow).
- When the Norwegian version of the contracts are ready, make
sure they’re corresponding to the Finnish ones in their legal
interpretation (as far as possible); then publish both versions
for others to reuse, with background documentation (in cooperation
with Kimmo Koskenniemi)
- Approach the text owners (see ordered list below)
Independent of the contract work
- Bible: The new testament (Trond)
- Bureaucratic text:
- Sámi Parliament (Børre)
- Sámi Oahpahusráđđi (Børre)
- KRD (Børre, check whether we miss texts (discuss with Trond))
- the Sámi municipalities (Børre)
- Textbooks
- To the extent that text can be got directly from SO.
After the contracts are ready
Sjur and Børre should probably take a Tour-de-Sápmi, and meet with the
most important persons and institutions. Børre as the responsible for
collecting, Sjur as responsible for the project, and representative of SD
The tour should be planned, not in this meeting, but before the contracts
are ready, i.e., it should be planned next week.
- Commercially published texts
- Author organisations’ meetings
- Key authors one by one
- (list of author names) Kerttu Vuolab, Kirsi Paltto,
- Iđut and key authors there (Børre)
- Davvi Girji and key authors there
- Newspaper text:
- Sámi Instituhtta’s (for the old archive of Min Áigi and Áššu)
- Áššu has been making a CD since the end of may, there should be a pile
there. Børre suggests that they send us the CDs they have, so that we
may look at them, and ensure that the routines work, and that we are
able to utilize their format.
- Min Áigi
List of texts with lower priority (to be gathered when the above list is
more or less fixed)
- the Sámi municipalities,
- Authors with smaller production
- Textbooks
6. Corpus infrastructure
Do documentation.
Naming conventions and directory structure
We have a decision from Helsinki:
- have the same directory structure in all three levels, and we also decided
upon the overall structure.
- Path forward: Tomi and Trond to implement the directory structure
7. Linguistics
North Sámi
- New place names received, should be added to our lexicons
- three-part compounds issue still open
- Johnny Andersen has written a letter to us on the treatment of Sámi place
names, we need a contract with “Norge digitalt”, via UFD. We will have to
get such an agreement. This is a task for Thomas and Trond.
Lule Sámi
We do not know when we get the lexicon from Anders Kintel. We need a meeting
with him and with Árran in order to coordinate the work.
Status quo on our parser:
- All the major POSes have been covered
- closed POSes have been checked
- compounding and derivation
**Thomas has begun with the deverbals
8. Speller infrastructure
aSpell
Write documentation here as well.
Munch-list is working, and the affix file is improving. See [previous meeting
memo|/admin/weekly/2005/Meeting_2005-08-15.html].
Issues:
#The phonetic file should be systematically looked into.
1. Check that it works
1. Add more correspondences on an impressionistic basis
- Start work on collecting systematic spelling errors:
- Our in-house file typos.txt
- The soon-to-arrive error texts from newspapers
- The holes in the affix list should be mended
- We should, at some point, evaluate whether this is The Correct Approach to
aspell-type speller building
- Affix file UTF-8 problem should be checked and reported.
- Then there is the UTF-8 root (or whatever) problem: The work-around using
Latin 4 is no solution because of the following latin 4 bug: It treats đ
as ð, and thus gives error on “ođđa” (wants “oðða”, this is since in Latin
4 đ and ð are unified (Latin 4).
Latin 6 (= ISO/IEC 8859-10) or ISO-IR 197 (the old Dihtorlávdegoddi code
page) could be a fix to this.
- The clitics issue: Today we have a manually created affix file in order to
meet the muncher. Strategy: Do the munching without the clitics, but then
enrich the manually-created affix file via a genfisuffix -program in order
to get an automatically created suffix + clitic file to add the compiled
lexicon to. We will have genfisuff taking affix + clitic and makes it into
affix’. Then we use affix to munch but affix’ to spellcheck.
- We must create subcomponents under the Speller
MS Office spellers
Nothing new, see
previous meeting memo.
OpenOffice.org
From last meeting:
The conversion from aspell to myspell will work trivially as soon as the myspell
list becomes smaller.
Issue left open.
Hunspell
Hunspell is presently already working with OOo, and is a much better speller
engine, linguistically speaking (can handle compounds much better than Aspell,
as well as complex inflection and derivation).
For pointers, see the
previous meeting memo.
Issue left open.
Other engines
Børre and Sjur had a long discussion with the author of the SFST library/tool
set. Next on his priority list is a feature for handling spelling errors in
running text analysis. This is principally the same as we want, thus there
should be a good opportunity for making SFST into the spelling engine.
Sjur has a suggestion on how to implement this feature that may be mailed
to the author of SFST.
9. Other
Technical issues
- Fixing the machine for the new coworker
- The mac os / perl bug (at least Trond and Sjur has it):
- utf8 “\xC4” does not map to Unicode at /Users/trond/gt/script/preprocess line 82.
This msg did not show up in 10.3 (perl 5.8.1), but does so in 10.4 (perl 5.8.6).
It is probably a perl - OS mismatch. (Trond, Thor Øivind, Tomi)
- Sjur has a non-solved Backspace + UTF-8 issue
- 29 open bugs - too much! Have a look at what you can fix.
10. Summary, task list
Børre
- Finish crontab specification for the cvs update/export script Tomi made
- reopen the jspwiki + UTF-8 issue
- Add issue to forrest issue tracker about utf-8 ihtml documents.
- Contact Svenska bibelsällskapet
- discuss with Anders Kintel about possible cooperation
- Follow up on CVS mailing:
- Meet up with Trond about directory structure
- Contact oahpahusossodat and the rest of the SD about texts
- Fixing the machine for the new coworker
Maaren
- The missing list, both the overall missing list from our xml corpus, and a
file-for-file review, in order to get different terminology
- Go through the missing list from risten.no
- Start working on grammatical issues with Thomas and Trond.
Saara
- Get aquainted with the project status quo
- Look at the corpus infrastructure issue
- Look at the corpus interface issue with Lars
Sjur
- risten.no bugs and fixes
- complete the action summary after our half-year evaluation
- follow up on:
- voice group-chat not working to Sámediggi
- Maaren has problems with SubEthaEdit (can’t connect)
- To the board:
- write proposal for permanent maintenance organisation
- write draft specification for the outsourced tasks
- write half-yearly project report with progress and bugdet status
- write agenda
- Deadline for the board tasks: 3 weeks ahead of the meeting (the meeting is
October 4th, deadline for the documents is September 13th)
- project planning with Trond
Thomas
- work on Lule Sami compounding and derivation
- Look at Linguistic bugs with Trond.
- Work on the name agreement with “Norge digitalt” with Trond
Tomi
- Aspell: Continue working on the affix file
- three-part compounding
- Add downcasing to makefile and CVS
- corpus infrastructure: dtd location (both public and internal)
- corpus infrastructure: file and dir organisation
Trond
- Work on the bug list (Lule Sámi).
- Work on compounds (three-part, with Tomi)
- Work on the corpus interface (with Lars)
- Corpus infrastructure: dtd location
- Work on the name agreement with “Norge digitalt” with Thomas
- Look at the linguistic aspects of the speller clitics, with
- Get the new version of the New Testament
- Check Hans-Ragnars names.
- New coworker
- translate contract
- check the new giellatekno site
- project planning with Sjur
10. Next meeting, closing
12.09.2005 10:00
Closed at 12:56