Meeting setup

Date: 05.09.2005
Time: 10.00 Norw. time
Place: Wherever we are :-)
Tools: iChat, SubEthaEdit

Agenda

Opening, agenda review
Reviewing the task list from two weeks ago
Summing up last week (Meeting + conference)
Documentation - divvun.no
Corpus gathering
Corpus infrastructure
Linguistics
Speller infrastructure
Other issues
Summary, task lists
Closing

1. Opening, agenda review, participants

Opened at 10:12.

Present: Børre, Saara (half an hour), Sjur, Thomas, Tomi, Trond

Absent: Maaren

Main secretary: Børre

Agenda accepted with new point 3.

2. Reviewing the task list from the last meeting

Børre

Add crontab specification for the cvs update/export script Tomi made
- Adjust it so it will work with the new server
  - Working on it, almost done
reopen the jspwiki + UTF-8 issue
- Not done
Add issue to forrest issue tracker about utf-8 ihtml documents.
- Not done
Contact Svenska bibelsällskapet
- Not done
discuss with Anders Kintel about possible cooperation
- Not done, been in Helsinki
Continue writing a contract draft, together with Trond and Sjur. Coordinate with Kimmo Koskenniemi
- Got new documents
Follow up on CVS mailing:
- check that Thomas and Maaren receive forwarded mail from cochice
- set up Thomas and Maaren
  - Thomas is set up, not sure about Maaren
Investigate compatibility between MySpell and Aspell source files
- Debian has common source files for these.
Finish giellatekno forrest transition.
- Done, deployed.
Other tasks:
- Has looked at sfst, begun reading “THE book” (about the Xerox tools)

Maaren

The missing list, both the overall missing list from our xml corpus, and a file-for-file review, in order to get different terminology
- Done?
Go through the missing list from risten.no
- Partially done
remember headphones, 2*money and batteries for Helsinki
- Done

Sjur

risten.no bugs and fixes
- Nothing during the last two weeks
complete the action summary after our half-year evaluation
follow up on:
- voice group-chat not working to Sámediggi
  - This won’t work without a new or upgraded Firewall
- Maaren has problems with SubEthaEdit (can’t connect)
  - Maaren’s computer updated and fixed, the problem might be solved, but I haven’t had a chance to test it.
To the board:
- write proposal for permanent maintenance organisation
- write draft specification for the outsourced tasks
- write half-yearly project report with progress and bugdet status
- write agenda
- Deadline for the board tasks: 3 weeks ahead of meeting (meeting is October 4th)
add Divvun to the UNESCO Hall-of-Fame list
- Not done
Plan the Helsinki meeting
- Done
meeting about corpus contract
- Done
project planning with Trond
- Not done

Thomas

work on Lule Sami compounding and derivation, check closed POSes **checked closed POSes, started with derivation

Tomi

Aspell: Continue working on the affix file
- Work has been done there
investigate Aspell/MySpell/OOo issues
- MySpell uses (almost) the same kind of wordlist & affix files as aspell
- OOo is turning over to hunspell, which has a much more powerfull and linguistically adequate formalism,
three-part compounding
- Not done
decapitalisation of proper nouns when (if?) compounded, and when derived (the oslolaš issue)
- This is working, but should be added to makefile and CVS
corpus infrastructure: dtd location (both public and internal)
- Discussed in Helsinki
corpus infrastructure: file and dir organisation
- Discussed in Helsinki

Trond

Work on the bug list (Lule Sámi).
- No work on Lule Sámi last week)
Work on compounds (three-part + oslolaš, with Tomi)
- The oslolaš issue was solved in the Helsinki conference
- Three-part compounds still open
Work on the corpus interface (with Lars)
- Not done.
Corpus infrastructure: dtd location
- Made a principled decision in Helsinki, not implemented.
Get the new version of the New Testament
- Got an html version in Helsinki from a collegue(!), still haven’t got hold of Helge Almås in Bibelselskapet.
Check Hans-Ragnars names.
- Not done.
Look at the Sámi names issue
- Not done. But the CD is on my desk, so we will look into it here in Tromsø.
Plan the Helsinki meeting
- Done
New coworker
- Formalities start now.
meeting about corpus contract
- We had a meeting, the work continues.
check the new giellatekno site
- Some checking done.
project planning with Sjur
- Not done.

3. Summing up last week (Meeting + conference)

Meetings: Our meetings went well.
Demo: We should have had a prepared text with collected authenthical spelling errors, and learned how to use KWord… Otherwise our demo performance was as good as we had capacity to.
The conference was too advanced from time to time (especially the mathematical parts), but the twol day (and some of the lectures) was ok.
We met the community and discussed, that was very positive. E.g. discussion with the person behind sfst, the Xerox persons, etc.

4. Documentation

The giellatekno.uit.no site is deployed. Børre has read and fixed some of the technical documents.

The standing invitation remains: Document what you do, when you do it.

Especially document Aspell. The Makefile, how to build, what changes has been added to the lexicon, and so on.

5. Corpus gathering

Børre, Trond and Sjur had their meeting, and the Helsinki contract is quite good as it is. There are some points still to be discussed. We’ll continue this week.

Paths forward: We have a contract suggestion. Sjur and Børre should start the negotiation with the authors and/or publishers to get text. We should make a priority list for authors, and we should invite ourselves to their meeting to discuss. We should carry on the discussions both with Iđut and with Davvi Girji.

How to proceed:

Get the contract suggestion ready
1. Translated part 1 ok, part 2 and 3 missing. Done this week
2. Get part 4 from Kimmo, and translate it
3. Contact our lawyers, at SD and UIT (today, tomorrow).
4. When the Norwegian version of the contracts are ready, make sure they’re corresponding to the Finnish ones in their legal interpretation (as far as possible); then publish both versions for others to reuse, with background documentation (in cooperation with Kimmo Koskenniemi)
Approach the text owners (see ordered list below)

Independent of the contract work

Bible: The new testament (Trond)
Bureaucratic text:
1. Sámi Parliament (Børre)
2. Sámi Oahpahusráđđi (Børre)
3. KRD (Børre, check whether we miss texts (discuss with Trond))
4. the Sámi municipalities (Børre)
Textbooks
1. To the extent that text can be got directly from SO.

After the contracts are ready

Sjur and Børre should probably take a Tour-de-Sápmi, and meet with the most important persons and institutions. Børre as the responsible for collecting, Sjur as responsible for the project, and representative of SD

The tour should be planned, not in this meeting, but before the contracts are ready, i.e., it should be planned next week.

Commercially published texts
1. Author organisations’ meetings
2. Key authors one by one
  1. (list of author names) Kerttu Vuolab, Kirsi Paltto,
3. Iđut and key authors there (Børre)
4. Davvi Girji and key authors there
Newspaper text:
1. Sámi Instituhtta’s (for the old archive of Min Áigi and Áššu)
2. Áššu has been making a CD since the end of may, there should be a pile there. Børre suggests that they send us the CDs they have, so that we may look at them, and ensure that the routines work, and that we are able to utilize their format.
3. Min Áigi

List of texts with lower priority (to be gathered when the above list is more or less fixed)

the Sámi municipalities,
Authors with smaller production
Textbooks

6. Corpus infrastructure

Do documentation.

Naming conventions and directory structure

We have a decision from Helsinki:

have the same directory structure in all three levels, and we also decided upon the overall structure.
Path forward: Tomi and Trond to implement the directory structure

7. Linguistics

North Sámi

New place names received, should be added to our lexicons
three-part compounds issue still open
Johnny Andersen has written a letter to us on the treatment of Sámi place names, we need a contract with “Norge digitalt”, via UFD. We will have to get such an agreement. This is a task for Thomas and Trond.

Lule Sámi

We do not know when we get the lexicon from Anders Kintel. We need a meeting with him and with Árran in order to coordinate the work.

Status quo on our parser:

All the major POSes have been covered
closed POSes have been checked
compounding and derivation **Thomas has begun with the deverbals

8. Speller infrastructure

aSpell

Write documentation here as well.

Munch-list is working, and the affix file is improving. See [previous meeting memo|/admin/weekly/2005/Meeting_2005-08-15.html].

Issues: #The phonetic file should be systematically looked into. 1. Check that it works 1. Add more correspondences on an impressionistic basis

Start work on collecting systematic spelling errors:
1. Our in-house file typos.txt
2. The soon-to-arrive error texts from newspapers
The holes in the affix list should be mended
We should, at some point, evaluate whether this is The Correct Approach to aspell-type speller building
Affix file UTF-8 problem should be checked and reported.
Then there is the UTF-8 root (or whatever) problem: The work-around using Latin 4 is no solution because of the following latin 4 bug: It treats đ as ð, and thus gives error on “ođđa” (wants “oðða”, this is since in Latin 4 đ and ð are unified (Latin 4). Latin 6 (= ISO/IEC 8859-10) or ISO-IR 197 (the old Dihtorlávdegoddi code page) could be a fix to this.
The clitics issue: Today we have a manually created affix file in order to meet the muncher. Strategy: Do the munching without the clitics, but then enrich the manually-created affix file via a genfisuffix -program in order to get an automatically created suffix + clitic file to add the compiled lexicon to. We will have genfisuff taking affix + clitic and makes it into affix’. Then we use affix to munch but affix’ to spellcheck.
We must create subcomponents under the Speller

MS Office spellers

Nothing new, see previous meeting memo.

OpenOffice.org

From last meeting:

The conversion from aspell to myspell will work trivially as soon as the myspell list becomes smaller.

Issue left open.

Hunspell

Hunspell is presently already working with OOo, and is a much better speller engine, linguistically speaking (can handle compounds much better than Aspell, as well as complex inflection and derivation). For pointers, see the previous meeting memo.

Issue left open.

Other engines

Børre and Sjur had a long discussion with the author of the SFST library/tool set. Next on his priority list is a feature for handling spelling errors in running text analysis. This is principally the same as we want, thus there should be a good opportunity for making SFST into the spelling engine. Sjur has a suggestion on how to implement this feature that may be mailed to the author of SFST.

9. Other

Technical issues

Fixing the machine for the new coworker
The mac os / perl bug (at least Trond and Sjur has it):
- utf8 “\xC4” does not map to Unicode at /Users/trond/gt/script/preprocess line 82. This msg did not show up in 10.3 (perl 5.8.1), but does so in 10.4 (perl 5.8.6). It is probably a perl - OS mismatch. (Trond, Thor Øivind, Tomi)
- Sjur has a non-solved Backspace + UTF-8 issue
29 open bugs - too much! Have a look at what you can fix.

10. Summary, task list

Børre

Finish crontab specification for the cvs update/export script Tomi made
reopen the jspwiki + UTF-8 issue
Add issue to forrest issue tracker about utf-8 ihtml documents.
Contact Svenska bibelsällskapet
discuss with Anders Kintel about possible cooperation
Follow up on CVS mailing:
- set up Maaren
Meet up with Trond about directory structure
Contact oahpahusossodat and the rest of the SD about texts
Fixing the machine for the new coworker

Maaren

The missing list, both the overall missing list from our xml corpus, and a file-for-file review, in order to get different terminology
Go through the missing list from risten.no
Start working on grammatical issues with Thomas and Trond.

Saara

Get aquainted with the project status quo
Look at the corpus infrastructure issue
Look at the corpus interface issue with Lars

Sjur

risten.no bugs and fixes
complete the action summary after our half-year evaluation
follow up on:
- voice group-chat not working to Sámediggi
- Maaren has problems with SubEthaEdit (can’t connect)
To the board:
- write proposal for permanent maintenance organisation
- write draft specification for the outsourced tasks
- write half-yearly project report with progress and bugdet status
- write agenda
- Deadline for the board tasks: 3 weeks ahead of the meeting (the meeting is October 4th, deadline for the documents is September 13th)
project planning with Trond

Thomas

work on Lule Sami compounding and derivation
Look at Linguistic bugs with Trond.
Work on the name agreement with “Norge digitalt” with Trond

Tomi

Aspell: Continue working on the affix file
three-part compounding
Add downcasing to makefile and CVS
corpus infrastructure: dtd location (both public and internal)
corpus infrastructure: file and dir organisation

Trond

Work on the bug list (Lule Sámi).
Work on compounds (three-part, with Tomi)
Work on the corpus interface (with Lars)
Corpus infrastructure: dtd location
Work on the name agreement with “Norge digitalt” with Thomas
Look at the linguistic aspects of the speller clitics, with
Get the new version of the New Testament
Check Hans-Ragnars names.
New coworker
translate contract
check the new giellatekno site
project planning with Sjur

10. Next meeting, closing

12.09.2005 10:00

Closed at 12:56

Language Technology at UiT

Page Content

Meeting setup

Agenda

1. Opening, agenda review, participants

2. Reviewing the task list from the last meeting

Børre

Maaren

Sjur

Thomas

Tomi

Trond

3. Summing up last week (Meeting + conference)

4. Documentation

5. Corpus gathering

Independent of the contract work

After the contracts are ready

6. Corpus infrastructure

Naming conventions and directory structure

7. Linguistics

North Sámi

Lule Sámi

8. Speller infrastructure

aSpell

MS Office spellers

OpenOffice.org

Hunspell

Other engines

9. Other

Technical issues

10. Summary, task list

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

10. Next meeting, closing

Sitemap