Meeting setup

Date: 18.12.2006
Time: 09.30 Norw. time
Place: Where we are
Tools: SubEthaEdit, iChat

Agenda

Opening, agenda review
Reviewing the task list from last week
Documentation - divvun.no
Corpus gathering
Corpus infrastructure
Infrastructure
Linguistics
name lexicon infrastructure
Spellers
Other issues
Summary, task lists
Closing

1. Opening, agenda review, participants

Opened at 09:56.

Present: Børre, Saara, Sjur, Thomas, Tomi

Absent: Maaren, Trond

Agenda accepted as is.

2. Updated task status since last meeting

Børre

contact authors who have already received the corpus licensing contract
- not done
continue work on script for automatic testing of the spell checker in Word
- not done
sma discussions with SD (with Sjur, Trond)
- not done
get an Intel Mac for testing Windows spellers; get a WinXP license from SD
- not done
recreate our forrest tarball
- not done
update setup and installation instructions for new users/computers
- not done
fix bugs!
- not done
other
- set up internationalized forrest on our webserver
- worked on fixing a bug in the aligner

Maaren

investigate the generated word form list sent to Polderland - use the command make wordlist TARGET=sme in victorio

Saara

help Trond with some shell commands
- not done
re-analyze parallel files
- decided to analyze some other files instead of these
consider implementing some new features to the corpus files
- not finished
write some Perl documentation
- done
vislcg as server, possibly as feature request to the vislcg devs
- not done
other
- implemented and tested fast handling of multi-word expressions to preprocessor
- number inflection etc. added to the preprocessor
- prepared corpus files for the move to the interfaace
fix bugs!

Sjur

name lexicon:
- refactor SD-terms editor code
  - worked on ideas and requests from the Alta meeting
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
hire linguist and programmer
decide how to specify compounding behaviour info in the lexicon
- one more try at moving this forward, not finished
sma discussions with SD (with Børre, Trond)
get an Intel Mac for testing Windows spellers; get a WinXP license from SD
publish corpus contracts and project infra on NoDaLi-sta
fix forrest installations for Maaren, Disamb
fix bugs!
other tasks:
- some work on the smj twol

Thomas

refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- nothing this week
decide how to specify compounding behaviour info in the lexicon
- nothing this week
fix bugs!
- fixed some lule-sámi that were discovered when testing speller

Tomi

add closed POS and clitics to PLX generation
- worked with this one
add derivations to the PLX generation
- not done
add compound stems to the PLX generation
- not done
fix bugs!

Trond

refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
get more sma texts
decide how to specify compounding behaviour info in the lexicon
sma discussions with SD (with Børre, Sjur) fix bugs!.

3. Documentation

Børre added the new, fully i18n-ed documentation to our public site.

TODO:

either fix forrest installations (Sjur), or create a new tarball (Børre)
cvs up of the public server should be done for xtdoc/sd/documentation/ (Børre)

4. Corpus gathering

Sjur has received a heap of Bible files from Pia. Børre will add them to the corpus.

TODO:

get sma Bible / NT texts (Trond)
- done
discussions with the Sámi Parliament about sma (Børre, Sjur, Trond)
- done (by Sjur)

5. Corpus infrastructure

Aligner

The aligner produces empty output - not so useful:-) Børre has been working on fixing this bug.

TODO:

gather more parallel texts (Trond, Børre)
re-analyze parallel files using the command-line version (Saara)

6. Infrastructure

Xerox tools wrapped as servers

TODO:

find a way of integrating vislcg as a server, or send a feature request to the vislcg developers (Saara)
- move this to Bugzilla

7. Linguistics

Names and multilinguality

TODO:

finish first version of the editing (Sjur)
test editing of the xml files. If ok, then: (Sjur, Thomas, Trond)
make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as well) (the morphological section should be kept intact, in e.g. propernoun-sme-morph.txt) (Sjur, Saara)
convert propernoun-($lang)-lex.txt to a derived file from common xml files (Sjur, Tomi, Saara)
start to use the xml file as source file
clean terms-sme.xml such that all names have the correct tag for their use (e.g. @type=secondary) (Thomas, Maaren, linguists)
merge placenames which are errouneously in different entries: e.g. Helsinki, Helsingfors, Helsset (linguists)
publish the name lexicon on risten.no (Sjur)
add missing parallel names for placenames (linguists)
add informative links between first names like Niillas and Nils (linguists)

North Sámi

The latest change in sme-lex.txt:

 +N+SgNomCmp:   R ; ! gahpirg√°nda, ƒçoahkkinordnet
 +N+SgNomCmp:X7 R ; ! gahperg√°nda, ƒçoahkkenordnet

How much will this overgenerate, would it be better to have two different lexicons, or lexicalise exceptional compounding? (GAHPIR has 2329 members…)

Command to extract the relevant parts of GAHPIR words:

grep GAHPIR noun-sme-lex.txt | cut -d":" -f1 | cut -d" " -f1 |
cut -d"#" -f3 | cut -d"#" -f2 | rev | sort | uniq | rev | l

One possibility is to split GAHPIR into three lexica:

vowel lowering (X7)
no vowel lowering
both for the same lexeme

Another possibility could be to write two-level rules, if lowering/non-lowering follows a certain pattern.

TODO:

go through the class of GAHPIR words, and try to generalise the compounding behaviour (Thomas)
change whatever is needed based on the above generalisation (Thomas, Trond)

Lule Sámi

A lot of work has been done on the sme name lexicon, the smj copy should be updated. Nothing new on the smj proper noun lexicon itself..

TODO:

refine smj proper noun lexica, cf. the propernoun-smj-lex.txt (Thomas, Trond)
update proper noun lexicon with copy of sme lexicon (Trond)

8. Name lexicon infrastructure

Decided in Tromsø:

add logging facilities to the interface
add option to download local copies of the lexicon files directly from the db
batch editing (change all entries in the found set), should later be enhanced to allow selection of exceptions (the found set minus deselected items)
tag for excluding/including a name from certain applications
future epxansion: choose what info to display in the single language browser
display existing language entries when adding a new language to a record
add editor to change single, existing entries

Details can be found in [the meeting memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html]

TODO:

develop the needed XQueries and UI (Sjur, Tomi)

Postponed:

data synchronisation between risten.no and the cvs repo
new version of xml2lexc (based on ccat), should handle complex names correct: construct entries like we have now from the different parts of a complex name entry

9. Spellers

Alpha version

Was received one and a half week ago, and contains spellers for sme and smj, as well as a sme hyphenator. The proofing files can be had from Sjur.

Polderland data generation

TODO:

decide how to specify compounding behaviour info for the lexicon (Thomas, Trond, Sjur)
- new meeting Tuesday Dec. 19, after lunch
add closed POS and clitics to PLX generation (Tomi)
- has been working, not yet finished
add derivations to the PLX generation (Tomi)
add compound stems to the PLX generation (Tomi)

Aspell

TODO when the major part of the PLX conversion is done:

add Aspell/Hunspell data generation to the lexc2xspell (Tomi - after the PLX data generation is finished)
study Hunspell, perhaps also Soikko (Børre, Sjur, Tomi)

Testing

TODO:

get an Intel Mac for testing Windows spellers; get a WinXP license from SD (Børre, Sjur)
- WinXP is sent from SD/Leif Åge to Sjur and Børre.

10. Other

Corpus contracts

TODO:

publish corpus contracts and project infra on NoDaLi-sta (Sjur)

Bug fixing

57 open Divvun/Disamb bugs, and 23 risten.no bugs

Guess: 1/3 of the bugs are fixed already (?)

New Perl modules

TODO:

write Perl module dependency documentation (Saara)
- done
update setup and installation instructions (Børre)

11. Next meeting, closing

The next meeting is 3.1.2007, 09:30 Norwegian time.

The meeting was closed at 11:09.

Appendix - task lists for the next week

Boerre

contact authors who have already received the corpus licensing contract
continue work on script for automatic testing of the spell checker in Word
recreate our forrest tarball
update setup and installation instructions for new users/computers
create new forrest tarball
cvs up of the public server should be done for xtdoc/sd/documentation/
fix bugs!

Maaren

investigate the generated word form list sent to Polderland - use the command make wordlist TARGET=sme in victorio

Saara

help Trond with some shell commands
re-analyze parallel files
Move to Bugzilla: vislcg server-friendly as feature request to the vislcg devs
fix bugs!

Sjur

name lexicon:
- refactor SD-terms editor code
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
hire linguist and programmer
decide how to specify compounding behaviour info in the lexicon
get an Intel Mac for testing Windows spellers
publish corpus contracts and project infra on NoDaLi-sta
fix forrest installations for Maaren, Disamb
fix bugs!

Thomas

refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
decide how to specify compounding behaviour info in the lexicon
go through the class of GAHPIR words, and try to generalise the compounding behaviour
change whatever is needed based on the above generalisation
fix bugs!

Tomi

add closed POS and clitics to PLX generation
add derivations to the PLX generation
add compound stems to the PLX generation
fix bugs!

Trond

update the smj proper noun lexicon, and refine the morphological analysis, cf. the propernoun-smj-lex.txt
Go through the GAHPIR lexicon (with Thomas)
get more sma texts
decide how to specify compounding behaviour info in the lexicon
fix bugs!.