Meeting setup

Date: 13.11.2006
Time: 09.30 Norw. time
Place: Where we are
Tools: SubEthaEdit, iChat

Agenda

Opening, agenda review
Reviewing the task list from two weeks ago
Documentation - divvun.no
Corpus gathering
Corpus infrastructure
Infrastructure
Linguistics
name lexicon infrastructure
Spellers
Other issues
Summary, task lists
Closing

1. Opening, agenda review, participants

Opened at 09:50.

Present: Børre, Sjur, Thomas, Tomi, Trond

Absent: Maaren, Saara

Agenda accepted as is.

2. Updated task status since last meeting

Børre

contact writers who already have received contracts
- Not done
consider a script for automatic testing of the spell checker in Word
- Not done
consider more testing routines
- Not done
update Maaren’s Forrest installation to r430284
- Not done
sma discussions with SD (with Sjur, Trond)
- Not done
report improvements in aligner back to Øystein
- Cleaned up tca2 to presentable state, sent the sources to him
add a simple password protection to risten.no in the G5
- Not done
consider infra for testing feedback
- Not done
get an Intel Mac for testing Windows spellers; get a WinXP license from SD
- Not done
check corpus contract issue
- Not done
port all i18n work to the main branch (from the i18n branch)
- Not done
update all forrest installations, including local patches
- Not done
fix bugs!
- Not done
other work:
- Cleaned up in forrest. Fixed broken links, helped Trond with gtuit.
- Moved Norwegian texts in Min Áigi
- Searched for documents in the Sami Parliaments archive system
- Worked on the rename script to Richard Valkeapää (NSI). Done.
- Worked on aspell
- Worked on wordlist

Maaren

investigate the generated word form list sent to Polderland - use the command make wordlist TARGET=sme in victorio
investigate unrecognised word forms in the hyphenator
create / check the paradigm grammar as exemplified above

Saara

add more texts to the graphical corpus interface
finalize server of the Xerox tools.
- waiting for the grammar for paradigm generator
generate parallel corpus files manually (with Trond)
export corpus tools to location available to all (with cron), cf news disc.
- done by copy_bin.sh and weekly cron script.
help Trond with some shell commands
plan the word form generator / data conversion script
- what was this again..
other:
- Implement upload of multiple files with the same metainfo
  - finally done, should be tested
- create better versions of the bibles
  - done for smj nt, sme nt in progress, see news discussion about the format.
fix bugs!
- done some

Sjur

name lexicon:
- refactor SD-terms editor code
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
hire linguist and programmer
investigate unrecognised word forms in the hyphenator
- done
decide how to specify compounding behaviour info in the lexicon
- no further discussion
sma discussions with SD (with Børre, Trond)
check why some SUB-marked entries got included in the normative transducer
- not done
remove comparation from -laš derivations
- done
plan the word form generator / data conversion script
- done
consider a script for automatic testing of the spell checker in Word
consider more testing routines
consider infra for testing feedback
get an Intel Mac for testing Windows spellers; get a WinXP license from SD
ask Julie Eira about SD employee seminar
check corpus contract issue
fix bugs!
other:
- done a lot with the sme transducers, especially related to getting control over the derivations

Thomas

refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- nothing done
find and study all derived words in our corpus (with Trond)
- done
suggest which derivations could be generated
- done
investigate unrecognised word forms in hyphenator
- done
decide how to specify compounding behaviour info in the lexicon
- nothing done
check why some SUB-marked entries got included in the normative transducer
- not done
fix bugs!
- worked

Tomi

continue implementation of the speller lexicon conversion
- continues
add hyphenation points to the generated output
- not done
plan the word form generator / data conversion script
- not done
fix bugs!

Trond

refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- Had a look at these, but not together with Thomas. Work still to be done.
get more sma texts, first the Bible / NT
- Discussed with Ove Lorentz, not yet Bible
add corpus user accounts and access issues to Bugzilla
fix the corpus tag list in the cwb/ directory
- Discussed corpus tag issues in the disamb team, not ready to revise cwb/ yet
investigate unrecognised word forms in the hyphenator
- Have had problems compiling the hyphenator, concentrated first upon the generator
decide how to specify compounding behaviour info in the lexicon
- Issue still open
sma discussions with SD (with Børre, Sjur)
check corpus contract issue
- Only smj last week, and speller generation automat issues.
fix bugs!.

3. Documentation

TODO:

update all forrest installations, including local patches (Børre)
- did some
port all i18n work to the main branch (from the i18n branch) (Børre)

4. Corpus gathering

TODO:

continue to help NSI to get their corpus (Børre)
- fixed their filename renaming script
get sma Bible / NT texts (Trond)
- still no contact with the priest
Discussions with the Sámi Parliament about sma (Børre, Sjur)
send the filename renaming script to NSI; get their corpus (Børre)

5. Corpus infrastructure

Nothing new on any of the sub-issues.

User accounts and access

TODO:

add the issue with subissues to Bugzilla (Trond)

More texts to the graphical corpus interface:

TODO:

add text to the server (Lars)

Aligner

TODO:

report improvements in aligner back to Øystein (Børre)
- done
gather more parallel texts (Trond)
try out NT alignment strategies (Saara)

Language recognition

TODO:

get more sma texts, first the Bible / NT (Trond)

6. Infrastructure

Xerox tools wrapped as servers

Tomi has modified the server a bit, causing it to NOT work with the perl client at the moment. It should not be a big deal to fix it (three lines?). Saara will fix it.

TODO:

improve and finish the present prototype (Saara)
- done, except for the paradigm generation (needs paradigm grammar)
fix the corpus tag list in the cwb/ directory (Trond)
create / check the paradigm grammar as exemplified above (Thomas)
- not done

Hyphenator

TODO:

make new list of unrecognised forms (Sjur)
investigate unrecognised word forms (Maaren, Thomas, Trond, Sjur)

7. Linguistics

Names and multilinguality

We need a more principled approach to this.

Background: the name lexicon is getting attention from the SD name/terminology sections, and they would like to use our name lexicon also for public searching.

Observations:

1) Multilinguality is always optional.

2) We can observe that “foreign” names in texts follows a domination pattern: majority language forms can be found in minority language texts as real names (“Kautokeino produkter”), whereas minority language names almost always occur in majority language texts as citations. And citations should not be considered a natural part of the text.

3) When looking at our name classification, multilinguality varies according to:

Ani - weak/none? (pet, myth anim.  names)
Fem - weak (informative)
Mal - weak (informative)
Obj - strong
Org - strong
Plc - strong for the national and country names, weak (informative) for foreign
       names
Sur - none
Tit - strong (titles)

Suggestion:

We need to reconsider the all names in all languages policy. That policy is valid only for Fem, Mal, and Sur (and Ani and Tit?). For Obj, Org, Plc the rule should be that if they have multilingual names, each name should only be used in it’s own language. Then we need a modification saying that majority language names can be included in minority language lexicons if attested in our corpus. Also, the majority language varies according to country (obviously), which means that in a speller context, we might consider tailoring spellers for each country, leaving out noise relating to majority language names from another country.

A further issue is whether we should reconsider our cohort policy. Today, Sur and Plc are different readings. An alternative would be to have them as secondary tags, not in conflict with each other:

"<Trosterud>"
        "Trosterud" N Prop Sur Sg Nom <<< @HNOUN
        "Trosterud" N Prop Plc Sg Nom <<< @HNOUN
"<Trosterud>"
        "Trosterud" N Prop Sg Nom <Sur> <Plc> <<< @HNOUN
"<Trosterud>"
        "Trosterud" N Prop Sg Nom &Sur &Plc <<< @HNOUN

TODO:

separate meeting for discussing this issue, Tuesday 14.11 after lunch (12.30) (Trond, Thomas, Sjur)

Derivation and spellers like Aspell

Done all for sme, needs to be redone for (or ported to) smj.

TODO:

find and study all derived words in our corpus (Thomas and Trond)
- done
suggest which derivations could be generated (Thomas)
- done
lexicalise the rest (Thomas)
- to be done as they are found
redo the work for smj, including discussion regarding Actio (Thomas, Sjur, Trond)

North Sámi

The following words are included in the normative list despite being marked with !SUB:

accompagnerejun
ábuhuvvože
ábuhuvvože      ábuhit+V+TV+Pass+Pot+Prs+Du1
áccohallagođežedne

TODO:

investigate the generated word form list sent to Polderland - use the command make pl-wordlist TARGET=sme in victorio (Maaren, Thomas)
- done, to be redone (many times:-)
check why some SUB-marked entries got included in the normative transducer (Thomas, Sjur)
remove comparation from -laš derivations (Thomas, Sjur)
- done

Lule Sámi

TODO:

refine smj proper noun lexica, cf. the propernoun-smj-lex.txt (Thomas, Trond)
- Trond had a short look at it, needs more work
hire new linguist (Sjur)

8. Name lexicon infrastructure

Decided in Tromsø:

add logging facilities to the interface
add option to download local copies of the lexicon files directly from the db
batch editing (change all entries in the found set), should later be enhanced to allow selection of exceptions (the found set minus deselected items)
tag for excluding/including a name from certain applications
future epxansion: choose what info to display in the single language browser
display existing language entries when adding a new language to a record
add editor to change single, existing entries

Details can be found in [the meeting memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html]

TODO:

develop the needed XQueries and UI (Sjur, Tomi)
add a simple password protection to risten.no in the G5 (Børre)

Postponed:

data synchronisation between risten.no and the cvs repo
new version of xml2lexc (based on ccat), should handle complex names correct: construct entries like we have now from the different parts of a complex name entry

9. Spellers

Speller data generation

Tomi has made further improvements to the lexc2xspell code, and Sjur and Tomi had a short meeting on the feature set of it. We have asked for the full PLX specification to be able to output correct entries.

TODO:

decide how to specify compounding behaviour info for the lexicon (Thomas, Trond, Sjur)
make first version of the PLX data generation (Tomi)
add hyphenation points to the generated output (Tomi)

Automatic testing of the Word spellchecker

It should be possible to write a script that runs texts through Word from the command line, using a combination of shell script and AppleScript. MS Word has the needed AppleScript commands to run the spell checker.

TODO:

consider a script for automatic testing (Sjur, Børre)
consider more testing routines (Sjur, Børre)
consider infra for testing feedback (Børre, Sjur)
get an Intel Mac for testing Windows spellers; get a WinXP license from SD (Børre, Sjur)

Aspell

Børre has worked on the Aspell code, mainly to be able to explain how it works and help out one external, interested person.

Further discussion was forgotten, will be brought up in the next meeting.

TODO:

revitalize the Aspell work (Børre, Sjur, Tomi)

10. Other

Corpus contracts

TODO:

check corpus contract issue in a meeting Wednesday 9.30 (Børre, Sjur, Trond)
publish corpus contracts and infra on NoDaLi-sta (Sjur)

Bug fixing

61 open Divvun/Disamb bugs, and 24 risten.no bugs

Guess: 1/3 of the bugs are fixed already (?)

Task lists as iCal entries

TODO:

update Maaren’s Forrest installation to r430284 (Børre)

Employee seminar in Alta

SD has an employee seminar in Alta 7.-8. December - should we go there? Sjur will ask Julie Eira if we have to go there.

TODO:

ask Julie Eira about SD employee seminar (Sjur)

11. Next meeting, closing

Closed at 10:33.

Appendix - task lists for the next week

Boerre

contact writers who already have received contracts
send file rename script to NSI; get their corpus
consider a script for automatic testing of the spell checker in Word
consider more testing routines
update Maaren’s Forrest installation to r430284
sma discussions with SD (with Sjur, Trond)
add a simple password protection to risten.no in the G5
consider infra for testing feedback
get an Intel Mac for testing Windows spellers; get a WinXP license from SD
check corpus contract issue
port all i18n work to the main branch (from the i18n branch)
update all forrest installations, including local patches
revitalize the Aspell work
check corpus contract issue in a meeting Wednesday 9.30
fix bugs!

Maaren

investigate the generated word form list sent to Polderland - use the command make wordlist TARGET=sme in victorio
investigate unrecognised word forms in the hyphenator

Saara

finalize server of the Xerox tools.
help Trond with some shell commands
fix bugs!

Sjur

name lexicon:
- refactor SD-terms editor code
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
hire linguist and programmer
make new list of unrecognised forms in the hyphenator
investigate unrecognised word forms in the hyphenator
decide how to specify compounding behaviour info in the lexicon
sma discussions with SD (with Børre, Trond)
check why some SUB-marked entries got included in the normative transducer
consider a script for automatic testing of the spell checker in Word
consider more testing routines
consider infra for testing feedback
get an Intel Mac for testing Windows spellers; get a WinXP license from SD
ask Julie Eira about SD employee seminar
check corpus contract issue
meeting discuss names and multilinguality Tuesday 14.11 12.30
redo the derived words work for smj
revitalize the Aspell work
check corpus contract issue in a meeting Wednesday 9.30
fix bugs!

Thomas

refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
redo the derived words work for smj
investigate unrecognised word forms in hyphenator
decide how to specify compounding behaviour info in the lexicon
check why some SUB-marked entries got included in the normative transducer
create / check the paradigm grammar
investigate the generated word form list sent to Polderland - use the command make wordlist TARGET=sme in victorio
meeting discuss names and multilinguality Tuesday 14.11 12.30
fix bugs!

Tomi

make first version of the PLX data generation in lexc2xspell
add hyphenation points to the generated output
revitalize the Aspell work
fix bugs!

Trond

refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
get more sma texts, first the Bible / NT
add corpus user accounts and access issues to Bugzilla
fix the corpus tag list in the cwb/ directory
investigate unrecognised word forms in the hyphenator
decide how to specify compounding behaviour info in the lexicon
sma discussions with SD (with Børre, Sjur)
check corpus contract issue
meeting discuss names and multilinguality Tuesday 14.11 12.30
redo the derived words work for smj
check corpus contract issue in a meeting Wednesday 9.30
fix bugs!.

Meeting setup

Agenda

1. Opening, agenda review, participants

2. Updated task status since last meeting

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

3. Documentation

4. Corpus gathering

5. Corpus infrastructure

User accounts and access

More texts to the graphical corpus interface:

Aligner

Language recognition

6. Infrastructure

Xerox tools wrapped as servers

Hyphenator

7. Linguistics

Names and multilinguality

Derivation and spellers like Aspell

North Sámi

Lule Sámi

8. Name lexicon infrastructure

9. Spellers

Speller data generation

Automatic testing of the Word spellchecker

Aspell

10. Other

Corpus contracts

Bug fixing

Task lists as iCal entries

Employee seminar in Alta

11. Next meeting, closing

Appendix - task lists for the next week

Boerre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

Sitemap