Meeting setup

Date: 23.10.2006
Time: 09.30 Norw. time
Place: Where we are
Tools: SubEthaEdit, iChat

Agenda

Opening, agenda review
Reviewing the task list from two weeks ago
Documentation - divvun.no
Corpus gathering
Corpus infrastructure
Infrastructure
Linguistics
name lexicon infrastructure
Spellers
Other issues
Summary, task lists
Closing

1. Opening, agenda review, participants

Opened at 09:58.

Present: Børre, Maaren, Saara, Sjur

Absent: Thomas, Tomi, Trond

Agenda accepted as is.

2. Updated task status since last meeting

Børre

corpus collection:
- contact Ája (Kåfjord)
  - They will send us documents on a monthly basis
- discuss access to older Min Áigi and Áššu files with Richard Valkeapää
  - Done …
Move norwegian documents in Min Áigi from sme to nob
- Not done
set up Bugzilla automatic reminders for open issues
- Done!
finish Forrest i18n work (pdf)
set up Tomcat on the G5 for use with eXist and the propnouns db, as well as Forrest in its i18n form
- tomcat is up and running
document SquidMan use at SD
- Available at /doc/infras/ichat-through-firewalls.html
fix bugs!
Other
- Done a lot of work on the Bergen aligner.

Maaren

investigate the generated word form list sent to Polderland - use the command make pl-wordlist TARGET=sme in victorio
- have started working
investigate unrecognised input word forms in the hyphenator

Saara

add more texts to the graphical corpus interface
finalize server of the Xerox tools.
- implemented xml-conversions.
generate parallel corpus files manually (with Trond)
- prepared and analyzed a set of files for alignment
Improve text_cat
- not finalized
export corpus tools to location available to all (with cron), cf news disc.
- not done
Improve hyph-filter.pl
- done, with respect to the case conversion and check for #-
help Trond with some shell commands
- not done
fix bugs!

Sjur

name lexicon:
- refactor SD-terms editor code
  - partly done, refactored and completed the classification editor code
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
move corpus user doc issue to Bugzilla
- done
hire linguist and programmer
finish i18n work of Forrest
install eXist and our local copy of risten.no and propnouns on the G5
investigate unrecognised input word forms in the hyphenator
decide how to specify compounding behaviour info in the lexicon
fix bugs!
other tasks:
- hyphenation and normative improvements to the data delivered to Polderland
- delivered hyphenated data to Polderland

Thomas

refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
find and study all derived verbs in our corpus
suggest which derivations could be generated
investigate unrecognised input word forms in hyphenator
investigate the generated word form list sent to Polderland - use the command make pl-wordlist TARGET=sme in victorio
decide how to specify compounding behaviour info in the lexicon

Tomi

continue implementation of the speller lexicon conversion
- implementing, it will be very nice, maybe too nice and complicated
make generator as server, based on Saara’s code
add lexc2xspell code to cvs
- still moving and renaming codefiles
add hyphenation points to the generated output
fix bugs!

Trond

refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
Get more sma texts to improve language recognition
study paragraphs with mixed content
add corpus user accounts and access issues to Bugzilla
investigate unrecognised input word forms in the hyphenator
decide how to specify compounding behaviour info in the lexicon
fix bugs!.

3. Documentation

TODO:

finish i18n work (Børre and Sjur)
- set up Tomcat at the faculty server, and install Forrest as war. Needs testing before being deployed on the server! Thus, install it on the G5 first and see that it works, then on the faculty server.
- i18n does not work in PDF (“Table of Content” won’t translate)
  - nothing
Write both user and admin documentation (Børre, review: Sjur, Thomas)
- move these issues into Bugzilla (Sjur)
  - finally done (see [Bug 348|http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=348])

4. Corpus gathering

Børre wrote the renaming script for NSI, but it didn’t work on their computer.

Børre has also talked with Ája. They agreed that they will send monthly all their documents from that last month, including parallell language versions.

He has also been digging more into the SD document hierarchy, to try to find more texts, has added a few more to our corpus repository.

One author has signed the corpus contract last week, and sent us a book: Synnøve Persen. She has sent us our first novel - excellent!

TODO:

discuss corpus transfer with NSI (Børre)
- done
continue to contact authors and text producers (Børre)
- done

5. Corpus infrastructure

User accounts and access

TODO:

add the issue with subissues to Bugzilla (Trond)

More texts to the graphical corpus interface:

TODO:

align texts, analyse, and send to Lars (Trond, Saara)
add text to the server (Lars)

Aligner

Børre has been working hard on the Bergen aligner: fixed compiling errors and cleaned the source - compilation now only produces 15 warnings, as opposed to more than 100 warnings and 74 errors earlier.

The aligner now works somewhat automatically, it handles memory issues gracefully without manual intervention. The command line version is coming, so far it loads the input documents, but it does not start aligning by itself.

We’ll stop working on it for this week, as the fixes already done are very useful in themselves.

Some parallell texts in the corpus are now aligned (but many of the texts believed to be in Norwegian were actually in Sámi).

TODO:

gather parallel texts (Trond)
try out NT alignment strategies (Saara)

Language recognition

TODO:

get more sma texts, first the Bible / NT (Trond)
what about paragraphs with mixed content? Build a corpus of such paragraphs (Trond)

6. Infrastructure

Xerox tools wrapped as servers

Tomi tried to make the generator, but no success. Saara will cooperate with Tomi to make something useful this week.

TODO:

improve and finish the present prototype (Saara)
- improved
add generator to the server setup (Tomi)
- tried

Hyphenator

TODO:

Update the sma hyphenator rule set with the insights gained from smj updates (Trond during weekends)
consider case conversion problems (Saara)
- done
check #- when comparing input string with hyphenated string - the hyphen could be part of the input string, keep it if it is, discard the hyphenated string if not (Saara)
- done
investigate unrecognised word forms (Maaren, Thomas, Trond, Sjur)
- nothing

Automatic Bugzilla reminder for untouched bugs

We now receive reminders for untouched bug reports, once a week.

TODO:

fix the remaining issues (Børre)
- done!

M4

Still anything?

It is problematic for the CG rules, as the rule numbering gets mixed up. The hope is that it is possible to get the same effect in CG3.

7. Linguistics

Names and multilinguality

We need a more principled approach to this.

Background: the name lexicon is getting attention from the SD name/terminology sections, and they would like to use our name lexicon also for public searching.

Observations:

1) Multilinguality is always optional.

2) We can observe that “foreign” names in texts follows a domination pattern: majority language forms can be found in minority language texts as real names (“Kautokeino produkter”), whereas minority language names almost always occur in majority language texts as citations. And citations should not be considered a natural part of the text.

3) When looking at our name classification, multilinguality varies according to:

Ani - weak/none? (pet, myth anim.  names)
Fem - weak (informative)
Mal - weak (informative)
Obj - strong
Org - strong
Plc - strong for the national and country names, weak (informative) for foreign
       names
Sur - none
Tit - strong (titles)

Suggestion:

We need to reconsider the all names in all languages policy. That policy is valid only for Fem, Mal, and Sur (and Ani and Tit?). For Obj, Org, Plc the rule should be that if they have multilingual names, each name should only be used in it’s own language. Then we need a modification saying that majority language names can be included in minority language lexicons if attested in our corpus. Also, the majority language varies according to country (obviously), which means that in a speller context, we might consider tailoring spellers for each country, leaving out noise relating to majority language names from another country.

A further issue is whether we should reconsider our cohort policy. Today, Sur and Plc are different readings. An alternative would be to have them as secondary tags, not in conflict with each other:

"<Trosterud>"
        "Trosterud" N Prop Sur Sg Nom <<< @HNOUN
        "Trosterud" N Prop Plc Sg Nom <<< @HNOUN
"<Trosterud>"
        "Trosterud" N Prop Sg Nom <Sur> <Plc> <<< @HNOUN
"<Trosterud>"
        "Trosterud" N Prop Sg Nom &Sur &Plc <<< @HNOUN

Derivation and spellers like Aspell

find and study all derived verbs in our corpus (Thomas)
suggest which derivations could be generated (Thomas)
lexicalise the rest (Thomas)

North Sámi

Unwanted word forms:

comparation of -laš derivations (they should not be generated in comparative and superlative)

TODO:

investigate the generated word form list sent to Polderland - use the command make pl-wordlist TARGET=sme in victorio (Maaren, Thomas)

Lule Sámi

TODO:

refine smj proper noun lexica, cf. the propernoun-smj-lex.txt (Thomas, Trond)
hire new linguist (Sjur)

8. Name lexicon infrastructure

Sjur has cleaned a lot of risten.no code during last week.

Decided in Tromsø:

add logging facilities to the interface
add option to download local copies of the lexicon files directly from the db
batch editing (change all entries in the found set), should later be enhanced to allow selection of exceptions (the found set minus deselected items)
tag for excluding/including a name from certain applications
future epxansion: choose what info to display in the single language browser
display existing language entries when adding a new language to a record
add editor to change single, existing entries

Details can be found in [the meeting memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html]

TODO:

develop the needed XQueries and UI (Sjur, Tomi)
turn Tomcat on on the G5; send admin username and password to Sjur (Børre)
- done
add eXist and the proper noun interface to the G5 (Sjur)
- eXist installed.
cvs synching of the risten.no code in eXist (read-only) (Børre)

Postponed:

data synchronisation between risten.no and the cvs repo
new version of xml2lexc (based on ccat), should handle complex names correct: construct entries like we have now from the different parts of a complex name entry

9. Spellers

Speller data generation

Derivations during generation of word forms: how do we generate derivations of input words, and their inflections?

Problem example: how do we get from Oslo to oslolaš?

Input:
Oslo

Output:
Oslo
Oslos
...
oslolaš
oslolaččat
...
(+ other derivations and their inflections)

Oslo ->

Oslo+N+Sg+Nom Oslo+der/laš -> oslolaš

oslolaš+N+Sg+Nom Oslo+N+Prop+Der/laš+A+Sg+Nom

Make a list of all possible derivations (see lexc lexicon files), and try them out one at a time:

if it succeeds, take the new stem, and inflect it
it not, try the next

What about compounding stem? How do we generate it?

TODO:

add code to cvs (Tomi)
implement generator server based on Saara’s code (Tomi)
specify compounding behaviour info for the lexicon (Thomas, Trond, Sjur)
add hyphenation points to the generated output (Tomi)
planning meeting for the word form generator / data conversion script (Sjur, Saara, Tomi)

Automatic testing of the Word spellchecker

Ask MS Word to spell check the open documents, and store all unrecognised words into the user dictionary. Possibly using AppleScript to interact with Word.

We should also ask Polderland whether they have tools for this.

This will only test unrecognised words. We also need to test the suggestions, and misspellings not recognised as such.

TODO:

consider a script for automatic testing (Sjur, Børre)
ask Polderland about testing tools (Sjur)
consider more testing routines (Sjur, Børre)

10. Other

Bug fixing

66 open Divvun/Disamb bugs, and 24 risten.no bugs

Guess: 1/3 of the bugs are fixed already (?)

Meetings and the SD Firewall

TODO:

document SquidMan use at SD (Børre)
- done

How do we set environment variables effective for all users

Look into /etc/environment on victorio - NOT FOUNT on the Mac! (only relevant on the G5)

Task lists as iCal entries

TODO:

update Maaren’s Forrest installation to r430284 (Børre)

Employee seminar in Alta

SD has an employee seminar in Alta in December - should we go there? We’ll discuss it in the next meeting.

11. Next meeting, closing

Next meeting 30.10.2006 at 9:30.

Closed at 11:24.

Appendix - task lists for the next week

Boerre

contact writers who already have received contracts
Move norwegian documents in Min Áigi from sme to nob
finish Forrest i18n work (pdf)
cvs synching of the risten.no code in eXist (read-only)
consider a script for automatic testing of the spell checker
consider more testing routines
update Maaren’s Forrest installation to r430284
fix bugs!

Maaren

investigate the generated word form list sent to Polderland - use the command make pl-wordlist TARGET=sme in victorio
investigate unrecognised word forms in the hyphenator

Saara

add more texts to the graphical corpus interface
finalize server of the Xerox tools.
generate parallel corpus files manually (with Trond)
export corpus tools to location available to all (with cron), cf news disc.
help Trond with some shell commands
fix bugs!

Sjur

name lexicon:
- refactor SD-terms editor code
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
hire linguist and programmer
finish i18n work of Forrest
install our local copy of risten.no and propnouns on the G5
investigate unrecognised word forms in the hyphenator
decide how to specify compounding behaviour info in the lexicon
fix bugs!

Thomas

refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
find and study all derived verbs in our corpus
suggest which derivations could be generated
investigate unrecognised word forms in hyphenator
investigate the generated word form list sent to Polderland - use the command make pl-wordlist TARGET=sme in victorio
decide how to specify compounding behaviour info in the lexicon
fix bugs!

Tomi

continue implementation of the speller lexicon conversion
make generator as server, based on Saara’s code
add lexc2xspell code to cvs
add hyphenation points to the generated output
fix bugs!

Trond

refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
Get more sma texts to improve language recognition
study paragraphs with mixed content
add corpus user accounts and access issues to Bugzilla
investigate unrecognised word forms in the hyphenator
decide how to specify compounding behaviour info in the lexicon
fix bugs!.