Meeting setup

Date: 12.12.2005
Time: 09.30 Norw. time
Place: Wherever we are :-)
Tools: iChat, SubEthaEdit, phone

Agenda

Opening, agenda review
Reviewing the task list from two weeks ago
Documentation - divvun.no
Corpus gathering
Corpus infrastructure
Linguistics
name lexicon infrastructure
Other issues
Summary, task lists
Closing

1. Opening, agenda review, participants

Opened at 09:47.

Present: Børre, Maaren, Saara, Sjur, Thomas, Tomi, Trond

Absent: none

Main secretary: Tomi

Agenda accepted as is.

2. Reviewing the task list from the last meeting

Børre

Gather public texts
Continue converting text from input format to our xml
Document the corpus directory structure
Ask Thor-Øivind to move bugzilla to our new webserver
- … or make the URL [http://giellatekno.uit.no/bugzilla] point to the present installation
  - He is back from a journey today, !’ll contact him
install new XXE and the new XXE Forrest config for Ilona
- Haven’t seen her online
divvun.no and giellatekno.uit.no:
- Binary files download area
  - Done
- Continue the conversion to static site, using our own script.
  - The script is done, has to be put into cron and Thor-Øivind will have to setup the urls. Test pages are up on [http://129.242.176.166/~sd] and [http://129.242.176.166/~gtuit]
corpus xsl files under version control
- Not done
review the convert2xml.pl script, what works well and what doesn´t.
- Some done
comment review template made by Saara
- Not done, but it was excellent!
fix bugs!
- Fixed bug 227
Cleaned up the Lule Sámi New Testament, added it to the corpus and sent it to Dave van Groothest who will add it to paratext

Maaren

work with risten.no **some done
find decisisons/documents regarding syllable shortening in compounds in the norm; send them to Thomas **not done (didn´t find any documents)
review corpus usage documentation **?? More today
discuss with relevant people regarding seminar on proofing tools, normativity and SGL in February
- discussed with SGL (Laila)

Saara

Look at the corpus infrastructure issue
Look at the corpus interface issue with Lars
Convert the name lexicon from present format to xml
- pending
discuss the new lexicon format in the newsgroup
- not done
document corpus infrastructure, your own parts
- not done
corpus infrastr. user docu review: make a template, and post it in the newsgroup
- done for the web interface, other parts missing
Language detection for the corpus files
- work in progress
Implement addition of tags to the converted corpus files
- the script is ready after some testing with documents that really contain some hyphenated words
Do some testing for bug #211
- not done
update the convert2xml script according to the comments
- done, needs to be documented
review corpus upload user interface
- done some, I think Tomi has to do some necessary fixes before the real review.
corpus xsl files under version control
- not done
xml2lexc update to handle complex names: construct entries like we have now from the different parts of a complex name entry
- not done
update preprocess to handle inflected forms of complex names
- Not done
fix bugs!
- Done some
Make preprocess faster
- Is now almost as fast as can be with Perl. Should be fast enough, though.

Sjur

Lule Sámi twol problems, look again at the sets definition with Thomas and Trond
- nothing
follow up on voice group-chat not working to Sámediggi
- Marratech: Sámediggi has bought a new server for this, but it is very unclean when it will be ready for service. Video quality is BAD, but hopefully voice quality is fine, and will make it possible for us to conduct online group chats with voide
project planning with Trond, continued
- also look at the development processes - specification and testing
  - nothing
Follow up on place names from Norge Digitalt
- write an e-mail to or call Bjørn Olav Megard
  - Finally received an answer - he had forgotten about it
Evaluate SFST as speller (and analyzer) lexicon
- more thorough analysis than was possible in Guovdageaidnu
  - nothing
write a background document on the corpus contracts
- nothing
discuss kvensk project support
- nothing
proper names:
- discuss lexicon format
  - nothing
- discuss implementation with Tomi
  - nothing
write public tender documents
- nothing
hyphenation in corpus docs
- discussed in news group
buy:
- new computer
  - done (more further down)
Investigate the never-arriving backpacks
- done, they’re all in Guovdageaidnu
update the SD contract version
- not done
send SD corpus contract to lawyer
- not done
review corpus upload user interface
- this week
comment review template made by Saara
- commented Saara’s proposal
fix bugs!
- commented some

Thomas

work on North sámi compounding and derivation **worked and still working
review corpus upload user interface
- done
review corpus usage documentation **häh?! More on this today.
fix bugs!
- presented a solution to Bug 77

Tomi

Aspell: Continue working on the affix file & aspell
- Contact aspell author (UTF-8 thing)
  - Not done
corpus infrastructure: dtd location (both public and internal)
- Not done
Document aspell and corpus infrastructure
- Not done
Specification for new catxml in C++
- this includes also placing the source and binary
  - Not done
new proper name lexicon
- remove last part of complex names not used as simplex names
  - Started
- start looking at conversion of the name lexicon from present format to xml
  - Nothing
- discuss the new lexicon format in the newsgroup
  - Nothing
- Look into synchronisation of proper names with risten.no
hyphenation in corpus docs
- Nothing
corpus xsl files under version control
- not done
comment review template made by Saara
- Haven’t really anything to say to it, than it is good :)
fix bugs!
- Fixed some

Trond

Look at the three-part compound issue
- Not done
Work on the CG-related bugs on the bug list (7 open) (numeral related ones postponed.
- Not done
project planning with Sjur, continued
- No serious planning.
discuss kvensk project support with Sjur
Work on the G3 bug issue with Sjur and Thomas, carry it over to sme.
- Not done
document corpus infrastructure, your part
- Not written documentation
update the SD contract version
- Not done
send SD corpus contract to lawyer
- Not done
review corpus upload user interface
- Not done
review corpus usage documentation
- Not done
comment review template made by Saara
- Not done
discuss the new lexicon format in the newsgroup
fix bugs!
Other:
- KUNSTI meeting, disambiguation, corpus format. webadr.fst.

3. Documentation

Documentation tasks:

Add documentation on our corpus infrastructure and our corpus work in general (Børre, Tomi, Trond, Saara). For the basic corpora, we need 2 additional types of documentation, or doc for 2 target groups:

For the users/linguists: What corpus are found, how do I use them (this info is now scattered) (Part of the HOWTO USE is documented in the catxml docu, what documents are found where etc + an overall documentation is not written, since the corpus is so sparsely populated)
1. catxml done, which is what is needed mostly. Do we need more?
2. we need a review: Thomas, Maaren, Linda, Ilona, Trond
For the collectors: How do I add texts, where do I add them, how do I convert them (this is (partly?) done in the Corpus Conversion document)
1. we need a review of the web interface for corpus uploading - what is still missing?
  1. Review: Sjur, Saara, Trond, Thomas

Review setup and reporting to be posted to the newsgroup, possibly a summary in our documentation.

Saara has made a review template, final version ready today.

Deadline for reviews: by next meeting.

test:

add/update Aspell documentation (Tomi)
- Some documentation has been written, but there still is work to be done.
as always: document what you’re doing:-) (all)

Tomcat->static HTML progress

Almost done. Thor Øivind is back now, and will help with the last URL fixes.

Deadline: in operation by next meeting.

4. Corpus gathering

Contracts

Next step:

Make our versions of the updated Helsinki contracts, and make sure they are according to our intention. (Sjur and Trond)
send them to the SD lawyer and to the University lawyer through formal channels. (Sjur and Trond)

Contract 1 should have the main priority (contract 2 for Trond).

Deadline: This should be finished this week.

Collecting

Børre has completed the HTML version of the Lule Sámi New Testamente.

Trond discussed with the editor of Odin, we will get a direct contact with him on routines for text delivery.

5. Corpus infrastructure

Updated task list:

Include the xsl files under version control
1. RCS to be used, Saara to include it in the (upload) processing of new corpus files, as well as documentation (Børre to review)
Incorporate language detection as part of the corpus processing (Saara)
1. The tool needs better training material.
we need a way to deal with hyphenated documents (documents with (manually) inserted hyphenation marks) in catxml/preprocess.
1. Saara has made a hyphen detection script that tries to discriminate between real hyphenation and other uses of hyphens (recall that only real hyphenation should be tagged ).

Examples of false positives (hard cases) - these should not be converted:

teknihkalaš<hyph>luonddudieđalaš
Norplus<hyph>prográmma
Mjøs<hyph>lávdegotti
norgga<hyph>ruoŧa
dánska<hyph>norgalaš

Precision, Recall, a repetition

 False positives: Hyphens that should be kept as is
 False negatives: Soft hyphens not recognised.

tp = true positives, fp = false positives
tn = true negatives, fn = false negatives

P = tp/tp+fp and R = tp/tp+tn

P = (number of real hyphens detected) / (number of hyphens found)
R = (number of real hyphens detected) / (number of real hyphens in the text)

If we pick only hyph at line end, then the number of false positives will drop. So will the number true positives…

we need to review whether only automatic hyphen detection is good enough, or whether manual post-processing in some form is needed. Delayed until we have some results to base the review on.
Acceptable results: 90% of all real hyphens correctly tagged.

6. Linguistics

North Sámi

three-part compounds issue still open
- look at Lule Sámi, but apply it to second-parts only
  - Thomas is finished with three-part compounds for North Sámi. On the negative side, we note a compilation time of 10m for the sme parser as a whole.
- the exact rules for when shortening happens should be documented (Maaren will send the normativity decisions to Thomas)
  - were not found any
  - Nickel has something, but our corpus, an article by Pekka Sammallahti and Konrad Nielsen contradicts Nickel
  - meeting in the SGL this week (old members) - Thomas to send the compound shortening issue to them as well.
- linguistic analysis/discussion to continue in the newsgroup
- we should include the members-to-be of the Sámi Giellalávdegoddi (SGL) in this discussion, but how? We need a common meeting / normativity seminar / seminar on proofing issues and on language technology in a broader sense with them. Then we also need to open the linguistic issues newsgroup we talked about earlier, and set up their computers and teach them to use the newsgroup. This seminar needs to include the Lule Sámi section of SGL (others are welcome too). Timetable:
  - The Sámediggiráddi is going to appoint new members before Christmas, the meeting is 13.12. (Johan Mikkel Saara is asking people)
  - Contact the SGL when it is elected, in December, and ask them to arrange a normativity seminar, with invited keypersons.
  - seminar in February?
- TODO list:
  - Maaren to discuss with relevant persons on this issue.
number project still open (see below)
diphthong simplification/G3 issue should be carried over from Lule Sámi.
- TODO:
  - Writing the rules (copy from smj, adjust) (Sjur, Thomas and Trond)
  - Bug 77 - clearify whether háliid is acceptable - it is found, thus we need to be able to analyse it, but only as !SUB:

(negative of háliidit and gen/acc of mii etc.)
the "in háliit"/"in ??" and maid/*mait and guliid/*guliit
d>t only for lexical stems, not for suffixes and closed class words.
Since we do not have any suffix boundary symbol %>, this is difficult.

hum-tf4-ans175:~/gt/sme trond$ kwic-snt 'h.liid ' corp/*
h.liid
hum-tf4-ans175:~/gt/sme trond$ kwic-snt 'h.liit ' corp/*
h.liit
ielddaválggat, ja NSR:ii ges Sámediggi. Utsi ii háliit dasa dadjat maide, go dál
                     ollašuvvan. Okta gielda ii háliit oassálastit barggus danne
                     ollašuvvan. Okta gielda ii háliit oassálastit barggus danne
aid maid Suodjalus loahpaha. Muhto go stáhta ii háliit oastit, de Suodjalus beas
                   loahpaha. Muhto go stáhta ii háliit oastit, de Suodjalus beas
ččii boahtteáiggi sámi servodaga. Muhto mii eat háliit ruovttoluotta dološáigái.
                  sámi servodaga. Muhto mii eat háliit ruovttoluotta dološáigái.
                                         Dál ii háliit šat joatkit dan birra ság
                                         Dál ii háliit šat joatkit dan birra ság

Status quo after Thomas' last bug fix:
háliit  háliidit+V+TV+Ind+Prs+ConNeg
háliid  háliidit+V+TV+Ind+Prs+ConNeg   <==== should this one be allowed?
maid    maid+Interj
mait    mait    +?

Lule Sámi

Open tasks:

Derived G3 that looks like G2 are still open.
- Thomas, Trond and Sjur should meet again to finish the rest.

Numerals

The following North Sámi linguistic issues should be settled before going into the numeral project:

Three-part compounds
Diphthong simplification
Derivation

These issues are recently done in Lule Sámi, and it is more efficient to complete them in North Sámi directly thereafter instead of beginning a new topic

Numeral treatment is on different level in the existing sme and smj parsers, but the issue itself is common to the two langauges, and should therefore be treated in parallel.

Numerals in North Sámi: Inventory is listed elsewhere.

Numerals in Lule Sámi: There are 70 lines of code setting up the structure for case inflection of basic numerals.

7. Name lexicon infrastructure

Summary: see the newsgroup

Complex names

Task list for this issue:

find eventual unique second-parts (B-parts of names that do not exist in isolation): Take the 1.126 version, remove the non-last parts, and compare the B-parts of proper-complex.xml with the simplex forms in 1.126. Non-hits should then be removed from the current propernoun-sme-lex.txt.
- Postponed till next week
remove these B-parts from the ordinary name file (Tomi)
- New York => York, Los Angeles =/=> *Angeles
make sure xml2lexc can handle complex names in ways compatible with our present tool chain (=reconstruct the lexc format we have now) (Saara)
- Tomi is working on a C++ version of the xml2lexc tool. He will put it into CVS soon (it is actually catxml, can easily be adopted to the xml2lexc requirements)
the resulting file format should be identical to our present prop-name file (=lexc), that can then be converted to our new xml format using the same script as for the regular names (Tomi or Saara, but only when the technical details are settled)
The core issue: The preprocess script can handle uninflected (A + B) compounds, but not (A + inflected B). Saara will add the analyzer as part of the preprocess, to be able to handle inflected cases as well.

XML format

Basis for the progress: separate documents according to project:

Containing one common concept or name id field, plus Divvun/Disamb fields only (based on our old proposal)
Containing one common concept or name id field, plus kvensk fields only (based on the field list from Irene)
common document containing one common id field, plus fields common to several bases

Discussion to continue in the newsgroup. Sjur will post a draft XML structure based on the above.

Tasks:

testing of conversion
continue the discussion of the name lexicon format (Saara, Tomi, Sjur, Trond)
implement a prototype in eXist
eXist as editor:
1. develop the needed XQueries and interface
2. synchronisation between risten.no and
3. test whether eXist as editor is actually working well

8. Other

New server

Ordered a new computer last Friday: Quad G5 (PowerMac), with 30” screen. Will be placed in Tromsø. Usage:

video conferencing
lexicon compilation
other heavy processsing
possibly our own Bugzilla there? Yes, we will try that, it is frustrating to rely on others that don’t have time for us
other server services?

Code conventions

Suggestions from Sjur:

80 char linelength (source code, meeting memos)
- twol-smj.txt has 140 - we need to allow some exceptions
do never change whitespace as part of other changes, whitespace cleanup should be checked in separately, with a very clear commit message
- This is nice
- inform all others before, to allow all to check in all changes. Then clean, commit, and tell the others that you are ready.
lexicon sorting is a BAD THING for cvs and change tracking/diffing
- a. sort or arbitrary order b. if sorting at all, alphabetic or reverse
- This is bad for cvs, but good for the linguistic work, and should be done, now and then. The question is whether we should have sort or reversed sort.
- Shall we have shell sort or emacs sort-lines? Note: We do not sort the whole file, but arbitrary parts of the file. Today: propernoun, noun, verb and adj are reverse-sorted with the shell rev | sort | rev command, all other sorted lexica are sorted alphabetically, with the sort-lines command in emacs. Earlier practice: I sorted every time I thought the lexicon was not sorted, or every time I had the time to do that.
  - Decision: We are allowed to add entries both where they belong and where they do not belong (although it is preferred to add the new entry where it belongs). Then we sort often, and we keep everything sorted alphabetically. We avoid reverse sorting in the cvs unless there are large collective projects working on the sublexicon structure (and these should be decided separately)
XXE is fine for larger documents, but imposes it’s own XML serializing conventions, thus not only content changes pop up in diffs, also whitespace changes appear and distract when reviewing version histories.
- Convention: when making smaller adjustments to a text, use a text editor like SubEthaEdit or Emacs in text mode. Use XXE or similar only for larger changes.

These conventions decided upon, and should be used from now on.

Technical issues

The mac os / perl bug (at least Trond and Sjur has it, [Bugzilla #211|http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=211]):
- utf8 “\xC4” does not map to Unicode at /Users/trond/gt/script/preprocess line
  1. This msg did not show up in 10.3 (perl 5.8.1), but does so in 10.4 (perl 5.8.6). It is probably a perl - OS mismatch. (Trond, Saara, Tomi)
- Test: the result of the last line should indicate whehter this is a problem in cat or a Perl/OS mismatch.
```
 preprocess file_name.txt OK
 cat file_name.txt | preprocess !!
 catxml file_name.xml | preprocess ??
```

Video conferencing across firewalls

A new server is bought, but it is open when it will be installed and useable. It may take quite some time, and the video quality will be low, based on a simple test. But the most important thing is group voice chat across the SD firewall, which probably will work quite ok.

Bug fixing

28 open bugs (and 25 risten.no bugs)

Move Bugzilla

Move Bugzilla to our new server when it arrives, and make it work at both the old URL [http://giellatekno.uit.no/bugzilla/] as well as a similar one based on the name of the new server (e.g. something like [http://project.divvun.no/bugzilla], where the ‘project’ part is still open for discussion) (Thor Øivind and Børre).

risten.no

It is back on the air since last Wednesday, with a small correction on Thursday.

Rugsacks

They are all in Guovdageaidnu. Tomi can pick them up after Christmas on his way back from vacation.

9. Summary, task list