Meeting setup

Date: 12.03.2007
Time: 09.00 Norw. time
Place: Internet
Tools: SubEthaEdit, iChat

Agenda

Opening, agenda review
Reviewing the task list from last week
Documentation - divvun.no
Corpus gathering
Corpus infrastructure
Infrastructure
Linguistics
name lexicon infrastructure
Spellers
Other issues
Summary, task lists
Closing

1. Opening, agenda review, participants

Opened at 10:17.

Present: Børre, Sjur, Steinar, Thomas, Trond

Absent: Maaren, Saara, Tomi

Agenda accepted as is.

2. Updated task status since last meeting

Børre

write form to request corpus user account
- not done (can be delayed till after beta)
document how to apply for access to closed corpus, and details on the corpus and its use in general
- not done (can be delayed till after beta)
update and fix our documentation and infrastructure as Steinar finds problem areas
- begun with it
continue work on script for automatic testing of the spell checker in Word
- not done
  - Sjur did it
fix sme texts in corpus this month
- not done (can be delayed till after beta)
find missing nob parallel texts in corpus
- not done (can be delayed till after beta)
add prefixes to the PLX conversion
- not done
add middle nouns to the PLX conversion
- not done
improve number PLX conversion
- not done
go through other directories, fix parallellity information for other documents
- not done
add sma texts to the corpus repository
- not done (can be delayed till after beta)
add info to front page (incl. download links)
- done
write separate page with detailed info (incl. download links)
- done
Improve automatic alignment process
- not done
Store the tested texts, for reference
- not done
Add potential speller test texts
- not done
Set up ways of adding meta-information to speller test docs
- not done
get an Intel Mac for Tomi
- ordered
collect a list of PR recipients, forward to Berit Karen Paulsen
- not done
add version info to the generated speller lexicons
- not done
run all known spelling errors in the corpus through the speller
- not done
test the typos.txt list, and check that all entries are properly corrected
- not done
consider how to do a regression self-test
- not done
fix bugs!

Maaren

lexicalise actio compounds
Manually mark speller test documents for typos

Saara

continue aligning the rest of the parallel files
prepare more files for manual alignment
update lexc2xml with comment field
- done
start improving the corpus interface for Sámi in Oslo.
Set up corpus directories for proofing test documents
- done
Mark-up the added speller test texts, using our existing xml format
- infrastructure is ready, the files?
fix bugs!

Sjur

name lexicon:
- refactor the rest of the SD-terms editor code
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
  - postponed till after beta release
hire linguist
- planning interview
fix stuorra-oslolaš lower case o
- not done (postponed till after beta)
write form to request corpus user account
- not done (postponed till after beta)
document how to apply for access to closed corpus, and details on the corpus and its use in general
- not done (postponed till after beta)
write press release for the beta
- nothing more done since first draft
get speller test tool from Polderland
- got them, but they were buggy - Polderland is working on a fix
Set up ways of adding meta-information to speller test docs
- not done, but speller test docs are in principle just regular corpus docs, thus the same tools could be used
collect a list of PR recipients
- not done
add version info to the generated speller lexicons
- not done
run all known spelling errors in the corpus through the speller
- not done, we’re not ready for that yet
test the typos.txt list, and check that all entries are properly corrected
- ran typos.txt (first column) through the speller - found many slip-throughs
- also ran the second column of typos.txt through the speller, and found several holes (or errors in the typos.txt file)
consider how to do a regression self-test
- been thinking, but no action so far
fix bugs!
other issues:
- wrote an AppleScript to run texts through the speller via MS Word itself, and wrote some initial Makefile targets to run the typos.txt file through it
- discussed derivation conversion with Tomi
- started discussing improvements to the conversion - presently it is way too slow - almost 1,5 months to convert all verbs! Right now (at the beginning of the meeting), the verb file is parsed and converted up to the verb anástaddat (line 446 of 14 723 lines of verbs). It has been running since Friday afternoon:-( Those 446 lines of lexc entries (and not all lines are entries) are converted to 1 443 714 lines of PLX code.

Steinar

Beta testing: Align manually (shorter texts)
Manually mark speller test texts for typos (making them into gold standards), add the texts to a certain directory
- some texts finished (marked typos) but not added to a directory yet
Infrastructure test: add report to gt/doc/infra/, probably as infrareport.jspwiki
- the report is hopefuly there very soon (if not I will contact the technisian)
Complete the semantic sets in sme-dis.rle
- no work this week
missing lists
- no work this week
Look at the actio compound issue when adding from missing lists
- not done
Align corpus manually
- not done
fix bugs!

Thomas

refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- done
work with compounding
- worked and still working
Lack of lowering before hyphen: Twol rewrite.
- not done
fix stuorra-oslolaš lower case o
- not done (delayed till after beta)
translate beta release docs to sme and smj
- not done
Add potential speller test texts
- not done
fix bugs!
- all the time

Tomi

add derivations to the PLX generation
- done
make PLX conversion test sample; add conversion testing to the make file
improve number PLX conversion
update ccat to handle error/correction markup
add version info to the generated speller lexicons
fix bugs!

Trond

Participate in the beta testing setup
- done
Test the beta versions
- done
Work on the parallel corpus issues
- done
- Discuss with Anders
  - done, i.e., with Lars
- Work on the aligner with (Børre)
  - Not done.
- fix sme texts in corpus this month
  - Not done.
- find missing nob parallel texts in corpus, go through Saara’s list
  - Not done.
Postpone these tasks to after the beta:
- update the smj proper noun lexicon, and refine the morphological analysis, cf. the propernoun-smj-lex.txt
- Go through the Num bugs
Improve automatic alignment process
- Not done.
Align corpus manually
- Not done.
Store the tested texts, for reference
- Not done.
Add potential speller test texts
- Not done.
collect a list of PR recipients
- Not done.
fix bugs!.
- Not done.

3. Documentation

The open documentation issues fall into these three categories:

Beta documentation for testers
Documentation for the online corpora
General documentation improvement after Steinar’s test (for open-source release)

TODO:

write form to request corpus user account (Børre, Sjur, Trond)
document how to apply for access to closed corpus, and details on the corpus and its use in general (Børre, Sjur, Trond)
correct and imrove it based on feedback from Steinar (Børre)
beta documentation (see separate beta section below)

4. Corpus gathering

TODO:

sme texts: no new additions, fix corpus errors during this month (Børre, Trond, Saara)
missing nob parallel texts should be added if such holes are found (Børre, Trond)
Go through the list of missing or errouneous nob texts, based upon Saara’s perfect list (Børre, Trond)
add sma texts to the corpus repository (Børre)

5. Corpus infrastructure

Trond has been in Odense, in the Panorama meeting. The goal of Panorama is to create Nordic parallel texts. It will dictate some of the focus of the UiTø work ahead. All corpora will use the UiO interface, but stored locally.

Alignment

TODO

go through other directories (nob dicrectories, sd directories), fix parallellity information for other documents (2 hours) (Børre)
Improve the automatic process:
- Improve the anchor list and realign (Trond, Børre)
- Only adding words does not improve alignment, you have to consider the format as well. If you cut with star e.g guo* too early, wrong word can be selected.
- The documents have still some formatting issues which cause trouble in alignment. (In some documents tables are included to the text, some not, etc.)
- Test and improve settings in the aligner
Align manually (Trond, Steinar) (especially shorter terminological texts) good idea. Saara will look for troublesome texts and prepare them for manual alignment.

6. Infrastructure

TODO:

add report to gt/doc/infra/, probably as infrareport.jspwiki (Steinar)
update and fix our documentation and infrastructure as Steinar finds problem areas (Børre)

7. Linguistics

North Sámi

TODO:

lexicalise actio compounds. Example: vuolggasadji vs. vuolginsadji (Maaren)
fix stuorra-oslolaš lower case o (Sjur, Thomas, Trond)
- postponed till after the public beta

Lule Sámi

TODO:

refine smj proper noun lexica, cf. the propernoun-smj-lex.txt (Thomas, Trond)

8. Name lexicon infrastructure

Decisions made in Tromsø can be found in [this meeting memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html]

TODO:

fix bugs in lexc2xml; add comments to the log element (Saara)
finish first version of the editing (Sjur)
test editing of the xml files. If ok, then: (Sjur, Thomas, Trond)
make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as well) (the morphological section should be kept intact, in e.g. propernoun-sme-morph.txt) (Sjur, Saara)
convert propernoun-($lang)-lex.txt to a derived file from common xml files (Sjur, Tomi, Saara)
implement data synchronisation between risten.no and the cvs repo, and possibly other servers (ie the G5 as an alternative server to the public risten.no - it might be faster and better suited than the official one; also local installations could be treated the same way)
start to use the xml file as source file
clean terms-sme.xml such that all names have the correct tag for their use (e.g. @type=secondary) (Thomas, Maaren, linguists)
merge placenames which are errouneously in different entries: e.g. Helsinki, Helsingfors, Helsset (linguists)
publish the name lexicon on risten.no (Sjur)
add missing parallel names for placenames (linguists)
add informative links between first names like Niillas and Nils (linguists)

9. Spellers

OOo speller(s)

TODO after the MS Office Beta is delivered:

add Aspell/Hunspell data generation to the lexc2xspell (Tomi - after the PLX data generation is finished)
study Hunspell, perhaps also Soikko (Børre, Sjur, Tomi)

Testing

Different ways of testing

Impressionistic, functionality: try the program, try all the functions
Impressionistic, coverage: try the program on different texts, look for false positives
Systematic (in order of importance):
1. Make a corpus of texts, from different genres (can be done before 0.2 release)
  1. For each text, detect precision
  2. For each text, detect recall
  3. For each text, detect accuracy

Before beta release: precision is important, but have a look at recall as well.

Definitions

tp - true positives (correctly recognised misspellings)
fp - false positives (correct words errouneously marked as misspellings)
fn - false negatives (misspellings not recognised by the speller)

Recall and precision

precision = tp / ( tp + fp ) = true redlines / all redlines
- can we trust that the redlines are actually errors?
- Task: check all hits
- (test p, are they tp or fp?)
recall = tp / ( tp + fn) = true redlines / all errors in doc
- can we trust that all errors are actually found?
- Task: check every single word
- (test p, are they tp or fp, test n, are they tn or fn?)
accuracy = tp + tn / tp + fp + fn + tn = overall performance

Precision and recall testing

A testbed has been set up (Trond), and some texts are marked for errors and corrections (Steinar). Versions alpha, beta 0.1 and beta 0.2 have been tested.

Types of tests:

Technical testing
Testing for linguistic functionality
Testing for lexical coverage
Testing for normativity
Testing the suggestions

The tester should identify these 4 values:

wds - number of words in the text
tp - correctly identified errors
fp - correctly written but marked as errors
fn - errors not marked as such

The spreadsheet will then calculate precision, recall and accuracy. Steinar has marked some texts like this: Errors are makred§marked with paragraph number followed by the correct form. Way of finding precision: Take out the § entries and evalate them for tp and fp. Way of finding recall: Remove the § entries and count the fn among the remaining words. Then fill in and collect results.

Testing of suggestion should follow the same lines:

errs - number of errors in the text
tp - the intended word is among the suggestions
fp - the intended word is not among the suggestions
fn - no suggestions
tn - (not relevant??)

Ordering of suggestions:

place in the list of the intended correction
- ordered first
- ordered top-five
- ordered below top-five

“Perceived Quality”, ie for all recognised errors/tp:

number of correct suggestions at top
number of correct suggestions among top-five
number of correct suggestions below top-five

Testing on unseen texts

We need to use unknown texts in order to measure the performance of the speller.

Regression tests

We need to ensure that we do not take steps backwars, ie all known spelling errors in the corpus should be correctly identified, with a proper suggestion among the top five. For this purpose we can use the regular corpus with correction markup.

We also need to regression test the PLX conversion. In principle this is easy - just send the full word-form list (as generated by make wordlist TARGET=sme) through the speller. None should be rejected - any word form rejected is in principle a regression. In practice, this is not that easy, since the word list is so huge. We have to investigate alternatives for this testing.

TODO:

add extraction of all known spelling errors in the corpus (not the prooftest corpus), and check that they are properly corrected (Børre, Sjur)
test the typos.txt list, and check that all entries are properly corrected (Børre, Sjur)
consider how to do a regression self-test, ie, how to test the full wordlist (Børre, Sjur)

Testing tools

We have received a set of testing tools from Polderland. They have some technical bugs, but Polderland are working on fixing them. The first (buggy) versions do show that we can use the output to run automated tests on them.

Sjur has written an AppleScript to run arbitrary texts through the MS Word speller directly in Word, and getting the result back in a text file in an easily parsed format. The script is found in gt/script/. The documentation is forthcoming.

TODO:

get updated Polderland testing tools (Sjur)
document the AppleScript testing tool (Sjur)
write tools for statistical analysis of test results (Sjur)

Storing test texts

Test texts should be stored in the corpus catalogue, separated from the ordinary corpus files. They should be marked as to whether their unknown words have been added to the lexicon or not (in the former case, they cannot be used for testing of performance any more, only for regression testing). When the words have been added, the whole text can be transferred to the regular corpus repository.

TODO:

Store the tested texts, for reference (Trond, Børre)
Set up (sub)directories (Saara)
- top-level dir corpus/prooftest/orig/ and corpus/prooftest/xml/
Add potential test texts (Børre, Thomas, Trond, anyone, really)
Manually mark them for typos (making them into gold standards) (Steinar, Maaren)
- erorr§error
Format the added texts in appropriate ways - use our existing xml format, with correct markup as decided earlier (the only thing that separate these documents from regular corpus documents is the directory (tree) in which they reside), thus regular corpus conversion tools, plus error markup (Saara)
- requires changes to ccat to handle error/correction markup (Tomi)
Set up ways of adding meta-information (source info, used in testing or not, added to lexicon or not) (Sjur, Børre)
Conduct tests on new beta versions on the basis of the unspoiled gold standard documents (whoever has time), and fill in data from testing (the testers: who?)
alternatively: make test scripts that will run the tests automatically, collect the numbers, and transform them into test results (Sjur)
include the ones already tested in the testing/ catalogue
test 0.3 on the same texts
test each version before beta release

The b0.3 / 2007.02.26 version

Known errors:

clitics do not work with W class words (uninflected words). Two options:
- generate these with clitics (adds words from 6700 -> 100 000) (Tomi, Saara)
  - done
- ask Polderland to look at it - Sjur will do that
  - Tomi did it, follow-up e-mail discussions by Sjur and Tomi

Localisation

We need to translate the info added to our front page (and a separate page) regarding the beta release. Also the press release needs to be translated.

TODO:

translate beta release docs to sme (Thomas)
translate beta release docs to smj (Thomas)

Conversion from LexC to PLX

Adjectives compile at 60 sec/adjective, i.e. (5000*60) / 3600 = 83 hrs
Nouns compile at 3 sec/noun,            i.e. (23600*3) / 3600 = 19 hrs

Verbs take 13-15 hours to compile.

This is so far acceptable for nouns, but on the edge of being unacceptable for adjectives. These times will multiply many times when we add derivation, meaning we will need more than a week to convert the major POSes from LexC to PLX then.

We need to investigate why adjectives are so slow, and try to improve on the conversion speed.

Update: The conversion is extremely slow after Tomi added derivations to the conversion process. One verb can take an hour to convert (including all its derivations, and their inflections). With the present speed, converting all the verbs will take almost 1,5 months, which is of course completely unacceptable.

Saara has several ideas for how to improve the speed on her end (ie the servers and the paradigm generator). And Sjur has an idea for a very different approach.

abohtta GOAHTI "abbot N" ;  !+SgNomCmp +SgGenCmp +PlGenCmp
   ↓ <- perl-script
+N+SgNomCmp+SgGenCmp+PlGenCmpabohtta GOAHTI "abbot N" ;  !
+N+SgNomCmp+SgGenCmp+PlGenCmpabohtta GOAHTI "abbot N" ;  !
   ↓
+N+SgNomCmp+SgGenCmp+PlGenCmpabohtta+N+Sg+Gen GOAHTI "abbot N" ;  !
+N+SgNomCmp+SgGenCmp+PlGenCmpabohtta:+N+SgNomCmp+SgGenCmp+PlGenCmpabohtta GOAHTI "abbot N" ;  !

-- filter which removes the Cmp-tags from all case forms except SgNom, SgGen
   and PlGen - thus, all forms without Cmp tags will be L only --

The basic idea is to use the Xerox tools to do the conversion for us, by augmenting it with regex-es that in the end will give us the PLX output on the lower side, and then just prin-lower. We know print-lower is extremely fast, and if it can be used to generate PLX, the time problem is solved.

Some brief profiling done by Børre during the meeting showed that hyphenation of the generated paradigms take an unproportionally large amount of time. By commenting out the hyphenation part, the paradigm generation went considerably faster. Based on this, we played with the idea of adding hyphenation directly to the paradigm output - it should be a simple matter of just isme-norm.fst .o. hyph.fst. It would in one step remove one very costly part of the processing, and as well remove the ambiguity and overgeneration from the hyphenated output.

We also played further with using the Xerox tools all the way to producing ready PLX-formatted output on one side. If it can be made, the PLX conversion would become a simple and fast task of printing one side of the language, which is done in minutes (but which requires the specially licensed version of the Xerox tools, available only on victorio).

We need to test that the conversion is correct and gives expected results in all cases, especially regarding compounding and derivation. For that we need a small set of test entries in lexc format, and the corresponding expected output in PLX format. By comparing the actual output with the expected output we get a measure of the quality of the conversion.

TODO:

Look at bottlenecks in existing code (Tomi, Børre)
Look at xfst ways of doing it (Sjur, Trond, …)
add derivations to the PLX generation (Tomi)
1. working on it
add prefixes to the PLX (Børre)
middle nouns (Børre)
make conversion test sample; add conversion testing to the make file (Tomi)
improve number conversion (Børre, Tomi)

Public Beta release

Due to the problems with generating the PLX files discussed above, we need to move the release date further. End of March?

Linguistic issues still open:

derivations (Tomi)
- “solved” in the existing code
numbers 1-20 (Børre)
prefixes (eahpe, ii-) (Børre)
middle nouns (LEXICON: lexc: Rmiddle, plx: L) (Børre)

DONE:

delivered PLX data of sme and smj including compounding
translated Windows installer to sme and smj
installed PLX compiler in G5 at /usr/local/bin/mklex* (one version for sme and one for smj)
added resources needed for compiling PLX lexicons to our cvs repo
tested the beta drop from Polderland - good we did, it is absolutely unacceptable (our responsibiliby - only linguistic errors (poor coverage) found so far)
questions for Polderland:
- version info in the speller
- remaking/updating the installer packages with linguistic updates
add compilation of MS Office spellers part of the Makefile
install Windows and MS Office; test tools on Windows

TODO:

finish press release (Sjur)
add info to front page (incl. download links) (Børre)
write separate page with detailed info (incl. download links) (Børre)
collect a list of PR recipients, forward to Berit Karen Paulsen (Børre, Sjur, Trond)

Version identification of speller lexicons

See the Norwegian spellers for an example, with the trigger string tfosgniL.

Suggestion:

nuvviD -> Divvun
nuvviD -> Dávvisámegiella
nuvviD -> Veršuvdna_1.0b1 (based on cvs tag?)
nuvviD -> 12.2.2007  (automatically generated/added)
nuvviD -> Sjur_Nørstebø_Moshagen
nuvviD -> Børre_Gaup
nuvviD -> Thomas_Omma
nuvviD -> Maaren_Palismaa
nuvviD -> Tomi_Pieski
nuvviD -> Trond_Trosterud
nuvviD -> Saara_Huhmarniemi
nuvviD -> Steinar_Nilsen
nuvviD -> Lene_Antonsen
nuvviD -> Linda_Wiechetek

These correction rules (and their corresponding PLX entries) should be added automatically to the PLX file and the phonetic file as part of the compilation process, to include build date and version number.

TODO:

add version info to the generated speller lexicons (Børre, Sjur, Tomi)

10. Other

Project meeting IRL

Reserve the whole week after easter for a project gathering, probably in Kåfjord. That is, the days 10-13.4.

Corpus contracts

TODO:

publish corpus contracts and project infra as open-source on NoDaLi-sta (Sjur)
- delayed until the public beta is out the door

Bug fixing

57 open Divvun/Disamb bugs, and 23 risten.no bugs

11. Next meeting, closing

The next meeting is 19.3.2007, 09:30 Norwegian time.

The meeting was closed at 11:50.

Appendix - task lists for the next week

Boerre

write form to request corpus user account
document how to apply for access to closed corpus, and details on the corpus and its use in general
update and fix our documentation and infrastructure as Steinar finds problem areas
fix sme texts in corpus this month
find missing nob parallel texts in corpus
add prefixes to the PLX conversion
add middle nouns to the PLX conversion
improve number PLX conversion
go through other directories, fix parallellity information for other documents
add sma texts to the corpus repository
Improve automatic alignment process
Store the tested texts, for reference
Add potential speller test texts
Set up ways of adding meta-information to speller test docs
get an Intel Mac for Tomi
collect a list of PR recipients, forward to Berit Karen Paulsen
add version info to the generated speller lexicons
run all known spelling errors in the corpus through the speller
test the typos.txt list, and check that all entries are properly corrected
consider how to do a regression self-test
Look at bottlenecks in existing PLX conversion code
fix bugs!

Maaren

lexicalise actio compounds
Manually mark speller test documents for typos

Saara

continue aligning the rest of the parallel files
prepare more files for manual alignment
start improving the corpus interface for Sámi in Oslo.
mark-up the added speller test texts, using our existing xml format
improve cgi-bin scripts
fix bugs!

Sjur

hire linguist
finish press release for the beta
Set up ways of adding meta-information to speller test docs
collect a list of PR recipients
add version info to the generated speller lexicons
run all known spelling errors in the corpus through the speller
consider how to do a regression self-test
get updated Polderland testing tools
document the AppleScript testing tool
write tools for statistical analysis of test results
Look at xfst ways of doing PLX conversion
fix bugs!

Steinar

Beta testing: Align manually (shorter texts)
Manually mark speller test texts for typos (making them into gold standards), add the texts to a certain directory
Infrastructure test: add report to gt/doc/infra/, probably as infrareport.jspwiki
Complete the semantic sets in sme-dis.rle
missing lists
Look at the actio compound issue when adding from missing lists
Align corpus manually
fix bugs!

Thomas

work with compounding
Lack of lowering before hyphen: Twol rewrite.
translate beta release docs to sme and smj
Add potential speller test texts
fix bugs!

Tomi

Look at bottlenecks in existing PLX conversion code
improve PLX conversion speed
make PLX conversion test sample; add conversion testing to the make file
improve number PLX conversion
update ccat to handle error/correction markup
add version info to the generated speller lexicons
fix bugs!

Trond

Test the beta versions
Work on the parallel corpus issues
- Discuss with Anders
- Work on the aligner with (Børre)
- fix sme texts in corpus this month
- find missing nob parallel texts in corpus, go through Saara’s list
Postpone these tasks to after the beta:
- update the smj proper noun lexicon, and refine the morphological analysis, cf. the propernoun-smj-lex.txt
- Go through the Num bugs
Improve automatic alignment process
Align corpus manually
Store the tested texts, for reference
Add potential speller test texts
collect a list of PR recipients
Look at xfst ways of doing PLX conversion
fix bugs!.