Meeting setup

Date: 22.01.2007
Time: 09.00 Norw. time
Place: Where we are
Tools: SubEthaEdit, iChat

Agenda

Opening, agenda review
Reviewing the task list from last week
Documentation - divvun.no
Corpus gathering
Corpus infrastructure
Infrastructure
Linguistics
name lexicon infrastructure
Spellers
Other issues
Summary, task lists
Closing

1. Opening, agenda review, participants

Opened at 9:39.

Present: Børre, Maaren, Sjur, Steinar, Thomas, Tomi, Trond

Absent: Saara

Agenda accepted as is.

2. Updated task status since last meeting

Børre

continue work on script for automatic testing of the spell checker in Word
- not done
fix sme texts in corpus this month
- not done
find missing nob parallel texts in corpus
- not done
translate Windows installer text to sme and smj
- some done
work on the Polderland data generation (PLX format conversion)
- Concentrate on compounding
go through other directories, fix parallelity information for other documents
- not done
add sma texts to the corpus repository
- not done
fix bugs!
- not done
other:
- fixed giellatekno.uit.no forrest. It had a strange bug which led to circular building.

Maaren

tasks according to Thomas **done some

Saara

fix sme texts in corpus this month
- fixed handling of msword document arrays.
send aligned, xml nob texts to Kristen
- will be done finally today.
fix problems with xml2lexc if needed
fix bugs!
- fixed couple of preprocessor bugs.

Sjur

name lexicon:
- restructure interface code for easier maintenance, coding and use
  - some work, found a recently introduced bug in Forrest’s handling of HTML source files (fixed and commited)
- refactor the rest of the SD-terms editor code
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
hire linguist and programmer
- contacted one more candidate for the linguist position; started writing specification for the programmer position
get an Intel Mac for testing Windows spellers
- gave spec to Børre - asked him to do the order
publish corpus contracts and project infra on NoDaLi-sta
- wrote most of the e-mail for the announcement, but found a couple of loose ends in our infra, that needs to be tied up
fix stuorra-oslolaš lower case o
fix bugs!
other tasks:
- Polderland follow-ups
- several requests to get clarification on Microsoft Sámi language codes in MS Office / Mac (2004 & 2008)

Steinar

Complete the semantic sets in sme-dis.rle
- done some work
missing lists
- done some work
report conversion errors to Saara
- not done
Look at the actio compound issue when adding from missing lists
- not done
Go through the Num bugs
- not done
fix bugs!
- not done

Thomas

refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- not this week
work with compounding
- worked
translate Windows installer to sme and smj
- finished with smj, don’t want to think about sme
lexicalise actio compounds
- not worked personally
Lack of lowering before hyphen: Twol rewrite.
- not done
Go through the Num bugs
- not done
fix stuorra-oslolaš lower case o
- not done
include basic numbers in the non-recursive transducers
- done
implement discontinous case inflection for numbers
- not done
produce correct number base forms in the analyzer
- not done
fix bugs!
- not any bugs this week

Tomi

add compound stems to the PLX generation
- done (probably not finished yet)
add closed POS and clitics to PLX generation
- done with help from Børre
add derivations to the PLX generation
- not done
fix bugs!

Trond

update the smj proper noun lexicon, and refine the morphological analysis, cf. the propernoun-smj-lex.txt (not this week)
- No.
fix sme texts in corpus this month
find missing nob parallel texts in corpus, go through Saara’s list
- Not done. Discussed the issue with Børre, but no.
report conversion errors to Saara
- Done some, but she is so efficient, it is hard to find errors before she has fixed them
Write twol rules for sme, smj on hyphen-triggered lowering with Thomas
- Done.
Go through the Num bugs
- Some
Make numeral testbed
- Some
Get input on sma hyphenations
- Sent query, still no answer.
include numbers in the non-recursive transducers for sme, smj
- Work underway, but the automaton may be circular.
implement discontinous case inflection for numbers
- Started thinking, nothing done.
produce correct base forms in the analyzer
fix bugs!.

3. Documentation

Missing documentation on corpus access: no link to unrestricted part of corpus for download, and no page describing how to apply for access to the corpus.

TODO:

write form to request corpus user account (Børre, Sjur, Trond)
document how to apply for access to closed corpus, and details on the corpus and its use in general (Børre, Sjur, Trond)
add short description on our front page on anonymous cvs and corpus access, with links to relevant documentation (Børre)

4. Corpus gathering

Børre had a short look at the new sma files. Nothing else happened.

TODO:

sme texts: no new additions, fix corpus errors during this month (Børre, Trond, Saara)
missing nob parallel texts should be added if such holes are found (Børre, Trond)
Go through the list of missing or errouneous nob texts, based upon Saara’s perfect list (Børre, Trond)
add sma texts to the corpus repository (Børre)

5. Corpus infrastructure

Alignment

TODO:

go through other directories, fix parallellity information for other documents (Børre)
- not done
when aligned, send aligned, xml nob texts to Kristin (Saara)
- almost ready

Conversion issues

Character encoding errors are vanishing, thanks to Saara. What is left as an annoying problem is the pdf cutting of wordforms.

đi iešguđet prográmmasurggiin, mat laktásit dear vvašvuhtii, eallindillái ja sám
n/depts/Mangf027.pdf.xml:    <p>Kultuvra ja dear vvašvuohta Kultuvra ja dearvvaš
jat johtui dutkama sámiid eallineavttuid ja dear vvašvuođa birra. Galgá vuoruhit
idfuolahusa doaibmaplánaruđaiguin (ja mielladear vvašvuođa buoridanplánaruđaigui
                          vuhtiiváldima dárbu ár vvoštallojuvvo dan oktavuođas,

Another error type found/still not corrected:
 ánáidgárddiide, giellaoahpahussii ja diehtojuo­ hkin-, ovddidan- ja oaivadanbar

We have two suggestions for fixing some or most of these errors:

String replacement in the xsl files, like / vva/vva/ etc.
testing the string output of the pdf conversion in the morphological analyser:

- test each string returned for morphological recognisability
- if string is not recognised, and the following sting is not recognised either:
  - concatenate the to unrecognised strings, and try again
  - if the concatenation is recognised, then use the concatenated string, if
    not, leave the strings as is

Pro/con:

String replacement is safe (or can be made safe, with conservative use of strings), but will have less coverage (leave more errors untouched)
It is fast, and within the setup we have. On the other side, it takes some time to identify the files.
morphological testing is safer, have far better error correction performeance, but will be slower. It also takes some time to set up the procedure.

TODO:

report conversion errors to Saara (Trond, Steinar)
Have a look at the two suggestions for pdf mending above (Saara)

6. Infrastructure

We need to test it for new corpus users, and external users in general, cf the e-mail announcement to the NoDaLi mailing list about to be made ready.

Test tasks:

check out anonymous cvs, and try to build the sme analyzer
download the open section of the corpus, and try to use it as input to the sme disambiguator
try to apply for a corpus user account, and when the account is created, try to use it to analyze some closed parts of the corpus

Instructions:

only the ones on our web site - no interference from Børre, Trond or others:-)
as soon as you hit a road block, write down the problem as detailed as possible, then go and ask for help, or wait until the problem has been fixed

TODO:

test our infrastructure and documentation - follow the documentation exactly, and find problem areas - report problems to Børre. Start: At the front page. (Steinar)
update and fix our documentation and infrastructure as Steinar finds problem areas (Børre)

7. Linguistics

North Sámi

TODO:

lexicalise actio compounds. Example: vuolggasadji vs. vuolginsadji (Thomas, Maaren, Steinar)
Lack of lowering before hyphen: Twol rewrite. (Thomas, Trond)
- done
fix stuorra-oslolaš lower case o (Sjur, Thomas, Trond)

Numbers:

We had a meeting last Friday, memo to be found here

TODO:

discontinous case inflection (but only for maximally three-part compound numerals) (viđain/goalmmát/logiin and guvttiin/logiin/viđain) (Thomas, Trond)
produce correct number base forms in the analyzer (Thomas, Trond)
include numbers in the non-recursive transducers (i.e. split the recursive and the non-recursive part of the numerals) (Trond, Thomas)
- done
Set up test bed for numerals, test and revise (who?)
- test bed ready - Trond did it
Make a test bed make num-paradigm (Trond)
- done
Go through the Num bugs (Trond, Thomas, Steinar)
Preprocessing of ordinals at the end of sentences - reported as bug #368. (Trond)
- discussed

Classification of compounding with numerals:

Num N   ok golbmafanas
N   Num R not fanasgolbma

Num Num is a restricted type of compounding, very different from free compounding. Num Num is also already implemented in the transducer, but left out, due to circularity.

Written long numerals are marginal, we may consider generating some. Syntax:

* 1-9 + thousand + 1-9 + hundred +
    { 1-9 + ten +  1-9 / 1-9 + teen / 1-9 twenty + 1-9 }

From our corpus:

golmma#beaivásaš
golmma#geardán
golmma#geardásaš
golmma#gielat
golmma#jagáš
golmma#jahkásaš
golmma#jienat
golmma#juvllat
golmma#lanjat
golmma#lágan
golmma#oasat
golmma#oktasaš
golmma#riika
golmma#čiegat

Hyphenation problem

TODO:

ask Ove Lorentz to report on our sma hyphenator (Trond)
- Still waiting for an answer

Lule Sámi

TODO:

refine smj proper noun lexica, cf. the propernoun-smj-lex.txt (Thomas, Trond)
Lack of lowering/fronting before hyphen: Twol rewrite. (Thomas, Trond)
include numbers in the non-recursive transducers (Thomas, Trond)
- done
Set up a test bed for numerals, test and revise (Trond)

8. Name lexicon infrastructure

Decisions made in Tromsø can be found in [the meeting memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html]

Postponed:

data synchronisation between risten.no and the cvs repo

TODO:

restructure interface code for easier maintenance, coding and use
1. well under way, still some work
  1. continued
finish first version of the editing (Sjur)
test editing of the xml files. If ok, then: (Sjur, Thomas, Trond)
make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as well) (the morphological section should be kept intact, in e.g. propernoun-sme-morph.txt) (Sjur, Saara)
convert propernoun-($lang)-lex.txt to a derived file from common xml files (Sjur, Tomi, Saara)
start to use the xml file as source file
clean terms-sme.xml such that all names have the correct tag for their use (e.g. @type=secondary) (Thomas, Maaren, linguists)
merge placenames which are errouneously in different entries: e.g. Helsinki, Helsingfors, Helsset (linguists)
publish the name lexicon on risten.no (Sjur)
add missing parallel names for placenames (linguists)
add informative links between first names like Niillas and Nils (linguists)

9. Spellers

Polderland data generation

PLX conversion made substantial progress, and we now cover most POSes, with the exception of numerals. Also derivations are still not covered. We now need to send off what we have to Polderland, to let them test what we have, and deliver a first version of the spellers based on PLX code from us.

Prefixes should be added as separate entries.

TODO:

add closed POS and clitics to PLX generation (Børre, Tomi)
1. done
add compound stems to the PLX generation (Børre, Tomi)
1. done
Include numerals in the speller (Børre, Tomi)
add prefixes to the PLX (Børre, Tomi)
add smj to PLX conversion (Børre, Tomi)
add derivations to the PLX generation (Børre, Tomi)

Aspell

TODO when the major part of the PLX conversion is done:

add Aspell/Hunspell data generation to the lexc2xspell (Tomi - after the PLX data generation is finished)
study Hunspell, perhaps also Soikko (Børre, Sjur, Tomi)

Testing

TODO:

get an Intel Mac for testing Windows spellers (Børre, Sjur)
- Sjur has asked Børre to order two new MBP

Localisation

smj is now translated, and should be sent to Polderland.

TODO:

translate Windows installer text to sme and smj (Børre, Thomas)
- smj done
send smj translations to Polderland (Børre)

10. Other

Corpus contracts

TODO:

publish corpus contracts and project infra on NoDaLi-sta (Sjur)
- wrote most of the letters, but found some holes (discussed above)

Bug fixing

55 open Divvun/Disamb bugs, and 23 risten.no bugs

11. Next meeting, closing

The next meeting is 29.1.2007, 09:30 Norwegian time.

The meeting was closed at 10:53.

Appendix - task lists for the next week

Boerre

send smj translations to Polderland
write form to request corpus user account
document how to apply for access to closed corpus, and details on the corpus and its use in general
add short description on our front page on anonymous cvs and corpus access, with links to relevant documentation
update and fix our documentation and infrastructure as Steinar finds problem areas
continue work on script for automatic testing of the spell checker in Word
fix sme texts in corpus this month
find missing nob parallel texts in corpus
translate Windows installer text to sme
work on the Polderland data generation (PLX format conversion)
- Concentrate on compounding
go through other directories, fix parallellity information for other documents
add sma texts to the corpus repository
order Intel Macs
fix bugs!

Maaren

tasks according to Thomas

Saara

fix sme texts in corpus this month
send aligned, xml nob texts to Kristen
fix problems with xml2lexc if needed
check the problem with pdf-conversion cutting wordforms.
fix bugs!

Sjur

name lexicon:
- restructure interface code for easier maintenance, coding and use
- refactor the rest of the SD-terms editor code
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
hire linguist and programmer
publish corpus contracts and project infra on NoDaLi-sta
fix stuorra-oslolaš lower case o
write form to request corpus user account
document how to apply for access to closed corpus, and details on the corpus and its use in general
fix bugs!

Steinar

test our infrastructure and documentation - follow the documentation exactly, and find problem areas - report problems to Børre. Start: At the front page.
Complete the semantic sets in sme-dis.rle
missing lists
report conversion errors to Saara
Look at the actio compound issue when adding from missing lists
lexicalise actio compounds. Example: vuolggasadji vs. vuolginsadji
Go through the Num bugs
fix bugs!

Thomas

refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
work with compounding
lexicalise actio compounds
Lack of lowering before hyphen: Twol rewrite.
Go through the Num bugs
fix stuorra-oslolaš lower case o
implement discontinous case inflection for numbers
produce correct number base forms in the analyzer
fix bugs!

Tomi

add compound stems to the PLX generation
include numerals in the speller
add prefixes to the PLX
add smj to PLX conversion
add derivations to the PLX generation
fix bugs!

Trond

update the smj proper noun lexicon, and refine the morphological analysis, cf. the propernoun-smj-lex.txt
fix sme texts in corpus this month
find missing nob parallel texts in corpus, go through Saara’s list
report conversion errors to Saara
Go through the Num bugs
Make numeral testbed for smj as well
Get input on sma hyphenations
implement discontinous case inflection for numbers
produce correct number base forms in the analyzer
write form to request corpus user account
document how to apply for access to closed corpus, and details on the corpus and its use in general
fix bugs!.