Meeting setup
- Date: 03.01.2007
- Time: 09.30 Norw. time
- Place: Where we are
- Tools: SubEthaEdit, iChat
Agenda
- Opening, agenda review
- Reviewing the task list from last week
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 10:42.
Present: Børre, Saara, Sjur, Steinar, Thomas, Trond
Absent: Maaren, Tomi
Agenda accepted as is.
2. Updated task status since last meeting
Børre
- contact authors who have already received the corpus licensing contract
- not done
- continue work on script for automatic testing of the spell checker in Word
- not done
- update setup and installation instructions for new users/computers
- not done
- create new forrest tarball
- not done
- cvs up of the public server should be done for
xtdoc/sd/documentation/- not done
- fix bugs!
- not done
- other:
- fixed link errors in gtuit/doc
Maaren
- investigate the generated word form list sent to Polderland - use the command
make wordlist TARGET=smein victorio
Saara
- help Trond with some shell commands
- re-analyze parallel files
- done, aligner now works.
- Move to Bugzilla: vislcg server-friendly as feature request to the vislcg devs
- done
- fix bugs!
- done some
Sjur
- name lexicon:
- refactor SD-terms editor code
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
- started work on a refactor of some basic code, to allow greater flexibility, easier coding, and hopefully a better interface
- hire linguist and programmer
- decide how to specify compounding behaviour info in the lexicon
- combinatorial behaviour decided; specification of form still not done
- get an Intel Mac for testing Windows spellers
- publish corpus contracts and project infra on NoDaLi-sta
- fix forrest installations for Maaren, Disamb
- fix bugs!
Thomas
- refine
smjproper noun lexica, cf. the propernoun-smj-lex.txt- we’ve begun
- decide how to specify compounding behaviour info in the lexicon
- on our way
- go through the class of GAHPIR words, and try to generalise the compounding
behaviour
- done
- change whatever is needed based on the above generalisation
- done
- fix bugs!
- worked with bugs
Tomi
- add closed POS and clitics to PLX generation
- add derivations to the PLX generation
- add compound stems to the PLX generation
- fix bugs!
Trond
- update the
smjproper noun lexicon, and refine the morphological analysis, cf. the propernoun-smj-lex.txt- We have done some work on this.
- Go through the GAHPIR lexicon (with Thomas)
- Done
- get more
smatexts- No news.
- decide how to specify compounding behaviour info in the lexicon
- At least we worked on this, what was the final outcome again?
- fix bugs!.
- Done some work on bugs, closed a couple.
3. Documentation
TODO:
- either fix forrest installations (Sjur), or create a new tarball (Børre)
- cvs up of the public server should be done for
xtdoc/sd/documentation/- notice the directory! (Børre)
- update setup and installation instructions (Børre)
4. Corpus gathering
We should probably concentrate upon mending what we have, up to Febrary 8th.
Børre has been fixing the meta information for many files, adding info on parallel language versions.
TODO:
smetexts: Fix what we have for this month (Børre, Trond, Saara)- missing
nobparallel texts could be an issue (Børre, Trond)
5. Corpus infrastructure
Aligner
The command line aligner is in place, and most(?)/all(?) potential texts are aligned.
TODO:
- make an overview of status quo (Trond, Børre, Saara). Børre has added
parallellity status to all
admin/sd/texts - gather more parallel texts (Trond, Børre)
- not any more? (depends on what we are missing and how hard they are to get, but we will not put much time into it)
- re-analyze parallel files using the command-line version (Saara)
- progressing
- when aligned, send aligned, xml
nobtexts to Lars (Saara)
Conversion issues
Analysis statistics:
words missing % recognised
adm 1493746 28967 98,06
bib 241233 2651 98,90
fac 382353 11371 97,03
fic 70601 7847 88,89
law 284700 6269 97,80
new 4032848 17134 99,58
There are still a lot of conversion errors, and fixing them should be a priority this month. Error types:
- pdf files (there will always be some errors here)
- string-string replacement during the xsl conversion
- Conversion errors (but we want to fix the conversion)
- typos (should be marked up using our error markup tags)
- grammatical errors (as above)
So far, we only have had character replacement (even context-less character
replacement). We can use XSL string replacement to correct spelling errors to
constructs like, cf the corpus.dtd and [a previous meeting memo
| https://giellalt.uit.no/admin/weekly/2006/Meeting_2006-03-20.html#Correction+tags%3F]:
<error correct="corrected text">erroneous text</error>
Either we use the function Trond found, or we install and use Saxon 8 and XSL 2, which have good string manipulation tools.
Flow:
- orig
- xsl
- corr-of-textstring (not implemented yet) (we is > we are) (we keep the orig text, and the correction is added as additional info in an attribute, see above)
- xml
- preprocess
- corr-of-token=typos.txt eror>error
- analysis
Do we have good tools for localising conversion errors? Problem: I spot an error
in the analysis. Which file is it in?
ccat removes file source info, and we are thus in the blind. Conclusion: we
don’t have a good tool for this, only grep.
TODO:
- add correction markup to the xml files (string-to-correction markup) (Saara)
- report conversion errors to Saara (Trond, Steinar)
6. Infrastructure
Xerox tools wrapped as servers
TODO:
- find a way of integrating
vislcgas a server, or send a feature request to thevislcgdevelopers (Saara)- move this to Bugzilla
- done
- move this to Bugzilla
7. Linguistics
Names and multilinguality
TODO:
- finish first version of the editing (Sjur)
- test editing of the xml files. If ok, then: (Sjur, Thomas, Trond)
- make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as well) (the morphological section should be kept intact, in e.g. propernoun-sme-morph.txt) (Sjur, Saara)
- convert propernoun-($lang)-lex.txt to a derived file from common xml files (Sjur, Tomi, Saara)
- start to use the xml file as source file
- clean terms-sme.xml such that all names have the correct tag for their use (e.g. @type=secondary) (Thomas, Maaren, linguists)
- merge placenames which are errouneously in different entries: e.g. Helsinki, Helsingfors, Helsset (linguists)
- publish the name lexicon on risten.no (Sjur)
- add missing parallel names for placenames (linguists)
- add informative links between first names like Niillas and Nils (linguists)
North Sámi
TODO:
- go through the class of GAHPIR words, and try to generalise the compounding
behaviour (Thomas)
- done
- change whatever is needed based on the above generalisation
(Thomas, Trond)
- done
Lule Sámi
TODO:
- refine
smjproper noun lexica, cf. the propernoun-smj-lex.txt (Thomas, Trond)- progressing - good!
- update proper noun lexicon with copy of
smelexicon (Trond)- done
8. Name lexicon infrastructure
Decided in Tromsø:
- add logging facilities to the interface
- add option to download local copies of the lexicon files directly from the db
- batch editing (change all entries in the found set), should later be enhanced to allow selection of exceptions (the found set minus deselected items)
- tag for excluding/including a name from certain applications
- future epxansion: choose what info to display in the single language browser
- display existing language entries when adding a new language to a record
- add editor to change single, existing entries
Details can be found in [the meeting memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html]
TODO:
- develop the needed XQueries and UI (Sjur, Tomi)
- try to make a first version of xml2lexc in Perl for testing and preparation for the big jump (Saara)
Postponed:
- data synchronisation between risten.no and the cvs repo
- new version of xml2lexc (based on ccat), should handle complex names correct: construct entries like we have now from the different parts of a complex name entry
9. Spellers
Polderland data generation
TODO:
- decide how to specify compounding behaviour info for the lexicon
(Thomas, Trond, Sjur)
- partially done - compound stem specification still open
- add closed POS and clitics to PLX generation (Tomi)
- add derivations to the PLX generation (Tomi)
- add compound stems to the PLX generation (Tomi)
Aspell
TODO when the major part of the PLX conversion is done:
- add Aspell/Hunspell data generation to the lexc2xspell (Tomi - after the PLX data generation is finished)
- study Hunspell, perhaps also Soikko (Børre, Sjur, Tomi)
Testing
TODO:
- get an Intel Mac for testing Windows spellers (Børre, Sjur)
Localisation
The Windows installer should be localised. We have received the string file from Polderland, now they should be translated.
The Mac installer they use is the standard MacOS X installer, and as sme
isn’t generally turned on as a UI language in MacOS X, there is no big point in
translating it. It could be done, though:-)
TODO:
- send translation text to Børre and Thomas (Sjur)
- translate Windows installer to
smeandsmj(Børre, Thomas)
10. Other
Corpus contracts
TODO:
- publish corpus contracts and project infra on NoDaLi-sta (Sjur)
- still not done
Bug fixing
55 open Divvun/Disamb bugs, and 23 risten.no bugs
Guess: 1/3 of the bugs are fixed already (?)
11. Next meeting, closing
The next meeting is 8.1.2007, 09:00 Norwegian time.
The meeting was closed at 12:12.
Appendix - task lists for the next week
Boerre
- contact authors who have already received the corpus licensing contract
- continue work on script for automatic testing of the spell checker in Word
- recreate our forrest tarball
- update setup and installation instructions for new users/computers
- create new forrest tarball
- cvs up of the public server should be done for
xtdoc/sd/documentation/ - fix
smetexts in corpus this month - find missing
nobparallel texts in corpus - aligner status quo
- translate Windows installer to
smeandsmj - fix bugs!
Maaren
- investigate the generated word form list sent to Polderland - use the command
make wordlist TARGET=smein victorio
Saara
- help Trond with some shell commands
- fix
smetexts in corpus this month - aligner status quo
- send aligned, xml
nobtexts to Lars - add correction markup to the xml files (string-to-correction markup)
- first new version of xml2lexc in Perl
- fix bugs!
Sjur
- name lexicon:
- refactor SD-terms editor code
- implement missing propnouns editing functions
- implement improvements decided upon in Tromsø
- hire linguist and programmer
- decide how to specify compounding behaviour info in the lexicon
- get an Intel Mac for testing Windows spellers
- publish corpus contracts and project infra on NoDaLi-sta
- fix forrest installations for Maaren, Disamb
- send Windows installer translation text to Børre and Thomas
- fix bugs!
Steinar
- conversion error screening
- missing lists
- report conversion errors to Saara
- fix bugs!
Thomas
- refine
smjproper noun lexica, cf. the propernoun-smj-lex.txt - decide how to specify compounding behaviour info in the lexicon
- translate Windows installer to
smeandsmj - fix bugs!
Tomi
- add closed POS and clitics to PLX generation
- add derivations to the PLX generation
- add compound stems to the PLX generation
- fix bugs!
Trond
- update the
smjproper noun lexicon, and refine the morphological analysis, cf. the propernoun-smj-lex.txt - get more
smatexts - aligner status quo
- decide how to specify compounding behaviour info in the lexicon
- Set up work on missing and converision screening with Steinar.
- fix
smetexts in corpus this month - find missing
nobparallel texts in corpus - report conversion errors to Saara
- fix bugs!.