Meeting setup
- Date: 13.02.2006
- Time: 09.30 Norw. time
- Place: Wherever we are :-)
- Tools: iChat, SubEthaEdit
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Linguistics
- name lexicon infrastructure
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 09:57.
Present: Børre, Saara, Sjur, Trond
Absent: Maaren, Thomas, Tomi
Main secretary: Børre
Agenda accepted as is.
2. Reviewing the task list from the last meeting
Børre
- send out contracts with accompanying letter
- Gather public texts, preferrably also parallel ones
- Gathered some, sitting on a computer at the Sámediggi.
- Continue converting text from input format to our xml
- Converted existing texts using the upload form. Nice experience:-)
- review code and documentation for corpus xsl files under version control
- convert nob and nno bible texts to be used as part of a parallel corpus, and
review the paratext2xml converter as part of the conversion
- convert smj NT to paratext
- close bug 211 as WONTFIX
- fix bugs!
Maaren
- work with risten.no
- discuss with relevant people regarding seminar on proofing tools, normativity
and SGL in February/March, including place.
Saara
- continue discussion on the new lexicon format
- Refine language detection for Finnish
- Finish the review of the hyphenation detection.
- Review the handling of xsl-files in corpus infrastructure, including version
control
- Fix the preprocess script and optimize it.
- finalize an improved working version of the CGI and command line scripts for
corpus additions
- update conversion from lexc to xml (proper names) with the latest refinements
- Try to add numeral treatment as part of the analyzator.
- Look at crontab ga/ directory issue with Trond.
- done, but there is a bug.. which should be fixed now.
- fix bugs!
Sjur
- Follow up the lawyer treatment of the contracts
- Lule Sámi twol problems, with Thomas and Trond
- delayed till Thomas is back
- project planning with Trond, continued
- Follow up on place names from Norge Digitalt
- Evaluate SFST as speller (and analyzer) lexicon
- write a background document on the corpus contracts
- public tender:
- review draft tender document from Finnut
- done, feedback and changes returned
- smj G3 issue with Thomas and Trond
- delayed till Thomas is back
- sme G3 issue with Thomas and Trond
- delayed till Thomas is back
- call EDD/Christian Emil Ore about national place name lexicon
- risten.no/proper noun lexicon development: fix bugs, continue development
- wrote a draft specification of filename conventions, that at the same time
also repeates what has been said earlier regarding dir. structure.
- some coding as well (don’t remember the details any more)
- fix bugs!
- other:
- monthly report for January
- report for 2005 to Nordplus Sprog
Thomas
On sick-leave.
Tomi
- Aspell: Continue working on the affix file & aspell
- Contact aspell author (UTF-8 thing)
- corpus infrastructure:
- dtd location (both public and internal)
- Document aspell and corpus infrastructure
- new proper name lexicon
- remove last part of complex names not used as simplex names
- discuss the new lexicon format and other issues in the newsgroup
- Look into data synchronisation of proper nouns between risten.no and CVS
- new version of xml2lexc (based on ccat), should handle complex names correct:
construct entries like we have now from the different parts of a complex name
entry
- comment review template made by Saara
- fix bugs!
Trond
- Work on corpus texts with Børre.
- Contact the Finnish and Swedish Bible societies to get Bible texts.
- Look at ga/ directory issue with Saara.
- Done. She has made a script, and I have posted things to the newsgroup.
- News group discussion followup.
- Do a bug report (if not done) on commandline bahaviour in the Xerox tools.
- Hmm, not done, after all.
- Ask for e-mail adress for corpus upload script
- fix bugs!.
3. Documentation
Reviews
XSLT processing part of the corpus infra review is finished. The code is
finished and ready to use, but language identification still needs improvements.
4. Corpus gathering
Collecting
See a previous meeting memo for what’s to be done.
TODO: Send out the rest of the letters (Børre)
Since last meeting:
- Min Áigi
- called Anders Kintel - will sign it as soon as he gets the contract; will also
burn on a CD
- Swedish Sámi Parliament: Grundström will be finished by summer time, now
proofreading
Next:
- calling Olavi Korhonen, his dictionary is now in for printing
- then continuing on the list of orgs/persons to contact
Odin
Waiting for Sæth to discuss with colleagues about how to implement the
cooperation, and return to us.
TODO:
Bible texts
TODO:
- review paratext2xml converter (Børre)
- convert smj NT to paratext. (Børre)
- ask to get fin and swe NT and OT in paratext format. (Trond)
5. Corpus infrastructure
We need more “version control” in the corpus work - we don’t know which version
of the XSL script was used (but roughly whether it was used or not). We need to
document within the XSL file which version of all tools were used, including
the XSL file (common section/template) itself.
Transferring the old gt/sme/corp files to the new corpus repo:
- for the biggest top ten (or so) the orig. should be located and copied to the
new corpus repo
- then these files should be removed from gt/sme/corp/
- all small files could just be forgotten/ignored
Task list:
- Include the xsl files under version control
- RCS version control is almost finished, but an issue with access control is
still open.
- Access control resolved through Unix groups: one group for corpus
maintainers with write access, and another for corpus users, with read-only
access.
- Improve Finnish language detection as part of the corpus processing
- Move to Bugzilla (Saara)
- Review automatic hyphen:
- Acceptable results: 90% of all real hyphens correctly tagged.
- Move to Bugzilla (Saara)
Further discussion about corpus analysis and computer use:
- the new G5 is tremendeously faster than cochise, thus we want to use it
- cochise will continue to be our main corpus repo
- the corpus/gt/ dir will be synchronised with the G5
- corpus analysis and usage will happen on the G5
- we need to develop strong enough security routines for the G5 to fulfill our
obligations towards the text licensers
- we are still using only one processor when analysing - making some simple
multiprocessor analysis script will increase the speed of the analysis a lot
6. Linguistics
Anything? Nothing.
7. Name lexicon infrastructure
Complex names
TODO:
- make sure xml2lexc can handle complex names in ways compatible with our
present tool chain (=reconstruct the lexc format we have now) (Tomi)
- the resulting file format should be identical to our present prop-name
file (=lexc), that can then be converted to our new xml format using the same
script as for the regular names (Tomi or Saara, but only when the
technical details are settled)
- Saara has added the analyzer as part
of the preprocess, but it is slow, and needs to be optimized.
Move these issues to bugzilla (Børre)
Preprocessor optimization
To optimize one could build a targeted transducer only containing the relevant
lexicons for preprocessing. But presently that leads to the following lexicons
being reported as referenced but undefined:
- (Punctuation)
- (Abbreviation)
- (Acronym)
- (Adposition)
- (Negativeverb)
- (Copula)
- (VerbRoot)
- (AdjectiveRoot)
- (At)
- (NounSecond)
- (ALIT)
- (NAMAT)
- (SAS)
Perhaps picking the
hum-tf4-ans142:~/gt/sme/src trond$ grep '% ' adv-sme-lex.txt
earret% eará adv ;
dan% dihte adv ;
...
TODO:
- make a lexc Root lexicon (first 40 lines of sme-lex.txt)
- extract the relevant parts of the relevant lexica from the main transducer
- built from the union of a and b.
Discussion will continue on the newsgroup.
TODO:
- testing of conversion
- eXist as editor:
- develop the needed XQueries and interface
- data synchronisation between risten.no and
- test whether eXist as editor is actually working well
8. Other
SGL Seminar
SGL has now been elected, with the folowing members:
- Rolf Olsen (Else Turi)
- Tor Magne Berg (Marit Breie Henriksen)
- Elle Marja Vars (-)
- Lena Kappfjell (Albert Jåma)
- Heidi Andersen (-)
SGL/normativity seminar:
- all members = potentially/likely all languages
- not all languages, only North Sámi
- date? As early as possible, end of February/beginning of March
- place? Maaren will investigate
Infra for new projects and ideas:
- make Forrest integration work as expected (Børre)
Bug fixing
30 open bugs (and 24 risten.no bugs)
- Add bug report for the Xerox backspace error (Trond)
9. Summary, task list
Børre
- send out contracts with accompanying letter
- Gather public texts, preferrably also parallel ones
- Continue converting text from input format to our xml
- convert nob and nno bible texts to be used as part of a parallel corpus
- review the paratext2xml converter
- convert smj NT to paratext
- Call Ove Sæth and Olavi Korhonen
- Correct Forrest integration for new projects and project ideas
- Move complex name lexicon issue to bugzilla
- fix bugs!
Maaren
- work with risten.no
- discuss with relevant people regarding seminar on proofing tools, normativity
and SGL in February/March, including place.
Saara
- continue discussion on the new lexicon format
- Move the issue “Refine language detection for Finnish” to Bugzilla
- Move the issue “Finnish the review of the hyphenation detection” to Bugzilla
- Add version information of the tools to part of the corpus infra.
- Fix the preprocess script and optimize it.
- finalize an improved working version of the CGI and command line scripts for
corpus additions
- update conversion from lexc to xml (proper names) with the latest refinements
- Try to add numeral treatment as part of the analyzator.
- Look at crontab ga/ directory issue with Trond.
- fix bugs!
Sjur
- Follow up the lawyer treatment of the contracts
- Lule Sámi twol problems, with Thomas and Trond
- project planning with Trond, continued
- Follow up on place names from Norge Digitalt
- Evaluate SFST as speller (and analyzer) lexicon
- write a background document on the corpus contracts
- public tender:
- review draft tender document from Finnut
- smj G3 issue with Thomas and Trond
- sme G3 issue with Thomas and Trond
- call EDD/Christian Emil Ore about national place name lexicon
- risten.no/proper noun lexicon development: fix bugs, continue development
- fix bugs!
Thomas
- work on North Sámi compounding and derivation
- review corpus usage documentation
- smj G3 issue with Sjur and Trond
- sme G3 issue with Sjur and Trond
Tomi
- move aspell UTF-8 suffix bug to Bugzilla
- corpus infrastructure:
- dtd location (both public and internal)
- Document aspell infrastructure: finish doc/proof/spelling/X-spell/aspell.xml
(it’s almost done, but there are a couple of loose ends)
- new proper name lexicon
- discuss the new lexicon format and other issues in the newsgroup
- Look into data synchronisation of proper nouns between risten.no and CVS
- new version of xml2lexc (based on ccat), should handle complex names correct:
construct entries like we have now from the different parts of a complex name
entry
- fix bugs!
Trond
- Work on corpus texts with Børre.
- Contact the Finnish and Swedish Bible societies to get Bible texts.
- Look at ga/ directory issue with Saara.
- News group discussion followup.
- Do a bug report (if not done) on commandline (mis)behaviour in the Xerox tools
- Ask IT guys for an e-mail adress for corpus upload script:
corpus@giellatekno.uit.no (forwarded to Børre)
- fix bugs!.
10. Next meeting, closing
20.02.2006 09:30
Closed at 11:33