Meeting setup
- Date: 06.02.2006
- Time: 09.30 Norw. time
- Place: Wherever we are :-)
- Tools: iChat, SubEthaEdit
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Linguistics
- name lexicon infrastructure
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 09:38.
Present: Børre, Saara, Sjur, Tomi, Trond
Absent: Maaren, Thomas
Main secretary: Trond
Agenda accepted as is, we’ll try to finish by 10.55, to allow for joining
the celebration of the Sámi national day.
2. Reviewing the task list from the last meeting
Børre
- send out contracts with accompanying letter
- Sent to Iđut and Kåfjord municipality
- Gather public texts, preferrably also parallel ones
- Continue converting text from input format to our xml
- review code and documentation for corpus xsl files under version control
- fix bugs!
- Other
- The server didn’t get an IP-address using DHCP. It turned out that if the
«Gateway Assistant» was used, then the network card connected to «the
outside world» didn’t get an IP-address using DHCP. Even if all the services
that was started using the «Gateway Assistant» were turned off, this
behaviour went on. The «solution» was to reinstall Mac OS X Server, and not
use the «GA».
Maaren
- work with risten.no
- discuss with relevant people regarding seminar on proofing tools, normativity
and SGL in February/March, including place.
Saara
- continue discussion on the new lexicon format
- Refine language detection for Finnish
- Finnish the review of the hyphenation detection.
- Review the handling of xsl-files in corpus infrastructure, including version
control
- almost done, I’ll need some help with the xsl-processing of the
main text.
- Fix the preprocess script and optimize it by building an analyzator
for the multi-part expressions.
- it seems that building a preprocessor-specific analyzator is not possible.
- finalize an improved working version of the CGI and command line scripts for
corpus additions
- update conversion from lexc to xml (proper names) with the latest
refinements
- Try to add numeral treatment as part of the analyzator.
- Change character coding detection to paragraph-based.
- done, use convert2xml.pl with option –multi-coding. This
introduces some errors (due to the small size of some paragraphs), so
the default is still one coding per file.
- fix bugs!
Sjur
- Follow up the lawyer treatment of the contracts
- Lule Sámi twol problems, with Thomas and Trond
- follow up on voice group-chat not working to Sámediggi
- Test Marratech when the new Marratech server is in place
- project planning with Trond, continued
- also look at the development processes - specification and testing
- Follow up on place names from Norge Digitalt
- write an e-mail to or call Bjørn Olav Megard
- Evaluate SFST as speller (and analyzer) lexicon
- more thorough analysis than was possible in Guovdageaidnu
- write a background document on the corpus contracts
- continue proper name lexicon work and discussion
- did a lot to upgrade the risten.no infrastructure to be multi-collection
aware
- discussions in the newsgroup
- added the test lexicons Saara created to my own instance of risten.no
- public tender:
- waited for and received a draft public tender document from Finnut
- smj G3 issue with Thomas and Trond
- sme G3 issue with Thomas and Trond
- call EDD/Christian Emil Ore about national place name lexicon
- fix bugs!
Thomas
On sick leave.
Tomi
- Aspell: Continue working on the affix file & aspell
- Contact aspell author (UTF-8 thing)
- corpus infrastructure:
- dtd location (both public and internal)
- cgi-admin script for adding xsl-files
- Document aspell and corpus infrastructure
- ccat: add a -v option - it should return the version of the tool
- new proper name lexicon
- remove last part of complex names not used as simplex names
- start looking at conversion of the name lexicon from present format to xml
- discuss the new lexicon format in the newsgroup
- Look into synchronisation of proper names with risten.no
- new version of xml2lexc (based on catxml, now ccat)
- Not done
- xml2lexc update to handle complex names: construct entries like we have now
from the different parts of a complex name entry
- comment review template made by Saara
- fix bugs!
Trond
- Work on corpus texts with Børre.
- Done some progres wrt. processing of texts.
- 3-part compounds with Sjur and Thomas.
- Had a look at the rule set myself, but awaiting Thomas.
- smj G3 issue with Sjur and Thomas.
- sme G3 issue with Sjur and Thomas.
- fix bugs!
- Worked mostly on disambiguation.
3. Documentation
Reviews
Anything? Nothing.
Other docu
Anything? Documentation has been added for disambiguation.
4. Corpus gathering
Collecting
See a previous meeting memo for what’s to be done.
Sent letter to Iđut and Kåfjord.
TODO: Still a lot for Børre!
Odin
Waiting for Sæth to discuss with colleagues about how to implement the
cooperation, and return to us.
Nothing heard.
Bible texts
TODO:
- write a paratext2xml converter
- Tomi has already done it! Excellent!
- files requiring this converter should have the filename extension .ptx
- Cf. the following nob Old Testament texts: 01GENNBST.u8.PTX
19PSANBST.u8.PTX
- Børre will review the converter as part of adding the Norwegian texts
to our corpus
- convert smj NT to paratext. (Børre)
- ask to get fin and swe NT and OT in paratext format. (Trond)
5. Corpus infrastructure
Task list:
- Include the xsl files under version control
- RCS version control is almost finished, but an issue with access control is
still open. Discussed a bit in the meeting, but nothing conclusive. We’ll
continue the discussion in the newsgroup.
- Incorporate language detection as part of the corpus processing (Saara)
- Almost finished. Needs improved Finnish language model - presently it isn’t
able to distinguish Finnish from Sámi (proving the family bonds:-)
- we need to review whether only automatic hyphen detection is good enough, or
whether manual post-processing in some form is needed. Delayed until we have
some results to base the review on.
- Acceptable results: 90% of all real hyphens correctly tagged.
- CGI-admin script to add xsl-file to a corpus file that doesn’t have one
(Saara)
Things are moving forward, but still more work to do. The list is left as is.
E-mail address in case of upload errors:
corpus@giellatekno.uit.no (-> Børre?) Also for reports about new uploads.
/www/opt/www/cgi-bin/smi/upload.cgi (no Forrest)
http://localhost:8888/upload/upload_corpus_file.html (Forrest)
One option is to ask the cochise team, that would be royd or steinar and the
address cc.uit.no.
*Problems with greek letter in Word documents. With font Sam Times Uni(versal)
- (Børre) Can’t we just manually change
the letters and fonts in the few documents affected?
We forget about these texts for the time being, they’ll be put in a dir. for
broken texts. Such texts can be looked upon later , if wanted/needed.
Suggestion for Script for text analysis.
We would like a shadow catalogue ga/ (giella analysed) parallel to the gt/ catalogue,
with one file for each of the five directories. A way of getting this is to ach night
(afternoon!): Make a crontab job, run the following command, for each
directory admin, bible, facta, ficti, laws, news:
ccat -a -r /usr/local/share/corp/gt/sme | preprocess --abbr=bin/abbr.txt
| lookup -flags mbTT -utf8 bin/sme.fst | lookup2cg
| vislcg --grammar src/sme-dis.rle > /usr/local/share/corp/ga/sme/dir.txt
For example:
ccat -a -r /usr/local/share/corp/gt/sme | preprocess --abbr=bin/abbr.txt
| lookup -flags mbTT -utf8 bin/sme.fst | lookup2cg | vislcg --grammar src/sme-dis.rle
> /usr/local/share/corp/ga/sme/admin.txt
- Today: /usr/local/share/corp/gt/sme/DIR(/)/xml
- Addition: /usr/local/share/corp/ga/sme/dir.txt
TODO:
- Look at the suggestion from Trond (Saara, discuss with Trond
if unclear)
- ask for e-mail adress as specified above (Trond)
6. Linguistics
Anything? Nothing.
7. Name lexicon infrastructure
Complex names
TODO:
- make sure xml2lexc can handle complex names in ways compatible with our
present tool chain (=reconstruct the lexc format we have now) (Tomi)
- the resulting file format should be identical to our present prop-name
file (=lexc), that can then be converted to our new xml format using the same
script as for the regular names (Tomi or Saara, but only when the
technical details are settled)
- Saara has added the analyzer as part
of the preprocess, but it is slow, and needs to be optimized.
TODO:
- update conversion from lexc to xml to reflect new xml format (Saara)
- mostly done, some open questions left
- testing of conversion
- eXist as editor:
- develop the needed XQueries and interface
- data synchronisation between risten.no and
- test whether eXist as editor is actually working well
More TODO:
- read and comment in the news group (all)
- decide upon and set up infra for new projects and project ideas
Definitions/terminology:
- synchronisation in our context is data synchronisation, that is, to
bring the two repositories (CVS and risten.no) in synch regarding the shared
lexicons (propnouns at least).
- code refactoring is the process of reorganising the code by moving general
functions and snippets out of specific functions and into general libraries
for shared access, easier maintenance, and a better organisation.
8. Other
SGL Seminar
- SGL/normativity seminar
- all members = potentially/likely all languages
- not all languages, only North Sámi
- date? As early as possible, end of February/beginning of March
- place? Maaren will investigate
Technical issues
- The mac os / perl bug (at least Trond and Sjur has it, [Bugzilla
#211|http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=211]):
- utf8 “\xC4” does not map to Unicode at /Users/trond/gt/script/preprocess line
- This msg did not show up in 10.3 (perl 5.8.1), but does so in 10.4 (perl
5.8.6). It is probably a perl - OS mismatch. (Trond, Saara,
Tomi)
- 10.4 introduced support for locales in the shell (10.3 and earlier didn’t
know about locales)
- Test: the result of the last line should indicate whether this is a problem
in cat or a Perl/OS mismatch.
- Is this a problem with ccat?
- It doesn’t seem so (3 min and still counting)
- In the end, the bug turned up with ccat as well. I gave the command:
-
zcorp/gt/sme//xml |
preprocess –abbr=bin/abbr.txt |
lookup -flags |
mbTT -utf8 bin/sme.fst |
lookup2cg |
vislcg –grammar=src/sme-dis.rle |
–minimal |
sort |
less |
- and it (in the end) responded:
1729 constraint rules
utf8 "\xA1" does not map to Unicode at /home/trond/gt/script/preprocess line 109, <> chunk 12.
To ccat’s defence I must say that cat, in a similar situation, would have given far
more error messages (hold on, testing still under way).
preprocess file_name.txt - OK
cat file_name.txt | preprocess - bug!!
catxml file_name.xml | preprocess - ??
ccat filename | preprocess - bug !!
This bug isn’t a high priority any more, because ccat behaves differently than
cat, and because there is the possibility of avoiding cat when working locally.
BUG: close as Won’t fix. (Børre)
Bug fixing
32 open bugs (and 24 risten.no bugs)
- Add bug report for the Xerox backspace error (Trond)
9. Summary, task list
Børre
- send out contracts with accompanying letter
- Gather public texts, preferrably also parallel ones
- Continue converting text from input format to our xml
- review code and documentation for corpus xsl files under version control
- convert nob and nno bible texts to be used as part of a parallel corpus, and
review the paratext2xml converter as part of the conversion
- convert smj NT to paratext
- close bug 211 as WONTFIX
- fix bugs!
Maaren
- work with risten.no
- discuss with relevant people regarding seminar on proofing tools, normativity
and SGL in February/March, including place.
Saara
- continue discussion on the new lexicon format
- Refine language detection for Finnish
- Finnish the review of the hyphenation detection.
- Review the handling of xsl-files in corpus infrastructure, including version
control
- Fix the preprocess script and optimize it by building an analyzator
for the multi-part expressions.
- finalize an improved working version of the CGI and command line scripts for
corpus additions
- update conversion from lexc to xml (proper names) with the latest refinements
- Try to add numeral treatment as part of the analyzator.
- Look at crontab ga/ directory issue with Trond.
- fix bugs!
Sjur
- Follow up the lawyer treatment of the contracts
- Lule Sámi twol problems, with Thomas and Trond
- project planning with Trond, continued
- Follow up on place names from Norge Digitalt
- Evaluate SFST as speller (and analyzer) lexicon
- write a background document on the corpus contracts
- public tender:
- review draft tender document from Finnut
- smj G3 issue with Thomas and Trond
- sme G3 issue with Thomas and Trond
- call EDD/Christian Emil Ore about national place name lexicon
- risten.no/proper noun lexicon development: fix bugs, continue development
- fix bugs!
Thomas
- work on North Sámi compounding and derivation
- review corpus usage documentation
- smj G3 issue with Sjur and Trond
- sme G3 issue with Sjur and Trond
Tomi
- Aspell: Continue working on the affix file & aspell
- Contact aspell author (UTF-8 thing)
- corpus infrastructure:
- dtd location (both public and internal)
- Document aspell and corpus infrastructure
- new proper name lexicon
- remove last part of complex names not used as simplex names
- discuss the new lexicon format and other issues in the newsgroup
- Look into data synchronisation of proper nouns between risten.no and CVS
- new version of xml2lexc (based on ccat), should handle complex names correct:
construct entries like we have now from the different parts of a complex name
entry
- comment review template made by Saara
- fix bugs!
Trond
- Work on corpus texts with Børre.
- Contact the Finnish and Swedish Bible societies to get Bible texts.
- Look at ga/ directory issue with Saara.
- News group discussion followup.
- Do a bug report (if not done) on commandline bahaviour in the Xerox tools.
- Ask for e-mail adress for corpus upload script
- fix bugs!.
10. Next meeting, closing
13.02.2006 09:30
Closed at 10:37