Meeting setup
- Date: 14.11.2005
- Time: 10.00 Norw. time
- Place: Wherever we are :-)
- Tools: iChat, SubEthaEdit, phone
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Linguistics
- Speller infrastructure
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 10:08.
Present: Børre, Maaren, Saara, Sjur, Thomas, Tomi, Trond
Absent: none
Main secretary: Børre
Agenda accepted as is.
2. Reviewing the task list from the last meeting
Børre
- Contact oahpahusossodat about texts
- Gather public texts
- Continue converting text from input format to our xml
- convert2xml.pl doesn’t work
- Document the corpus directory structure
- Ask Thor-Øivind to move bugzilla to our new webserver
- … and update Bugzilla at the same time
- install new XXE and the new XXE Forrest config for all (or check that it is
installed and working)
- mark-up names
- move existing corpus docs from gt/ to new corpus repository
- divvun.no and giellatekno.uit.no
- Binary files download area
- Moving to static site, using forrestbot or something else.
- Investigated, will continue with internal script, converting to site.
Maaren
- shall work with Sámi place names only
**done a little bit
- update the last issue in the North Sámi normativity issues document
**done
Saara
- Look at the corpus infrastructure issue
- scripts for transferring the corpus metadata to the mysql database
are ready (but waiting for changes in corpus.dtd and test documents)
- start looking at conversion of the name lexicon from present format to xml
- document corpus infrastructure, your own parts
- the scripts are documented, some updates needed
Sjur
- Lule Sámi twol problems, look again at the sets definition with Thomas and
Trond, Wednesday
- continued, still some issues open.
- risten.no bugs and fixes
- done some work on the new eXist version
- discuss risten.no work with Tomi
- Meeting with Risten and Tomi, some work on eXist
- follow up on voice group-chat not working to Sámediggi
- Test Marratech
- URL received, but it was only internal, and when I tried it, the license had
expired:-( They seem to have installed a new license now, but the page does
not work.
- project planning with Trond, continued
- also look at the development processes - specification and testing
- Follow up on place names from Norge Digitalt
- write an e-mail to or call Bjørn Olav Megard
- Evaluate SFST as speller (and analyzer) lexicon
- more thorough analysis than was possible in Guovdageaidnu
- write a background document on the corpus contracts
- Discuss the contract issue with Kimmo, return the new version to the lawyer
- had a meeting, updated versions checked with Trond and sent to lawyer
- Follow up on meeting with Anders Kintel
- Meeting confirmed 17.11., Arran, afternoon.
- discuss kvensk project support with Trond
- not with Trond, but with Risten and Tomi, as part of the risten.no updates
- write public tender documents
- buy:
- new computer (project server)?
Thomas
- work on North sámi compounding and derivation
**done some work on three-part compounding
- Look at Linguistic bugs with Trond
**worked with bug 193
- Continue to meet with Sjur and Trond about and work with the definition of G1, G2, G3
**met and made great progress, G2>G3 issue left
- update the lule sámi normativity issues document
**not done
Tomi
- Aspell: Continue working on the affix file & aspell
- Contact aspell author (UTF-8 thing)
- three-part compounding
- corpus infrastructure: dtd location (both public and internal)
- Specification for new catxml in C++
- this includes also placing the source and binary
- discuss risten.no work with Sjur
- discuss about xml-processing with Saara
- look into efficient editing of the xml proper name lexicon (tools, modes, etc)
- Didn’t we discussed about this?
- start looking at conversion of the name lexicon from present format to xml
- Look into synchronisation of proper names with risten.no
- Other
- Have been looking into internal structure of risten.no
Trond
- Work on the CG-related bugs on the bug list (7 open) (numeral related ones
postponed.
- Not worked on other bugs than 193.
- project planning with Sjur, continued
- also look at the development processes - specification and testing
- Not done.
- Work on the name project, mark up names
- Very much done, approximately 500 names left. There will be need for revision
of the marking that has been undertaken, but all in all the name lexicon
should now be ready for translation to xml format.
- discuss kvensk project support with Sjur
- Work on the G3 bug issue with Sjur and Thomas
- Done substantial progress here.
- Also worked on disambiguation with Linda
3. Documentation
Documentation tasks:
Add documentation on our corpus infrastructure and our corpus work in general
(Børre, Tomi, Trond, Saara):
- The directory structure is now settled (as of last meeting), and should be
documented.
For the basic corpora, we need 2 additional types of documentation, or doc for 2 target
groups:
- For the users/linguists: What corpus are found, how do I use them (this
info is now scattered)
(Part of the HOTWO USE is documented in the catxml docu
The what documents are found where etc + an overall documentation is not
written, since the corpus is so sparsely populated)
- For the collectors: How do I add texts, where do I add them, how do I
convert them (this is (partly?) done in the Corpus Conversion document)
test:
- add/update Aspell documentation (Tomi)
- Some documentation has been written, but there still is work to be done.
- as always: document what you’re doing:-) (all)
Divvun.no down again
Tomcat is running out of memory in between. Børre will look into changing
to Forrest generating static html pages (forrest site), and serve those off of
the standard Apache server. He will also look at utilizing Forrestbot as the
tool to update the site, instead of our homegrown script.
Update: Only one small change needed in our own script. Binary download section
should be included.
4. Corpus gathering
Governmental documents (earlier in pdf, now in html)
Børre has gathered files from the Sámediggi
Will go on gathering files from Odin.
Contracts
Sjur had a meeting with Kimmo Koskenniemi, resolving all the issues that he
had with it. Trond and Sjur has also discussed these, changed them a little
bit. The contracts are ready to be sent to the lawyer (who sadly is ill).
5. Corpus infrastructure
- Problems with convert2xml.pl?
- Barfs up at line 91
- Add issue to Bugzilla (always when you find problems!)
Quoting from the convert2xml.pl file:
26 my $xsl_file = '';
27 my $dir = '';
28 my $log_dir = '';
90 my $log_file = $log_dir . "/" . $file . ".log";
91 open STDERR, '>>', "$log_file" or die "Can't redirect STDERR: $!";
Problem analysed and will be corrected (Tomi)
Updated task list:
- Make a system for file and directory permission (today: we all belong to the
cvs group), to only allow people with root user privileges write access to the
corpus repository, at least regarding original files (Børre)
- Done
- Include the xsl files under version control (Børre, Tomi, Saara)
- Incorporate language detection as part of the corpus processing (Tomi)
- we need a way to deal with hyphenated documents (documents with (manually) inserted
hyphenation marks) in catxml/preprocess.
(Tomi, Børre, (discussion in the newsgroup:) Sjur, Trond, Saara)
- Discuss details in the newsgroup
- in normal cases hyphenation points should be removed
- when testing the robustness of our parsers, as well as when testing the
hyphenator, the hyphenation points should be retained
Corpus dtd issue
To summarize (taken from Saara’s newsgroup posting of Fri, 11 Nov 2005:
- Change the person name to firstname and lastname. Agreed/decided.
- Add an element collection:
<!ELEMENT collection (#PCDATA) >
- Agreed.
- Leave the element translator as it is now (easier to read that way and
avoid the changes).
- Add element conserning the completeness of the metadata. I guess the
uncompleteness here means that some information that should be added is
missing. (This info would be for our internal use.)
<!ELEMENT metadata (complete|incomplete)>
<!ELEMENT complete EMPTY>
<!ELEMENT incomplete EMPTY>
- Element for the word count (this is used when counting e.g. the frequencies
of some form etc.). Filling this slot should be part of the standard xsl, then.
<!ELEMENT wordcount #PCDATA>
Saara’s suggestion:
<!ELEMENT availability (free|license)
<!ELEMENT free EMPTY>
<!ELEMENT license EMPTY>
<!ATTLIST license
type (type1|type2|..) #REQUIRED
>
Saara will update the dtd.
6. Linguistics
Name lexicon
Summary: see the newsgroup
The plan for this project was as follows: Two lines of work run in parallel:
- name markup
- testing of conversion
When these two tasks are done (at some point in the future), the conversion will
be done.
Status quo on the two lines of work:
The mark up of the remaining 400 entries until conversion starts (People
allocated look at the rest: Maaren, Ilona, Trond, Børre). This week’s status
quo is as follows (some 400 names not assigned):
323 NYSTØ
32 BERN
20 LONDON
18 MARJA
17 NIILLAS
12 ACCRA
5 HEANDARAT
4 ANAR
2 ALEUHTAT
The technical issues (specified in earlier memos: Conducted by:
Tomi, Saara, Sjur. Sjur and Tomi will tomorrow Tuesday report back
on a plan for using risten.no as editor for our name lexicon
A very short example is found at common/src/proper-nouns.xml.
Saara has made a conversion script which is ready to use. More discussions on
the layout of the resulting xml file is needed.
Complex names
In the present lexicon, complex names are treated as a class of first parts (see
below), and the last part is stored as a regular simplex name.
With the new XML lexicon format, the complex names should be restored. The
present lexicon format can easily be reconstructed (by splitting at the
space character), and the list of complex names can also be used for other
purposes in the future.
Also, integration with risten.no and the kvensk project (and through that, also
Kartverket in one way or another), presupposes that we can store the complete
and “true” name.
There are ~100 first parts of complex names, the name lexicon contained 739
such complex names before they were broken up in 2004/10/14.
First-part tags are now listed separately:
LEXICON ProperNounFirstPart
El% Baradej BERN-sur ;
FirstTag ;
Badje FirstTag ;
Bajimus FirstTag ;
Bajit FirstTag ;
Bassi FirstTag ;
The format we left a year ago looked like this:
Aleksander% I%:a% suo0lu:Aleksander% I%:a% suollu SUOLU ;
Amerihká% Ovttastuvvan% Stáhtat:Amerihká% Ovttastuvvan% Stáhta ALEUHTAT ;
Amery% jiekn1arav0da:Amery% jiekn1arav'da DEATNU ;
Amundsena-Scotta% stas1uvdna DEATNU ;
Austrália% Álppat:Austrália% Ál'pa ALEUHTAT ;
Badje% Riebejoh0ka:Badje% Riebejoh'ka DEATNU ;
Badje% Stuorjoh0ka:Badje% Stuorjoh'ka DEATNU ;
Bajimus% Fielvuonjáv0ri:Bajimus% Fielvuonjáv'ri DEATNU ;
Bajimus% Molles1jáv0ri:Bajimus% Molles1jáv'ri DEATNU ;
They were broken up with the following argumentation:
revision 1.127
date: 2004/10/14 09:38:17; author: trond; state: Exp; lines: +4653 -5153
This is the great % removal revision. The background was that our
pre-composed multiword names, such as Davimus Borsejoh0ka, etc. did
not work. They passed the preprocessor only in the nominative, and not
in other cases. In the worst case, their parts were not recognised as
such, and the result would be a missing analysis. Now, the first part
has been assigned to a separate lexicon, ProperNounFirstPart, that get
the tag +N+Prop+Attr only. This lexicon contains entries like Davimus,
Guhkes, Helse, magnehtalas1, and other first parts of complex
names. These should be disambiguated in sme-dis.rle, leaving the tag
only when there is a N Prop following it. As a result of this, the
file bin/attr.txt is drastically reduced.
Task list for this issue:
- restore complex names from old cvs; c/sh-ould be stored in a separate file
- cvs up -r 1.126 sme/src/propernoun-sme-lex.txt (Trond)
-
grep % propernoun-sme-lex.txt |
iconv -f L1 -t UTF-8 |
less (Trond) |
- Make a file complex-names-lex.txt in gt/common/src/complex-proper-nouns.txt
(Trond)
- find eventual unique second-parts (B-parts of names that do not exist in
isolation)
- remove these B-parts from the ordinary name file (Tomi)
- the resulting file format should be identical to our present prop-name
file (=lexc), that can then be converted to our new xml format using the same
script as for the regular names
- make sure xml2lexc can handle complex names in ways compatible with our
present tool chain (=reconstruct the lexc format we have now)
Example of how the old lexicon can be used to identify complex name last parts
that are also used as simple names:
$grep Riebejo gt/sme/src/propernoun-sme-lex.txt
Badje% Riebejoh0ka:Badje% Riebejoh'ka DEATNU ;
Riebejoh0ka:Riebejoh'ka DEATNU ;
The details of the new XML format needs to be further discussed in the newsgroup
and integrated with the rest of the XML work and discussion.
North Sámi
Lule Sámi
Sjur, Thomas and Trond will cont. Lule Sámi issues.
Tasks:
- update the normativity issues document:
- G3 open issues (S2, some S3, S5, S6 and S7; Sx = Spiik, consonant series)
- Great progress has been made on the G3 issue, just some minor points remain.
Numerals
The issue awaits closure of the propernames project, and is postponed to next week.
Árran meeting
Børre, Anne Britt and Sjur go to Árran on Wednesday, for meetings on
Thursday. Main meeting is with Anders Kintel about using his Lule Sámi
dictionary in our projects.
7. Speller infrastructure
Nothing this week either.
8. Other
Technical issues
- The mac os / perl bug (at least Trond and Sjur has it):
- utf8 “\xC4” does not map to Unicode at /Users/trond/gt/script/preprocess line
- This msg did not show up in 10.3 (perl 5.8.1), but does so in 10.4 (perl
5.8.6). It is probably a perl - OS mismatch. (Trond, Thor Øivind,
Tomi)
- Another example of the same bug:
- :”\x{00c3}” does not map to utf8 at ../script/preprocess line 113, <> chunk
33.
- One way to “resolve” this is to redirect the error messages to /dev/null:
... | preprocess 2> /dev/null | lookup ...
XXE updates
Who has the latest XXE (3.0) and the latest forrest config?
- Børre - ok
- Trond - ok
- Maaren - XXE, but no config
- Tomi - no
- Thomas - no
- Saara - ok
- Sjur - ok
- Ilona - ok?
- Linda - no
Børre is updating the ones not yet up to speed.
Video conferencing across firewalls
The problem we’ve had with the SD firewall persists, and there doesn’t seem to
be any resources available to help us. Geir Kaaby instead suggested we look
at the Marratech package, and try it out. So please
download the MacOS X client (or get it from me), and I’ll send you the URL to
the meeting room as soon as I get it.
Bug fixing
24 open bugs (and 23 risten.no bugs)
Bugzilla update
When Bugzilla is being moved, it should also be updated to the newest version,
and the UTF-8 bug should be resolved.
risten.no
- Organisation: could Tomi be used, in exchange for more linguistic work by
(old) GIO members? Yes, it is ok, but how much still needs to be evaluated
- it is ok to integrate “kvensk” placenames with risten.no
- this should be integrated with the general proper name work - we want all
proper names integrated with risten.no, df above
- needs further development of risten.no to allow for multiple XML bases to
be presented and maintained in parallel. This is to be further worked on by
Tomi and Sjur
- infrastructure for proper names in place by end of November, if everything
goes well (or according to a tentative plan)
9. Summary, task list
Børre
- Contact oahpahusossodat about texts
- Gather public texts
- Continue converting text from input format to our xml
- Document the corpus directory structure
- Ask Thor-Øivind to move bugzilla to our new webserver
- … and update Bugzilla at the same time
- install new XXE and the new XXE Forrest config for all (or check that it is
installed and working)
- mark-up names
- divvun.no and giellatekno.uit.no
- Binary files download area
- Make the conversion to static site, using our own script.
- hyphenation in corpus docs
- meet with Anders Kintel in Árran
- corpus xsl files under version control
Maaren
- shall work with Sámi place names only
- update the last issue in the North Sámi normativity issues document
Saara
- Look at the corpus infrastructure issue
- Look at the corpus interface issue with Lars
- look into efficient editing of the xml proper name lexicon (tools, modes, etc)
- Convert the name lexicon from present format to xml
- document corpus infrastructure, your own parts
- Look at the hyphenation issue
- Update the corpus.dtd
- corpus xsl files under version control
Sjur
- Lule Sámi twol problems, look again at the sets definition with Thomas and
Trond, Wednesday
- risten.no bugs and fixes
- follow up on voice group-chat not working to Sámediggi
- project planning with Trond, continued
- also look at the development processes - specification and testing
- Follow up on place names from Norge Digitalt
- write an e-mail to or call Bjørn Olav Megard
- Evaluate SFST as speller (and analyzer) lexicon
- more thorough analysis than was possible in Guovdageaidnu
- write a background document on the corpus contracts
- discuss kvensk project support with Trond
- proper name integration with risten.no
- discuss risten.no work with Tomi
- write public tender documents
- buy:
- new computer (project server)?
- hyphenation in corpus docs
- meet with Anders Kintel in Árran
Thomas
- work on North sámi compounding and derivation
- Look at Linguistic bugs with Trond
- Continue to meet with Sjur and Trond about and work with the definition of G1, G2, G3
- update the lule sámi normativity issues document
Tomi
- Aspell: Continue working on the affix file & aspell
- Contact aspell author (UTF-8 thing)
- three-part compounding
- corpus infrastructure: dtd location (both public and internal)
- Document aspell and corpus infrastructure
- Specification for new catxml in C++
- this includes also placing the source and binary
- discuss about xml-processing with Saara
- look into efficient editing of the xml proper name lexicon (tools, modes, etc)
- start looking at conversion of the name lexicon from present format to xml
- discuss risten.no work with Sjur
- Look into synchronisation of proper names with risten.no
- hyphenation in corpus docs
- corpus xsl files under version control
- add automatic language detection to the corpus processing
Trond
- Send the contract to the university lawyer
- Look into the document hyphenation issue
- Look at the three-part compound issue
- Work on the CG-related bugs on the bug list (7 open) (numeral related ones
postponed.
- project planning with Sjur, continued
- also look at the development processes - specification and testing
- The name project
- Work on the name project, mark up names (400 names left)
- Extract complex names from version 1.126 and save them as a separate file in common.
- discuss kvensk project support with Sjur
- Work on the G3 bug issue with Sjur and Thomas, carry it over to sme.
10. Next meeting, closing
21.11.2005 10:00
Closed at 11:47