Meeting setup
- Date: 27.03.2006
- Time: 09.30 Norw. time
- Place: Wherever we are :-)
- Tools: iChat, SubEthaEdit
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 10:56.
Present: Børre, Saara, Sjur, Thomas, Tomi, Trond
Absent: Maaren
Main secretary: Trond
Agenda accepted with additions under “Other”.
2. Reviewing the task list from the last meeting
Børre
- send out contracts with accompanying letter
- Davvi Girji, NSI (Sámi Instituhtta), Min Áigi, Aššu, DAT,
Báhko (Lule Sámi)
- Gather public texts, preferrably also parallel ones
- Some gathered, but not converted
- Continue converting text from input format to our xml
- Tried to convert html documents, but didn’t succeed
- convert nob and nno bible texts to be used as part of a parallel corpus
- waiting for Saara and Tomi
- review the paratext2xml converter
- convert smj NT to paratext
- waiting for the two issues above
- Call Ove Sæth
- Impossible to reach on the phone, sent a mail
- Move complex name lexicon issue to bugzilla
- Send out letters to the Iđut authors
- waiting for address list from Åge Persen leader of Iđut.
- Add corpus security re G5 syncing as an issue to Bugzilla
- write docu for how to apply for a corpus user account (forms, recipients,
etc)
- remove old corpus files from gt/sme/corp/ after Trond has cleaned it
- integrate generated corpus repository summaries in the Forrest site
- copy updated DTD’s to the permlink location, or help Saara do it
- Done, and given Saara instructions on how to do it herself.
- send a final e-mail to Iđut and KIO Grafisk about copyright issues and texts
- fix bugs!
- Resolved 197 (Sjur and Thor-Øivind), 241, 259 (by Sjur)
- Misc:
- Added the GPL to our cvs repositories.
Maaren
- work with new missing lists
Saara
- Extract corpus meta info into a standard xml format; set up cron task for the
extraction
- Create a parallel corpora of the new testaments.
- Implement validation of xml corpus against the dtd.
- Validation is implemented. There were new errors found during this
process, they are almost all fixed, but some fine tuning left.
- Finish corpus dtd documentation, dtd location and permlink reference
- update the corpus dtd with option for correction tags
- copy updated dtd’s to permanent external location
- Update convert2xml.pl to handle two gt-trees (gtfree and gtbound)
- done, but the name of gt-tree is not yet changed.
- add more texts to the graphical corpus interface
- grammatical searchability in the graphical corpus interface
- review paratext2xml converter.
- the paratext2xml was not implemented. now it’s written and part of
the conversion process. Ready for review.
- install sentence aligner.
- Aligner has a graphical interface, so it was not installed on
cochise. The tool is briefly tested and commented.
- test anonymous cvs access and review documentation.
- fix bugs!
Sjur
- Follow up the lawyer treatment of the contracts
- Follow up on place names from Norge Digitalt
- Evaluate SFST as speller (and analyzer) lexicon
- write a background document on the corpus contracts
- public tender:
- answer requests/questions
- test anon. read-only cvs, review docu, and send link to Finnut
- corpus repo access to free texts (with Børre)
- conversion of corpus repo summary xml to Forrest xml
- call EDD/Christian Emil Ore about national place name lexicon
- risten.no/proper noun lexicon development:
- refactor code
- implement inheritance/collection overriding for xsl/css/xquery using sitemaps
- code design for XQueries needed for dict/term editing
- send a final e-mail to Iđut and KIO Grafisk about copyright issues and texts
- sent to Anne-Britt and Per Edvard instead
- add manual editing of corpus files as an issue to Bugzilla (error tags)
- fix bugs!
Thomas
- add incoming Lule sámi words
**not this week
- work on North Sámi compounding and derivation
**not this week
- smj G3 issue
**not this week
- sme G3 issue
**not this week
- translate stopword list into smj (aligner; list from Trond)
**translated half of it til now
- assist Trond and Linda with the smj disamb work
**done
Tomi
- move aspell UTF-8 suffix bug to Bugzilla
- corpus infrastructure:
- dtd location (both public and internal)
- Document aspell infrastructure: finish doc/proof/spelling/X-spell/aspell.xml
(it’s almost done, but there are a couple of loose ends)
- new proper name lexicon
- implement data synchronisation of proper nouns between risten.no and CVS
- XQuery refactoring and code development for our proper noun editor
- new version of xml2lexc (based on ccat), should handle complex names correct:
construct entries like we have now from the different parts of a complex name
entry
- read aligner docu, install, provide feedback
- translate stopword list into sme (aligner; list from Trond)
- fix bugs!
Trond
- Contact the Finnish and Swedish Bible societies to get Bible texts.
- Contacted both. The Finnish one is open to research use, we will get
confirmation from the Churh, which is the instance with the formal rights.
The Swedish Bible society will contact me.
- translate stopword list into nno?
- Not done, but partly into Finnish. cvs?
- double check all remaining docs in gt/sme/corp/ for copyright issues
- grammatical searchability in the graphical corpus interface
- Important issue, not done.
- better smj NT text
- Asked the Bible society, still not got any.
- work on semantically based sets (sme, smj)
- start and lead discussion and work on semantic features for disamb
- Done some thinking, that’s all.
- fix bugs!.
- Tested anon. cvs and corpus upload. Both worked very well.
3. Documentation
Changes and updates because of the Divvun public tender
TODO:
- review anon. cvs: Sjur, Saara, by Wednesday morning
- probably a new main section (sub-tab?) on external access to all our resources
- documentation on how to apply for a user account for the corpus repo
(Børre)
- we need to finish the corpus dtd documentation (Saara)
TODO:
- copy updated DTD’s to the permlink location (Børre or Saara)
4. Corpus gathering
Collecting
See a previous meeting memo for what’s to be done.
TODO: Send out the rest of the letters (Børre)
Børre has sent a letter to the publishers, has talked to Brita Kåven (she
was positive), to Bård Eriksen (Báhko), and has mailed to Dát. Has talked to Min
Áigi, they will have a meeting on Monday (today). Áššu will collect text. Has
contacted Audhild Schanche at NSI, she would look at the contracts and said they
would work out a solution. She also mentioned Dieđut.
Odin
Waiting for Sæth to discuss with colleagues about how to implement the
cooperation, and return to us.
TODO:
- call Sæth (Børre)
- I have mailed him, not able to reach him by phone. No answer yet, though …
Olavi Korhonen’s Lule Sámi dictionary.
Korhonen and Oahpadusguovdásj have a shared copyright to the dictionary.
They are both very positive.
KIO Grafisk and the Iđut books
- Sjurhas sent an e-mail explaining the issues as we see them, to Anne Britt
(the head of the project board) and Per Edvard Klemetsen (member of the
board). We will first see whether we can get an agreement with other
publishers, then try to get a meeting with the publishers and a member of the
Sámi Parliament council. If that fails as well we will have no means to get
texts from them. If so, we will forget about Iđut, and go directly to the
authors.
TODO:
- Børre will send letters to the authors
Bible texts
We will get text from Finland. We are awaiting an answer from Sweden. As for the
last html versions from Norway, the people have been very busy the last weeks.
Saara has made the paratext2xml converter.
TODO:
- review paratext2xml converter (Saara)
- converter corrected/made, use suffix .ptx when converting.
- convert smj NT to paratext (Børre)
- Will be done now that the paratext2xml has been finished.
- ask to get fin and swe NT and OT in paratext format. (Trond)
- Work in progress/texts underway.
5. Corpus infrastructure
TODO in transferring the old gt/sme/corp files to the new corpus repo:
- make sure there’s nothing left with a copyright attached to it (Trond)
- Trond will go a second round
- remove the deleted files from the CVS repository (Trond)
Further discussion about corpus analysis and computer use:
- we need to develop strong enough security routines for the G5 to fulfill our
obligations towards the text licensers
- TODO: Børre to move this to bugzilla
TODO dtd usage and documentation:
- corpus dtd documentation:
- structure, content/model and location of the dtd (location = permlink):
http://giellatekno.uit.no/dtd/corpus.dtd
- TODO: Saara to write and finish the docu, also check the dtd link
- add xml validation against our dtd to the corpus conversion process
(Saara)
- done. Some new errors were found, they are almost fixed now.
- add UTF-8 check as part of the validation (Saara)
TODO:
OPEN ISSUES:
- since this is manual editing, we break the automatic regeneration/reconversion
principle. Either we track each change when we find such editions in the
existing version before re-conversion, and apply them again after the
re-conversion, or we have to find another way of preserving them across
conversion generations. Anyway, this is now left as an open issue, and added
to Bugzilla (Sjur)
- the proposed markup is too simplistic for describing more complex error
patterns, e.g. when two different errors overlap or intertwine. One could
allow nested error markup to cover cases with a syntactic error surrounding a
spelling error (one error tag for the syntax error, another inside for the
spelling error. To be further discussed in the newsgroup.
- discussed, and nesting added as well
Changes and updates because of the Divvun public tender
User account admin and infra: see [previous
memo|/admin/weekly/2006/Meeting_2006-03-06.html].
TODO: see above under Documentation.
Automatic build of the content of our corpus repo: also see [previous
memo|/admin/weekly/2006/Meeting_2006-03-06.html].
TODO:
- convert from that xml to Forrest document format (Sjur)
- integrate the final Forrest documents into Forrest, and make sure it gets
published (Børre)
Free and non-free texts
More info in a [previous meeting
memo.|/admin/weekly/2006/Meeting_2006-03-13.html]
Newsgroup discussion - whether to rename gt/ to gtbound/ or not:
Saara:
I was thinking of the scripts the users have of their own (if they
have) and trying to avoid changes in already existing system. But if you prefer
gtbound, I’ll do the changes right away.
Sjur:
Let’s evaluate this in the next meeting. I would prefer gtbound/ over just gt/
for clarity’s sake and future newcomers, so if the workload isn’t too high, I
would still like you to change it. We’ll discuss and decide on Monday.
Solution:
Use a symbolic link to handle backwards compatibility.
TODO:
- update scripts to handle this dichotomy. (Saara)
- gt/ vs gtbound/: change to gtbound/, add symbolic link from gt/ to gtbound/
(Saara)
More texts to the graphical corpus interface:
TODO:
- We would like to have more than the NT in the graphical interface (Saara)
- We would like to have grammatical searchability, not only POS. (Saara,
Trond)
- This presupposes a discussion with Oslo. (Trond to start discussion
and Trond and Saara to continue
- For Lule Sámi: We would like to have a parallel corpus interface with NT
(text only). This presupposes better quality texts (Trond, Børre)
- Better Lule NT text still not made.
- preparations: gather more texts (we are doing this)
- Review the tag list and have it ready for inclusion (gt/cwb/korpustags.txt)
(Trond, Linda)
- Prepare a list of good candicates for first inclusion into the corpus.
(Trond, Linda)
Text upload
The upload is working, but Børre doesn’t receive an automatic message
whenever a new text is being uploaded. Saara has made a procedure, but
hasn’t turned it on, since she didn’t know whether the email address was
working.
TODO:
- Ask for email-address: corpus@giellatekno.uit.no (Børre)
- Make a setup for this email address so that it goes to Børre, and then
turn on the procedure (Saara)
6. Infrastructure
Aligner
We are working on it, there are problems, and the test files are not good
enough.
TODO:
- Read documentation and try out, give feedback to Bergen. (Trond,
Saara, Tomi)
- Trond to send relevant documents to Tomi.
- Translate the anchor list anchor-eng-nor.txt into sme (and fin?)
(Tomi), and into smj (Thomas or Anders Urheim) (and nno?
Trond). Note that the format is “lang1 / lang2”, and that the number of
lines should be kept in order to make it possible to move from one language to
the next.
- Saara to install the aligner, everyone to read the documentation on
Tuesday and Wednesday, and then we have a meeting on it later this week.
- Add the anchor list translations to cvs (Trond)
- add to cvs location: gt/common/src/anchor.txt
- “eng / nob / sme / smj / fin”.
word / ord* / sátni, sáni / sana, sanoj
- contra mono: hard to align
- contra bi: each lg twice
- usage: for eng/nob alignment, use eng/nob, for nob/sme alignment, use
nob/sme
Perhaps best to have all lgs in one list, and extract pairs via
cut -d"/" -f1,4
.
7. Linguistics
North Sámi
Semantic feature system
TODO:
- decide on a semantic feature system for nouns (Linda).
- Work with semantically based sets (Trond, Linda)
- Return to the infrastructure issue (Trond)
- A full semantic encoding of the lexicon is a future project, outside the
scope of both divvun and disamb, but the ground work for such a project will
be laid now.
Further discussion and details in the [previous meeting
memo.|/admin/weekly/2006/Meeting_2006-03-20.html]
TODO (Trond):
- Discussion testing
- infrastructure
- semiautomatic retagging
Lule Sámi
TODO:
- add the rest of the inc- words (Thomas)
**nothing done this week
- name morphology (Thomas)
**handed Tomi list
- translate Northern Sámi lists and sets to Lule Sámi
- Linda, Trond, with help from mother tongue speakers (Thomas, others).
Work in progress.
8. Name lexicon infrastructure
Complex names
TODO:
- Move xml2lexc complex name issue to bugzilla (Børre)
Editing
TODO on eXist as editor:
- refactor and prepare risten.no for multiple collections:
- develop the Cocoon sitemap to delegate requests to the proper folder level,
such that the most specific code is always used (Sjur)
- done for XQueries and XSLT; only CSS left (needs to be handled differently)
- refactor the code into more and more specific components according to our
folder hierarchy (Tomi, Sjur)
- develop the needed XQueries and interface (Sjur, Tomi)
- data synchronisation between risten.no and the cvs repo (Tomi)
- nothing last week
- test and review when ready
Data synchronisation task list/specification:
Details in the [previous meeting
memo.|/admin/weekly/2006/Meeting_2006-03-20.html]
9. Spellers
Nothing until the new proper noun lexicon is in place.
10. Other
Divvun admin
The project manager would like all Divvun project
members not working in the SD buildings to write down all hours worked, and a
very brief description of the tasks done. The list should be sent Sjur
every week on Friday afternoon as you leave work, or the following Monday
morning.
Making such lists is necessary to be able to document to the SD administration
that we are working the hours we should, in case of inquiries or newspaper
stories. I trust (and know) you do, but that is not enough if somebody external
doesn’t. Those working in the SD buildings are using a time clock for the
same purposes, which at the same time enforces a stricter working-hour regime
than what is possible with our self-reporting system.
I have been doing the same thing for myself for a long time, and the benefit is
that it is easy to keep track of extra hours and “avspasering”. There are many
applications out there that will help you, or you can just make a simple list on
your own.
TODO:
- keep a list of worked hours (all Divvun team members)
- start this week, then every week
Divvun project management while Sjur is on paternal leave
Sjur will soon go on paternal leave (expected April 6), and most likely be
away for two weeks. While he is away, somebody else would need to be heading the
Divvun project. The basic tasks are pretty simple for the most part:
TODO:
- set up Monday meetings
- conduct the meeting (or let Trond do it:-)
- finalize the meeting memo afterwords, making sure all tasks discussed have
been properly added to the task list of the relevant persons
- also add the meeting memo template for the next meeting, so that people can
update their tasks as they complete/work on them
- be the main contact person for Finnut Consult AS, and
if there are any requests for information to make an offer for the Divvun
project, delegate that question to the most appropriate one, and return the
answer to Finnut
Børre is temp. Project Manager:-)
Easter vacation/absenses
Who? |
When? |
Børre |
from the 10th to the 12th of April |
Saara |
at work normally |
Sjur |
no vacation, possibly paternal leave |
Thomas |
from the 10th to the 12th of April, 3 days |
Tomi |
from the 10th to the 12th of April, might be at work offline |
Trond |
don’t know yet |
Gobby
TODO:
- install and test it, to prepare for cooperation with non-Mac users (use case:
Lars Nygård in Oslo) (Børre, Tomi, Trond, if it works ok then also the
others)
SubEthaEdit update
SEE 2.3 is released. It is now commercial only, but 2.2 is still available for
free, non-commercial use. Since we already have licenses, this is a non-issue,
and all should upgrade. The new version contains bug fixes, and a few new
features (don’t remember them all, mostly improvements in the UI).
Sjur: I have made a simple, but useful jspwiki mode, for syntax coloring of
our meeting memos:-) It isn’t completely reliable yet, improvements welcome (I
won’t do anything more on it).
Sjur: I have also made a first attempt at an XQuery mode, but that one isn’t
working very well. I’ll give it to interested people, though:-)
TODO:
- upgrade SEE (all)
- install jspwiki mode from Sjur (all interested)
Bug fixing
35 open Divvun/Disamb bugs, and 25 risten.no bugs
11. Summary, task list
Børre
- send out contracts with accompanying letter
- Gather public texts, preferrably also parallel ones
- Continue converting text from input format to our xml
- convert nob and nno bible texts to be used as part of a parallel corpus
- review the paratext2xml converter
- convert smj NT to paratext
- Move complex name lexicon issue to bugzilla
- Send out letters to the Iđut authors
- Add corpus security re G5 syncing as an issue to Bugzilla
- write docu for how to apply for a corpus user account (forms, recipients,
etc)
- remove old corpus files from gt/sme/corp/ after Trond has cleaned it
- integrate generated corpus repository summaries in the Forrest site
- Ask for email-address: corpus@giellatekno.uit.no
- install and test Gobby, install new version of SEE (also for Thomas)
- fix bugs!
Maaren
- will be on sick leave throughout April
Saara
- Create a parallel corpora of the new testaments.
- change the name of
gt/
to gtbound/
and add a symbolic link.
- fix the email address for corpus upload.
- add more texts to the graphical corpus interface
- grammatical searchability in the graphical corpus interface
- add utf-8 check to xml-validation of the corpus files.
- install aligner, test it and give feedback
- fix bugs!
Sjur
- Follow up the lawyer treatment of the contracts
- Follow up on place names from Norge Digitalt
- Evaluate SFST as speller (and analyzer) lexicon
- write a background document on the corpus contracts
- public tender:
- answer requests/questions
- corpus repo access to free texts (with Børre)
- conversion of corpus repo summary xml to Forrest xml
- call EDD/Christian Emil Ore about national place name lexicon
- risten.no/proper noun lexicon development:
- refactor code
- implement inheritance/collection overriding for css using sitemaps
- code design for XQueries needed for dict/term editing
- fix bugs!
Thomas
- add incoming Lule sámi words
- work on North Sámi compounding and derivation
- smj G3 issue
- sme G3 issue
- translate stopword list into smj (aligner; list from Trond)
- assist Trond and Linda with the smj disamb work
Tomi
- move aspell UTF-8 suffix bug to Bugzilla
- Document aspell infrastructure: finish doc/proof/spelling/X-spell/aspell.xml
(it’s almost done, but there are a couple of loose ends)
- new proper name lexicon
- implement data synchronisation of proper nouns between risten.no and CVS
- XQuery refactoring and code development for our proper noun editor
- new version of xml2lexc (based on ccat), should handle complex names correct:
construct entries like we have now from the different parts of a complex name
entry
- read aligner docu, install, provide feedback
- translate stopword list into sme (aligner; list from Trond)
- install and test Gobby, install new version of SEE
- fix bugs!
Trond
- Translate anchor list into nno, work on sme, fin.
- Add the anchor list translations to cvs
- remove deleted files from the CVS repository (in the Attic)
- grammatical searchability in the graphical corpus interface: revise taglist
- better smj NT text
- Prepare a list of good candicates for first inclusion into the corpus.
- translate Northern Sámi lists and sets to Lule Sámi
- work on semantically based sets (sme, smj)
- start and lead discussion and work on semantic features for disamb
- Install Gobby with support programs, see, etc.
- get a key for Maaren in May
- install aligner, test it and give feedback
- fix bugs!.
12. Next meeting, closing
03.04.2006 09:30
Closed at 12:19