Meeting setup
- Date: 13.03.2006
- Time: 09.30 Norw. time
- Place: Wherever we are :-)
- Tools: iChat, SubEthaEdit
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 09:40.
Present: Børre, Linda (from topic 7 onwards), Maaren, Saara, Sjur, Thomas,
Tomi, Trond
Absent: none
Main secretary: Tomi
Agenda accepted as is.
2. Reviewing the task list from the last meeting
Børre
- send out contracts with accompanying letter
- Gather public texts, preferrably also parallel ones
- http://girji.info/skolehist done …
- Continue converting text from input format to our xml
- convert nob and nno bible texts to be used as part of a parallel corpus
- Waiting for paratext versions
- review the paratext2xml converter
- Waiting for paratext versions of nob and nny
- convert smj NT to paratext
- Waiting for paratext versions of nob and nny
- Call Ove Sæth
- Move complex name lexicon issue to bugzilla
- Ask KIO Grafisk to make a test Quark document based on a Word document from us
- Iđut and KIO Grafisk won’t give access to their Quark files, so they don’t
see the need to send it.
- Send out letters to the Iđut authors
- Åge Persen will collect an address list.
- Add corpus security re G5 syncing as an issue to Bugzilla
- set up anon. read-only cvs with Sjur
- write docu for how to apply for a corpus user account (forms, recipients,
etc)
- remove old corpus files from gt/sme/corp/ after Trond has cleaned it
- integrate generated corpus repository documents in the Forrest site
- give Saara the needed details regarding corpus dtd location on our public
server
- fix bugs!
Maaren
- work with the top-ten list
**done
- translate stopword list from Norw. to smj, fin (aligner, stopword list from
Trond)
**?????? (ask Trond)
Saara
- Extract corpus meta info into a standard xml format; set up cron task for the
extraction
- Create a parallel corpora of the new testaments.
- Implement validation of xml corpus against the dtd.
- Create a group for corpus users.
- learned how to create groups and added one: corpus.
- Finish corpus dtd documentation, dtd location and permlink reference
- The dtd-location still open.
- Update convert2xml.pl to handle two gt-trees (gtfree and gtbound)
- Implemented, some questions left before installation.
- add more texts to the graphical corpus interface.
- Not done. emailed Lars some questions.
- grammatical searchability in the graphical corpus interface
- fix bugs!
Sjur
- Follow up the lawyer treatment of the contracts
- Lule Sámi twol problems, with Thomas and Trond
- project planning with Trond, continued
- Follow up on place names from Norge Digitalt
- Evaluate SFST as speller (and analyzer) lexicon
- write a background document on the corpus contracts
- public tender:
- answer requests/questions
- set up anon. read-only cvs with Børre
- done (well, I really did nothing)
- corpus repo access
- still open, but we’ll let them have only the free part for now; further
discussions later
- conversion of corpus repo summary xml to Forrest xml
- smj G3 issue with Thomas and Trond
- sme G3 issue with Thomas and Trond
- call EDD/Christian Emil Ore about national place name lexicon
- risten.no/proper noun lexicon development:
- refactor code
- did some small adjustments - most of the work waiting for the task below
- implement inheritance/collection overriding for xsl/css/xquery using sitemaps
- first version of this system implemented and working for the initial query
page for editors (select a collection - the query page that follows will now
be generated from the most specific version found for the requested
collection
- fix bugs!
- no bugs fixed last week:-(
Thomas
- add incoming Lule sámi words
**added and still adding
- include the SGL decisions in our normativity document
**done
- include normativity desicions made by Magga and Sammalahti in our normativity
document
**done
- work on North Sámi compounding and derivation
**nothing
- smj G3 issue with Sjur and Trond
**nothing
- sme G3 issue with Sjur and Trond
**nothing
- translate stopword list into smj (aligner; list from Trond)
**not done
Tomi
- move aspell UTF-8 suffix bug to Bugzilla
- corpus infrastructure:
- dtd location (both public and internal)
- Document aspell infrastructure: finish doc/proof/spelling/X-spell/aspell.xml
(it’s almost done, but there are a couple of loose ends)
- new proper name lexicon
- discuss the new lexicon format and other issues in the newsgroup
- implement data synchronisation of proper nouns between risten.no and CVS
- XQuery refactoring and code development for our proper noun editor
- helped Sjur with this one
- new version of xml2lexc (based on ccat), should handle complex names correct:
construct entries like we have now from the different parts of a complex name
entry
- read aligner docu, install, provide feedback
- implement oslolaš issue for smj
- fix bugs!
Trond
- Clean up the old corp/ directory.
- Work on corpus texts with Børre for parallel NT texts
- Not much, still awaiting response from Bible societies.
- Contact the Finnish and Swedish Bible societies to get Bible texts.
- Asked my contacts in Norway to get addresses, still no response.
- read aligner docu, install, provide feedback
- Ask for a Quark test file from his sister
- Done, received, also got some from Michael Everson.
- translate stopword list into nno?
- Not done anything with the aligner issue.
- fix bugs!.
3. Documentation
Changes and updates because of the Divvun public tender
TODO:
- document anonymous, read-only access to our cvs repo (Børre)
- review: Sjur, Saara, by Wednesday morning
- probably a new main section (sub-tab?) on external access to all our resources
- documentation on how to apply for a user account for the corpus repo
(Børre)
- we need to finish the corpus dtd documentation (Saara)
Permlink location for all our dtd’s (filename will vary, of course):
http://giellatekno.uit.no/dtd/corpus.dtd
This corresponds to the dir ~/gt/public_html/dtd/
on our public web server.
The DTD’s has to be manually copied to this location, but since they don’t
change that often, that shouldn’t be a big problem.
TODO:
- copy updated DTD’s to the permlink location (Børre or Saara)
4. Corpus gathering
Collecting
See a previous meeting memo for what’s to be done.
TODO: Send out the rest of the letters (Børre)
Odin
Waiting for Sæth to discuss with colleagues about how to implement the
cooperation, and return to us.
TODO:
Olavi Korhonen’s Lule Sámi dictionary.
Korhonen and Oahpadusguovdásj have a shared copyright to the dictionary.
They are both very positive.
KIO Grafisk and the Iđut books
Iđut and KIO Grafisk won’t give access to their Quark files, due to copyright
issues with fonts and pictures. It is a principle for them.
Citations from one of the discussions we have had with Quark experts:
- Trond: Can you confirm that I can get a quark file from you WITHOUT at the
same time receiving the fonts?
- Michael: Of course. Quark files do not embed fonts. They are in the Fonts
folder.
- Trond: what about pictures?
- Michael: However, if I send you a Quark file and you don’t have the fonts, it
will display with other fonts and will look bad. A low-quality image of a
picture will appear when you import one into Quark, but the main picture that
the printer uses is in a separate file.
TODO:
- Børre will send letters to the authors.
- send an e-mail explaining the issues as we see them, to Iđut and KIO Grafisk,
and with a copy to the head of the project board (Anne Britt). This e-mail
will be our last effort to try to get the Iđut material in its corrected form.
(Sjur and Børre)
Bible texts
TODO:
- review paratext2xml converter (Saara)
- convert smj NT to paratext (Børre)
- ask to get fin and swe NT and OT in paratext format. (Trond)
- Still not done. Trond has contacted Bibelselskapet for a new sme
version, and the issue is under discussion.
5. Corpus infrastructure
TODO in transferring the old gt/sme/corp files to the new corpus repo:
- for the biggest top ten (or so) the orig. should be located and copied to the
new corpus repo (Trond)
- then these files should be removed from gt/sme/corp/ (Trond/Børre)
- all small files could just be forgotten/ignored
- make sure there’s nothing left with a copyright attached to it (Trond)
- Trond will go a second round
TODO for access control:
- Access control to corpus repo resolved through Unix groups: one group for
corpus maintainers with write access, and another for corpus users, with
read-only access. Still not done.
- Saara has asked for a *nix group - use it when created
Further discussion about corpus analysis and computer use:
- we need to develop strong enough security routines for the G5 to fulfill our
obligations towards the text licensers
- TODO: Børre to move this to bugzilla
TODO dtd usage and documentation:
- corpus dtd documentation:
- structure, content/model and location of the dtd (location =
permlink) http://giellatekno.uit.no/dtd/corpus.dtd
TODO: Saara to write and finish the docu, also check the dtd link
- see above under documentation for details.
- add xml validation against our dtd to the corpus conversion process
(Saara)
HTML conversion problem
We need to extract only the table from input like below, since our DTD does not
allow tables to be nested inside paragraphs. Cf newsgroup message from
Saara.
<p>
<table>
...
</table>
</p>
The solution is a simple XSL template that will only match the relevant
structures, and then “eat” the paragraphs before continue processing of the
table:
<xsl:template match="p[table]">
<xsl:apply-templates select="./table">
</xsl:template>
There are many scenarios where information about spelling and other errors is
useful, especially if combined with the correct string. As a simple way of
marking up such info, Sjur proposed the following in the newsgroup:
... this is <error correct="text">tekst</error> with an error...
No problem in adding it to the DTD, together with corresponding info in the
header/meta part. That info should be a single element stating that/whether it
was manually edited for corrections. Absence of this element is the same as the
document NOT being edited.
TODO:
OPEN ISSUES:
- since this is manual editing, we break the automatic regeneration/reconversion
principle. Either we track each change when we find such editions in the
existing version before re-conversion, and apply them again after the
re-conversion, or we have to find another way of preserving them across
conversion generations. Anyway, this is now left as an open issue, and added
to Bugzilla (Sjur)
- the proposed markup is too simplistic for describing more complex error
patterns, e.g. when two different errors overlap or intertwine. One could
allow nested error markup to cover cases with a syntactic error surrounding a
spelling error (one error tag for the syntax error, another inside for the
spelling error. To be further discussed in the newsgroup.
Changes and updates because of the Divvun public tender
User account admin and infra: see [previous
memo|/admin/weekly/2006/Meeting_2006-03-06.html].
TODO: see above under Documentation.
Automatic build of the content of our corpus repo: also see [previous
memo|/admin/weekly/2006/Meeting_2006-03-06.html].
TODO:
- extract meta info into a compact xml document, the xml should be stored in the
Forrest docu tree; set up cron task for the extraction (Saara)
- discuss and decide upon the structure of the generated xml above (Sjur and
Saara, news)
- convert from that xml to Forrest document format (Sjur)
- looked at the generated file, nothing more yet
- integrate the final Forrest documents into Forrest, and make sure it gets
published (Børre)
Free and non-free texts
More info in the [previous meeting
memo.|/admin/weekly/2006/Meeting_2006-03-13.html]
TODO:
- update scripts to handle this dichotomy. (Saara)
More texts to the graphical corpus interface:
6. Infrastructure
We need to set up anonymous, read-only access to our cvs repo as outlined by our
friend in Skolelinux.
Howto/who:
- what do we need?
- web interface? maybe, not required
- command line check-out? yes (Roy Dragseth/Børre)
- need to be able to restrict anonymous cvs to only specific modules
(Roy Dragseth/Børre)
- testing needed: (Saara, Sjur)
Aligner
TODO:
- Read documentation and try out, give feedback to Bergen. (Trond,
Saara, Tomi)
- Trond to send relevant documents to Tomi.
- Translate the stopword list anchor-eng-nor.txt into sme (and fin?)
(Tomi), and into smj (Thomas or Anders Urheim) (and nno? Trond).
Note that the format is “lang1 / lang2”, and that the number of lines should be
kept in order to make it possible to move from one language to the next.
- Saara to install the aligner, everyone to read the documentation on
Tuesday and Wednesday, and then we have a meeting on it later this week.
Language recogniser
We don’t have enough Finnish text. We will look at the Helsinki corpus
(Saara).
- This is documented in bug database.
7. Linguistics
North Sámi
TODO:
- document all past decisions in our normativity document (Thomas)
**done
- decide on a semantic feature system for nouns (Linda).
This is needed for the sme locative/comitative issue. We looked at the feature
system made by Eckhard Bick - Eckhard Bick: “The parsing system “palavras””,
p.372ff. The top of the feature tree looks like the following:
Concrete
+/ \-
Animate Verbal Content
+/ \- +/ \-
Human Moving Control Mass
+/ \- +/ \- +/ \- +/ \-
#humans# Moving #vehicles# Movable Perfective Perfective #features# Count
.......
TODO:
- Work with semantically based sets (Trond, Linda)
- Return to the infrastructure issue (Trond)
- A full semantic encoding of the lexicon is a future project, outside the
scope of both divvun and disamb, but the ground work for such a project will
be laid now.
Semantic tags we already have:
Actio
Plc (for proper nouns such as “London London+N+Prop+Plc+Sg+Nom”)
Place names:
Now: Tags Plc Sur and combinations (London, Trosterud).
Problem:
19000 Plc
900 SurPlc (Trosterud BERN-surplc) Oslo sur. Berlin Bonn?
10500 Sur
- Solution A:
- Heavily retag the lexicon: London also as Sur (Jack)
- Solution B: Do not use (new..) double tags.
- Plc being the default tag for Plc/Sur (“if it can be Plc, it is Plc)
- Sur being the tag of things that cannot be places (Andersen)
- Then cg rules turning Plc into &Plc and &Sur, and Sur into &Sur.
- Then rules for interpreting London, Trosterud, etc. as &Sur.
- Then a final rule for removing ambiguous ones (remove Plc &Sur strings).
- Solution C compromise:
- Real placenames Menešjávri
- Convertible placenames (today’s double) England, Bonn, (default)
- Real surnames Andersen, Johansson
TODO (Trond):
- Discussion testing
- infrastructure
- semiatomatic retagging
Lule Sámi
TODO:
- add the rest of the inc- words (Thomas)
**still working on it, should finish this week
- name morphology (Thomas)
**handed Tomi list
- oslolaš for smj (Tomi)
- translate Northern Sámi lists and sets to Lule Sámi
- Linda, Trond, with help from mother tongue speakers (Thomas, others).
Work in progress.
Trond will go to Drag tomorrow. Issues for the trip? No unobvious ones.
8. Name lexicon infrastructure
Complex names
- make sure xml2lexc can handle complex names in ways compatible with our
present tool chain (=reconstruct the lexc format we have now) (Tomi)
- the resulting file format should be identical to our present prop-name
file (=lexc), that can then be converted to our new xml format using the same
script as for the regular names (Tomi or Saara, but only when the
technical details are settled)
TODO:
- Move this issue to bugzilla (Børre)
TODO on eXist as editor:
- refactor and prepare risten.no for multiple collections:
- develop the Cocoon sitemap to delegate requests to the proper folder level,
such that the most specific code is always used (Sjur)
- Progressing well
- refactor the code into more and more specific components according to our
folder hierarchy (Tomi)
- develop the needed XQueries and interface (Sjur, Tomi)
- data synchronisation between risten.no and the cvs repo (Tomi)
- done some, but it didn’t work out, will need to start on a different trail
- test whether eXist as editor is actually working well (linguists)
Data synchronisation task list/specification:
- the xml file needs to be stored/updated in cvs
- there should be no diffs on whitespace and sorting order (to ensure we get
only substantial diffs, and no noise)
- the prop name update cycle should something like:
- dump the xml from eXist (in proper sorting order)
- check whether there are diffs against cvs; continue only if there are
- update from cvs
- error check: are there conflicts? if yes, send report to
- are there still diffs? if yes, continue:
- check in/commit w. generated comment
- error check: is the document valid and conformant xml? if no, stop and send a
report to
- reimport the xml file into eXist
- question: do we need to lock the file in eXist through this update cycle?
- the update cycle should be a nightly cron job
9. Spellers
Nothing until the new proper noun lexicon is in place.
10. Other
Gobby
A cross-platform alternative to SubEthaEdit, Gobby, is now available for OS X
through DarwinPorts. I haven’t tested it in collaborative use, but it is worth
looking at for collaboration involving Windows and Linux users.
Requirements for easy install:
- DarwinPorts ([http://darwinports.opendarwin.org/])
- Port Authority (GUI for DarwinPorts)
([http://www.apple.com/downloads/macosx/unix_open_source/portauthority.html])
Install and run the above as admin user. Then find and install Gobby (hint: use
the search field), and wait. It took something like 8 hours on my computer to
download, compile and install all dependencies, but there was no problems,
hicups or other complications. When finished, open X11, and just type ‘gobby’ in
any terminal window.
Bug fixing
35 open Divvun/Disamb bugs, and 25 risten.no bugs
SPR language policy decision
Last week’s SPR meeting decided upon a language policy. Their decision was the
following (cf. their meeting minutes for the final version):
«En fungerende språklig infrastruktur er av avgjørende betydning for at de
samiske språkene skal kunne fungere som bruksspråk i et moderne samfunn. I en
slik infrastruktur inngår en velutviklet terminologi, enspråklige ordbøker,
ordbøker og lærebøker de samiske språkene i mellom, uttømmende grammatikker som
gir informasjon om alle sider ved språket, og språkprogrammer som gjør det mulig
å søke etter informasjon, finne termer, rette skrivefeil og grammatikk, og
oversette maskinelt fra ett språk til et annet. For å få dette til trengs det
representative tekstsamlinger, helst på flere hundre millioner ord, både
enspråklige og parallellspråklige, det trengs grammatisk og språkteknologisk
forsking. SPR sin rolle er å legge til rette for dette arbeidet.»
11. Summary, task list
Børre
- send out contracts with accompanying letter
- Gather public texts, preferrably also parallel ones
- Continue converting text from input format to our xml
- convert nob and nno bible texts to be used as part of a parallel corpus
- review the paratext2xml converter
- convert smj NT to paratext
- Call Ove Sæth
- Move complex name lexicon issue to bugzilla
- Send out letters to the Iđut authors
- Add corpus security re G5 syncing as an issue to Bugzilla
- write docu for how to apply for a corpus user account (forms, recipients,
etc)
- remove old corpus files from gt/sme/corp/ after Trond has cleaned it
- integrate generated corpus repository summaries in the Forrest site
- copy updated DTD’s to the permlink location, or help Saara do it
- send a final e-mail to Iđut and KIO Grafisk about copyright issues and texts
- fix bugs!
Maaren
- work with new missing lists
Saara
- Extract corpus meta info into a standard xml format; set up cron task for the
extraction
- Create a parallel corpora of the new testaments.
- Implement validation of xml corpus against the dtd.
- Create a group for corpus users.
- Finish corpus dtd documentation, dtd location and permlink reference
- update the corpus dtd with option for correction tags
- copy updated dtd’s to permanent external location
- Update convert2xml.pl to handle two gt-trees (gtfree and gtbound)
- add more texts to the graphical corpus interface
- grammatical searchability in the graphical corpus interface
- review paratext2xml converter.
- install sentence aligner.
- test anonymous cvs access and review documentation.
- fix bugs!
Sjur
- Follow up the lawyer treatment of the contracts
- Follow up on place names from Norge Digitalt
- Evaluate SFST as speller (and analyzer) lexicon
- write a background document on the corpus contracts
- public tender:
- answer requests/questions
- test anon. read-only cvs, review docu, and send link to Finnut
- corpus repo access to free texts (with Børre)
- conversion of corpus repo summary xml to Forrest xml
- call EDD/Christian Emil Ore about national place name lexicon
- risten.no/proper noun lexicon development:
- refactor code
- implement inheritance/collection overriding for xsl/css/xquery using sitemaps
- code design for XQueries needed for dict/term editing
- send a final e-mail to Iđut and KIO Grafisk about copyright issues and texts
- add manual editing of corpus files as an issue to Bugzilla (error tags)
- fix bugs!
Thomas
- add incoming Lule sámi words
- work on North Sámi compounding and derivation
- smj G3 issue
- sme G3 issue
- translate stopword list into smj (aligner; list from Trond)
- assist Trond and Linda with the smj disamb work
Tomi
- move aspell UTF-8 suffix bug to Bugzilla
- corpus infrastructure:
- dtd location (both public and internal)
- Document aspell infrastructure: finish doc/proof/spelling/X-spell/aspell.xml
(it’s almost done, but there are a couple of loose ends)
- new proper name lexicon
- implement data synchronisation of proper nouns between risten.no and CVS
- XQuery refactoring and code development for our proper noun editor
- new version of xml2lexc (based on ccat), should handle complex names correct:
construct entries like we have now from the different parts of a complex name
entry
- read aligner docu, install, provide feedback
- translate stopword list into sme (aligner; list from Trond)
- fix bugs!
Trond
- Contact the Finnish and Swedish Bible societies to get Bible texts.
- translate stopword list into nno?
- double check all remaining docs in gt/sme/corp/ for copyright issues
- grammatical searchability in the graphical corpus interface
- better smj NT text
- work on semantically based sets (sme, smj)
- start and lead discussion and work on semantic features for disamb
- fix bugs!.
12. Next meeting, closing
27.03.2006 09:30
Maaren will be away the next four weeks, starting next week. After that she will
work in Tromsø all May, and will share office with one of the Tromsø gang then.
She will need her own key for that period (Trond).
Closed at 11:48