Meeting setup

Date: 06.06.2006
Time: 09.30 Norw. time
Place: Where we are
Tools: SubEthaEdit, iChat

Agenda

Opening, agenda review
Reviewing the task list from two weeks ago
Documentation - divvun.no
Corpus gathering
Corpus infrastructure
Infrastructure
Linguistics
name lexicon infrastructure
Spellers
Other issues
Summary, task lists
Closing

1. Opening, agenda review, participants

Opened at 09:59.

Present: Sjur, Thomas, Trond, Børre, Tomi

Absent: Maaren, Saara

Agenda accepted as is.

2. Updated task status since last meeting

Børre

corpus collection:
- send out contracts with accompanying letter
  - Synnøve Persen
- Gather public texts, preferrably also parallel ones
  - http://finnmarksloven.no gathered, fully parallel
- Send out letters to the rest of the Iđut authors
- send contract to Kurt Tore Andersen
  - Done
- call Brita Kåven again
  - Not done
- contact Ája (Kåfjord)
  - No answer
- Send renaming scripts to R. Valkeapää
  - Done
- call Bård Eriksen
  - Not done
- send contracts to Čálliid Lágádus
  - Not done
corpus conversion:
- Continue converting text from input format to our xml
  - Done
- convert nob and nno bible texts to be used as part of a parallel corpus
- convert fin, swe to paratext or directly to our XML
- review the paratext2xml converter
- convert smj NT to paratext
  - Not done
- complete Min Áigi metadata
  - Some done
corpus access:
- meeting 9.6. (postponed to 15(?).6) t 9.30: discuss and decide upon the exact access policy we want
  - Done
- set upp the unix group structure for corpus users (also Saara, Trond)
set up Bugzilla automatic reminders for open issues; ask Thor-Øivind if needed
- Not done
test Gobby (with others)
- done
document use of Gobby within our project if above test is ok
- Not done
fix bugs!

Maaren

On sick leave

Saara

Create a parallel corpora of the new testaments
add more texts to the graphical corpus interface
Implement parallel corpus upload in web upload script
- not done
Install Gobby
- not done
set upp the unix group structure for corpus users (also Børre, Trond)
Test the aligners once again
- not done
refine the xml output of the xml-tagged analyses
- waiting Tomi’s ccat implementation
convert or adapt the received PHP for paradigm generation to our needs
- done some planning.
remove headers and footers from antiword documents, other improvements
- done, now improving handling of pdf-documents
discuss parallel corpus markup
- done
extend the DTD with parallel markup
- do we want that? In the newsgroup, I proposed to keep the marking separate from corpus db. Sjur: I think I meant the markup in the header to link parallel files, which you have done now.
Implement cleaning the corpus text in the file-specific xsl file
- not done
fix bugs!

Sjur

public tender:
- collect the last clarifying answers, and make a final proposal to the board
  - still no answer from PL. If it doesn’t arrive today, we’ll make our proposal to the board based on the information at hand.
fst gymnastics to add hyphenation and word boundary marks to hyphenation transducer
- started some work on integrating M4 processing of the twol code - needed to adapt the twol to different scenarios
name lexicon:
- implement editing functions
- finalise refactoring for multiple collections (regular search interface)
  - investigation ready, and the concept phase is over, only implementation left
change corpus summary processing to generate smaller pages
- nothing yet
meeting 23.5. t 9.30: discuss and decide upon the exact access policy we want
- done
test Gobby (with others)
- not done
discuss parallel corpus markup
- not really
move the SEE 2.5 extension list to Bugzilla
- not done
fix bugs!
other:
- monthly reports for April and May
- fixed a bug in the Forrest JSPWiki input plugin that produced invalid HTML for nested lists.

Thomas

investigate Actio compounding, first sme, later smj
- done North-sámi three-syll
discuss findings with Maaren, and later all of us
- not done, Maaren is on sickleave
add proper numeral analysis/treatment to smj
- not done
add loanwords (e.g. latin -ere verbs) to smj
- not done
work on compounding
- soon finished
lule sámi incoming words
- finished
bug-fixing
- done some
sme G3 issue
- not done

Tomi

new proper name lexicon
- data synchronisation of proper nouns between risten.no and CVS
  - not done
- XQuery refactoring and code development for our proper noun editor
  - not done
- new version of xml2lexc (based on ccat), should handle complex names correct: construct entries like we have now from the different parts of a complex name entry
  - not done
read aligner docu, install, provide feedback
- not done
Set up the mechanism for the hash-mark transducer package
- not done
refine the xml output of the xml-tagged analyses
- done, but is it ready?
convert or adapt the received PHP for paradigm generation to our needs
- not done
test Gobby (with others)
- not done
fix bugs!

Trond

better smj NT text
- Not done, must finish Swedish work with Saara and Børre first.
get fin, swe, nob and nno NT and OT in paratext format
- Still waiting for Oslo
install aligner, test it and give feedback
- Not done
fst gymnastics to add hyphenation and word boundary marks to hyphenation transducer
- Not done
meeting 9.6. t 9.30: discuss and decide upon the exact access policy we want
- Done
set upp the unix group structure for corpus users (also Børre, Saara)
- Done
test Gobby (with others)
- Done on a regular basis, also last week.
discuss parallel corpus markup
- Discussed with Lars and Saara, made good progress, and we will send texts this week.
fix bugs!.

3. Documentation

TODO:

documentation on how to apply for a user account for the corpus repo (Børre)
- we will administer the corpus user accounts ourselves
- We first have to discuss and decide what we want before Børre can write documentation (see further down)
  - done
Write both user and admin documentation (Børre, review: Sjur, Thomas)
- User documentation probably in several languages. This covers how to apply for an account, on what grounds one can apply, and pointers to documentation telling how to use the corpus.
- Admin documentation, telling how we set the permissions to the corpus files, and whatever other processes and tasks needed to set up a corpus account.

4. Corpus gathering

Collecting

See a previous meeting memo for what’s to be done.

TODO:

Send out the rest of the letters (Børre)

New contracts:

none last 2 weeks

Olavi Korhonen’s Lule Sámi dictionary.

TODO:

set up user account/corpus access for Olavi (Børre)
- we need the infrastructure for corpus access in place first
  - decided, but still not implemented
Contact Olavi Korhonen, to actually get the dictionary

KIO Grafisk and the Iđut books

TODO:

send letter to Kurt Tore Andersen (Børre)
- done
send letters to the other authors (Børre)
- sent to Synnøve Persen
- Erling Persen has lost his files, only source is Iđut

Bible texts

TODO:

get nob and nno NT and OT in paratext format. (Trond)
- waiting for Oslo
convert smj NT to paratext (Børre)
- waiting for an NT in paratext format (whatever language will do)
convert fin, swe to paratext or directly to our XML (Børre)

Davvi Girji

TODO:

call Brita Kåven again towards the end of the week (Børre)
- not done
call the authors (Børre)
- not more

Min Áigi

TODO:

complete metainformation (Børre)
- continued, not finished

Kåfjord

TODO:

contact Ája (Børre)
- no answer, will try again

Sámi Instituhtta

TODO:

send renaming scripts to R. Valkeapää (Børre)
- done

Čálliid Lágádus

[http://www.calliidlagadus.org/]

TODO:

send contracts (Børre)
- not done

Árran

TODO:

call Bård Eriksen (Børre)
- Done

5. Corpus infrastructure

General

Errors in the Antiword conversions found when parsing the xml corpus.

There are also problems with the PDF conversion. Børre has found a tool which will produce xml output, he sent the link to Saara.

TODO:

remove headers and footers from antiword documents, other improvements as needed (Saara)
- in the works, also improvements to the PDF conversion

User accounts and access

TODO:

discuss and decide upon the exact access policy we want to give corpus users; meeting on Tuesday May 23, at 9.30 (Børre, Sjur, Trond)
- meeting done, full memo available.

Conclusion: we have the following classes of users:

Users that don’t need a unix shell
- linguists doing research on singleton examples
- historians and other people interested in content, not in form
Users that do need a unix shell
- linguists doing research on texts as a whole
- linguists with separate analysis tools
- language technology developers

Shell access

External users will get their own user account, belonging to the groups myself and bound, and will be able to install their own tools and programs for corpus processing, analysis, etc.

To let the bound group members be able to analyse, we need to do some minor adjustments - as other they automatically have full access to the Xerox tools, and the compiled fst’s are available in /opt/smi/sme/bin/sme-num.fst etc. The Xerox tools and vislcg are available in /opt/Xerox/bin. A couple of tools are missing right now, and need to be added to /opt/ by a crontab.

TODO:

make a group bound for our external corpus users, which: (Børre)
- gives access to read our bound texts
- gives access to execute/run the tools in /opt
export to /opt (with cron) tools that the project team members find in their cvs tree (the bound users do not have a cvs tree, and therefore need these tools in /opt in order to conduct linguistic analyses) (Tomi)
- ccat (and some perl scripts?)
- other tools?
make shell script wrappers for the most common commands for user friendlyness (Trond)
write user account form, probably ask for copy of existing ones from the IT centre (Trond and Sjur)
possibly deploy the user account form as an HTML form (Børre)
write documentation for our bound users, with pointers to the ordinary documentation (Børre, Trond)
write documentation for how to apply for a user account (where’s the form, to whom do I send the form, who needs it, etc.) (Børre)
make our own guidelines for the user application processing (Børre)
make a test user (Børre)
test corpus access as test user (Trond)

Web browser access

Users of only the free corpus won’t need anything but a browser.

Users of the bound corpus will need a username and password to the Oslo computer (until the base is moved to Tromsø). These usernames and passwords will be created and administered by the Oslo people (modulo discussion), later by ourselves.

TODO:

discuss with Oslo
delay other tasks until we are ready to go public?
user management for access to bound texts

More texts to the graphical corpus interface:

TODO:

refine xml-tagged output (Saara and Tomi)
1. done, but still open if it is finished
add text to the server (Lars)

Aligner

Trond and Saara will continue this issue.

We need markup of parallelism in the corpus DTD, at least an indication of which documents belong together. Discussion to continue in the newsgroup (Saara has started it - please respond!).

TODO:

discuss parallel corpus markup (Saara, Trond, Sjur, others)
- done
extend the DTD (Saara)
- done

Language recognition

Still waiting for more smj text to improve it.

Free and non-free texts

Anything? Final check with Børre and Saara - waiting for them to return. Nothing more according to Børre. Only texts which are explicitly marked as free are now included in the free/ directory.

Corpus summary

Forrest goes into an endless loop when processing these files. It happens when converting the XML to Forrest format. More info on our bugzilla

TODO:

trim generated corpus summary pages (Sjur)
- suggestion: lump together files with content less than X paragraphs (X < 5?)
- nothing done yet

6. Infrastructure

Paradigm generation

Goal: Reuse Greenlandic code for paradigm generation.

TODO:

convert or adapt the received PHP to our needs (Tomi or Saara)
- Saara has started to look at it

Hyphenator

TODO:

correct the treatment of hyphenation of word boundaries and exceptions (fst gymnastics) (Sjur, Trond)
- Sjur has started some work with M4 integration
Update the sme and sma hyphenator rule sets with the insights gained from smj updates
- Thomas has updated sme
- update (some of) sma (Trond during summer vacation)

Automatic Bugzilla reminder for untouched bugs

TODO:

set up Bugzilla to send automatic reminders for bugs not touched in a given timeframe; ask Thor-Øivind for help if needed (Børre)
- in the meantime: Bugzilla as [RSS feed|feed://giellatekno.uit.no/bugzilla/buglist.cgi?query_format=specific&bug_status=open&product=&content=&ctype=rss]

JSPWiki update

Sjur has corrected and improved the jspwiki parsing in Forrest, and found that mixed lists should not be supported. We should check whether we have any such lists anywhere, and correct them, otherwise we risk that they are not rendered in HTML, and as such impose an information loss. Sjur has sent in a patch to Forrest that will correct nested list behaviour, but it will at the same time make mixed lists invisible except for the first level. It seems that mixed lists aren’t part of the wiki format.

Mixed list example:

1. something numbered
    - some sub-thing with a bullet

* something bulleted
    1. some sub-thing with a number

Sjur tried to grep, but multiline pattern matching is beyond him. Tomi: Here’s the pattern to use:

egrep -C 3 -R "^\*.*[$]*{1,16}\#" *

TODO:

grep for all occurences of ^* followed by a line ^## and vice versa (the patterns above) - send the list of identified documents to Sjur (Tomi)
- done during the meeting

7. Linguistics

Name double-tagging

Conclusion, in a principled fashion:

hardcoded sem-tags win
There is a sem-tag conversion procedure: according to a hierarchy of sem-tags: Any Plc can be interpreted as Sur, etc. (to be spelled out)

TODO:

Make a section under gt/doc/lang/smi/, add a chapter Common linguistic resources under the Languages section in site-frag.xml, and add the document below to that section (Børre)
write guidelines for annotators wrt. to name tagging and put them under gt/doc/lang/smi. A short version of the guidelines goes as follows: \Systematic ambiguity (the Trosterud (plc/sur) and Aftenposten (org/obj) types) should not be doubletagged, but given primary tags (plc, org) only. Doubletagging should be confined to exceptional cases (Eira (Fem/Sur/Plc), Janne (Fem/Mal), such things. These guidelines should be added to our documentation file. (Trond)
when adding new names, only use one sem-tag unless there are known objects which belong to different sem classes(?) (all linguists)
write disamb rules to implement the system above (Trond, Linda)

North Sámi

TODO:

investigate Actio compounding (Thomas)
- done North-sámi three-syll
discuss findings with Maaren, and later all of us (Thomas)
- not done, Maaren is on sickleave

Actio compounding

It is definitely productive. Whether this is a problem for our speller(s), we don’t know yet, but if there’s a lot of overlap or parallel forms with the same verbal stem producing compounds both with and without Actio, it is most likely a problem, unless we correct to both forms (but then we risk correcting to an impossible form, which is also bad).

Whether Actio is used or not in compounding a verbal stem follows from the semantic properties of the Actio. We need to try to identify this property, and formalise it in one way or another, otherwise we will overgenerate. And overgeneration is a speller’s worst enemy :-)

TODO:

investigate and identify under which conditions Actio compounding is possible (Thomas)

Lule Sámi

TODO:

50 unknown words left+2 abbr. +moaddi etc (numerals) need more checks (Thomas)
- done except for abbr. and numerals
  - add abbr to a new abbr lexicon file
  - numerals are covered by the next task
add proper numeral analysis/treatment (Thomas)
- not done
add loanwords (e.g. latin -ere verbs) (Thomas)
- not done

8. Name lexicon infrastructure

TODO:

finish refactoring for multiple collections in the search interfarce (Sjur)
1. investigation done, still not implemented
develop the needed XQueries and interface (Sjur, Tomi)
1. nothing this week
data synchronisation between risten.no and the cvs repo (Tomi)
1. discussion started on eXist-list, nothing useful came up. We need to reformulate the question from our perspective, and bring it up again (Sjur)
  1. not yet done

9. Public tender

Nothing received from PL yet. They have an extended deadline today. After that we will decide upon the information we have.

TODO:

final evaluation based on the answers to the clarification letters
- waiting for the PL answer
present the final report to the board, and bring their decision into action

10. Other

Summer vacation

From Bitte: “I følge fjorårets liste tok iallefall Børre ut 10 dager av årets ferie og Tomi tok 8 dager, dvs at Børre kan ta 3 uker med lønn og Tomi 3 uker og 2 dager. De kan selvsagt låne av neste års ferie dersom de ønsker det.”

She would like to receive a final vacation plan soon.

Who	When
Børre	24.7 - 20.8
Linda	?
Maaren	on sick leave
Saara	July
Sjur	at least 2 weeks in July, but still open
Thomas	3.7 - 7.8
Trond	3.7 - 14.8 (last two weeks off at summer school)
Tomi	8.7 - 16.7, 2 more weeks in July and/or August

Bug fixing

43 open Divvun/Disamb bugs (two down!), and 25 risten.no bugs

Guess: 1/3 of the bugs are fixed already (?)

Please help Saara with [bug 279|http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=279] (Perl locale). Not much help… Saara will contact Roy on this issue.

Gobby

TODO:

install Gobby (Saara)
test it (Tomi, Trond, Sjur, Børre, Lars?)
- done
if successful, document its use within our project (Børre)

SEE 2.5 extensions

Future extensions and whish modes:

write a perl script to extract all TODO items and sort them according to responsible person; wrap the script in a AppleScript available from the menu
twol mode
xfst mode
lexc mode
vislcg mode

TODO:

move the above list to Bugzilla (Sjur)

Task lists as iCal entries

With the latest corrections to the Wiki parsing, and with the tasks at the end of the document, we should be able to use the intermediate XML format to extract the info needed to create iCal entries for all tasks.

iCal entries look like the following:

BEGIN:VTODO
DTSTAMP:20060619T090920Z
ORGANIZER;CN=Børre Gaup:MAILTO:boerre@skolelinux.no
CREATED:20050621T171425Z
UID:libkcal-939001838.216
SEQUENCE:1
LAST-MODIFIED:20050622T050540Z
SUMMARY:Ordne lenker
CLASS:PUBLIC
PRIORITY:5
DUE:20050824T073000Z
COMPLETED:20050622T050540Z
PERCENT-COMPLETE:100
END:VTODO

A reference can be found at Wikipedia

References should be of the type:

/doc/admin/weekly/2006/Tasks_2006-06-19_Sjur.ics

TODO:

create iCal entries from our meeting memos (Sjur or Børre)

11. Next meeting, closing

26.06.2006 09:30

Sjur might be away, will inform you later.

Closed at 11:20.

Appendix - task lists for the next week

Boerre

corpus collection:
- send out contracts with accompanying letter
- Gather public texts, preferrably also parallel ones
- Send out letters to the rest of the Iđut authors
- call Brita Kåven again
- contact Ája (Kåfjord)
- call Bård Eriksen
- send contracts to Čálliid Lágádus
- Contact Olavi Korhonen, to actually get the dictionary
corpus conversion:
- Continue converting text from input format to our xml
- convert nob and nno bible texts to be used as part of a parallel corpus
- convert fin, swe to paratext or directly to our XML
- review the paratext2xml converter
- convert smj NT to paratext
- complete Min Áigi metadata
corpus access:
- make a group bound for our external corpus users
- possibly deploy the user account form as an HTML form
- make a test user
- Write both user and admin documentation
set up Bugzilla automatic reminders for open issues; ask Thor-Øivind if needed
document use of Gobby within our project
create document & document entry for name double-tagging
create iCal entries from our meeting memos (or Sjur)
fix bugs!

Maaren

On sick leave

Saara

Create a parallel corpora of the new testaments
add more texts to the graphical corpus interface
grammatical searchability in the graphical corpus interface
Implement parallel corpus upload in web upload script
Install Gobby
Test the aligners once again
refine the xml output of the xml-tagged analyses
convert or adapt the received PHP for paradigm generation to our needs
remove headers and footers from antiword documents, other improvements
fix bugs!

Sjur

public tender:
- collect the last clarifying answers, and make a final proposal to the board
fst gymnastics to add hyphenation and word boundary marks to hyphenation transducer
name lexicon:
- implement editing functions
- finalise refactoring for multiple collections (regular search interface)
change corpus summary processing to generate smaller pages, see bug report
move the SEE 2.5 extension list to Bugzilla
create iCal entries from our meeting memos (or Børre)
review user and admin documentation for corpus access
write user account form, probably ask for copy of existing ones from the IT centre (with Trond)
fix bugs!

Thomas

investigate productivity of Actio compounding in smj
investigate and identify under which conditions Actio compounding is possible
discuss findings with the rest of us
add proper numeral analysis/treatment to smj
add loanwords (e.g. latin -ere verbs) to smj
work on compounding
sme G3 issue
review user documentation for corpus access
create smj abbr file
fix bugs!

Tomi

new proper name lexicon
- data synchronisation of proper nouns between risten.no and CVS
- XQuery refactoring and code development for our proper noun editor
- new version of xml2lexc (based on ccat), should handle complex names correct: construct entries like we have now from the different parts of a complex name entry
read aligner docu, install, provide feedback
Set up the mechanism for the hash-mark transducer package
test the new xml output of the xml-tagged analyses
export corpus tools to /opt (with cron)
fix bugs!

Trond

better smj NT text
get fin, swe, nob and nno NT and OT in paratext format
install aligner, test it and give feedback
fst gymnastics to add hyphenation and word boundary marks to hyphenation transducer
make shell script wrappers for the most common commands for user friendlyness
write user account form, probably ask for copy of existing ones from the IT centre (with Sjur)
write documentation for our bound users, with pointers to the ordinary documentation
write documentation on double-tagging names
discuss web-only user access management with Oslo
fix bugs!.