Meeting setup
- Date: 06.06.2006
- Time: 09.30 Norw. time
- Place: Where we are
- Tools: SubEthaEdit, iChat
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
The meeting was delayed due to the project board having a telephone meeting at
our regular meeting time.
Opened at 13:17.
Present: Sjur, Thomas, Børre, Tomi
Absent: Maaren, Saara, Trond
Agenda accepted as is.
2. Updated task status since last meeting
Børre
- corpus collection:
- send out contracts with accompanying letter
- Gather public texts, preferrably also parallel ones
- Send out letters to the rest of the Iđut authors
- call Brita Kåven again
- contact Ája (Kåfjord)
- call Bård Eriksen
- send contracts to Čálliid Lágádus
- Contact Olavi Korhonen, to actually get the dictionary
- corpus conversion:
- Continue converting text from input format to our xml
- convert nob and nno bible texts to be used as part of a parallel corpus
- convert fin, swe to paratext or directly to our XML
- review the paratext2xml converter
- convert smj NT to paratext
- None of the bible things are done
- complete Min Áigi metadata
- corpus access:
- make a group
bound
for our external corpus users
- possibly deploy the user account form as an HTML form
- make a test user
- Write both user and admin documentation
- set up Bugzilla automatic reminders for open issues; ask Thor-Øivind if needed
- Built-in feature of Bugzilla, tried to turn it on.
- document use of Gobby within our project
- create document & document entry for name double-tagging
- create iCal entries from our meeting memos (or Sjur)
- fix bugs!
Maaren
Saara
- Create a parallel corpora of the new testaments
- add more texts to the graphical corpus interface
- grammatical searchability in the graphical corpus interface
- Implement parallel corpus upload in web upload script
- Install Gobby
- Test the aligners once again
- refine the xml output of the xml-tagged analyses
- convert or adapt the received PHP for paradigm generation to our needs
- remove headers and footers from antiword documents, other improvements
- fix bugs!
Sjur
- public tender:
- collect the last clarifying answers, and make a final proposal to the board
- done. Decision by the board: Polderland
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
transducer
- nothing since last attempt
- name lexicon:
- implement editing functions
- finalise refactoring for multiple collections (regular search interface)
- change corpus-summary processing to generate smaller pages, see
bug report
- done, now waiting for feedback from Børre
- move the SEE 2.5 extension list to Bugzilla
- create iCal entries from our meeting memos (or Børre)
- review user and admin documentation for corpus access
- write user account form, probably ask for copy of existing ones from the IT
centre (with Trond)
- fix bugs!
- other tasks:
Thomas
- investigate productivity of Actio compounding in smj
- had a look at the three-syllables
- investigate and identify under which conditions Actio compounding is possible
- have identified some things
- discuss findings with the rest of us
- add proper numeral analysis/treatment to smj
- add loanwords (e.g. latin -ere verbs) to smj
- work on compounding
- finished in practice, some things still that I need to discuss with Maaren
- sme G3 issue
- review user documentation for corpus access
- create smj abbr file
- fix bugs!
Tomi
- new proper name lexicon
- data synchronisation of proper nouns between risten.no and CVS
- XQuery refactoring and code development for our proper noun editor
- new version of xml2lexc (based on ccat), should handle complex names correct:
construct entries like we have now from the different parts of a complex name
entry
- read aligner docu, install, provide feedback
- Set up the mechanism for the hash-mark transducer package
- test the new xml output of the xml-tagged analyses
- export corpus tools to
/opt
(with cron)
- fix bugs!
Trond
- better smj NT text
- get fin, swe, nob and nno NT and OT in paratext format
- install aligner, test it and give feedback
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
transducer
- make shell script wrappers for the most common commands for user friendlyness
- write user account form, probably ask for copy of existing ones from the IT
centre (with Sjur)
- write documentation for our
bound
users, with pointers to the ordinary
documentation
- write documentation on double-tagging names
- discuss web-only user access management with Oslo
- fix bugs!.
3. Documentation
TODO:
- Write both user and admin documentation (Børre, review: Sjur, Thomas)
- User documentation probably in several languages. This covers how to apply
for an account, on what grounds one can apply, and pointers to documentation
telling how to use the corpus.
- Admin documentation, telling how we set the permissions to the corpus files,
and whatever other processes and tasks needed to set up a corpus account.
4. Corpus gathering
Collecting
See a previous meeting memo for what’s to be done.
TODO:
- Send out the rest of the letters (Børre)
New contracts:
Olavi Korhonen’s Lule Sámi dictionary.
Talked with him, sent him the contract.
TODO:
- Contact Olavi Korhonen, to actually get the dictionary
KIO Grafisk and the Iđut books
TODO:
- send letters to the other authors (Børre)
Bible texts
TODO:
- get nob and nno NT and OT in paratext format. (Trond)
- convert smj NT to paratext (Børre)
- waiting for an NT in paratext format (whatever language will do)
- convert fin, swe to paratext or directly to our XML (Børre)
Davvi Girji
Called Brita Kåven. She can’t say anything about letting us have the texts
untill she has talked with the employees, to get a work estimate on their part.
Nothing was decided in their previous meeting. Davvi Girji seems generally very
positive, but has problems finding the best way of actually delivering the texts
to us.
TODO:
- call Brita Kåven again towards the end of the week (Børre)
- call the authors (Børre)
Min Áigi
TODO:
- complete metainformation (Børre)
- done as far as possible: author and language info
- some smj files moved to the
smj
tree
- still some Norwegian files to be checked
Kåfjord
TODO:
Sámi Instituhtta
When will we get the corpus? We don’t know, Børre will contact him again.
TODO:
- contact NSI again (Børre)
Čálliid Lágádus
[http://www.calliidlagadus.org/]
TODO:
Árran
Talked to Bård Eriksen, he needs to discuss more with his coworkers.
TODO:
- continue discussion (Børre)
5. Corpus infrastructure
General
TODO:
- remove headers and footers from antiword documents, other improvements
as needed, including PDF conversion (Saara)
- in the works, also improvements to the PDF conversion
- problems with running the tools, they need some updated libraries in
Victorio
User accounts and access
For details, see the previous meeting memo, as well as the
memo from a [dedicated
meeting|/infra/corpus_policy.html].
Shell access
TODO:
- make a group
bound
for our external corpus users, which: (Børre)
- gives access to read our bound texts
- gives access to execute/run the tools in
/opt
- export to
/opt
(with cron) tools that the project team members find in
their cvs tree (the bound users do not have a cvs tree, and therefore need
these tools in /opt
in order to conduct linguistic analyses) (Tomi)
- ccat (and some perl scripts?)
- other tools?
- make shell script wrappers for the most common commands for user friendlyness
(Trond)
- write user account form, probably ask for copy of existing ones from the IT
centre (Trond and Sjur)
- short discussion with Trond about what we need
- possibly deploy the user account form as an HTML form (Børre)
- waiting for the form to be written
- write documentation for our
bound
users, with pointers to the ordinary
documentation (Børre, Trond)
- write documentation for how to apply for a user account (where’s the form, to
whom do I send the form, who needs it, etc.) (Børre)
- make our own guidelines for the user application processing (Børre)
- make a test user (Børre)
- test corpus access as test user (Trond)
Web browser access
TODO:
- discuss with Oslo (Trond)
- delay other tasks until we are ready to go public?
- user management for access to bound texts
More texts to the graphical corpus interface:
TODO:
- refine xml-tagged output (Saara and Tomi)
- done, but still open if it is finished
- add text to the server (Lars)
Aligner
More to be said about this? (certainly, but right now?)
Language recognition
Still waiting for more smj text to improve it.
Corpus summary
Forrest goes into an endless loop when processing these files. It happens when
converting the XML to Forrest format. More info on our
bugzilla
It is now implemented, but a test to skip this summary option for smaller
collections did not work. Thus all genres in all languages are treated the same.
Improvement suggestion: instead of summarize files under a certain limit,
one could also split Min Áigi into years, and display it year-wise. That
will require some more info in the generated source file corpus-content.xml
.
TODO:
- trim generated corpus summary pages (Sjur)
- suggestion: lump together files with content less than X paragraphs (X < 5?)
- done - Forrest seems to work fine now
- add improvement suggestion to Bugzilla (Børre)
6. Infrastructure
Paradigm generation
Goal: Reuse Greenlandic code for paradigm generation.
Saara has given a report on the PHP code in News. Please read.
TODO:
- convert or adapt the received PHP to our needs (Saara)
- code evaluated, evaluation reported
Hyphenator
TODO:
- correct the treatment of hyphenation of word boundaries and exceptions (fst
gymnastics) (Sjur, Trond)
- nothing since last meeting
- Update the sma hyphenator rule set with the insights gained from smj updates
(Trond during summer vacation)
Automatic Bugzilla reminder for untouched bugs
TODO:
- set up Bugzilla to send automatic reminders for bugs not touched in a given
timeframe; ask Thor-Øivind for help if needed (Børre)
- turned on the feature, but it doesn’t seem to work - needs some more
investigation
JSPWiki update
Here’s the pattern to use:
egrep -C 3 -R "^\*.*[$]*{1,16}\#" *
TODO:
- grep for all occurences of ^* followed by a line ^## and vice versa (the
patterns above) (Sjur)
- Tomi tried it, but nothing found
- correct the documents found to have consistent lists (Sjur)
7. Linguistics
Derivation and spellers like Aspell
It is impossible to create a model that can dynamically generate new derivations
of verbs, and then let them nominalise for compounding. What we need is to
extract all derived verbs not in the lexicon, and lexicalise most of them. We
could potentially let the 5(?) most productive verbal derivations be
unlexicalised, and generate them from our lexicon directly when creating
entries. A tentive, first guess on which derivational suffixes this could be is
(for sme):
+st, +l, +h, -goahtit, (o)juvvot (passive)
Later we might consider lexicalising all other derivations found in our corpus.
To make it easier to extract all derived stems, we should enhance the tags used
for derivations in sme
to make them easier to grep. The most straightforward
solution is to make the tags follow the same pattern as for smj
,
+Der/NNN
. Presently sme
is only using the NNN
part as a tag, where
NNN
represents the derivational suffix. That is, there is no single pattern
to match against for sme
. Example:
"<laktigohtet>"
"laktit" V TV goahti Ind Prs Pl3 @+FMAINV
Only the tag goahti
is identifying that the word form is a derivation. It
should be extended to Der/goahti
as in smj
. Then it is easy to grep for
the pattern Der/
to get all derived verbs.
TODO:
- change tagging of derived stems in the disamb output, to facilitate much
easier extraction of these verbs (Trond)
- not done, the other tasks depend on this one
- find and study all derived verbs in our corpus (Thomas)
- suggest which derivations could be generated (Thomas)
- see source code above, but also consider overgeneration problems, as well as
input from the statistical results in the previous point
- lexicalise the rest (Thomas)
Name double-tagging
TODO:
- Make a section under
gt/doc/lang/smi/
, add a chapter
Common linguistic resources under the Languages section in
site-frag.xml
, and add the document below to that section (Børre)
- write guidelines for annotators wrt. to name tagging and put them under
gt/doc/lang/smi
. A short version of the
guidelines goes as follows: \Systematic
ambiguity (the Trosterud (plc/sur) and Aftenposten (org/obj) types)
should not be doubletagged, but given primary tags (plc, org) only.
Doubletagging should be confined to exceptional cases (Eira
(Fem/Sur/Plc), Janne (Fem/Mal), such things. These guidelines should be
added to our documentation file. (Trond)
- when adding new names, only use one sem-tag unless there are known objects
which belong to different sem classes(?) (all linguists)
- write disamb rules to implement the system above (Trond, Linda)
North Sámi
TODO:
- discuss findings with all of us (Thomas)
- investigate and identify under which conditions Actio compounding is possible
(Thomas)
Following already derived verbs are not happy with further derivation. It seems
like the most of them do not appear as Actio forms in first part of
compounding either. The following holds for both sme and smj:
LEXICON MUITTASJ !Words ending -šit, -skit, -smit, -idit, -ldit, -git and
5-syllables, formerly directed to
MUITAL
+V+TV: MUITALStem ;
### SHOULD be directed here as well:
### Reflexives on -dit
### Reciprocals on -dit, -(a)lit
### Momentatives on -dit, -(a)lit, -ádit, -ihit
### Frequentatives on -(a)lit, -(u)hit, -dit
### Continuatives on -dit, -(u)hit, -nit
### Inchoatives in -nit
### Translatives on -dit
### Essives on -dit and -stit
### Causatives on -dit, -stit
Lule Sámi
TODO:
- add inc abbr to a new abbr lexicon file (Thomas)
- add proper numeral analysis/treatment (Thomas)
- add loanwords (e.g. latin -ere verbs) (Thomas)
- nothing done to any of these
8. Name lexicon infrastructure
TODO:
- finish refactoring for multiple collections in the search interfarce
(Sjur)
- develop the needed XQueries and interface (Sjur, Tomi)
- done the inflection menu, but it doesn’t work…
- data synchronisation between risten.no and the cvs repo (Tomi)
- discussion started on eXist-list, nothing useful came up. We need to
reformulate the question from our perspective, and bring it up again
(Sjur)
9. Public tender
We finally recieved their answer as well.
TODO:
- final evaluation based on the answers to the clarification letters
- present the final report to the board, and bring their decision into action
- done. Decision: Polderland.
- write a contract (mostly done by Finnut, review by Sjur)
- get it signed (Finnut, Lennart Mikkelsen)
10. Other
Summer vacation
Who |
When |
Børre |
24.7 - 20.8 |
Linda |
? |
Maaren |
on sick leave |
Saara |
July |
Sjur |
3.7 - 23.7 + single days at other times |
Thomas |
3.7 - 7.8 |
Trond |
3.7 - 14.8 (last two weeks off at summer school) |
Tomi |
8.7 - 16.7, 2 more weeks in July and/or August |
Bug fixing
43 open Divvun/Disamb bugs (two down!), and 25 risten.no bugs
Guess: 1/3 of the bugs are fixed already (?)
Please help Saara with [bug
279|http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=279]
(Perl locale). Not much help…
Saara will contact Roy on this issue.
Gobby
TODO:
- install Gobby (Saara)
- document its use within our project (Børre)
- review the document (Thomas)
SEE 2.5 extensions
Future extensions and wish modes:
- write a perl script to extract all TODO items and sort them according to
responsible person; wrap the script in a AppleScript available from the menu
- twol mode
- xfst mode
- lexc mode
- vislcg mode
TODO:
- move the above list to Bugzilla (Sjur)
Task lists as iCal entries
This feature requires that the patch Sjur sent in to Forrest regarding parsing
of nested lists within wiki documents is applied. It was applied to Forrest last
Friday, thus all who is interested in using this feature from their local
Forrest should update their installation. It is easiest done using Subversion
(svn) to update your local copy.
TODO:
- create iCal entries from our meeting memos (Sjur or Børre)
- update all forrest installations to latest svn HEAD (Børre)
Project meeting in Tromsø in august?
The project board has decided upon a meeting in Tromsø in august. We’ll discuss
the date later this week.
11. Next meeting, closing
Next meeting is undefined due to summer vacation.
Closed at 14:36.
Appendix - task lists for the next week
Boerre
- corpus collection:
- send out contracts with accompanying letter
- Gather public texts, preferrably also parallel ones
- Send out letters to the rest of the Iđut authors
- contact Ája (Kåfjord)
- send contracts to Čálliid Lágádus
- Contact Richard Valkepää at NSI about older Min Áigi and Ássu files.
- corpus conversion:
- convert nob and nno bible texts to be used as part of a parallel corpus
- convert fin, swe to paratext or directly to our XML
- review the paratext2xml converter
- Move norwegian documents in Min Áigi from sme to nob
- corpus access:
- possibly deploy the user account form as an HTML form
- make a test user
- Write both user and admin documentation (Børre, review: Sjur, Thomas)
- User documentation probably in several languages. This covers how to apply
for an account, on what grounds one can apply, and pointers to documentation
telling how to use the corpus.
- Admin documentation, telling how we set the permissions to the corpus files,
and whatever other processes and tasks needed to set up a corpus account.
- set up Bugzilla automatic reminders for open issues
- create document & document entry for name double-tagging
- Update forrests to latest svn version
- fix bugs!
Maaren
Saara
- Create a parallel corpora of the new testaments
- add more texts to the graphical corpus interface
- grammatical searchability in the graphical corpus interface
- Implement parallel corpus upload in web upload script
- Install Gobby
- Test the aligners once again
- refine the xml output of the xml-tagged analyses
- convert or adapt the received PHP for paradigm generation to our needs
- remove headers and footers from antiword documents, other improvements
- fix bugs!
Sjur
- public tender:
- review letters to tenderers, contract for subcontractor
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
transducer
- name lexicon:
- implement editing functions
- finalise refactoring for multiple collections (regular search interface)
- review user and admin documentation for corpus access
- write user account form, probably ask for copy of existing ones from the IT
centre (with Trond)
- correct jspwiki docs with mixed lists
- fix bugs!
Thomas
- investigate productivity of even-syllable Actio compounding
- investigate and identify under which conditions even-syllable Actio
compounding is possible
- discuss findings with the rest of us
- add proper numeral analysis/treatment to smj
- add loanwords (e.g. latin -ere verbs) to smj
- sme G3 issue
- review user documentation for corpus access
- create smj abbr file
- review the document
- Redirected following three syllable verbs and prevent them from being
first-part Actios in compounds:
- Reflexives on -dit
- Reciprocals on -dit, -(a)lit
- Momentatives on -dit, -(a)lit, -ádit, -ihit
- Frequentatives on -(a)lit, -(u)hit, -dit
- Continuatives on -dit, -(u)hit, -nit
- Inchoatives in -nit
- Translatives on -dit
- Essives on -dit and -stit
- Causatives on -dit, -stit
- find and study all derived verbs in our corpus (depends on Trond)
- suggest which derivations could be generated (depends on Trond)
Tomi
- new proper name lexicon
- data synchronisation of proper nouns between risten.no and CVS
- XQuery refactoring and code development for our proper noun editor
- new version of xml2lexc (based on ccat), should handle complex names correct:
construct entries like we have now from the different parts of a complex name
entry
- read aligner docu, install, provide feedback
- Set up the mechanism for the hash-mark transducer package
- test the new xml output of the xml-tagged analyses
- export corpus tools to
/opt
(with cron)
- fix bugs!
Trond
- better smj NT text
- get fin, swe, nob and nno NT and OT in paratext format
- install aligner, test it and give feedback
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
transducer
- make shell script wrappers for the most common commands for user friendlyness
- write user account form, probably ask for copy of existing ones from the IT
centre (with Sjur)
- write documentation for our
bound
users, with pointers to the ordinary
documentation
- write documentation on double-tagging names
- discuss web-only user access management with Oslo
- change tagging of derived stems in the disamb output, to facilitate much
easier extraction of non-lexicalised derivations
- fix bugs!.