Meeting setup
- Date: 28.08.2006
- Time: 09.30 Norw. time
- Place: Where we are
- Tools: SubEthaEdit, iChat
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 09:47.
Present: Saara, Sjur, Thomas, Børre, Tomi, Trond
Absent: Maaren
Agenda accepted as is.
2. Updated task status since last meeting
Børre
- corpus collection:
- send out contracts with accompanying letter
- Gather public texts, preferrably also parallel ones
- Send out letters to the rest of the Iđut authors
- contact Ája (Kåfjord)
- send contracts to Čálliid Lágádus
- Contact Richard Valkepää at NSI about older Min Áigi and Áššu files.
- Not heard anything from him
- corpus conversion:
- convert nob and nno bible texts to be used as part of a parallel corpus
- convert fin, swe to paratext or directly to our XML
- review the paratext2xml converter
- Move norwegian documents in Min Áigi from sme to nob
- corpus access:
- possibly deploy the user account form as an HTML form
- make a test user
- Write both user and admin documentation (Børre, review: Sjur, Thomas)
- User documentation probably in several languages. This covers how to apply
for an account, on what grounds one can apply, and pointers to documentation
telling how to use the corpus.
- Admin documentation, telling how we set the permissions to the corpus files,
and whatever other processes and tasks needed to set up a corpus account.
- Not done
- set up Bugzilla automatic reminders for open issues
- create document & document entry for name double-tagging
- Update forrests to latest svn version
- fix bugs!
Maaren
Saara
- Create a parallel corpora of the new testaments
- add more texts to the graphical corpus interface
- grammatical searchability in the graphical corpus interface
- Implement parallel corpus upload in web upload script
- not done, waiting for decision of the cgi-bin scripts
- Install Gobby
- Test the aligners once again
- refine the xml output of the xml-tagged analyses
- convert or adapt the received PHP for paradigm generation to our needs
- discussed possible solutions in the news
- remove headers and footers from antiword documents, other improvements
- antiword done, pdf-documents are to be fixed next
- fix bugs!
Sjur
- public tender:
- review letters to tenderers, contract for subcontractor
- contract finished, signed and countersigned. First common meeting held.
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
transducer
- name lexicon:
- implement editing functions
- more done, plans for the rest is made.
- finalise refactoring for multiple collections (regular search interface)
- mostly done, but a bug in eXist or Cocoon is blocking the final step
- review user and admin documentation for corpus access
- write user account form, probably ask for copy of existing ones from the IT
centre (with Trond)
- fix bugs!
Thomas
- investigate productivity of even-syllable Actio compounding
- investigate and identify under which conditions even-syllable Actio
compounding is possible
- discuss findings with the rest of us
- add proper numeral analysis/treatment to smj
- done to the same extent as in sme
- add loanwords (e.g. latin -ere verbs) to smj
- sme G3 issue
- started with a load of help from my friends
- review user documentation for corpus access
- create smj abbr file
- copied the sme abbr file and lulefied it
- review the document
- Redirected following three syllable verbs and prevent them from being
first-part Actios in compounds:
- Reflexives on -dit
- Reciprocals on -dit, -(a)lit
- Momentatives on -dit, -(a)lit, -ádit, -ihit
- Frequentatives on -(a)lit, -(u)hit, -dit
- Continuatives on -dit, -(u)hit, -nit
- Inchoatives in -nit
- Translatives on -dit
- Essives on -dit and -stit
- Causatives on -dit, -stit
- find and study all derived verbs in our corpus (depends on Trond)
- suggest which derivations could be generated (depends on Trond)
Tomi
- new proper name lexicon
- data synchronisation of proper nouns between risten.no and CVS
- XQuery refactoring and code development for our proper noun editor
- new version of xml2lexc (based on ccat), should handle complex names correct:
construct entries like we have now from the different parts of a complex name
entry
- read aligner docu, install, provide feedback
- Set up the mechanism for the hash-mark transducer package
- test the new xml output of the xml-tagged analyses
- export corpus tools to
/opt
(with cron)
- fix bugs!
Trond
- better smj NT text
- get fin, swe, nob and nno NT and OT in paratext format
- No bible work during summer, not received anything, will have to look into this again.
- install aligner, test it and give feedback
- Aligner installed, it is good, but slow an operates manually. Will call Bergen today.
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
transducer
- make shell script wrappers for the most common commands for user friendlyness
- write user account form, probably ask for copy of existing ones from the IT
centre (with Sjur)
- write documentation for our
bound
users, with pointers to the ordinary
documentation
- write documentation on double-tagging names
- discuss web-only user access management with Oslo
- change tagging of derived stems in the disamb output, to facilitate much
easier extraction of non-lexicalised derivations
- prepared for the conversion, not done the actual change.
- fix bugs!.
3. Documentation
TODO:
- Write both user and admin documentation (Børre, review: Sjur, Thomas)
- User documentation probably in several languages. This covers how to apply
for an account, on what grounds one can apply, and pointers to documentation
telling how to use the corpus.
- Admin documentation, telling how we set the permissions to the corpus files,
and whatever other processes and tasks needed to set up a corpus account.
4. Corpus gathering
Nothing has happened during the summer.
Olavi Korhonen’s Lule Sámi dictionary.
Waiting for an answer.
Bible texts
Will have a second round with the Word versions.
TODO:
- get nob and nno NT and OT in paratext format. (Trond)
- convert smj NT to paratext (Børre)
- convert fin, swe to paratext or directly to our XML (Børre)
Kåfjord
TODO:
- contact Ája (Børre)
- talk to Lene about Kåfjord (Børre)
Sámi Instituhtta
When will we get the corpus? We still don’t know, Børre will contact him
again.
TODO:
- contact NSI again (Børre)
Čálliid Lágádus
[http://www.calliidlagadus.org/]
TODO:
Árran
TODO:
- contact Bård Eriksen again (Børre)
5. Corpus infrastructure
General
What we would like: A make-type system that kept track of the file.xml and
file.xsl -pairs, always converting the former whenever the latter had a newer
date. Cf. Trond’s letter “Makefile for xsl corpus conversion?” in the news.
At first sight, this sounds like a good idea.
TODO:
- remove headers and footers from antiword documents, other improvements
as needed, including PDF conversion (Saara)
- give the make issue a second sight (Tomi)
- comment Trond’s suggestion in the news (all)
User accounts and access
For details, see a previous meeting memo, as well as the
memo from a [dedicated
meeting|/infra/corpus_policy.html].
Shell access
TODO:
- export to
/opt
(with cron) tools that the project team members find in
their cvs tree (the bound users do not have a cvs tree, and therefore need
these tools in /opt
in order to conduct linguistic analyses) (Tomi)
- make shell script wrappers for the most common commands for user friendliness
(we must think of what commands they are) (Trond)
- (first version of first script, teaksta.sh, was checked in, but it is still
not working (the problem is a simple handling of input-output, some shell
script literates should have a look at it)
- write user account form, probably ask for copy of existing ones from the IT
centre (Trond and Sjur)
- possibly deploy the user account form as an HTML form (Børre)
- write documentation for our
bound
users, with pointers to the ordinary
documentation (Børre, Trond)
- write documentation for how to apply for a user account (where’s the form, to
whom do I send the form, who needs it, etc.) (Børre)
- make our own guidelines for the user application processing (Børre)
- make a test user (Børre)
- test corpus access as test user (Trond)
Web browser access
TODO:
- discuss with Oslo (Trond)
- delay other tasks until we are ready to go public?
- user management for access to bound texts
- short user guide needed before going public (either write one or take whatever has been made in Oslo (Trond)
More texts to the graphical corpus interface:
TODO:
- add text to the server (Lars)
Aligner
The aligner aligns fine, better than its competitors. Unfortunately it is slow, and dependent upon manual input.
TODO:
- contact Bergen to discuss these issues, ask for a non-manual interface, etc.
(Trond)
Language recognition
Still waiting for more smj and sma text to improve it. We need South
Sámi, since Reindriftsnytt/Boazodoallo-ođđasat is trilingual, nob, sme, sma. We
are presently unable to correctly identify sma.
Corpus summary
The time-based statistics is still missing.
TODO:
- add time-based display as a feature request to Bugzilla (Sjur)
6. Infrastructure
To improve throughput and response time on heavy loads, it would really be nice
to have the Xerox tools wrapped up as servers.
TODO:
- decide the programming language to use (Saara)
- find some (almost-)ready-to-use code to build on (Saara)
- implement it (Saara)
Paradigm generation
Goal: Reuse Greenlandic code for paradigm generation.
Saara has given a report on the PHP code in News. Please read.
Conclusion:
Only to be used as a source of inspiration. We’ll wait with further work until
we have the server wrapper thing (see above) in place.
Hyphenator
TODO:
- correct the treatment of hyphenation of word boundaries and exceptions (fst
gymnastics) (Sjur, Trond)
- Update the sma hyphenator rule set with the insights gained from smj updates
(Trond during summer vacation)
Automatic Bugzilla reminder for untouched bugs
TODO:
- give mail reminders a second try; ask Thor-Øivind for help if needed
(Børre)
M4
Tomi and Saara did a lot in Tromsø. How far is it now? Probably finished today!
TODO:
- finish the work, and check it in (Saara)
7. Linguistics
Derivation and spellers like Aspell
To make it easier to extract all derived stems, we should enhance the tags used
for derivations in sme
to make them easier to grep. The most straightforward
solution is to make the tags follow the same pattern as for smj
,
+Der/NNN
. Presently sme
is only using the NNN
part as a tag, where
NNN
represents the derivational suffix. That is, there is no single pattern
to match against for sme
.
Problematic issue: the disamb output will presently give information only
about non-lexicalised derivations. This can potentially give false data on the
frequency of derivational affixes. To get (more) precise data, there should be
an option in lookup2cg
that favours derived analyses over lexicalised ones,
everything else being equal. A similar option for compounds can also be useful
in certain contexts. That is, the present behaviour should be partly turned
upside-down (“select the analysis/-es with the fewest compounds and derivations
available”). The best alternative would probably be to select the next to least
complex analysis, that is, to allow only one compound border or one derivation.
TODO:
- change tagging of derived stems in the disamb output, to facilitate much
easier extraction of these verbs (Trond, Tomi)
- Issues: Rewrite the
sme-lex.txt, sme-dis.rle, sme-tdis.rle, twol-sme.txt
files (and perhaps other source files), (40 tags and 5 files are
involved (+ possibly 5-10 more, utf8-scripts, also the tag conversion
scripts in sme/src
and in gt/cwb/korpustags.txt
)). Pseudocode:
For file i, take taglist 33-37 and replace with 39-43, respectively
- find and study all derived verbs in our corpus (Thomas)
- suggest which derivations could be generated (Thomas)
- see source code above, but also consider overgeneration problems, as well as
input from the statistical results in the previous point
- lexicalise the rest (Thomas)
Semantic double-tagging of names
The policy needs documentation. Thus:
TODO:
- Make a section under
gt/doc/lang/smi/
, add a chapter
Common linguistic resources under the Languages section in
site-frag.xml
, and add the document below to that section (Børre)
- write guidelines for annotators wrt. to name tagging and put them under
gt/doc/lang/smi
. A short version of the
guidelines goes as follows: \Systematic
ambiguity (the Trosterud (plc/sur) and Aftenposten (org/obj) types)
should not be doubletagged, but given primary tags (plc, org) only.
Doubletagging should be confined to exceptional cases (Eira
(Fem/Sur/Plc), Janne (Fem/Mal), such things. These guidelines should be
added to our documentation file. (Trond)
- make sure all linguists is aware of the guidelines (Trond, Sjur)
- write disamb rules to implement the system above (Trond, Linda)
North Sámi
The following already derived verbs (verbs ending in -šit, -skit, etc.) are not
happy with further derivation. It seems that most of them do not appear as Actio
forms in the first part of compounding either. The following holds for both
sme and smj:
LEXICON MUITTASJ !Words ending -šit, -skit, -smit, -idit, -ldit, -git and
5-syllables, formerly directed to
MUITAL
+V+TV: MUITALStem ;
### These derived verbs have now been redirected to MUITTASJ and similar lexica.
### Reflexives on -dit
### Reciprocals on -dit, -(a)lit
### Momentatives on -dit, -(a)lit, -ádit, -ihit
### Frequentatives on -(a)lit, -(u)hit, -dit
### Continuatives on -dit, -(u)hit, -nit
### Inchoatives in -nit
### Translatives on -dit
### Essives on -dit and -stit
### Causatives on -dit, -stit
Examples:
- muittašit > *muittašallat
TODO:
- investigate and identify under which conditions Actio compounding is possible
(Thomas)
- discuss findings with all of us (Thomas)
Lule Sámi
TODO:
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
(Thomas, Trond)
- convert roughly 100 smj names from that file (lines 740-843) to XML
(Saara)
- add inc abbr to a new abbr lexicon file (Thomas)
- copied the sme abbr file and lulefied it
- add proper numeral analysis/treatment (Thomas)
- done to the same extent as in sme
- add loanwords (e.g. latin -ere verbs) (Thomas)
8. Name lexicon infrastructure
Decided in Tromsø:
- add smj proper noun lexicon file to the output
- remove ```^ # 0` from the center ID
- replace spaces with underscores in all IDs
- remove occurence indicator from language IDs: Agalin_1 (the center/concept ID)
=> Agalin (the language ID), and thus the two Agalin’s should become one
language entry (but two different concept entries)
- store a redundant copy of the center-file semantic information in the
language-specific files, for processing speed
- add logging facilities
- add option to download local copies of the lexicon files directly from the db
- batch editing (change all entries in the found set), should later be enhanced
to allow selection of exceptions (the found set minus deselected items)
- all names in all languages by default
- tag for excluding/including a name from certain applications
- hide / display
^
during browsing
- future epxansion: choose what info to display in the single language browser
- search by (single) language
- display existing language entries when adding a new language to a record
- make searches behave predictable (the hits should be the expected ones)
- add editor to change single, existing entries
Details can be found in [the meeting
memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html]
Language entry example illustrating both the sem-tag on the sense elements, and
the removal of occurence indicator:
<entry id="Agalin">
<infl lexc="BERN" />
<senses>
<sense sem="plc" ref="Agalin"/>
<sense sem="sur" ref="Agalin_1"/>
</senses>
</entry>
TODO:
- improve lexc2xml conversion (Saara)
- add default
smj
entries
- exclude
^ # 0
from the center ID
- add an empty element to all entries (center and lanuage files)
- add a
last-update
attribute to the root element of all files
- finish refactoring for multiple collections in the search interfarce
(Sjur)
- waiting for a bug fix (Tomi is investigating it)
- develop the needed XQueries and UI (Sjur, Tomi)
- data synchronisation between risten.no and the cvs repo (Tomi)
- discussion started on eXist-list, nothing useful came up. We need to
reformulate the question from our perspective, and bring it up again
(Sjur)
9. Public tender
TODO:
- write a contract (mostly done by Finnut, review by Sjur)
- get it signed (Finnut, Lennart Mikkelsen)
10. Tromsø meeting round-up
TODO:
- check in meeting memos (Sjur)
- Polderland questions. Thomas did already send requested info.
- speller development - see the meeting memo. Separate follow-up next week.
- Lule Sámi linguist - Sjur has tried to call a possible candidate, but no
answer so far. He will use e-mail instead. (Sjur)
- order AirPort Express to Tromsø (Sjur)
11. Other
Bug fixing
43 open Divvun/Disamb bugs (two down!), and 25 risten.no bugs
Guess: 1/3 of the bugs are fixed already (?)
Please help Saara with [bug
279|http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=279]
(Perl locale). Not much help…
Saara will contact Roy on this issue.
Gobby
TODO:
- install Gobby (Trond, Sjur)
- review the document (Thomas)
- done - document accepted:-)
Task lists as iCal entries
TODO:
- update all forrest installations to r430284 (Børre)
cd $FORREST_HOME
svn up -r430284
11. Next meeting, closing
Next meeting 4.9.2006 at 9:30.
Closed at 11:25.
Appendix - task lists for the next week
Børre
- corpus collection:
- send out contracts with accompanying letter
- Gather public texts, preferrably also parallel ones
- Send out letters to the rest of the Iđut authors
- contact Ája (Kåfjord), talk to Lene
- send contracts to Čálliid Lágádus
- contact Richard Valkepää at NSI about older Min Áigi and Áššu files
- contact Bård Eriksen again
- corpus conversion:
- convert nob and nno bible texts to be used as part of a parallel corpus
- convert fin, swe to paratext or directly to our XML
- review the paratext2xml converter
- Move norwegian documents in Min Áigi from sme to nob
- corpus access:
- possibly deploy the user account form as an HTML form
- make a test user
- Write both user and admin documentation (Børre, review: Sjur, Thomas)
- User documentation probably in several languages. This covers how to apply
for an account, on what grounds one can apply, and pointers to documentation
telling how to use the corpus.
- Admin documentation, telling how we set the permissions to the corpus files,
and whatever other processes and tasks needed to set up a corpus account.
- set up Bugzilla automatic reminders for open issues
- create document & document entry for semantic double-tagging of names (for
Trond)
- Update forrests to svn version r430284
- fix bugs!
Maaren
Saara
- Create a parallel corpora of the new testaments
- add more texts to the graphical corpus interface
- grammatical searchability in the graphical corpus interface
- Implement parallel corpus upload in web upload script
- remove headers and footers from pdf documents
- Implement server of the analysis tools.
- Add more languages to the lexc2xml propernoun conversion.
- Refine the namelex output
- finish M4 work
- fix bugs!
Sjur
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
transducer
- name lexicon:
- implement editing functions
- finalise refactoring for multiple collections
- implement improvements decided upon in Tromsø
- review user and admin documentation for corpus access
- write user account form, probably ask for copy of existing ones from the IT
centre (with Trond)
- add time-based corpus summary as a feature request to Bugzilla
- check in meeting memos from Tromsø
- start hiring process of linguist and programmer
- order AirPort Express to the Tromsø gang
- install Gobby
- fix bugs!
Thomas
- sme G3 issue
- bug-fixing!
- refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
- review user documentation for corpus access
- find and study all derived verbs in our corpus (depends on Trond)
- suggest which derivations could be generated (depends on Trond)
Tomi
- new proper name lexicon
- data synchronisation of proper nouns between risten.no and CVS
- XQuery refactoring and code development for our proper noun editor
- new version of xml2lexc (based on ccat), should handle complex names correct:
construct entries like we have now from the different parts of a complex name
entry
- implement improvements decided upon in Tromsø
- read aligner docu, install, provide feedback
- Set up the mechanism for the hash-mark transducer package
- test the new xml output of the xml-tagged analyses
- export corpus tools to
/opt
(with cron)
- Do the sme Der/ change (with Trond)
- consider Trond’s suggestion of a makefile for corpus conversion
- fix bugs!
Trond
- better smj NT text
- get fin, swe, nob and nno NT and OT in paratext format
- contact Bergen about aligner issues
- fst gymnastics to add hyphenation and word boundary marks to hyphenation
transducer
- make shell script wrappers for the most common commands for user friendlyness
- write user account form, probably ask for copy of existing ones from the IT
centre (with Sjur)
- write documentation for our
bound
users, with pointers to the ordinary
documentation
- write documentation on semantic double-tagging of names
- discuss web-only user access management with Oslo
- change tagging of derived stems in the disamb output, to facilitate much
easier extraction of non-lexicalised derivations
- do the sme Der/ change (with Tomi)
- write short user guide for the corpus web interface
- install Gobby
- fix bugs!.