Meeting setup
- Date: 27.02.2006
- Time: 09.30 Norw. time
- Place: Wherever we are :-)
- Tools: iChat, SubEthaEdit
Agenda
- Opening, agenda review
- Reviewing the task list from two weeks ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Linguistics
- name lexicon infrastructure
- Spellers
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 10:36.
Present: Børre, Saara, Sjur, Thomas, Trond
Absent: Maaren, Tomi
Main secretary: Trond
Agenda accepted as is.
2. Reviewing the task list from the last meeting
Børre
- send out contracts with accompanying letter
- Gather public texts, preferrably also parallel ones
- Continue converting text from input format to our xml
- Some new texts added to the corpus, from the Sámediggi
- convert nob and nno bible texts to be used as part of a parallel corpus
- review the paratext2xml converter
- Not done
- convert smj NT to paratext
- Call Ove Sæth and Olavi Korhonen
- Called Korhonen, he was very positive about sharing his Lule Sámi
dictionary. He asked if he could get access to our corpus as a return
favour, because he is also making a Northern Sámi dictionary.
- Correct Forrest integration for new projects and project ideas
- Move complex name lexicon issue to bugzilla
- fix bugs!
Maaren
- work with risten.no
- Added words. Also added words from a frequency-based missing list.
- discuss with relevant people regarding seminar on proofing tools, normativity
and SGL in February/March, including place.
Saara
- continue discussion on the new lexicon format
- Fix the preprocess script and optimize it.
- update conversion from lexc to xml (proper names) with the latest refinements
- Try to add numeral treatment as part of the analyzator.
- Look at crontab ga/ directory issue with Trond.
- Create a parallel corpora of the new testaments.
- Routine for adding new languages to the propernoun xml-structure (Lule Sámi
name lexicon and links to the termcenter.xml etc.).
- fix bugs!
Sjur
- Follow up the lawyer treatment of the contracts
- Lule Sámi twol problems, with Thomas and Trond
- project planning with Trond, continued
- Follow up on place names from Norge Digitalt
- Evaluate SFST as speller (and analyzer) lexicon
- write a background document on the corpus contracts
- public tender:
- review draft tender document from Finnut
- almost finished - will finish today.
- smj G3 issue with Thomas and Trond
- sme G3 issue with Thomas and Trond
- call EDD/Christian Emil Ore about national place name lexicon
- risten.no/proper noun lexicon development: fix bugs, continue development
- fix bugs!
- other:
Thomas
- work on North Sámi compounding and derivation
**nothing, been working with Lule sámi namelexicas and incoming words
- review corpus usage documentation
**nothing, been working with Lule sámi namelexicas and incoming words
- smj G3 issue with Sjur and Trond
**nothing, been working with Lule sámi namelexicas and incoming words
- sme G3 issue with Sjur and Trond
**nothing, been working with Lule sámi namelexicas and incoming words
Tomi
On sick leave.
Trond
- Mostly worked on Lule Sámi this week.
- Work on corpus texts with Børre.
- Contact the Finnish and Swedish Bible societies to get Bible texts.
- Look at ga/ directory issue with Saara.
- Postponed. We discussed it, though…
- News group discussion followup.
- Do a bug report (if not done) on commandline (mis)behaviour in the Xerox tools
- Ask IT guys for an e-mail adress for corpus upload script:
corpus@giellatekno.uit.no (forwarded to Børre)
- Other people have looked into this.
- fix bugs!.
3. Documentation
TODO:
- Integrating the future project plans and ideas in our Forrest documentation
(Børre)
4. Corpus gathering
Collecting
See a previous meeting memo for what’s to be done.
TODO: Send out the rest of the letters (Børre)
Odin
Waiting for Sæth to discuss with colleagues about how to implement the
cooperation, and return to us.
TODO:
Olavi Korhonen’s Lule Sámi dictionary.
Korhonen and Oahpadusguovdásj have a shared copyright to the dictionary.
KIO Grafisk and the Iđut books
We need a test file in order to find out whether file conversion from Quark
to InDesign works.
TODO:
- Trond to ask for a Quark test file from his sister
- Børre to ask KIO to send a more elaborate test file, representative for
what we can expect, but without any copyrighted content (and please stress,
that fonts will never leave the computer, they are part of the OS, and are
not included in the Quark file). The file should include text with Sámi and
Norwegian letters, and preferably some dummy pictures as well. Actually,
Børre could create that file in Word and send it to KIO for them to move
to Quark and send back.
- Børre will send letters to the authors.
Bible texts
TODO:
- review paratext2xml converter (Børre)
- convert smj NT to paratext (Børre)
- ask to get fin and swe NT and OT in paratext format. (Trond)
- Still not done. Trond will contact Bibelselskapet for a new sme version, and
for the NT nob and nno, and the fin and swe ones (after Wednesday)
5. Corpus infrastructure
We need more “version control” in the corpus work - we don’t know which version
of the XSL script was used (but roughly whether it was used or not). We need to
document within the XSL file which version of all tools were used, including
the XSL file (common section/template) itself.
Transferring the old gt/sme/corp files to the new corpus repo:
- for the biggest top ten (or so) the orig. should be located and copied to the
new corpus repo
- then these files should be removed from gt/sme/corp/
- all small files could just be forgotten/ignored
TODO:
- Access control to corpus repo resolved through Unix groups: one group for corpus
maintainers with write access, and another for corpus users, with read-only
access. Still not done.
- Saara will ask Roy to do what should be done.
Further discussion about corpus analysis and computer use:
The new G5 is tremendeously faster than cochise, thus we want to use it. But
cochise will continue to be our main corpus repo.
- the corpus/gt/ dir will be synchronised with the G5
- TODO: set up copying script (Saara)
- we need to develop strong enough security routines for the G5 to fulfill our
obligations towards the text licensers
- TODO: Børre to move this to bugzilla
- we are still using only one processor when analysing - making some simple
multiprocessor analysis script will increase the speed of the analysis a lot
- script implemented, but not tested due to copying not in place yet (see
above).
Secure copying:
To get cvs ssh working without password prompting:
ssh-keygen -t rsa
<just type enter to all questions>
chmod 0644 .ssh/id_rsa.pub
scp .ssh/id_rsa.pub <user>@cochise.uit.no:.ssh/authorized_keys2
New tasks:
- corpus dtd documentation:
- structure, content/model and location of the dtd (location =
permlink) http://giellatekno.uit.no/dtd/corpus.dtd
TODO: Saara to write and finish the docu, also check the dtd link
- finalize gt/doc/ling/corpus_conversion_tech.html (and possibly convert it to
xml) (Saara)
- add xml validation against our dtd to the corpus conversion process (Saara)
6. Linguistics
North Sámi
We want to get the decisions from the SGL meeting in December.
TODO:
- get the decisions (Maaren)
Lule Sámi
NFR wrote:
[Programstyret] bed (...) om at det umiddelbart vert laga ein alternativ plan
for resten av prosjektperioden. Denne planen må gå ut frå den endra føresetnaden,
at ein ikkje har tilgang til den lulesamiske ordboka. Programstyret vil at ein
går over til den alternative planen frå 1. mars, dersom tilgangen ikkje er reelt
i orden på det tidspunktet.
The criterion for continuing with Lule Sámi that we set up in our plan was the
following, somewhat stronger:
kriteriet for om det blir satsing på lulesamisk er om vi har ein betaversjon
av den lulesamiske transdusaren, med integrert leksikon, i gang 1. mars.
Moments to our March 1st report:
- We have recieved the dictionary, and integrated most of it (27.2:
78 % of the Kintel words are added)
- Our Lule Sámi disambiguator is up and running as an alpha version
- The analyser runs on 80,0 recall on token and 67,4 on type, on the New Testament
(proper names excluded)
- The analyser has 17773 lexicon lines
- We have 3367 lines of non-allocated Kintel words
- The integration of proper nouns will be done in parallel with Northern Sámi, and
when it suits the Northern Sámi conversion timetable.
Goal for March 1st: Reach the 90 percent lexical recall limit.
Conclusion: we have already met the basic requirements for continuing with Lule
Sámi.
TODO:
- Trond and Thomas: work on the 90 % limit
- Trond: Write a report to NFR.
7. Name lexicon infrastructure
Complex names
- make sure xml2lexc can handle complex names in ways compatible with our
present tool chain (=reconstruct the lexc format we have now) (Tomi)
- the resulting file format should be identical to our present prop-name
file (=lexc), that can then be converted to our new xml format using the same
script as for the regular names (Tomi or Saara, but only when the
technical details are settled)
TODO:
- Move this issue to bugzilla (Børre)
TODO:
- testing of conversion
- eXist as editor:
- develop the needed XQueries and interface (Sjur, Tomi)
- data synchronisation between risten.no and (Sjur, Tomi)
- test whether eXist as editor is actually working well (Thomas, others)
8. Spellers
Nothing while Tomi is on sick leave, and until the new proper name lexicon is in
place.
9. Other
Upcoming Strategy money application deadline at Samisk senter
Possible grant proposal themes include:
- Southern Sámi:
- Grammar research
- corpus collection
- normativity seminar (future speller)
- Lule Sámi:
- Northern Sámi:
- Text-to-speech
- Semantic annotation
- Programming infrastructure:
- Setting up a structure for inflecting dictionaries
SGL Seminar
- all members = potentially/likely all languages
- not all languages, only North Sámi
- date:
- As early as possible, end of March?
- place? Maaren will investigate
Bug fixing
33 open bugs (and 24 risten.no bugs)
10. Summary, task list
Børre
- send out contracts with accompanying letter
- Gather public texts, preferrably also parallel ones
- Continue converting text from input format to our xml
- convert nob and nno bible texts to be used as part of a parallel corpus
- review the paratext2xml converter
- convert smj NT to paratext
- Call Ove Sæth
- Move complex name lexicon issue to bugzilla
- Ask KIO Grafisk to make a test Quark document based on a Word document from us
- Send out letters to the Iđut authors
- Add corpus security re G5 syncing as an issue to Bugzilla
- fix bugs!
Maaren
- work with risten.no
- discuss with relevant people regarding seminar on proofing tools, normativity
and SGL in February/March, including place.
- get the normativity decissions from the December SGL meeting
Saara
- continue discussion on the new lexicon format
- Fix the preprocess script and optimize it.
- Try to add numeral treatment as part of the analyzator.
- Move gt2ga.sh to G5 and implement copying of the gt-dir.
- Create a parallel corpora of the new testaments.
- Routine for adding new languages to the propernoun xml-structure.
- Move to Bugzilla: the analyzer needs to be optimized.
- Implement validation of xml corpus against the dtd.
- Create a group for corpus users.
- Finish corpus dtd documentation, dtd location and permlink reference
- Finish gt/doc/ling/corpus_conversion_tech.html and rename to .xml
- fix bugs!
Sjur
- Follow up the lawyer treatment of the contracts
- Lule Sámi twol problems, with Thomas and Trond
- project planning with Trond, continued
- Follow up on place names from Norge Digitalt
- Evaluate SFST as speller (and analyzer) lexicon
- write a background document on the corpus contracts
- public tender:
- review draft tender document from Finnut
- smj G3 issue with Thomas and Trond
- sme G3 issue with Thomas and Trond
- call EDD/Christian Emil Ore about national place name lexicon
- risten.no/proper noun lexicon development:
- refactor code
- implement inheritance/collection overriding for xsl/css/xquery using sitemaps
- data synchronisation between risten.no and the cvs repository
- fix bugs!
Thomas
- work with Lule sámi 90% goal
- work on North Sámi compounding and derivation
- review corpus usage documentation
- smj G3 issue with Sjur and Trond
- sme G3 issue with Sjur and Trond
Tomi
- move aspell UTF-8 suffix bug to Bugzilla
- corpus infrastructure:
- dtd location (both public and internal)
- Document aspell infrastructure: finish doc/proof/spelling/X-spell/aspell.xml
(it’s almost done, but there are a couple of loose ends)
- new proper name lexicon
- discuss the new lexicon format and other issues in the newsgroup
- Look into data synchronisation of proper nouns between risten.no and CVS
- XQuery refactoring and code development for our proper noun editor
- new version of xml2lexc (based on ccat), should handle complex names correct:
construct entries like we have now from the different parts of a complex name
entry
- fix bugs!
Trond
- Bring Lule Sámi up to 90 %
- Write Lule Report to NFR
- Apply for strategy funds
- Work on corpus texts with Børre.
- Contact the Finnish and Swedish Bible societies to get Bible texts.
- Look at ga/ directory issue with Saara
- Ask for a Quark test file from his sister
- News group discussion followup.
- fix bugs!.
11. Next meeting, closing
06.03.2006 09:30
Børre, Saara, Trond will be away.
Closed at 12:00