Meeting setup
- Date: 02.05.2005
- Time: 10.00 Norw. time
- Place: Wherever we are :-)
- Tools: Phone, iChat, SubEthaEdit
Agenda
- Opening, agenda review
- Reviewing the task list from a week ago
- Documentation - divvun.no
- Corpus gathering
- Corpus infrastructure
- Linguistics
- Term db
- Other issues
- Summary, task lists
- Closing
1. Opening, agenda review, participants
Opened at 10.10. Agenda accepted as is.
Present: Maaren, Sjur, Thomas, Tomi, Trond, Børre
Main secretary: Trond (Sjur)
2. Reviewing the task list from the last meeting
- All: report vacation plans
- Trond, Børre, Sjur and Tomi
- A session on the corpus format.
- We had several sessions on the corpus format, and actually came up with
a working model, at /usr/local/share/corp/. We made a dtd proposal, a directory
structure, in essence a home for our corpora.
- Børre
- Contact Kimmo Koskenniemi and Ruth Vatvedt Fjeld about contracts
- Sent an e-mail, no respons so far
- Trond: will call around
- Fix the link problems in the documentation.
- Only a handful of link problems left
- Further discussion with web server administrator about the divvun.no site.
- Monday/Tuesday next week he will do the nexcessary work
- Some minor things with termdb.
- Done some of it, some left
- Review giellatekno.no site suggestion with Trond.
- Did some, found problems with utf-8 .ihtml files.
- Tomi
- continue corpus discussion
- Did so, and updated the conversion script from docbook to our internal format
- corpus location and corpus processing
- Maaren: tries to work with the missing list. Else?
- Maaren has had courses and seminars at the Sámi parliament the whole week.
- Thomas: work with verb transitivity
- Has been working as usual, with verb transitivity (which is much work).
- Sjur:
- … has been working with corpus format and processing.
- The editing tool of the terminology db is almost completed. The technical
problems are solved as well.
- No work on the text license contract or on divvun.no, awaiting further progress
here.
3. Documentation - divvun.no
The humfak server will be set up for divvun.no this week, so that the pentalingual
welcome pages should be ready by next week.
Todo: Translate the current welcome page into 5 lgs, and add some text as well.
We will add more information on the plans and the progress of the project.
The setup incl. sync with CVS
Official opening by Språkstyret medio June. Todo: Ask Anne Britt to put it on the
agenda (21.-22.6. in Tysfjord).
North Sámi as default language. There should be a note explaining that English is
the internal language for documentation so that people won’t complain about the techdoc
not existing in other lgs.
Giellatekno pages will be changed from parallel bilingual text on the same pages
into two parallel sets of pages.
4. Corpus gathering
There is a problem that we have got no answer from Oslo and Helsinki. We need
the documents soon, Sjur and Trond to talk to relevant people.
Northern Sámi texts:
Further issues for corpus gathering: Talk with the Bible translators, the writers
association,
We have written letters to the Departments in Oslo, the Sámi parliament, the
municipalities, etc. Maja Hætta at the Guovdageaidnu suohkan is going to collect
text for us. Sámi Allaskuvla has not responded, Sámi Instituhta are awaiting a
license suggestion for their taped texts, but we will get the newspaper texts right
away.
Lule Sámi texts:
We should talk to the Tysfjord branch of the Parliament on gathering of Lule Sámi
text. There is some text translated to Lule Sámi within the Sámi parliament system,
the challenge is just to find them. There should be Lule Sámi text in Sweden as well,
especially the NT, here, Susanna Kuoljok-Angeus should be contacted.
We need a web crawler for Sámi text. TODO: Contact Knut Hofland in Bergen on this
issue. The Skolelinux project has set up a web-crawler for bokmål and nynorsk, we
could probably also use the same one. The two crawlers should be compared.
5. Corpus infrastructure
We have arrived, or rather are arriving, at the following:
sme/
/orig/<donor>/<year>/file.doc <- dump, **must** be write protected
/int/as-in-orig/file.db.xml <- docbook format
/int/as-in-orig/file.xsl <- file-specific scripts, under version control
/gt/publisher OR author/year/file.xml <- xmlpreprocess
... | lookup ...
We should put some effort into the donor directory. Donor could be ordered according
to person, or according to institution.
We need: The texts
We need: A script for going from file.xml to .. -> preprocess
eventually modifying preprocess as well (but still making preprocess able
to take raw text as input, or we may make an xmlpreprocess along with the
(txt)preprocess).
6. Linguistics
Gone through 5200+1500 verbs, more than 13 000 all in all.
Maren tries to work with the missing list
(again)
Issues:
- General issue: How should we prioritise between work on names, n-v-a cluster,
closed classes
- Detailed view:
- Lexical coverage
- Sámi place names (later, when Hønefoss have been transposing more names)
- Names:
- The LONDON/BERN (-is/-as) issue.
- Distinguishing different name types? (Person, place, …).
- Group parallel names (Guovdageaidnu / Kautokeino(s) / Koutokeino = Koutokeinossa,
Maaren / Maren, Máret / Marit)
- 3-part compounds: sámegieloahpahus. Needed: Linguistic description + lexc
programming skills.
- east-west differences
- indefinite pronouns (closed-sme-lex.txt file)
- Compounding
A priority policy is to first look at the closed classes.
For the other classes
So far, we have done: inflectional morphology of the n-v-a cluster.
Derivational morphology and compounding not been systematically checked.
Linguistic priority list:
- finish the verb transitivity
- closed POS
- compounds
- derivation
- completing the lexicon
- names
7. Term db
Issues left:
- i18n
- l10n
- styling
- graphics
- translations
- more…
New deadline for internal opening: May 13th
New server: June 1st
Official opening: June 17th
8. Other issues
University project:
Will need 1-2 new linguists, for at least one and a half year.
Friday May 6:
- Sjur: taking day off
- Maren: me too
9. Summary, task list
TODO:
- Thomas, Maaren, Børre: Translate divvun front page
- Trond, Tomi, Børre, Sjur: Continue with corpus infrastructure
- Trond:
- Will find out who can help us with corpus agreement contracts
in Oslo when Ruth Vadtvedt Fjeld is away.
- Work with Børre on the giellatekno
- Sjur:
- Call Kimmo about license text & e-mail from Børre
- More termdb work
- Contact Anne Britt about divvun.no and the Språkstyremøte.
- Børre:
- Ask Kåre Tjikkom, Harriet Aira, Karin Tuolja, Susanna Kuoljok-Angeus and
Samuel Gælok about Lule Sámi text.
- Contact Leif Åge adding Trond to divvun e-mail alias.
- Prepare the war file for Tomcat deployment next week
- Work with Trond on the giellatekno pages
- Finish the work on the Termdb
- Contact Skolelinux about their webcrawler, also Knut Hofland (ask Trond/Sjur)
- Tomi:
- Decide the directory structure for corpus originals
- Script for antiword and UTF-8 fix -processing
- Template for xsl manual conversion script
- Script for processing the corpus for preprocessor
- Thomas: continue with verbs and ask Teemu Leskinen for finnish parallel names.
- Maren:
- Have a look at the closed classes, especially the indefinite pronouns.
- Work more on lexical issues.
10. Next meeting, closing
09.05.2005 10.00
Closed at 12.23.