For reference: This is what we decided in Helsinki:
admin/depts/ (governmental departments)
guovda/ (Guovdageaidnu municipality)
karas/ (Kárášjohka municipality)
sd/ (Sámi parliament)
others/ (everything else)
bible/ot/
nt/
facta/
ficti/
laws/
news/MinAigi
/Assu
/NRK
/YLE
/other ```
- Reprocess the old (from new dir.) corpus files
Naming conventions and directory structure
- The original file should be protected using file and directory permission.
- The meta information (i.e., the xsl translation files) should be under version
control
- Given that our language detection works well, the intermediate file don’t need
to be under version control (the lg identification tool is under gt/script,
and it needs to be made part of the coprus processing)
Tasks:
- Make a system for file and directory permission (today: we all belong to the
cvs group), to only allow people with root user privileges write access to the
corpus repository, at least regarding original files
- Include the xsl files under version control (cvs? rcs?)
- Incorporate language detection as part of the corpus processing.
- the dir structure is:
- one dir for orig, containing also the meta-info and interm. files
- another dir for our ready-to-use xml files after conversion
- dir structure for web-posted corpus files:
- subdivision according to week or month, we start out with month till we see
the amount of traffic (yyyy-mm)
- Done
- we need a way to deal with hyphenated documents in catxml/preprocess:
- in normal cases hyphenation points should be removed
- when testing the robustness of our parsers, as well as when testing the
hyphenator, the hyphenation points should be retained
Corpus conversion
Pdf to XML
Saara has made a new conversion module, it is almost finished. We’ll return
to the issue, evaluation, etc. on the next meeting.
Task: Saara to prepare for this presentation, and to make documentation.
perldoc gt/script/samiChar/Decode.pm
(X)HTML to XML
Tomi has been looking at this, and is making an xsl script for it. The web form
developed by Tomi should be augmented to allow posting of URL’s as well as
documents from the local file system.
The URL posting need to check whether the same URL has been posted before, and
if so, whether the page has changed.
Task: Tomi and Saara to present status quo and suggest routines, merger,
etc. on the next meeting.
The documentation for corpus conversion should be added to the
gt/doc/ling/corpus_conversion.xml document.
6. Linguistics
Name lexicon
Summary: see the newsgroup
Motivation:
- Divvun: We want to cross-link different versions of the same locations
in different languages
- Common: We do not want to enter the same names twice. We want a
language-independent name lexicon
- Disamb: Having a richer tag set makes it easier to disambiguate
- Future: Richer analysis makes new applications possible, within
information retrieval, grammar checking, machine translation etc.
Needed: A plan for this project:
a. do the main markup in the present propernoun file
b. make a script for converting it to xml (to be done one time)
c. make a script for xml2lexc (to be done by the makefile)
d. make the tags etc. in the parser
Conversion:
- This week
- clean up the present infl. lexicons (merge BLIND and BERN, VUOLAB and LONDON)
- Trond
- Make an emacs mode for markup (Saara). Options: fem, mal, sur, plc, org,
obj, none). Combinations: surplc
- (end of this week and) Next week:
- Mark up as much as possible within a week or so (Maaren to do the Sámi
names, and to split CNAME into BERN and LONDON, Trond and Børre to
look at the rest)
- Then convert to xml
- Then mark up the rest with correct semantic tags
- This means we would need a seventh option, the unspecified name.
- Look into efficient editing of the XML lexicon
- Look into synchronisation issues with risten.no - we want the names there
as well
Updated status quo:
- Entries: 20000
- Converted: 13500
- Time used: 10 h
Needed tools: An emacs mode doing this (Saara):
- Go to next “ NAME ;” ( where NAME is a string of symbols “A-Z-”)
- Wait for input, one of these: m f s p o b
- Replace “ NAME ;” with “ NAME-mal ;”, “ NAME-fem ;” etc. and go to next
“ NAME ;”
Possible refinement: Encode for combined options (both plc and sur, e.g.)
already in this phase.
Waiting for emacs mode.
Twol SETS definition issue
The definition of G1, G2, G3 in Lule Sámi is still open. and we would like to
have input on this issue.
Update: it is still not working, see [bug
193|http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=193]
SUGGESTION (Trond): Thomas, Trond and Sjur have a new meeting on
this issue, on Wednesday.
North Sámi
- three-part compounds issue still open, as is the number project.
- The treatment of Sámi place names, we need a contract with “Norge digitalt”,
via UFD.
- Sjur has written an e-mail to the UFD contact person,
Øystein Johannessen, who will look into it soon. He has not responded
beyond saying he will return to it. Sjur brought this up in the board
meeting, and Bjørn Olav Megard will remind Øystein Johannessen
about this issue. Sjur will follow up on this one.
- normativity issues:
- the Giellalávdegoddi meeting was last Friday, they will have a new meeting in
December. They were not able to make any decisions, and there will be a new
Giellalávdegoddi beginning next year who won’t make decisions until late
spring. This is a serious problem for the Divvun problem.
- Actions: Sjur will bring this to the Divvun board, write a new letter to
the Giellalávdegoddi, emphasizing the needs and timetables of the project
Lule Sámi
Sjur, Thomas and Trond will cont. Lule Sámi issues.
Numerals
- An empirical overview
- Numeral generation
- Numeral inflection
- Numerals as parts of compounds
- A clear concept of how we want to treat them
- Tagging
- A treatment
We will return to this issue after the name conversion.
7. Speller infrastructure
Nothing this week either.
8. Other
Technical issues
- The mac os / perl bug (at least Trond and Sjur has it):
- utf8 “\xC4” does not map to Unicode at /Users/trond/gt/script/preprocess line
- This msg did not show up in 10.3 (perl 5.8.1), but does so in 10.4 (perl
5.8.6). It is probably a perl - OS mismatch. (Trond, Thor Øivind,
Tomi)
- Another example of the same bug:
- :”\x{00c3}” does not map to utf8 at ../script/preprocess line 113, <> chunk
33.
Bug fixing
10 open bugs (and 24 risten.no bugs)
Buying
risten.no
- Organisation: could Tomi be used, in exchange for more linguistic work by
(old) GIO members? Yes, it is ok, but how much still needs to be evaluated
- it is ok to integrate “kvensk” placenames with risten.no
- this should be integrated with the general proper name work - we want all
proper names integrated with risten.no, df above
- needs further development of risten.no to allow for multiple XML bases to
be presented and maintained in parallel. This is to be further worked on by
Tomi and Sjur
Meeting with Kvensk revitalisation project
Grammar, dictionary, placename lexicon for Kvensk. They want similar
infrastructure as in Sámi language technology. Trond and Sjur will
discuss how we can help them without taking too much attention away from our
real jobs.
9. Summary, task list
Børre
- Contact oahpahusossodat and the rest of the SD about texts
- Doing some digging into WebSak
- Reorganise the directory structure
- Put all corpus texts into one place
- Continue converting text from input format to our xml
- Have a look at the placenames files.
- Ask Thor-Øivind to move bugzilla to our new webserver.
- Gather public texts
- Work on the name lexicon
Maaren
- The missing list, both the overall missing list from our xml corpus, and a
file-for-file review, in order to get different terminology.
- continue working with the missing list from risten.no
- working with the missing list from risten.no this week (today)
- Start working on Sámi place names
- Start working at normativity issues (numeral issues with Trond?)
Saara
- Look at the corpus infrastructure issue
- Look at the corpus interface issue with Lars
- Convert texts from .doc to .xml, to get a grasp of our corpus format
- make an emacs mode for the name project (cf. specs in the memo above)
- prepare for a presentation of the pdf etc. conversion together with Tomi
for the next meeting.
Sjur
- Lule Sámi twol problems, look again at the sets definition with Thomas and
Trond
- risten.no bugs and fixes
- follow up on voice group-chat not working to Sámediggi
- Now awaiting cost evaluation from the IT guys (Geir Kaaby et al)
- project planning with Trond, continued
- also look at the development processes - specification and testing
- Follow up on the meeting with Anders Kintel 17th of November -> ask
Berit Karen Paulsen/Bitte
- Follow up on place names from Norge Digitalt -> remind Bjørn Olav Megard
- Evaluate SFST as speller (and analyzer) lexicon
- more thorough analysis than was possible in Guovdageaidnu
- write a background document on the corpus contracts
- Discuss the contract issue with Trond, return the new version to the lawyer
- write to the board about the lack of progress with the Giellalávdegoddi, and
the problem it causes for the project
- write to the Giellalávdegoddi once more, emphasizing timetable and response
needs in the Divvun project
- discuss kvensk project support with Trond
- write public tender documents
Thomas
- work on Lule Sami compounding and derivation
- Look at Linguistic bugs with Trond
- Meet with Sjur and Trond about the definition of G1, G2, G3
Tomi
- Aspell: Continue working on the affix file & aspell
- Contact aspell author (UTF-8 thing)
- three-part compounding
- corpus infrastructure: dtd location (both public and internal)
- corpus infrastructure: file and dir organisation
- Document aspell and corpus infrastructure
- Cgi-script for uploading documents to corpus base
- Specification for new catxml in C++
- this includes also placing the source and binary
- clean the script/ catalogue with Trond
- Common makefile issues
Trond
- Work on the bug list (7 open).
- Get the new version of the New Testament
- project planning with Sjur, continued
- also look at the development processes - specification and testing
- Discuss the contract issue with Sjur, return the new version to the lawyer.
- Work on the name project: Clean up the lexicon file, discuss the emacs mode with
Saara and the work with Maaren and Børre.
- Add docu on the corpus infrastructure
- clean the script/ dir
- discuss kvensk project support with Sjur
10. Next meeting, closing
24.10.2005 10:00
Closed at 12:11