Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Linguistics
  7. Speller infrastructure
  8. Other issues
  9. Summary, task lists
  10. Closing

1. Opening, agenda review, participants

Opened at 10:10.

Present: Børre, Saara, Sjur, Tomi, Trond

Absent: Thomas, Maaren

Main secretary: Børre

Agenda accepted as is.

2. Reviewing the task list from the last meeting

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

3. Documentation

Documentation tasks:

  1. Add documentation on our corpus infrastructure and our corpus work in general (“To be done by the ones making the corpora”: Børre, Tomi, Trond, Saara).
  2. Now we have 4 documents:
    1. Correct corpus (disamb usage)
    2. Corpus plan (for the disamb corpus cwb)
    3. catxml

For the basic corpora, we need 3 types of documentation, or doc for 3 target groups:

  1. For the users/linguists: What corpus are found, how do I use them (this info is now scattered)
  2. For the collectors: How do I add texts, where do I add them, how do I convert them (this is the Corpus conversion doc)
  3. For the programmer: What did I actually do? (this is partly the catxml doc)

For the work on the graphical user interface, we need documentation as well, in principle along the same lines, except that the user is not the same linguist as above.

4. Corpus gathering

Governmental documents (earlier in pdf, now in html)

Tasks:

Contracts

Tasks:

The most problematic issue:

Who has the copyright of extracted material, like single words, collections of words, syntactic structure (potentially with some words filled in)? We need this to be controlled by us, not by the authors. The exact borderline is hard to define.

North Sámi New Testament

Lule Sámi New Testament

Svenska Bibelsällskapet is putting their finishing touches to the Lule Sámi translation, we will have it soon.

Lule Sámi Dictionary

Sjur will check whether Berit Karen has contacted Anders Kintel. — She has now sent the invitation.

5. Corpus infrastructure

Naming conventions and directory structure

New suggestions last Friday, with a proposal from Børre and Tomi: We decided to put original in this structure:

orig/yyyy-mm/filename.doc
            /filename.doc.xsl
            /filename.doc.xml
            /samefilename.doc => samefilename.doc
            /samefilename.doc => samefilename-1.doc
            /This\ is\ a\ very\ cumbersome\ and\ long\ filename.doc =>
            /This_is_a_very_cumbersome_and_long_filename.doc

Reasoning:

If input document is filename.(doc|pdf|html|txt|whatever), it has a title Output document is title.xml sd-2001-1.txt

After a long discussion, we decided on the following:

orig/sme/news/thelongandstupidnameswegetasinputwithunderscore_for_space.doc
             /thelongandstupidnameswegetasinputwithunderscore_for_space.xsl
     sma
     smj
     nob
     fin
     swe
        /news/title2.xml
        /laws/title.xml
        /fict/title.xml ! oops same name as cousin in laws/
        /fact
        /bibl
        /admi
  gt/sme/news/thenewshortandsmartnameweinventedifneeded.xml
              (cf. lines 258-263, for smartness directions)
     sma
     smj
     nob
     fin
     swe
        /news/title2.xml
        /laws/title.xml
        /fict/title.xml ! oops same name as cousin in laws/
        /fact
        /bibl
        /admi
parallel.xml

What parallel.xml could look like:

<paradocs>
    <entry id=1>
        <file lang=sme orig=yes>sme-file.xml</file>
        <file lang=nob>nob-file.xml</file>
    </entry>
    ...
    <entry id=1234>
        <file lang=sme orig=yes>sme-OTHERfile.xml</file>
        <file lang=nob>nob-OTHERfile.xml</file>
    </entry>
</paradocs>

This decision is final!

Further discussion is directed to the news group.

The old task list is repeated for convenience:

  1. Make a system for file and directory permission (today: we all belong to the cvs group), to only allow people with root user privileges write access to the corpus repository, at least regarding original files
  2. Include the xsl files under version control (cvs? rcs?)
  3. Incorporate language detection as part of the corpus processing.
  4. the dir structure is:
    1. one dir for orig, containing also the meta-info and interm. files
    2. another dir for our ready-to-use xml files after conversion
  5. dir structure for web-posted corpus files:
    1. subdivision according to week or month, we start out with month till we see the amount of traffic (yyyy-mm)
      1. Done
  6. we need a way to deal with hyphenated documents in catxml/preprocess:
    1. in normal cases hyphenation points should be removed
    2. when testing the robustness of our parsers, as well as when testing the hyphenator, the hyphenation points should be retained

Corpus conversion

All conversion (doc, pdf, html) are now integrated into one script.

Encoding conversion

perldoc gt/script/samiChar/Decode.pm One script for converting all the different input formats. The xsl-file is not taken properly into account yet.

gt/script/convert2xml.pl

--dir=dir_name  # The directory where the files are searched
--use-decode    # Use the character decoding (for testing)
--xsl=file_name # The name of the xsl file. I am going to change this.

Tasks:

This is Documentation

Pdf to XML

Saara has made a new conversion module, it is almost finished.

Task: Saara to prepare for this presentation, and to make documentation.

(X)HTML to XML

This is implemented by Tomi, under gt/script/xhtml2corpus.xsl. Usage:

tidy --quote-nbsp no --add-xml-decl yes --enclose-block-text yes -asxml -utf8
    -language sme file.html |
    xsltproc $HOME/gt/script/xhtml2corpus.xsl - > file.xml

Documentation

The documentation for corpus conversion should be added to the gt/doc/ling/corpus_conversion.xml document.

6. Linguistics

Name lexicon

Summary: see the newsgroup

Motivation:

Needed: A plan for this project:

  1. do the main markup in the present propernoun file
  2. make a script for converting it to xml (to be done one time)
  3. make a script for xml2lexc (to be done by the makefile)
    1. There is a sample file for the xml file format in gt/common/src/proper-nouns.xml
    2. There is a working xml2lexc for Komi, written by Saara
  4. make the tags etc. in the parser

Conversion:

  1. This week
  2. (end of this week and) Next week:
    1. Then add the +Plc, +Mal, etc. tags in the parser
    2. Mark up as much as possible within a week or so (Maaren to do the Sámi names, and to split CNAME into BERN and LONDON, Ilona to look at C-FI-NEN and other Finnish names, Trond and Børre to look at the rest)
    3. Still to be done:
7985 DEATNU
3836 LONDON
1939 BERN
1388 C-FI-NEN
 692 ACCRA
 471 NYSTØ
 134 MARJA
 118 DUORTNUS
  59 NIILLAS
  45 ALEUHTAT
  43 ANAR
  29 SULLOT
  20 GIEDDI
  17 HEANDARAT
   8 GUOLBBA
   4 VARGGAT
   4 GEAVNNIS
   4 EATNAMAT
   1 ROMSA
  1. list continued:
    1. Then mark up the rest with correct semantic tags
    2. This means we would need a seventh option, the unspecified name.
    3. Then split propernoun-sme-lex.txt into two, one with the sami name being generated by the xml2lexc script, and one manually written file, containing the name sublexica (called propernoun-sme-morph.txt or whatever)
    4. Look into efficient editing of the XML lexicon
    5. Then convert to xml
    6. Look into efficient editing of the XML lexicon again
    7. Look into synchronisation issues with risten.no - we want the names there as well

Updated status quo:

Twol SETS definition issue

The definition of G1, G2, G3 in Lule Sámi is still open. and we would like to have input on this issue. We need a G3 definition for North Sámi also.

Update: it is still not working, see [bug 193|http://giellatekno.uit.no/bugzilla/show_bug.cgi?id=193]

SUGGESTION (Trond): Thomas, Trond and Sjur didn’t meet last week and should have a new meeting this Tuesday instead (tomorrow).

North Sámi

Lule Sámi

Sjur, Thomas and Trond will cont. Lule Sámi issues.

Numerals

  1. An empirical overview
    1. Numeral generation
    2. Numeral inflection
    3. Numerals as parts of compounds
  2. A clear concept of how we want to treat them
    1. Tagging
  3. A treatment

We will return to this issue after the name conversion.

7. Speller infrastructure

Nothing this week either.

8. Other

Technical issues

Video conferencing across firewalls

The problem we’ve had with the SD firewall persists, and there doesn’t seem to be any resources available to help us. Geir Kaaby instead suggested we look at the Marratech package, and try it out. So please download the MacOS X client (or get it from me), and I’ll send you the URL to the meeting room as soon as I get it.

Bug fixing

17 open bugs (and 24 risten.no bugs)

Bugzilla:
 37	nor	P2	Mac	thor.oivind.johansen@hum.ui...	ASSI	Bugzilla is not able to handle the Sámi characters.
197	nor	P2	Mac	boerre@skolelinux.no		NEW	Links to Bugzilla must be checked and corrected for new s...

UTF-8:
 61	nor	P2	Mac	boerre@skolelinux.no		ASSI	mpage barfs on utf-8 input
196	nor	P2	All	boerre.gaup@samediggi.no	NEW	UTF-8 encoded html gets garbled

Corpus:
160	nor	P2	Mac	tomi.pieski@hum.uit.no		NEW	Hyphen not recognised in Genesis
187	nor	P2	All	tomi.pieski@hum.uit.no		ASSI	catxml is undocumented
188	nor	P2	All	tomi.pieski@hum.uit.no		ASSI	catxml crashes if XML/Twig.pm is not installed
198	nor	P2	Mac	tomi.pieski@hum.uit.no		NEW	xsl script for Bible files does not single out chapter he...

Hard to solve:
 77	nor	P2	Mac	trond.trosterud@hum.uit.no	ASSI	consonantchange in the end of verbstem

háliidit d > t in final position -ijd is spelled iid and should be spelled -iit.
We should have had *in háliit* but do have *in háliid*

Present situation:
háliit  háliit  +?                      #wrong
háliid  háliidit+V+TV+Ind+Prs+ConNeg    #wrong
maid    maid+Interj                     #ok, but not if háliit is corrected
maid    maid+Adv                        #ok, but not if háliit is corrected
guliid  guolli+N+Pl+Gen                 #ok, but not if háliit is corrected
maid    mii+Pron+Interr+Pl+Acc          #ok, but not if háliit is corrected

G3 definition issue:
 50	nor	P2	Mac	Maren.Palismaa@Samediggi.no	NEW	LEXICON-GEARGGUS and others
 56	nor	P2	Mac	trond.trosterud@hum.uit.no	ASSI	-headdjiid and -heddjiid
186	nor	P2	Mac	trond.trosterud@hum.uit.no	ASSI	No dipht. simpl in actor nouns before uj
193	nor	P2	Mac	trond.trosterud@hum.uit.no	NEW	oa->å dipht. simpl. in actor nouns

Numeral project:
  6	nor	P2	All	tomi.pieski@hum.uit.no		NEW	Num tag is needed in compounds, but stripped in lookup2cg
158	nor	P2	Mac	trond.trosterud@hum.uit.no	ASSI	Num+Sg+Gen+logi
169	nor	P2	Mac	trond.trosterud@hum.uit.no	NEW	golbmalohkása
176	nor	P2	Mac	trond.trosterud@hum.uit.no	NEW	beal+Ord

Bugzilla update

When Bugzilla is being moved, it should also be updated to the newest version, and the UTF-8 bug should be resolved.

Buying

risten.no

Project planning and development processes

Trond is using his project as a test case for an IT guy, Geir Tore Voktor, who is taking a course in project management. Be prepared to answer questions.

9. Summary, task list

Børre

Maaren

Saara

Sjur

Thomas

Tomi

Trond

10. Next meeting, closing

31.10.2005 10:00

Closed at 12:36