Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Page Content

Meeting setup

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from a week ago
  3. Documentation - status
  4. Corpus gathering
  5. Corpus infrastructure
  6. Linguistics
  7. Term db
  8. Other issues?
  9. Summary, task lists
  10. Closing

Last week’s task list:

1. Opening, agenda review

Opened at 10.12. Additions to the agenda:

2. Reviewing the task list from a week ago

3. Documentation - status

We should aim at generating static html. There is “forrest run” dynamically, and “forrest site” that makes static html pages.

X [0] doc/lang/sms/src/verb-sms-lex.txt
      BROKEN: No pipeline matched request: doc/lang/sms/src/verb-sms-lex.txt
X [0] doc/ling/vislcg.html
      BROKEN: /Users/trondtro/xtdoc/sd/src/documentation/content/xdocs/doc/ling/vislcg.xml
      (No such file or directory)
X [0] index_smi.html
      BROKEN: /Users/trondtro/xtdoc/sd/src/documentation/content/xdocs/index_smi.xml
      (No such file or directory)
X [0] doc/lang/smj/docu-sme-flowchart.html
      BROKEN: /Users/trondtro/xtdoc/sd/src/documentation/content/xdocs/doc/lang/smj/
      docu-sme-flowchart.xml (No such file or directory)

Link problems:

All internal links should be corrected, and links to original source files should be replaced with links to excerpts.

Forrest on cochise is still problematic: no such command

4. Corpus gathering

Thomas has done a lot of work last week, many positive answers (see last week’s meeting memo).

Also Info nuorra. Agriculture ministry in Sweden have sent some texts in Northern, Lule, Southern and Anár sámi. Some private texts too.

We need a contract to use when making agreements with people and organisations. Børre will look at the one used in Oslo

5. Corpus infrastructure

The antiword issue: Antiword thinks everything is in Unicode in word, and converts that to whatever 8-bit characters et we want (the -m option). Our problem is that we want a many-to-one machine, and not a one-to-many-machine.

Use antiword to convert from .doc to docbook format. Then perl postprocesses the errors in the .xml files that have Levi and the other private hacks

6. Linguistics

Thomas: I have now gone through all the 54 northern sami adjectival sublexicas in sme-lex.txt. and - with the exception of two bugs (reported to Bugzilla) - all the paradigms now seem to be generated right. When I have done the testing I have noticed though that some of the adjectives themselves are directed to the wrong sublexicas. Started to go through these adjectives. Other issues regarding comparition of certain adjectives have to be treated and will be posted in discussion group.

Maaren: no time last week. Maybe sitting home would work?

Trond, Lena: Worked on transitivity in verbs. Checked so far: -ot verbs, -eret ^LOAN words, work started on DIEHTI verbs. The standard version of the n, a, v are now reversly sorted in cvs, in order to facilitate systematic lexicon work..

7. Term db

Sjur and Børre will continue working on that until the official opening at wednesday 16th. After that there will be no one to maintain it, and it needs documentation, but the workload will be much smaller (hopefully).

8. Other issues

PaNoLa-Plus

FIN1) Asiakas syö lounasta. (The client is eating lunch.) (FIN1a)
A1
STA:cl(fcl)
S:n(sg,nom)	Asiakas
P:v(fin,pres,ind,3sg)	syö
Od:n(sg,par)	lounasta

FIN82) Soittaessasi viulua unohdin ajan kuluvan.
 (While you were playing the violin, I forgot how the time passes.) (FIN9b)

A1
STA:cl(fcl)
A:cl(icl)
=P:v(inf2,act,ine+poss2sg)	Soittaessasi
=Od:n(sg,par)	viulua
P:v(fin,imperf,ind,1sg)	unohdin
Od:cl(icl)
=S:n(sg,gen)	ajan
=P:v(pcp1,act,sg,gen)	kuluvan

[http://visl.sdu.dk/], cf. e.g. Finnish, under languages:

“Projektets overordnede målsætning er at udvikle og udbygge et unificeret sprogbeskrivelsesapparat for nordiske sprog med udgangspunkt i grammatiske konventioner og sprogteknologiske værktøjer anvendt i VISL-systemet. Fokus vil være på undervisningsredskaber, paralleltekster og spredningseffekten mellem rammesprogene dansk og norsk på den ene side og de mindre nordiske sprog på den anden side.

Samisk: fx Ressourceinventarisering, aktuelle CG-aktiviteter, specifikke grammatiske problemer i et kontrastivt perspektiv”

What we will need from the project is a view on syntactic analysis (S, Od, P, A, …)

Easter

Wednesday 1/2 day, Thursday and Friday off, as well as Monday.

9. Summary, task lists

10. Closing

Closed at 12.10