Language Technology at UiT The Arctic University of Norway

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Page Content

Corpus meeting 11.4.2011

Present: Berit Merete, Børre, Ciprian, Tomi, Trond

Agenda

Bugzilla

983 maj P2 xsl conv boerre@skolelinux.no Multiple copies of sma files in corpus

The bug itself

20-or-so left (cf. list in bug). Børre to close it.

Extention

  1. A follow-up would be (is) to follow up for the other languages: in the whole corpus (free+bound).
  2. We also risk having double content (same file, two names)
    1. Detect: sort by line and look for double lines
  3. General problem: test of NOT adding already existing files
    1. Script for doing that?

The last point should await a somewhat working corpus.

301 enh P5 Text cor borre.gaup@samediggi.no Proofed/unproofed parallel files need better treatment

This is not a bug, it is a feature.

“We are receiving both proofread and unproofed documents from Min Áigi, and some of them in parallel (the same doc as both versions). For the time being we are not able to make any real use of that info, but we should. Here are some thoughts about possible processing:”

Suggestion: Make a web page / project for goldstandard improvement, and close this bug. (Trond has commented)

841 maj P2 Text cor ciprian.gerstenberger@uit.no Nested error markup does not get correctly converted to xml

This is the only real bug on this list. So, this is for Cip (and Sjur). Check with Lene for a reasonable deadline (mid august?).

838 nor P3 xsl conv ciprian.gerstenberger@uit.no Log file creation not permitted by some users

Cip closes.

836 cri P2 xsl conv ciprian.gerstenberger@uit.no Replace all xsl:include with xsl:import in corpus stylesh…

Old metafiles. These attributes (xsl:include inside the xsl file for each and every corpus file) does not apply anymore.

TODO: Børre to write an explanation in the bug, and Ciprian to close.

Have a look at line 151 in gt/script/langTools/Converter.pm

There is common.xsl and whatnot combined into a new .xsl file

835 cri P3 xsl conv ciprian.gerstenberger@uit.no Corpus conversion xsl files missing from $CORPUSHOME/bin/

This is Cip’s cup of tea.

941 nor P3 xsl conv sjur.moshagen@samediggi.no Combination of $GTFREE/orig/sme/facta/659_1.pdf.xsl and $…

#14
var whole
runtime error: file /home/boerre/gtsvn/gt/script/corpus/common.xsl line 777
element param
xsltApplyXSLTTemplate: A potential infinite template recursion was detected.
You can adjust xsltMaxDepth (--maxdepth) in order to raise the maximum number
of nested template calls and variables/params (currently set to 3000).
Templates:

This is an error with the $GTHOME/gt/script/corpus/common.xsl file, and B has no idea what error.

The following lines are not a result of this:

para_ref after type03
 Reinbeitedistrikt som presses av
selskap som žnsker å bygge ut
vindmžlleparker står i en vanskelig
situasjon. Et frivillig forlik med
utbygger om en erstatning, kan
vŽre å foretrekke fremfor en usikker kamp i rettssystemet. I distrikt
6 i Varanger valgte man den fžrste

Rather, they are a result of decode.pm.

For Børre, the conversion stops when applying common.xsl on the file.

TODO: Cip to have a look.

884 enh P5 Correct! sjur.moshagen@samediggi.no corpus.dtd seems flawed

To repeat:

xmllint whatever-dtd-file-of-your-choice.dtd

… will give you the same error report.

Status quo

Tomi has work which is not checked in. Check in.

File- and problemspecific problems go to Bugzilla.

Forward

New meeting on monday.