Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Corpus meeting 11.11.11. :-)

Agenda

Status quo

Improvements in sme and nob abbr files, and in anchor file. Made a style sheet to convert tmx files to html. Many errors in the tmx file are caused by bad abbr. handling in the nob files On time use: half the time goes to add sentences to the file, the other half to parallelise them Throughput on xserve yesterday: 25k bytes per minute Possible test: Check if the size of the anchor file affects the speed of tca2

cd prestable/tmx/smenob/ for file in *.tmx do xsltproc $GTHOME/gt/script/corpus/tmx2html.xsl $file > $file.html done

Here it seems the texts are different:

prestable/tmx/smenob/vuollasa-asahusat.html_id=115192.tmx.html

Filenames and directory structure

Root tmx dir:

prestable/tmx/SOURCELANG2TARGETLANG/

Below this point we follow the directory structure found elsewhere, ie GENRE/subdirs/file.tmx. This should give us:

prestable/tmx/nob2sme/admin/depts/regjeringen.no/xxx.tmx

TODO

Black question mark files:

$ grep -lr '�' * | grep -v '\.svn'
nob/admin/depts/other_files/OTP200620070025000SE_12.html.xml
nob/admin/depts/regjeringen.no/7-narmere-om-planbestemmelsene.html_id=571096.xml
sme/admin/depts/other_files/STM_TS007SA.pdf.xml
sme/admin/depts/regjeringen.no/10.html_id=458508.xml
sme/admin/depts/regjeringen.no/2011--rievdadeami-aigi-afghanistanas.html_id=604390.xml
sme/admin/depts/regjeringen.no/7.html_id=458471.xml
sme/admin/depts/regjeringen.no/aigeguovdil.html_id=1150.xml
sme/admin/depts/regjeringen.no/bismagodderait.html_id=449030.xml
sme/admin/depts/regjeringen.no/historihkka.html_id=861.xml
sme/admin/depts/regjeringen.no/horingsbrev.html_id=499754.xml
sme/admin/depts/regjeringen.no/raehus-rahkadahtta-samegiela-doaibmaplan.html_id=514922.xml
sme/admin/depts/regjeringen.no/sami.html_id=615757.xml
smj/admin/depts/other_files/HP_2009_samisk_sprak_lulesam.pdf.xml

Evaluation

Conversion

ŋ not converted: samediggi-article-3002.html.tmx.html:

Son gii biddjo virgái ferte hálddašit davviriikkalaš giela , sámegiela ja e ? gelasgiela .

Personen som blir ansatt må beherske skandinavisk språk , samisk og engelsk .

mnd. is not sentence final:

Forøvrig tilsettes arbeidstakere etter gjeldende lover , reglement og overenskomster , herunder lønn og pensjon , samt 6 mnd .
	prøvetid .

Capital letter in names divides sentence: (boerre: I think the sentence division comes from the .)

Mun doaivvun strategiija maid mii plánet váikkuha ahte bargu ollislaš ja dássásaš bálvalusain sámi álbmoga váste šaddá álkit ja beaktileappot , dadjá várrepresideanta Ragnhild L .
Nystad .

Jeg håper strategien vi legger opp til vil bidra til at arbeidet med å oppnå helhetlige og likeverdige tjenester til det samiske folket vil bli lettere og mer effektivt , sier visepresident Ragnhild L .
	Nystad .

The 1-0 issue

There are two types of 1-0 cases:

  1. The sentence is missing in the other language
  2. The 1-0 status reveals an alignment error (the match is in the neighbour pair)

Trond’s impressionistic feeling: (1) is the overwhelmingly most common one.

Next meeting

Middle of next week.