Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Corpus meeting 7.11.2011

Conversion status

What is (not) included

In prestable:

Here we count sme words.

Prestable total is 1 612 856.

Catalogue prestable conv. locally@Trond conv. apache@vic
admin/sd/samediggi.no/ 197 676 xxx 233 667
admin/sd/other_files/ 335 543 1 925 954 1 935 348
admin/depts/other_files 571 250 1 573 658 1 311 716
admin/depts/regjeringen.no 218 730 1 592 557 1 613 487
prestable/converted/sme/admin/others/laws 13 521 631 361 649 426
Total 1 336 720 xxx 5 743 644

This is enough to start doing sentence alignment, but it also shows that there is quite a lot that is NOT yet good enough for (automatic) inclusion in prestable.

What does it look like

Document structure

Law bug still open.

Text

For sme, ligatures represent 400 errors. Others?

Of 1.6 mill words, appr. 1/3 is pdf.

All words divided in pdf are still lost.

Sjur and Trond to look at all-caps xfst script.

Do not forget this conversion error (missing đ): Vuo oskuvlla ja joatkkaoahpahusa oktasaš

~/freecorpus$ccat prestable/converted/sme/laws/other_files/finnmarkkulahka_lov_web.pdf.xml \
| preprocess|usme|grep '?'|l

Bántideapmi     Bántideapmi     +?
aktivan aktivan +?
váfistit váfistit +?
finnmárkkulága   finnmárkkulága   +?
mear    mear    +?    <=================================== hyph error
rida    rida    +?
fidnet   fidnet   +?

Next tasks

Sentence alignment

Børre has initiated the parallelisation of the entire prestable parallelised corpus. TCA2 was run with default sentences, and the anchor file in GTHOME/common/.

Targets

cd GTLANG/st/nob/src
make abbr
and that's it.
cat nobtext | preprocess --abbr=st/nob/bin/abbr.txt

Todo

Milestones

Next meeting

Friday at 10.