The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages
In prestable:
Here we count sme words.
Prestable total is 1 612 856.
Catalogue | prestable | conv. locally@Trond | conv. apache@vic |
---|---|---|---|
admin/sd/samediggi.no/ | 197 676 | xxx | 233 667 |
admin/sd/other_files/ | 335 543 | 1 925 954 | 1 935 348 |
admin/depts/other_files | 571 250 | 1 573 658 | 1 311 716 |
admin/depts/regjeringen.no | 218 730 | 1 592 557 | 1 613 487 |
prestable/converted/sme/admin/others/laws | 13 521 | 631 361 | 649 426 |
Total | 1 336 720 | xxx | 5 743 644 |
This is enough to start doing sentence alignment, but it also shows that there is quite a lot that is NOT yet good enough for (automatic) inclusion in prestable.
Law bug still open.
For sme, ligatures represent 400 errors. Others?
Of 1.6 mill words, appr. 1/3 is pdf.
All words divided in pdf are still lost.
Sjur and Trond to look at all-caps xfst script.
Do not forget this conversion error (missing đ): Vuo oskuvlla ja joatkkaoahpahusa oktasaš
~/freecorpus$ccat prestable/converted/sme/laws/other_files/finnmarkkulahka_lov_web.pdf.xml \
| preprocess|usme|grep '?'|l
Bántideapmi Bántideapmi +?
aktivan aktivan +?
váfistit váfistit +?
finnmárkkulága finnmárkkulága +?
mear mear +? <=================================== hyph error
rida rida +?
fidnet fidnet +?
Børre has initiated the parallelisation of the entire prestable parallelised corpus. TCA2 was run with default sentences, and the anchor file in GTHOME/common/.
cd GTLANG/st/nob/src
make abbr
and that's it.
cat nobtext | preprocess --abbr=st/nob/bin/abbr.txt
/home/boerre/freecorpus/tmp
Friday at 10.