Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Corpus meeting 7.4.2011

Present: Berit Merete, Børre, Ciprian, Tomi, Trond

Agenda

Algorithm for dealing with scanning errors

Finding the files

Analyse the main language text morphologically. Then, for each file:

converted/$lang/catalogue/file.xml -- analyse $lang nodes
Count the missing ones -- … | usme | grep '\?'
For each file: register the missing/total ratio,
List the files according to the ratio, and pick the worst files

Priority: converted/sme/admin/

Tomi has tried this with one file. Command:

linecount=`ccat -l sme $1 | preprocess | wc -l`
errors=`ccat -l sme $1 | preprocess | /opt/sami/xerox/c-fsm/ix86-linux2.6-gcc3.\
4/bin/lookup -flags mbTT -utf8 ~/langtech/gt/sme/bin/sme.fst | grep '?' | wc -l\`

echo "lines: $linecount / $errors"

0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100%
lines: 1535 / 87
0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100%
lines: 1703 / 53
for i in $ANALYSED_DIR/$SMILANG*.ccat.txt
time cat $i | $PREPROCESS 2> /dev/null | lookup -q -flags mbTT $GTHOME/gt/$SMILANG/bin/$SMILANG.fst |

Outcome of this: A list of files

TODO:

Finding a cure for improving the files

Status quo for boundcorpus and freecorpus

List of files which are not converted:

freecorpus$ grep "Couldn't convert" tmp/*.log | grep admin | cut -f5 -d" "
/home/apache_corpus/freecorpus/orig/sme/admin/sd/other_files/dc_3_99.doc
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/guided-tour.html_id=313861
/home/apache_corpus/freecorpus/orig/sme/admin/depts/regjeringen.no/samisk.html_id=454913
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/sitemap.html_id=256029
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/site-map.html_id=313733
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/site-map.html_id=313744
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/site-map.html_id=313795
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/site-map.html_id=313850
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/site-map.html_id=313851
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/site-map.html_id=313855
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/site-map.html_id=313857
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/site-map.html_id=313865
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/site-map.html_id=313868
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/site-map.html_id=313883
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/site-map.html_id=426594
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/state-secretary-karl-eirik-schjott-peder.html_id=439605
/home/apache_corpus/freecorpus/orig/nno/admin/depts/regjeringen.no/statsrad-karl-eirik-schjott-pedersen.html_id=439605
/home/apache_corpus/freecorpus/orig/nob/admin/depts/regjeringen.no/statssekretar-karl-eirik-schjott-pederse.html_id=439605

Command for finding the list:

freecorpus $ grep "Couldn't convert" tmp/*.log | grep admin | cut -f5 -d" " | wc -l

The error in the eng files is trivial. Focus now is on the nob-sme pairs under admin.

TODO

Sentence alignment

TCA2 version update

TCA2 used to work. Some time, during the time we have not touched the code, it stopped working because of Java upgrade to 1.6. Børre tried to fix it this autumn.

TODO:

TCA2 installing for the rest of us

Postponed to the version question has been clarified.

Anchor list

Trond made an sme-anchor.fst

{biegga} ?* |
{biekka} ?* |
{lássa} ?* |
{lása} ?* |
{viidni} ?* |
{viinni} ?* |
{vuitti} |

Run the corpus through the anchor fst, and spot holes. Fill them.

Split the anchor list in two:

  1. a domain-independent one
  2. a domain-dependent one

TODO:

Sentence length parameter

We need the ratio.

TODO:

Dice coefficient

Explanation here Implementations here

Wikipedia: The coefficient may be defined as twice the shared information (intersection) over the combined set (union)

Hofland and Johansson:

For English and Norwegian, a value of more than 0.7 or 0.8 gives reasonable results. For other languages, the acceptable value for the coefficient can be less. The cognate parameter is also read by the program.

Question: Is there a parameter to be set here?

TODO:

Preprocessing

Probably an important candidate.

TODO:

Alternatives to TCA2?

Work ahead

Next (short) corpus meeting: