Corpus meeting 7.4.2011

Present: Berit Merete, Børre, Ciprian, Tomi, Trond

Agenda

Algorithm for dealing with scanning errors
Setningsparallellisering
Ordparallellisering

Algorithm for dealing with scanning errors

Finding the files

Analyse the main language text morphologically. Then, for each file:

converted/$lang/catalogue/file.xml -- analyse $lang nodes
Count the missing ones -- … | usme | grep '\?'
For each file: register the missing/total ratio,
List the files according to the ratio, and pick the worst files

Priority: converted/sme/admin/

Tomi has tried this with one file. Command:

linecount=`ccat -l sme $1 | preprocess | wc -l`
errors=`ccat -l sme $1 | preprocess | /opt/sami/xerox/c-fsm/ix86-linux2.6-gcc3.\
4/bin/lookup -flags mbTT -utf8 ~/langtech/gt/sme/bin/sme.fst | grep '?' | wc -l\`

echo "lines: $linecount / $errors"

0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100%
lines: 1535 / 87
0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100%
lines: 1703 / 53

for i in $ANALYSED_DIR/$SMILANG*.ccat.txt
time cat $i | $PREPROCESS 2> /dev/null | lookup -q -flags mbTT $GTHOME/gt/$SMILANG/bin/$SMILANG.fst |

Outcome of this: A list of files

TODO:

File list as described, due this week. Tomi.

Finding a cure for improving the files

What has caused the high error rates?
How can it be fixed?

Status quo for boundcorpus and freecorpus

1700 out of 52000 files in boundcorpus still cause problems for convert2xml.pl
84 files out of 9276 in freecorpus still cause problems for convert2xml.pl. Of these, 18 are in $lang/admin.
Also, look at errors in nob/admin
Look at errors in files with parallel versions.

List of files which are not converted:

freecorpus$ grep "Couldn't convert" tmp/*.log | grep admin | cut -f5 -d" "
/home/apache_corpus/freecorpus/orig/sme/admin/sd/other_files/dc_3_99.doc
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/guided-tour.html_id=313861
/home/apache_corpus/freecorpus/orig/sme/admin/depts/regjeringen.no/samisk.html_id=454913
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/sitemap.html_id=256029
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/site-map.html_id=313733
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/site-map.html_id=313744
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/site-map.html_id=313795
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/site-map.html_id=313850
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/site-map.html_id=313851
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/site-map.html_id=313855
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/site-map.html_id=313857
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/site-map.html_id=313865
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/site-map.html_id=313868
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/site-map.html_id=313883
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/site-map.html_id=426594
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/state-secretary-karl-eirik-schjott-peder.html_id=439605
/home/apache_corpus/freecorpus/orig/nno/admin/depts/regjeringen.no/statsrad-karl-eirik-schjott-pedersen.html_id=439605
/home/apache_corpus/freecorpus/orig/nob/admin/depts/regjeringen.no/statssekretar-karl-eirik-schjott-pederse.html_id=439605

Command for finding the list:

freecorpus $ grep "Couldn't convert" tmp/*.log | grep admin | cut -f5 -d" " | wc -l

The error in the eng files is trivial. Focus now is on the nob-sme pairs under admin.

TODO

Fix the (very few) remaining nonconverted ones (Børre)

Sentence alignment

TCA2 version update

TCA2 used to work. Some time, during the time we have not touched the code, it stopped working because of Java upgrade to 1.6. Børre tried to fix it this autumn.

oct 5th 2006 – gui-interface does not work for Børre, Trond
sep 29th 2006 – gui-interface works for Børre, Trond

TODO:

Get a newer version (Trond).

TCA2 installing for the rest of us

Postponed to the version question has been clarified.

Anchor list

Trond made an sme-anchor.fst

{biegga} ?* |
{biekka} ?* |
{lássa} ?* |
{lása} ?* |
{viidni} ?* |
{viinni} ?* |
{vuitti} |

Run the corpus through the anchor fst, and spot holes. Fill them.

Split the anchor list in two:

a domain-independent one
a domain-dependent one

TODO:

Trond and Berit Merete to look at this.

Sentence length parameter

We need the ratio.

TODO:

Ciprian to look into it

Dice coefficient

Explanation here Implementations here

Wikipedia: The coefficient may be defined as twice the shared information (intersection) over the combined set (union)

Hofland and Johansson:

For English and Norwegian, a value of more than 0.7 or 0.8 gives reasonable results. For other languages, the acceptable value for the coefficient can be less. The cognate parameter is also read by the program.

Question: Is there a parameter to be set here?

TODO:

Discuss with Bergen (Trond)
Follow-up (all)
- Find the sme:nob Dice coefficient for cognates
- Find the lower length for words to be considered

Preprocessing

Probably an important candidate.

TODO:

Ciprian to look into it.

Alternatives to TCA2?

Maligna
Maca
Others?

Work ahead

All: Read documentation and get a grip of the total picture.
Tomi: Find the bad files and look into them + report (this week)
Trond: Talk to Bergen (this week) + thereafter we install
Berit, Trond: Anchor list
Børre: Conversion
Cip: various counting thing

Next (short) corpus meeting:

Monday afternoon, April 11th.

Language Technology at UiT

Page Content

Corpus meeting 7.4.2011

Agenda

Algorithm for dealing with scanning errors

Finding the files

Finding a cure for improving the files

Status quo for boundcorpus and freecorpus

Sentence alignment

TCA2 version update

TCA2 installing for the rest of us

Anchor list

Sentence length parameter

Dice coefficient

Preprocessing

Alternatives to TCA2?

Work ahead

Next (short) corpus meeting: