The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages
Present: Berit Merete, Børre, Ciprian, Tomi, Trond
Analyse the main language text morphologically. Then, for each file:
converted/$lang/catalogue/file.xml -- analyse $lang nodes
Count the missing ones -- … | usme | grep '\?'
For each file: register the missing/total ratio,
List the files according to the ratio, and pick the worst files
Priority: converted/sme/admin/
Tomi has tried this with one file. Command:
linecount=`ccat -l sme $1 | preprocess | wc -l`
errors=`ccat -l sme $1 | preprocess | /opt/sami/xerox/c-fsm/ix86-linux2.6-gcc3.\
4/bin/lookup -flags mbTT -utf8 ~/langtech/gt/sme/bin/sme.fst | grep '?' | wc -l\`
echo "lines: $linecount / $errors"
0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100%
lines: 1535 / 87
0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100%
lines: 1703 / 53
for i in $ANALYSED_DIR/$SMILANG*.ccat.txt
time cat $i | $PREPROCESS 2> /dev/null | lookup -q -flags mbTT $GTHOME/gt/$SMILANG/bin/$SMILANG.fst |
Outcome of this: A list of files
TODO:
List of files which are not converted:
freecorpus$ grep "Couldn't convert" tmp/*.log | grep admin | cut -f5 -d" "
/home/apache_corpus/freecorpus/orig/sme/admin/sd/other_files/dc_3_99.doc
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/guided-tour.html_id=313861
/home/apache_corpus/freecorpus/orig/sme/admin/depts/regjeringen.no/samisk.html_id=454913
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/sitemap.html_id=256029
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/site-map.html_id=313733
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/site-map.html_id=313744
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/site-map.html_id=313795
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/site-map.html_id=313850
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/site-map.html_id=313851
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/site-map.html_id=313855
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/site-map.html_id=313857
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/site-map.html_id=313865
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/site-map.html_id=313868
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/site-map.html_id=313883
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/site-map.html_id=426594
/home/apache_corpus/freecorpus/orig/eng/admin/depts/regjeringen.no/state-secretary-karl-eirik-schjott-peder.html_id=439605
/home/apache_corpus/freecorpus/orig/nno/admin/depts/regjeringen.no/statsrad-karl-eirik-schjott-pedersen.html_id=439605
/home/apache_corpus/freecorpus/orig/nob/admin/depts/regjeringen.no/statssekretar-karl-eirik-schjott-pederse.html_id=439605
Command for finding the list:
freecorpus $ grep "Couldn't convert" tmp/*.log | grep admin | cut -f5 -d" " | wc -l
The error in the eng
files is trivial. Focus now is on the nob-sme pairs under admin.
TODO
TCA2 used to work. Some time, during the time we have not touched the code, it stopped working because of Java upgrade to 1.6. Børre tried to fix it this autumn.
TODO:
Postponed to the version question has been clarified.
Trond made an sme-anchor.fst
{biegga} ?* |
{biekka} ?* |
{lássa} ?* |
{lása} ?* |
{viidni} ?* |
{viinni} ?* |
{vuitti} |
Run the corpus through the anchor fst, and spot holes. Fill them.
Split the anchor list in two:
TODO:
We need the ratio.
TODO:
Explanation here Implementations here
Wikipedia: The coefficient may be defined as twice the shared information (intersection) over the combined set (union)
Hofland and Johansson:
For English and Norwegian, a value of more than 0.7 or 0.8 gives reasonable results. For other languages, the acceptable value for the coefficient can be less. The cognate parameter is also read by the program.
Question: Is there a parameter to be set here?
TODO:
Probably an important candidate.
TODO: