The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages
sme-sma-mt meeting 12.8.2013
Francis, Lene, Trond.
The abstract and hence the plan:
Evaluation procedure
There is a similar study evaluating es2pt, giving pt translators an en original and a es2pt MT text. Here is the paper:
“Using the Apertium Spanish-Brazilian Portuguese machine translation system for localization”. François Masselot, Petra Ribiczey (both Autodesk) and Gema Ramírez-Sánchez (Prompsit) Annual Conference of the European Association for Machine Translation in 2010.
Content:
Content:
Online:
/opt/mt/README
Apertium Wiki:
Deadlines:
sme-dis.rle vs. Old-sme-dis.rle
Some syntactic tags are missing. Linda used syntactic functions in her rules.
Lene will spend a day or two on that.
We do not use dependency.
Evaluate Francis’ tag conversion: Analyse the same sme text with identical morphology, and identical dis, but one with gt tags and one with Fran’s converted apertium tags.
Francis to look into that and report differences.
Two ways of translating positive adjectives in the attributive:
Here are the cases:
1)
2)
3)
guoktečuođigolbmalogi
guokte#čuođi#golbma#logi+Num+Sg+Nom <= change the "#" to "+"?
guoktečuođigolbmalogi+Num+Sg+Nom
guoktečuođigolbmalogi+Num+Sg+Nom guoktečuođigolbmalogi
guoktečuođigolbmalogi+Num+Sg+Nom guoktečuođigolbmalohki
guokte#čuođi#golbma#logi+Num+Sg+Nom
guokte#čuođi#golbma#logi+Num+Sg+Nom guoktečuođigolbmalogi
guokte#čuođi#golbma#logi+Num+Sg+Nom guoktečuođigolbmalohki
guokte#čuođi#golbma#logi+Num+Sg+Nom
göökte#tjuetie#golme#luhkie+Num+Sg+Nom
^göökte+tjuetie+golme+luhkie<num><sg><nom>$
^göökte$ ^tjuetie$ ^golme$ ^luhkie<num><sg><nom>$
.dix:
<e><p><l>lávet<s n="n"/></l><r>tsietsehthmuerjie<s n="n"/></r></p></e>
<e><p><l>lávet<s n="v"/><s n="tv"/></l><r>provhkedh<s n="v"/><s n="iv"/></r></p></e>
The default pair is listed in the file:
apertium-sme-sma.sme-sma.lrx:
<pron><indef><
<pron><indef><attr>
input:
^lávet<n>$ -> ^lávet<n>/aaa<n>/bbb<n>$
^lávet<v>$ -> ^lávet<v>/xxx<v>/yyy<v>$
sed 's/lávet/aaa/g'
sed 's/lávet/yyy/g'
vs.
sed 's/lávet<n>/aaa/g'
sed 's/lávet<v>/yyy/g'
rules:
1. select aaa for lávet ;
2. select yyy for lávet ;
l: á: v: e: t: :select(aaa)
l: á: v: e: t: :select(yyy)
vs.
l: á: v: e: t: <n>: :select(aaa)
l: á: v: e: t: <v>: :select(yyy)
result:
input: ^lávet<v>/xxx<v>/yyy<v>$ ; rules-matched: 1, 2
input: ^lávet<n>/aaa<n>/bbb<n>$ ; rules-matched: 1, 2
which rule is chosen ? 1 or 2 ?
<rule comment="...">
<match lemma="lávet" tags="n.*"><select lemma="tsietsehthmuerjie" tags="n.*"/></match>
</rule>
<rule>
<match lemma="lávet" tags="v.tv.*"><select lemma="provhkedh" tags="v.*"/></match>
</rule>
<rule>
<match lemma="sáhttit" tags="v.*"><select lemma="maehtedh" tags="v.*"/></match>
<match lemma="leat" tags="v.*"/>
</rule>
The compound symbol is not the correct one.
Ovttasbargu<n><sgnomcmp><cmp>#šiehtadus<n><sg><nom>
@Ovttasbargu#šiehtadus\<n\>\<sg\>\<nom\><n><sgnomcmp><cmp><n><sg><nom>
We thus want # -> +
Ovttasbargu
fran@eki:~/source/giellatekno-sma/src$ cat tagsets/Makefile.am | grep echo
echo -e "#\t+" >> $@
ovttasbargošiehtadus
ovttasbargošiehtadus ovttasbargu+N+SgNomCmp+Cmp#šiehtadus+N+Sg+Nom
ovttasbargošiehtadus ovttasbargošiehtadus+N+Sg+Nom
echo "Mis lea ovttasbargošiehtadus." | apertium -d . sme-sma
Mijjeste lea @ovttasbargu#šiehtadus\.
^Ovttasbargu<n><sgnomcmp><cmp>/Ektiebarkoe<n><sgnomcmp><cmp>$ ^šiehtadus<n><sg><nom>/latjkoe<n><sg><nom>$
usma:
ektiebarkoe ektiebarkoe+N+Sg+Nom
ektiebarkoelatjkoe ektiebarkoe+N+SgNomCmp+Cmp#latjkoe+N+Sg+Nom
$ hfst-lookup sme-sma.autogen.hfst
Ektiebarkoe<n><sgnomcmp><cmp>+latjkoe<n><sg><nom>
Ektiebarkoe<n><sgnomcmp><cmp>+latjkoe<n><sg><nom> Ektiebarkoe<n><sgnomcmp><cmp>+latjkoe<n><sg><nom>+? inf
px on gen/ill
#gyhtjelasse<n><sg><ill><pxsg3>
#tjidtjie<n><sg><gen><pxsg3>
This we want.
Trond, Lene
#sïebredahke<n><sg><ela> TYPO => siebriedahke
#ealjoeh<a><der_lakaan><adv>
#Seamma tïjjen<a>
#sïjhtedh jiehtedh<v><tv><ind><prs><sg3>
buaratjåbpoe/buerebe/bööretjåbpoe
#vedrørende<po> NOB
#onne<a><superl><der_lakaan><adv>
#mubpie<a><ord><attr>
#eejtegh<n><sg><nom>
#jïjtje<pron><refl><gen><pxsg3>
#learohke<n><nomag><pl><ela>
A systematic test:
Extract all lemma from bidix and check whether they generate. Make frequencylist of words not generating
iešguđet ládje Adv Adv joekehtslaakan
…
… corresponds to space in the sme analyser.