sme-sma-mt meeting 12.8.2013
Francis, Lene, Trond.
Agenda
- Evaluation
- Plan, overall principles
- Analysis
- Linguistic transfer issues **Px **Inflected forms **Numerals **Lexical selection **Px **…
- Generation
Evaluation
The abstract and hence the plan:
- Show sme2sma as a pilot, that it is feasible. Choose a narrow domain.
Evaluation procedure
- Send text pairs to sma translators: sme2sma and nob.
(the sme2sma text perhaps enriched with missing words)
- Which is quicker: editing sma MT vs translating from scratch
- Method: giving two texts: one to translate and one to edit Two texts to three translators: one nob-only, one nob + smaMT
- Questions:
- Time the task
- Answer question: How did you like the smaMT text?
- hypothesis: smaMT has a less Norwegian syntax, and this can be seen as an asset (?)
There is a similar study evaluating es2pt, giving pt translators an en original and a es2pt MT text. Here is the paper:
“Using the Apertium Spanish-Brazilian Portuguese machine translation system for localization”. François Masselot, Petra Ribiczey (both Autodesk) and Gema Ramírez-Sánchez (Prompsit) Annual Conference of the European Association for Machine Translation in 2010.
Content:
- 2 articles, each one or two pages
- 3 translators
Plan, overall principles
Content:
- sme: Improve the analysis (syntactic functions…)
- sme-sma texts: pick words, add words
- sme-sma mt-tests: improve the syntax, morphosyntax
- sma: Improve the generation (double forms, …)
- Worst-case-fix: word1/word2 => word1
- sma and sme: add missing words to fst
- CG-rules for lexical selection
- Improve/finish sme/src/smi-syn.rle (the file is temporarily in sme/src/)
Online:
- [https://gtweb.uit.no/mt/]
- Update:
- gtweb:
/opt/mt/README
- gtweb:
Apertium Wiki:
- [http://wiki.apertium.org/wiki/Talk:North_S%C3%A1mi_and_South_S%C3%A1mi]
Deadlines:
- Find texts
- Find translators
- 30.8. Send texts to the translators
- 15.9. Receive evaluation from the translators
- 26.9. Conference
Analysis
sme-dis.rle vs. Old-sme-dis.rle
Some syntactic tags are missing. Linda used syntactic functions in her rules.
Lene will spend a day or two on that.
We do not use dependency.
Evaluate Francis’ tag conversion: Analyse the same sme text with identical morphology, and identical dis, but one with gt tags and one with Fran’s converted apertium tags.
Francis to look into that and report differences.
Linguistic issues
Inflected forms
Two ways of translating positive adjectives in the attributive:
- to adjective (attr -> attr)
- to a noun in the genitive
Here are the cases:
1)
2)
3)
- báikkálaččat vs. byjreskisnie => fst? MT
Numerals
guoktečuođigolbmalogi
guokte#čuođi#golbma#logi+Num+Sg+Nom <= change the "#" to "+"?
guoktečuođigolbmalogi+Num+Sg+Nom
guoktečuođigolbmalogi+Num+Sg+Nom guoktečuođigolbmalogi
guoktečuođigolbmalogi+Num+Sg+Nom guoktečuođigolbmalohki
guokte#čuođi#golbma#logi+Num+Sg+Nom
guokte#čuođi#golbma#logi+Num+Sg+Nom guoktečuođigolbmalogi
guokte#čuođi#golbma#logi+Num+Sg+Nom guoktečuođigolbmalohki
guokte#čuođi#golbma#logi+Num+Sg+Nom
göökte#tjuetie#golme#luhkie+Num+Sg+Nom
^göökte+tjuetie+golme+luhkie<num><sg><nom>$
^göökte$ ^tjuetie$ ^golme$ ^luhkie<num><sg><nom>$
Lexical selection
.dix:
<e><p><l>lávet<s n="n"/></l><r>tsietsehthmuerjie<s n="n"/></r></p></e>
<e><p><l>lávet<s n="v"/><s n="tv"/></l><r>provhkedh<s n="v"/><s n="iv"/></r></p></e>
The default pair is listed in the file:
apertium-sme-sma.sme-sma.lrx:
transfer/bidix
<pron><indef><
<pron><indef><attr>
- Tag differences for the whole paradigm: bidix
- Tag differences for parts of the paradigm: t1x-files
input:
^lávet<n>$ -> ^lávet<n>/aaa<n>/bbb<n>$
^lávet<v>$ -> ^lávet<v>/xxx<v>/yyy<v>$
sed 's/lávet/aaa/g'
sed 's/lávet/yyy/g'
vs.
sed 's/lávet<n>/aaa/g'
sed 's/lávet<v>/yyy/g'
rules:
1. select aaa for lávet ;
2. select yyy for lávet ;
l: á: v: e: t: :select(aaa)
l: á: v: e: t: :select(yyy)
vs.
l: á: v: e: t: <n>: :select(aaa)
l: á: v: e: t: <v>: :select(yyy)
result:
input: ^lávet<v>/xxx<v>/yyy<v>$ ; rules-matched: 1, 2
input: ^lávet<n>/aaa<n>/bbb<n>$ ; rules-matched: 1, 2
which rule is chosen ? 1 or 2 ?
<rule comment="...">
<match lemma="lávet" tags="n.*"><select lemma="tsietsehthmuerjie" tags="n.*"/></match>
</rule>
<rule>
<match lemma="lávet" tags="v.tv.*"><select lemma="provhkedh" tags="v.*"/></match>
</rule>
<rule>
<match lemma="sáhttit" tags="v.*"><select lemma="maehtedh" tags="v.*"/></match>
<match lemma="leat" tags="v.*"/>
</rule>
Compounds
The compound symbol is not the correct one.
Ovttasbargu<n><sgnomcmp><cmp>#šiehtadus<n><sg><nom>
@Ovttasbargu#šiehtadus\<n\>\<sg\>\<nom\><n><sgnomcmp><cmp><n><sg><nom>
We thus want # -> +
Ovttasbargu
fran@eki:~/source/giellatekno-sma/src$ cat tagsets/Makefile.am | grep echo
echo -e "#\t+" >> $@
ovttasbargošiehtadus
ovttasbargošiehtadus ovttasbargu+N+SgNomCmp+Cmp#šiehtadus+N+Sg+Nom
ovttasbargošiehtadus ovttasbargošiehtadus+N+Sg+Nom
echo "Mis lea ovttasbargošiehtadus." | apertium -d . sme-sma
Mijjeste lea @ovttasbargu#šiehtadus\.
^Ovttasbargu<n><sgnomcmp><cmp>/Ektiebarkoe<n><sgnomcmp><cmp>$ ^šiehtadus<n><sg><nom>/latjkoe<n><sg><nom>$
usma:
ektiebarkoe ektiebarkoe+N+Sg+Nom
ektiebarkoelatjkoe ektiebarkoe+N+SgNomCmp+Cmp#latjkoe+N+Sg+Nom
$ hfst-lookup sme-sma.autogen.hfst
Ektiebarkoe<n><sgnomcmp><cmp>+latjkoe<n><sg><nom>
Ektiebarkoe<n><sgnomcmp><cmp>+latjkoe<n><sg><nom> Ektiebarkoe<n><sgnomcmp><cmp>+latjkoe<n><sg><nom>+? inf
Px
px on gen/ill
#gyhtjelasse<n><sg><ill><pxsg3>
#tjidtjie<n><sg><gen><pxsg3>
px on relatives/reflexive
This we want.
px on nouns -> poss. pron?
Trond, Lene
Generation
#sïebredahke<n><sg><ela> TYPO => siebriedahke
#ealjoeh<a><der_lakaan><adv>
#Seamma tïjjen<a>
#sïjhtedh jiehtedh<v><tv><ind><prs><sg3>
buaratjåbpoe/buerebe/bööretjåbpoe
#vedrørende<po> NOB
#onne<a><superl><der_lakaan><adv>
#mubpie<a><ord><attr>
#eejtegh<n><sg><nom>
#jïjtje<pron><refl><gen><pxsg3>
#learohke<n><nomag><pl><ela>
A systematic test:
Extract all lemma from bidix and check whether they generate. Make frequencylist of words not generating
MWE
iešguđet ládje Adv Adv joekehtslaakan
…
… corresponds to space in the sme analyser.