Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Page Content

Meeting on smenob transition

April 8th, 2014

Fran, Lene, Sjur, Trond

LEXICAL SELECTION

ny fil:

apertium-sme-nob.sme-nob.lex

gammel fil:

dev/archive/apertium-sme-nob.sme-nob.lex

Now we use SELECT/REMOVE rules instead of SUBSTITUTE rules.

The input looks like:

$ echo "biebmu" | apertium -d . sme-nob-biltrans
^biebmu<n><sg><nom><@HNOUN>/føde<n><m><unc><sg><nom><@HNOUN>/mat<n><m><unc><sg><nom><@HNOUN>$^.<clb>/.<sent><clb>$

And the output:

$ echo "biebmu" | apertium -d . sme-nob-lex
^biebmu<n><sg><nom><@HNOUN>/mat<n><m><unc><sg><nom><@HNOUN>$^.<clb>/.<sent><clb>$

It is much easier to debug this way. Although it means rewriting the old rules, and specifically writing default rules.

SELECT ("mat"i) IF (0 ("<biebmu>"i)) ;

the source language lemma ( “biebmu”) is the WORDFORM of the biltrans output.

word form = biebmu<@HNOUN> (tags are invisible)

readings:

  1. “føde” n m unc sg nom @HNOUN
  2. “mat” n m unc sg nom @HNOUN
$ echo "biebmu" | apertium -d . sme-nob-biltrans
^biebmu<n><sg><nom><@HNOUN>/føde<n><m><unc><sg><nom><@HNOUN>/mat<n><m><unc><sg><nom><@HNOUN>$^.<clb>/.<sent><clb>$

Output of biltrans in CG-style output:

"<biebmu>" n sg nom @HNOUN
    "føde" n m unc sg nom @HNOUN
    "mat" n m unc sg nom @HNOUN
"<.>" clb
    "." sent clb

Lene vil oppdatere lex-fila frå dev/archive til gjeldande fil.

REGRESSIONS

sme-nob   boradišgohten.
    - jeg begynte å spise.
    + #spise.

$ echo "boradišgohten" | apertium -d . sme-nob-biltrans
^boradit<v><tv><der_goahti><ind><prt><sg1><@+FMAINV>/spise<vblex><pers><der_goahti><ind><prt><sg1><@+FMAINV>$^.<clb>/.<sent><clb>$

Original:

^boradit<v><tv>+goahti<v><ind><prt><sg1><@+FMAINV>$

So, how do we change der_goahti to +goahti ?

boradišgohten
boradišgohten   boradit+V+TV+Der/goahti+V+Ind+Prt+Sg1

borakeahttá
borakeahttá borrat+V+TV+VAbess

$ echo "boradišgohten" | apertium -d . sme-nob-morph
^boradišgohten/boradit<v><tv><der_goahti><v><ind><prt><sg1>/borrat<v><tv><der_d><v><der_goahti><v><ind><prt><sg1>$^./.<clb>$

$ echo "^boradit<v><tv>+goahti<v><ind><prt><sg1><@+FMAINV>$" | apertium-pretransfer | lt-proc -b sme-nob.autobil.bin
^boradit<v><tv>/spise<vblex><pers>$ ^goahti<v><ind><prt><sg1><@+FMAINV>/begynne<vblex><pret><sg><p1><@+FMAINV>$

$ echo "^boradit<v><tv>+goahti<v><ind><prt><sg1><@+FMAINV>$" | apertium-pretransfer | lt-proc -b sme-nob.autobil.bin   | apertium-transfer -b apertium-sme-nob.sme-nob.t1x sme-nob.t1x.bin
^verb<SV><@+FMAINV><ind><pret><p1><sg><pers>{^begynne<vblex><pret>$}$ ^part<part>{^å<part>$}$ ^verb<SV><inf><pers><NC>{^spise<vblex><inf>$}$

-> begynte å spise

Focus words

+Foc/ge +ge<pcle>
+Foc/gen    +gen<pcle>
+Foc/ges    +ges<pcle>
+Foc/gis    +gis<pcle>
+Foc/naj    +naj<pcle>
+Foc/ba +ba<pcle>
+Foc/be +be<pcle>
+Foc/hal    +hal<pcle>
+Foc/han    +han<pcle>
+Foc/bat    +bat<pcle>
+Foc/son    +son<pcle>
+Foc/naj+Qst    <qst>+naj<pcle>
+Qst+Foc/son    <qst>+son<pcle>

The + makes them into separate words (for the lexical transfer?)

  1. translate tags to plus notation
  2. disambiguate
  3. split words
  4. send to biltrans and get nob words for the + “words”
$ echo "^boađátge/boahtit<v><iv><ind><prs><sg2>+ge<pcle>$" | cg-conv -a -l
"<boađátge>"
    "boahtit" v iv ind prs sg2
        "ge" pcle

<spectre> Unhammer, did you make any changes to the CG in sme-nob in order to deal with subreadings ?
<Unhammer> can't remember …

From SMA:

LEXICON FINAL1
 ENDLEX ;
 +Foc/ge+Use/Circ:#ge ENDLEX ; !
 +Foc/gan+Use/Circ:#gan ENDLEX ; !
 +Foc/gih+Use/Circ:#gih ENDLEX ; !
 +Foc/gænnah+Use/Circ:#gænnah ENDLEX ; !

Todo

Sjur@Apertium:

Legge til ord frå nobsme/inc/False* til smenob

  1. Dele smenob-katalogen i <e src vs.
  2. Ha i separat katalog loansrc
  3. Legge til alle smenob/src til bidix
  4. Legge atil alle smenob/loansrc til bidix, men merka r=”LR” i sme-nob.dix

Propernouns

  1. Legge til sme-nob (og sme-fin og sme-swe) frå geo/xml_src
  2. smi-propernouns.lexc
  3. commons smi