The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages
View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no
Agenda:
It has been solved in sme. How?
Now: adding special tags (xml node attribute) after the lemma. This attribute is added to the lexc entry, which ties the transducer to the specific lemma in the dictionary.
Example:
Textword lohkki is ambiguous:
Nom | Gen | Norwegian | |
1. | lohkki | lohki | lokk |
2. | lohkki | lohkki | lesar |
We want to generate both paradigms, and bind each to the correct lexical entry.
xml:
1.
<e>
<lemma pos="n">
2.
<e>
<lemma pos="actor">
All actors follow the same pattern. To generate the corresponding paradigms:
The corresponding xml entries looks like:
1.
<e usage="ped" src="nj">
<lg>
<l pos="n">lohkki</l>
<lc>lohkit</lc>
</lg>
<mg>
<tg>
<t pos="n">lokk</t>
2.
<e usage="ped" src="nj">
<lg>
<l pos="n" subclass="actor">lohkki</l>
</lg>
<mg>
<tg>
<t pos="m">leser</t>
We introduce the @subclass to denote inflectional subclasses, like proper nouns, actor inflection like the above
In
<e usage="ped" src="nj">
<lg>
<l pos="n" subclass="m1">rett</l>
<lc>lohkit</lc>
</lg>
<mg>
<tg>
<t pos="n">domstol</t>
2.
<e usage="ped" src="nj">
<lg>
<l pos="n" subclass="m2">rett</l>
</lg>
<mg>
<tg>
<t pos="m">matrett</t>
Here we use the inflection class as subclacc. That should uniquely identify the correct paradigm.
Friday 20.3. Then subversion.