Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Page Content

Grammatikkontrollmøte 14.6.2016

Til stades: Kevin, Sjur

Tema: fleirtydig tokenisering

LEXICON Root
< {skuvla} 0:" " "@P.Pmatch.Backtrack@" {busse} "+N":0 > ENDLEX;
skuvla+N:skuvla ENDLEX;
busset+V:busse ENDLEX;

Leksikonet over bør kunna gi denne analysen:

"<skuvla busse>"
   "skuvlabusse" N Err/SpaceCmp
   "busset" V
       "skuvla" N

Forslag til pmatch-filter:

define filter_flags(net) net .o. [?* flag:0 ?*]*;
###  Would this happen online or during compilation?
###  Compilation of the pmatch rules
###  (Leads to overgeneration but can that be limited mechanically?)
###  (Probably scratch this idea, I forgot about the overgeneration problem)

###  the tag LexCmp indicates a lexicalized compound
define lexicalized_compounds Lexicon .o. ?* LexCmp ?* ;
define allowed_prefixes Lexicon .o. ?* PrefixForms ?* ;
define multitoken_surfaces [filter_flags(lexicalized_compounds.i) .o. [[ allowed_prefixes 0:" " ]+ Lexicon ]].o;
define multitokens multitoken_surfaces .o. [ Lexicon " " ]+ ;
define lexicalized_compounds_with_erroneous_spaces multitoken_surfaces .o. [?* " ":0 ?*] .o. lexicalized_compounds ;

Kommentarar frå Kevin i koden:

###  Issues: flag diacritics probably both break the morphosemantics and
###  cause huge memory consumption
###  Idea: if lexicalized_compounds could be made flag-free, that might suffice? What about the flags in the forms ambiguous with lex.cmps? We don't know which forms are ambig. with lex.cmp's, that's why we intersect.
###  Below: that still doesn't work
###  Krister: have recent developments in restricting the compound correction perhaps made this possible and we
###  should try again?
##  Tried heavily restricting, even to just simple nouns, still too much mem
###  Another idea: since we really want to start with surface forms, could we just output a text file with
###  a list of the lexicalized compound surface forms?
##  Ie. analyse a bunch of forms and then … script the lexc to add a tag to lemmas that are ambiguous?
###  Maybe use eg. ospell to generate them
###  It's suggested that I (Sam) try to do this for omorfi where there are no flag diacritics just to validate the idea
###  I did originally implement this as form-intersection, which worked just fine where there were no flags :) but unfortunately anything interesting in sme has flags …
###  I don't see how this is less hacky than online backtracking :/
###  Neither do I
###  so what about this RC mentioned in the email thread
###  That's mainly for avoiding doing multitoken analysis even when there's no possibility of misspelled compound
###  But thinking again about the rules above ... Isn't it possible to filter out flags from the surfaces
###  Our meeting is running out of time... but I'll think some more about this possibility and write it up in an email

Og vidare:

###  Trying to remove flag diacritics with foma in order to intersect on forms-only:
###  runs out of ram. Trying to grep them out manually: runs out of ram during
###  minimize or composition. Even grepping out only parts from only the parts of
###  the lexicon we need runs out of ram during later steps of compilation.

define TOP
LC(Boundary)
[ Lexicon |
lexicalized_compounds_with_erroneous_spaces |
RC(lexicalized_compounds_with_erroneous_spaces) multitokens
] RC(boundary);

> [ RC(verb) noun] meaning that the current match is first tested to be in the input
> set of "verb" and then processed as "noun". So you could test that you
> have an ambiguity and then trigger that sort of tokenization
###  So something like:
define word lexicon RC([blank|#]) LC([blank](#));
###  All analyses that have the SpaceCmp tag:
define spacecmp lexicon .o. ?* Err/SpaceCmp ?*;
###  If something had the SpaceCmp tag, try reanalysing that string as if it were two tokens with a space in the middle:
define token [ RC(spacecmp) word " " word ];

Tankar om løysing:

skuvlabusse:skuvla#busse CONTLEX ;
.o.
1. (->) " " ; ! And also add +Err/SpaceCmp

Problemstilling: Korleis kan vi få analysene:

skuvla+N
busset+V

i tillegg til:

skuvla busse+N

utan å eksplisitt ha < {skuvla} "+N":0 "@P.Pmatch.Loc@" busset:busse "+V":0 > i lexc (alt for mange kombinasjonar til å handtera manuelt), og utan å køyra intersection A∩A” “A på formar (som ikkje går pga. flagg vs RAM), dvs. slik at me berre seier i lexc at «herfrå vil me ha online backtracking».

Me har ikkje noko behov for å spesifikt seia at “der, der er backtrack-punktet”, berre at “denne analysestrengen krev at me tek med reanalyser som fleire token”, der reanalysen har dei vanlege tokeniseringsgrensene (mellomrom i dette tilfellet).

Abbreviations require the same treatment as compounds unless we want to manually specify paths into PUNCT for all forms ambiguous with punctuated abbreviations:

Abbreviations vs other POS + full stop
su. (sunnuntai or su + .)

We already have a solution for: Numerals: ordinals vs cardinals + full stop as in: 1000., since numbers have fairly unambiguous forms :)

Sitat frå tekstchat under møtet:

June 14, 2016
14:06 Kevin: su. er jo dekka av pmatch_input_mark
14:06 Sjur: men no prøver vi å finna ei løysing som ikkje bruker det merket - eller?
14:06 Kevin: nei, det blir noko anna
14:06 Sjur: og eg får ikkje su. til å funka
14:07 Sjur: ok
14:07 Kevin: me kan sjå på det, men det skal uansett ikkje trenga backtracking
14:07 Sjur: kvifor ikkje? vi vil ha su. = ABBR + su. = Pron & CLB
14:08 Sjur: altså treng vi backtracking etter det eg kan forstå
14:09 Kevin: Åh, viss me skal gjera det utan å spesifisera fullt ut i leksikon, ok. Eg såg på det som ekvivalent med «3.», der det er veldig lett å fullt ut gi begge i leksikon
14:10 Sjur: ok - eg såg for meg ei generell løysing der vi ikkje byggjer leksikonet for tokeniseringa
14:11 Sjur: men om det er enkelt kan vi sjølvsagt gjera det der :)
14:13 Sjur: «og utan å køyra intersection på formar» - meiner du composition?