The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages
View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no
Til stades: Kevin, Sjur
Tema: fleirtydig tokenisering
LEXICON Root
< {skuvla} 0:" " "@P.Pmatch.Backtrack@" {busse} "+N":0 > ENDLEX;
skuvla+N:skuvla ENDLEX;
busset+V:busse ENDLEX;
Leksikonet over bør kunna gi denne analysen:
"<skuvla busse>"
"skuvlabusse" N Err/SpaceCmp
"busset" V
"skuvla" N
Forslag til pmatch-filter:
define filter_flags(net) net .o. [?* flag:0 ?*]*;
### Would this happen online or during compilation?
### Compilation of the pmatch rules
### (Leads to overgeneration but can that be limited mechanically?)
### (Probably scratch this idea, I forgot about the overgeneration problem)
### the tag LexCmp indicates a lexicalized compound
define lexicalized_compounds Lexicon .o. ?* LexCmp ?* ;
define allowed_prefixes Lexicon .o. ?* PrefixForms ?* ;
define multitoken_surfaces [filter_flags(lexicalized_compounds.i) .o. [[ allowed_prefixes 0:" " ]+ Lexicon ]].o;
define multitokens multitoken_surfaces .o. [ Lexicon " " ]+ ;
define lexicalized_compounds_with_erroneous_spaces multitoken_surfaces .o. [?* " ":0 ?*] .o. lexicalized_compounds ;
Kommentarar frå Kevin i koden:
### Issues: flag diacritics probably both break the morphosemantics and
### cause huge memory consumption
### Idea: if lexicalized_compounds could be made flag-free, that might suffice? What about the flags in the forms ambiguous with lex.cmps? We don't know which forms are ambig. with lex.cmp's, that's why we intersect.
### Below: that still doesn't work
### Krister: have recent developments in restricting the compound correction perhaps made this possible and we
### should try again?
## Tried heavily restricting, even to just simple nouns, still too much mem
### Another idea: since we really want to start with surface forms, could we just output a text file with
### a list of the lexicalized compound surface forms?
## Ie. analyse a bunch of forms and then … script the lexc to add a tag to lemmas that are ambiguous?
### Maybe use eg. ospell to generate them
### It's suggested that I (Sam) try to do this for omorfi where there are no flag diacritics just to validate the idea
### I did originally implement this as form-intersection, which worked just fine where there were no flags :) but unfortunately anything interesting in sme has flags …
### I don't see how this is less hacky than online backtracking :/
### Neither do I
### so what about this RC mentioned in the email thread
### That's mainly for avoiding doing multitoken analysis even when there's no possibility of misspelled compound
### But thinking again about the rules above ... Isn't it possible to filter out flags from the surfaces
### Our meeting is running out of time... but I'll think some more about this possibility and write it up in an email
Og vidare:
### Trying to remove flag diacritics with foma in order to intersect on forms-only:
### runs out of ram. Trying to grep them out manually: runs out of ram during
### minimize or composition. Even grepping out only parts from only the parts of
### the lexicon we need runs out of ram during later steps of compilation.
define TOP
LC(Boundary)
[ Lexicon |
lexicalized_compounds_with_erroneous_spaces |
RC(lexicalized_compounds_with_erroneous_spaces) multitokens
] RC(boundary);
> [ RC(verb) noun] meaning that the current match is first tested to be in the input
> set of "verb" and then processed as "noun". So you could test that you
> have an ambiguity and then trigger that sort of tokenization
### So something like:
define word lexicon RC([blank|#]) LC([blank](#));
### All analyses that have the SpaceCmp tag:
define spacecmp lexicon .o. ?* Err/SpaceCmp ?*;
### If something had the SpaceCmp tag, try reanalysing that string as if it were two tokens with a space in the middle:
define token [ RC(spacecmp) word " " word ];
Tankar om løysing:
skuvlabusse:skuvla#busse CONTLEX ;
.o.
1. (->) " " ; ! And also add +Err/SpaceCmp
Problemstilling: Korleis kan vi få analysene:
skuvla+N
busset+V
i tillegg til:
skuvla busse+N
utan å eksplisitt ha
< {skuvla} "+N":0 "@P.Pmatch.Loc@" busset:busse "+V":0 >
i lexc (alt for mange kombinasjonar til å handtera manuelt), og utan å køyra
intersection A∩A” “A på formar (som ikkje går pga. flagg vs RAM), dvs. slik at
me berre seier i lexc at «herfrå vil me ha online backtracking».
Me har ikkje noko behov for å spesifikt seia at “der, der er backtrack-punktet”, berre at “denne analysestrengen krev at me tek med reanalyser som fleire token”, der reanalysen har dei vanlege tokeniseringsgrensene (mellomrom i dette tilfellet).
Abbreviations require the same treatment as compounds unless we want to manually specify paths into PUNCT for all forms ambiguous with punctuated abbreviations:
Abbreviations vs other POS + full stop
su. (sunnuntai or su + .)
We already have a solution for:
Numerals: ordinals vs cardinals + full stop as in: 1000.
,
since numbers have fairly unambiguous forms :)
Sitat frå tekstchat under møtet:
June 14, 2016
14:06 Kevin: su. er jo dekka av pmatch_input_mark
14:06 Sjur: men no prøver vi å finna ei løysing som ikkje bruker det merket - eller?
14:06 Kevin: nei, det blir noko anna
14:06 Sjur: og eg får ikkje su. til å funka
14:07 Sjur: ok
14:07 Kevin: me kan sjå på det, men det skal uansett ikkje trenga backtracking
14:07 Sjur: kvifor ikkje? vi vil ha su. = ABBR + su. = Pron & CLB
14:08 Sjur: altså treng vi backtracking etter det eg kan forstå
14:09 Kevin: Åh, viss me skal gjera det utan å spesifisera fullt ut i leksikon, ok. Eg såg på det som ekvivalent med «3.», der det er veldig lett å fullt ut gi begge i leksikon
14:10 Sjur: ok - eg såg for meg ei generell løysing der vi ikkje byggjer leksikonet for tokeniseringa
14:11 Sjur: men om det er enkelt kan vi sjølvsagt gjera det der :)
14:13 Sjur: «og utan å køyra intersection på formar» - meiner du composition?