The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages
View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no
Planar for sumaren og hausten:
Finansiering: Divvun-potten for ekstra satsingar Timeplan: ferdig til offentleg lansering av iOS8 (rykte: september) - vi satsar på 15. september for ein beta, ferdig så fort som mogleg etter det
Design-mål:
Moglege framtidsvariantar:
Basert på:
Possible issues with hfst-pmatch:
Tommi: You cannot get your tokeniser as you analyse with ambiguos readings in middle of the string from pmatch; if “in order to” is lrlm there won’t be “in” “order” “to” using pmatch applicator.
Sjur: Can this be changed in the pmatch code to collect all paths up until a common tokenisation point?
Tommi: Wouldn’t it in the end be just as much work as rewriting from scratch and probably harder? Like, using pmatch for this with these specs is like having a hammer and trying very hard to use it on screws cause they kind of look like a nail.
See [http://www.stanford.edu/~laurik/publications/pmatch] for details on how to use (hfst-)pmatch.
Mike to try out hfst-pmatch for a month, then we evaluate the feasibility of hfst-pmatch as an analysing tokeniser.
^the/the$ ^cat's/cat+'s/cat's$ ^mother/mother$^,/,$ ^in order to/in+order+to/in order to$
Two possible tokenisations:
"<in order to>"
"in order to" pr
"<in>"
"pr"
"<order>"
"order" vblex pres
"order" n sg
"<to>"
"to" pr
Re unicode regexes: “You can match a single character belonging to the “letter” category with \p{L}. You can match a single character not belonging to that category with \P{L}.” See [http://www.regular-expressions.info/unicode.html] for details.
Which tools support Unicode regexes? pcre? Yes, I believe so. Any decent and recent programming language with proper ICU-based Unicode support :)