Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Page Content

Tastatur og preprosessering

Planar for sumaren og hausten:

tastatur for iOS8 og Android (Lavangen og India? Utlysing)

Finansiering: Divvun-potten for ekstra satsingar Timeplan: ferdig til offentleg lansering av iOS8 (rykte: september) - vi satsar på 15. september for ein beta, ferdig så fort som mogleg etter det

Design-mål:

Moglege framtidsvariantar:

preprosessering

Basert på:

Possible issues with hfst-pmatch:

Tommi: You cannot get your tokeniser as you analyse with ambiguos readings in middle of the string from pmatch; if “in order to” is lrlm there won’t be “in” “order” “to” using pmatch applicator.

Sjur: Can this be changed in the pmatch code to collect all paths up until a common tokenisation point?

Tommi: Wouldn’t it in the end be just as much work as rewriting from scratch and probably harder? Like, using pmatch for this with these specs is like having a hammer and trying very hard to use it on screws cause they kind of look like a nail.

See [http://www.stanford.edu/~laurik/publications/pmatch] for details on how to use (hfst-)pmatch.

arbeid til Mike

Mike to try out hfst-pmatch for a month, then we evaluate the feasibility of hfst-pmatch as an analysing tokeniser.

wishlist for tokeniser

Two possible tokenisations:

"<in order to>"
    "in order to" pr

"<in>"
    "pr"

"<order>"
    "order" vblex pres
    "order" n sg

"<to>"
    "to" pr

Re unicode regexes: “You can match a single character belonging to the “letter” category with \p{L}. You can match a single character not belonging to that category with \P{L}.” See [http://www.regular-expressions.info/unicode.html] for details.

Which tools support Unicode regexes? pcre? Yes, I believe so. Any decent and recent programming language with proper ICU-based Unicode support :)