Language Technology at UiT

26.09.2013

present:

Sjur
Linda

grammar checker project plan

0 intro

working definition: errors that cannot be resolved by the spellchecker
Excluding real word errors by default

1 done until now:

error type classification
- lexical errors (&lex-majuscule)
- morphosyntactic errors (&msyn-inf_not_actio)
- syntactic errors (&syn-case_congruence)
- real-word errors (&real-vuosttaš)
- correct tags (&corr-not-compound)
additional error types
- punctuation errors
- number formatting errors
- capitalisation errors
- specific syntactic grammar for the grammar checker philosophy: sme-gramdis.rle, rules are marked (REMOVE:GramPo)
- grammarchecker grammar: sme-gramchk.rle
- publication: Constraint Grammar based Correction of Grammatical Errors for North Sámi LREC 2012

2 todo:

practical things:
- move SME (and GC) from old to new infrastructure
- meetings with Francis
maintenance:
- add/change/update semantic/syntactic tags
work on things started:
- Duommá’s 250 word list (compounds that lead to real word errors) - excluding real word errors by default
- rules for valency example sentences collected in gramchkcorpus.txt
errors:
- find out which types of errors are most frequent
- error corpus - size?? other sources??
  - $GTFREE/goldstandard/orig/sme (xserve)
  - main/gt/sme/src/gramchk/gramchkcorpus.txt
possible classes?
presentation:
- sponsor-demonstrations
- release early/often (Open Source principles)
- we cannot make a Microsoft Office grammar checker - prohibited by MS - users can protest by writing to them ;) (we can only deliver to LibreOffice)
- look at a graphic grammarchecker (voikko - Finnish)
- http://wiki.apertium.org/wiki/Spellchecking
rules:
- for real word errors: which semantic tags can be combined? - dálkkádat + rap + poarta
- bigrams and statistics for compounds?
- fix/annotate grammatical errors (compounds) already in preprocessing/tokenization/morphological analysis (i.e. treat space as compound border for relevant POS’s) (other ideas - Eckhard?)
- hfst-proc må truleg oppdaterast for å gje alle analyser av potensielle samansetjingsfeil

Samansetjingsfeil - særskriving:

[N Nom]         [N ...] ===== kasusfeil (Gen not Nom) / sammensettingsfeil
[N Nom/N Gen]   [N ...] =====
[N Gen]         [N ...] =====
[N Nom+VR]      [N ...] ===== med vokalreduksjon (VR) - alltid feil
[N Nom/N Gen+VR][N ...] ===== --"--
[N Gen+VR]      [N ...] ===== --"--

VR = Vokalreduksjon

what is one word?

stavekontroll - space before and after
tokenizer:
- space as a possible sign in a compound (in the case: [[N Nom] [[N …] the error tag can get annotated right away)
- CG needs to clean up - disambiguate
tools to be used:
- dependencies
- valencies
- semantic roles
- semantic prototypes
- evaluation:
  - precision and recall
  - how much has been resolved

Sitemap