Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

We want to extend (some of) the corpus files with markup for spelling and other errors, to use them as gold standards for testing our spellers (and in the future other tools as well). The markup is done manually, and needs to follow certain rules.

ISL - Icelandic markup

Description of the error classification for ISL:

  1. Unclassified errors - {wrong}§{correct}

Errors of an unknown type.

  1. Orthographic errors, non-words - {wrong}${error classification|correct}

Traditional misspellings confined to single (error) strings, that is, errors that don’t need an analysis of the surrounding words to be detected and corrected. In the resulting xml, the element is named <errorort>. These errors do always lead to non-words in the text, such that a speller should be able to detect them.

Error types

vow - {error}${vow,position,subtype|correct}

Errors involving an incorrect vowel.

con - {error}${con,position,subtype|correct}

Errors involving an incorrect consonant.

typo - {error}${typo,position,subtype|correct}

Typographical error. Slips of the hand or fingers. Not the same as spelling errors.

cap - {error}${cap,position,subtype|correct}

An error in capitalization.

meta - {error}${meta,position,subtype|correct}

The metathesis of letters. Can be 2 or more, though 3 is the most seen.

abp - {error}${abp,subtype|correct}

An error in punctuation in abbreviations, resulting in an error.

cmp type 1 - {error}${cmp,subtype|correct}

Errors in compounding words. The wrong form of the former word is used, resulting in an error.

cmp type 2 - {error}${cmp,wordclasses,subtype|correct}

Errors in compounding words. Two or more words are written together as one word, resulting in an error.

cmp type 3 - {error}${cmp,slash,subtype|correct}

Errors in compounding words. A slash is used to compound words that should be separate words, resulting in an error.

Sitemap