The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages
Korp is a Corpus tool and Karp a Lexicon tool from the Swedish Språkbanken. We want to install them locally.
The Korp code:
Links to the Karp code are forthcoming.
Menu:
Generation of lemgrams from lexc:
Use dict-isme-norm.fst or generator-dict-gt-norm.xfst or generator-dict-gt-norm.hfst. We remove the tags v1, v2.. from the fst. It is better for the user that all variants of the same paradigm are in the same lemgram. Many fst-lemmas have more than one entry in lexc, so the list should be uniqed before generating forms. I suggest that we start with these files:
For nouns, we pick different 3 lists: The ordinary nouns, the actors (NomAg), and the G3-marked nouns. For the other parts of speech, one command is enough. Commands to filter (ir)relevant forms:
*Ordinary words:
egrep -v "(G3|ACTOR|CmpN/Only|ShCmp|RCmpnd|\+V\+|^\!)"
grep N+NomAg
grep N+G3
egrep -v "(ENDLEX|\+V|^\!)"
egrep -v "(LEXICON|Der| Rreal | R |^\!)"
egrep -v "(LEXICON| K |^\!)"