The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages
To enable tags in a natural language instead of the quite cryptic tags we
normally use (like +N
etc.), you need to do the following:
$GTLANG/src/tagsets/
$GTLANG/src/tagsets/Makefile.am
$GTLANG/src/Makefile.am
The name of the regexes and the name of the localised fst’s are related,
such that tagsets/foo.regex
corresponds to analyser-foo-desc.hfst
etc.
Typically you would want to use the language code of the natural language the
tags are written in.
Since the natural language of the tags will vary a lot depending on the
language of the analyser, the regex is language specific. But to get a starting
point, have a look at langs/sme/
, or use it as a starting point:
cp langs/sme/src/tagsets/nob.regex langs/YOURLANG/src/tagsets/LANGCODE.regex
Replace YOURLANG
and LANGCODE
with the relevant language codes.
Then start to edit the file to get what you want. For North Sami (sme
) we
have tags in Norwegian Bokmål (nob
).
For the build system to properly build the fst’s that are going to change the
tags, you need to tell it that there is a source file and some targets to be
built. This you specify in $GTLANG/src/tagsets/Makefile.am
.
Open this file, and list the source file(s) in the variable GT_TAGSET_SRCS
.
Then list the hfst
and xfst
targets in the appropriate sections for the
variable GT_TAGSETS
.
In the North Sami case, these files are named: nob.regex
, nob.hfst
and
nob.xfst
.
You also need to tell the build system that you have a new set of analysers and
generators you want to build. This is done in the file
$GTLANG/src/Makefile.am
. For North Sami, we want to build e.g. the file
analyser-nob-desc.hfst
. You tell the build system this by adding that name
to the variable GT_ANALYSERS_HFST
. And similar for other files. Have a look
at North Sámi to see how it is done for the rest of the analysers and generators
you may want.
When these three steps are done, you can type make
, and you should soon be
greeted with a new set of analysers and generators!