The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages
gt/
, kt/
and st/
into one dir named gtlangs/
script/
out of any language dirsOmorfi
which has a dir structure and build system along the lines of what we wantThere are some details further down, but the meat of the plan is found [on a separate page | NewInfraPlan.html]. The same goes for the progress. |
Another design goal is that once you are within $GTHOME/gtlangs/, you should be able to just ‘make’, and all languages should be built. If you instead cd into one of the language subfolders and ‘make’, only that language should be built. Probably obvious, but I just wanted to put it in print.
The basic dir structure could be something like this:
$GTHOME/
gtcore/
scripts/ # the old gt/script/ dir
mk-files/ # shared core mk-files
templates/ # src file templates and dir structure
shared/ # old common/ - shared linguistic src files
gtlangs/
sme/
smj/
sma/
fao/
kom/
langgroups/
smi/
Comments: the dirs in $GTHOME/gtcore/
are intended to be used as follows:
mk-files/
- the bulk of the build system is found here. In each language dir der is only a very simple (auto)make file that imports or includes the core makefile at the same relative position, as well as a local (auto)make file for local overrides. This will make it easy to improve or change the build for all languages, and at the same time provide enough flexibility to do language-specific experiments.templates/
- contains a complete directory structure and source file templates for all tools and purposes for our languages. This dir tree is used for two purposes: to populate a new language dir tree, and to add new files to existing languages when new source file templates are added. That is, when a new functionality is added, with its corresponding source files in this directory, then all languages will automatically be updated to get source file templates for this new functionality.Longer term, one can consider the following additions:
$GTHOME/
gtcore/ # as above
gtlangs/ # as above
gtlangpairs/ # language pairs, typically dictionaries and MT
gtlanggroups/ # multilingual resources, typically terminology
# collections and shared name resources
The idea is to gather resources that are specific to the given language pairs within these directories. They should also serve as the starting point for ‘‘CS’s Dream’’ (Cip’s and Sjur’s Dream), where all monolingual information is stored in gtlangs/
, and all multilingual information is stored in one of the two dirs indicated above. Language pair names are directional, indicating the source and target languages.
In this scenario, resources for an MT application would then probably be divided among three dir trees: gtlangs/
for the monolingual resources, gtlangpairs/
for the transfer dictionaries, and gtlanggroups/
for terminology resources.
Filenames need to be standardised, as well as the use of filename extensions. The extension should reflect the content type. A possible list of extensions could be:
.lexc
- LexC source files.xfscript
- xfst script file.regex
- xfst regex file.twolc
- twolc source file.xfst
- compiled transducer, Xerox type.hfst
- compiled transducer, HFST type.hfstol
- compiled transducer, optimised HFST type.cg
- generic constraint grammar (CG) file.cg2
- CG2 source file (probably not used anymore, but listed to ensure we can differentiate between major versions of the CG formalism).cg3
- VISL CG3 constraint grammar source file.cg3bin
- binary/compiled VISLCG3 constraint grammar fileThere are probably other file types we need to handle, add mmore extensions here as needed.
So far we have used ISO639-2 codes for all languages, and applied that to both dir names and as part of file names. We should probably move to (the relevant subparts of) proper locale codes, following the standards used by the rest of the world. This means changing all sme
strings to se
, nob
to nb
, etc.
Small and big in the list below refers to the size of the linguistic resources. Simplifying a bit it is roughly equal to the number of lexc
entries.
fao
, which is reasonably big but still not too complex. Create the basic dir tree, and use svn copy
to copy over the fao
sources, so that the old fao
dir remains intact and usable all the time (only when everything is working ok, the old dir will be removed).fao
fao
has)kal
(which is using xfst
instead of twolc
and thus provides a slightly new use case). Make sure all build targets are working as they should, and extend the build system, template files, etc as needed. kal
has probably more requirements than fao
.We need to ensure that nothing changes in terms of the output of the transducers as part of the remake - unless there are some intended changes (e.g. unifying tags across languages). It is probably best to first do the infra remake, and then later do such tag unifying in the output. So what we need for each language is:
The testing then amounts to ensuring that the output is the same from both the old and new transducers. This should guarantee stability in the output, and thus reliability from a linguistic point of view.
There might be problems with this testing scenario in cases where we want to change tags as part of the infrastructure remake. One example could be that we want to standardise some of the compounding tags, to ensure that a compound filter works the same for all languages. Or that some tags that are visible now will be removed in the output of the new transducers, since they really should not be part of the output even in the old transducers (e.g. the +Der1
tags).
Should we build end-user applications in a separate dir tree, one tree for each application, or should the applications be included in the regular language dirs? As long as the application basically only involves one technology and a few files, it would probably seem easiest to build the application as part of the other builds for that language. One such example is spell checkers, which basically are an application of normative transducers.
But as soon as the application builds on multiple technologies, requires several installation packages for different plattforms and a multitude of files as part of the application, it might get more complicated. In this case it might be easier to maintain separate directory trees for each application. Take Oahpa as an example, which uses both several transducers, disambiguators, SQL data, and user interface files.
There is no easy answer to this, we probably have to try both, and see how things develop.