The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages
View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no
This is an overview of the file structure for each language
catalogue found in the GiellaLT infrastructure, i.e.
over the directories giella-core, langs, startup-langs, experimental-langs
located under our main catalogue.
The file README in $GTHOME
also describes some basic properties of
the infrastructure.
The starting point is the mother catalogue $GTHOME
(called main
if you follow the standard checkout procedure.
In these catalogues language has its own catalogue (folder), the naming convention is from ISO 639-3. Each language folder contains the following subfolders:
am-shared
(automake shared, common commands for fst compilation,
never changed here, only in giella-core/langs-templates)doc
(folder where we write language-specific documentation)m4
(support files for compilation)misc
(a grab bag dir for anything you don’t want in svn - all files are
ignored)src
(this is where the linguistic source files are)
filters
(language-specific filters)hyphenation
(each phenomenon its folder)fst
(see below)orthography
(capital letters, spellrelax)phonetics
(for different scripts for phonetic transcription)cg3
(see below)test
(see below)tools
(tools we build, both proofing tools and other tools)Here’s a tree view of the structure, shown for the undefined language und
( thus the dir structure of all
languages):
und$ tree -d
.
├── am-shared
├── doc
│ └── resources
│ └── images
├── m4
├── misc
├── src
│ ├── filters
│ ├── hyphenation
│ ├── fst
│ │ ├── affixes
│ │ └── stems
│ ├── orthography
│ ├── phonetics
│ ├── cg3
│ ├── tagsets
│ └── transcriptions
├── test
│ ├── data
│ ├── src
│ │ ├── morphology
│ │ ├── phonology
│ │ └── syntax
│ └── tools
│ └── spellcheckers
└── tools
├── analysers
├── datagrammarcheckers
├── grammarcheckers
├── hyphernators
├── mt
│ ├── apertium
│ └── cgbased
│ └── filters
├── shellscripts
├── tokenisers
└── spellcheckers
├── fstbased
│ ├── foma
│ └── hfst
└── listbased
└── hunspell
Some directories are described in further detail below.
root.lexc
(defines tags and basic parts of speech)The folder might also contain some lexc files, like:
clitics.lexc
compounding.lexc
The makefile defines two, perhaps four variables (the two first must be defined)
We define all the source files we need to build the transducers. The build system will take care of putting them together and compiling them.
include $(top_srcdir)/am-shared/lesc-include.am
This statement includes the majority of the build instructions. You should never need to touch the included file.
Here we take care of initial capitalisation, and of spellrelax. Note that spellrelax now will be language-specific.
This folder contains files for conversion to IPA.
This folder contains disambiguation.cg3
.
The files functions.cg3, dependency.cg3
for for sma/sme/smj are in
giella-shared/smi/src/cg3/
. Faroese also uses the common
dependency.cg3
, but has its own functions.cg3
.
Within it there are several subdirs for different kind of tests. Each test is wrapped in a shell script that emits one of the following values, depending on the outcome of the test:
If you need a new test, just write a shell script that follows this convention, and add it to the TESTS variable in the Makefile.am file. That’s it.
Also take care that the shell script uses AM variables for all references to files and commands outside the script which can not be assumed to be universally available. This will make the test scripts portable.
So far we have morphology tests only, but there is a setup for syntax and phonology as well.
What kind of and how many transducers we produce varies from language to language.
The binary files are stored in $LANG/src
, with
the following name conventions.
For some languages (at least sme and sma) the content of and difference between the transducers is explained on the documentation page of each language
$GTHOME
also contains the catalogues giella-core, giella-shared
. the former is for
language-independent technical files, and the latter is for language-independent files,
one subfolder for each language.
The catalogue giella-core
contains scripts used for all languages.