Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Quasicode For Keeping Track Of Transducers

This document follows the cgi-interface.xsl and cgi-index.xsl document. They document what cgi-bin processes each and every language is subject to. The processes themselves are governed by conf.pl and other cgi-bin scripts in the www directory.

The source files are compiled in the maintainer’s home directory on gtweb (and for oahpa, on gtoahpa). A suitable command for compiling the fsts might be to run makeall.sh (and thereafter go for a cup of coffee).

After the fsts are compiled they should be copied manually or semi-manually to the /opt/smi catalogue by the person who compiled them.

The reason why this is not done automatically (say, by a cronjob) is that we want to be sure fsts are checked before they are put online. With the large number of fsts we still need to make the process semi-automatic, with scripts a la update mhr. This documents paves the way for such a setup.

In short: We want to have a selective button to press, called e.g. web-update LANG.

A (now outdated) script to do this, and to get inspiration, is:

$GTLANG/gt/script/fst2opt

Source files are in these catalogues:

$GTHOME/langs/LANG
$GTHOME/gt/sme/src

List of file names

This list is taken from the sme catalogue, and thus a maximum version. Other languages contain less.

Name in /opt/smi/LANG/bin  | Name in langs/LANG/src
---------------------------+--------------------------
abbr.txt                   = (ikkje i ny infra enno)
clock-LANG.fst             = transcriptor-clock2text-desc.xfst
corr.txt                   = (ikkje i ny infra enno)
date-LANG.fst              = transcriptor-date2text-desc.xfst
dict-iLANG-norm.fst        = generator-dict-gt-norm.xfst
dict-LANG-norm.fst         = analyser-dict-gt-norm.xfst
hyphrules-LANG.fst         = hyphenation/hyphenation.xfst
hyph-LANG.fst              = (ikkje i ny infra enno)
hyph-LANG.save             = (ikkje i ny infra enno)
iclock-LANG.fst            = transcriptor-text2clock-desc.xfst
idate-LANG.fst             = transcriptor-text2date-desc.xfst
iLANG.fst                  = generator-gt-desc.xfst
iLANG-GG.restr.fst         = ikkje i ny infra
iLANG-norm.fst             = generator-gt-norm.xfst
korpustags.LANG.txt        = ../test/data/korpustags.LANG.txt
oahpa-iLANG-norm.fst       = generator-oahpa-gt-norm.xfst
paradigm_full.LANG.txt     = ../test/data/paradigm_full.LANG.txt (temporary place)
paradigm_min.LANG.txt      = ../test/data/paradigm_min.LANG.txt (temporary place)
paradigm.LANG.txt          = ../test/data/paradigm.LANG.txt (temporary place)
paradigm_standard.LANG.txt = ../test/data/paradigm_standard.LANG.txt (temporary place)
ped-LANG.fst input til s/v = ???
ped-tol-LANG.fst           = ???
phon-LANG.fst              = phonetics/text2ipa.xfst
LANG-dep.bin               = syntax/dependency.bin
LANG-dep.rle               = syntax/dependency.cg3
LANG-dis.bin               = syntax/disambiguation.bin
LANG-dis.rle               = syntax/disambiguation.cg3
LANG.fst                   = analyser-gt-desc.xfst (see LANG-site.fst!)
LANG-inum.fst              = transcriptor-text2numbers-desc.xfst
LANG-norm.fst              = analyser-gt-norm.xfst
LANG-num.fst               = transcriptor-numbers2text-desc.xfst
LANG-site.fst sme.fst u/sem= analyser-gt-desc.xfst (see LANG-site.fst!)
smi-syn.rle                = syntax/syntax.cg3
typos.fst                  = ../test/data/typos.fst

Basic files for all

We have analysis and paradigm for all languages

LANG.fst       = src/analyser-gt-desc.xfst
iLANG.fst      = src/src/generator-gt-desc.xfst
iLANG-norm.fst = src/src/generator-gt-norm.xfst

Language-specific files

The distribution is governed by cgi-interface.xsl. Thus, one possibility is to just copy all files over.

cgi-interface.xsl

Syntax

if dependency="1", copy:
LANG/src/syntax/dependency.bin /opt/smi/LANG/bin/LANG-dep.bin
LANG/src/syntax/dependency.cg3 /opt/smi/LANG/bin/LANG-dep.rle

if nodisamb="0", copy:
LANG/src/syntax/disambiguation.bin /opt/smi/LANG/bin/LANG-dis.bin
LANG/src/syntax/disambiguation.cg3 /opt/smi/LANG/bin/LANG-dis.rle
LANG/src/...	/opt/smi/LANG/bin/abbr.txt

if nohyph="0", copy:
LANG/src/hyphenation/hyphenation.xfst  /opt/smi/LANG/bin/hyph-LANG.fst

if orth="1" & lang="kal", copy:
LANG/src/transcriptions/kleinschmidt2norm.xfst /opt/smi/LANG/bin/orth-LANG.fst

if translate_dan="1", copy
LANG/

if translate_nob="1", copy
LANG/

### Other transducers

if orth="1" & lang="ipk", copy:

if phon="1", copy
LANG/src/phonetics/text2ipa.xfst /opt/smi/LANG/bin/phon-LANG.fst

### Paradigm generation

if dialpara="1"

if minpara="1"

if standardpara="1"

if fullpara="1"

cgi-index.xml

if nonum=”0”

LANG/src/transcriptions/transcriptor-numbers2text-raw.xfst /opt/smi/LANG/bin/LANG-num.fst

if oahpa=”1”

src/transcriptions/transcriptor-clock2text-raw.xfst /opt/smi/LANG/bin/iclock-LANG.fst

src/transcriptions/transcriptor-date2text-raw.xfst /opt/smi/LANG/bin/idate-LANG.fst

Invert these transducers and copy them to clock-LANG.fst and date-LANG.fst

CHECK: It might be that Oahpa uses fsts stored in he /home/oahpa catalogue.