The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages
This page documents the scripts and the Makefile used as test tools.
There are five perl scripts, all located in
$CVSROOT/gt/script/testing/
, and a Makefile, one copy for each
language, located in $CVSROOT/gt/smX/testing/
(where smX
is the ISO
code of your favourite Saami language). The Southern Saami (sma)
Makefile is used as the development version, and serves as the original,
from which the others are copied.
Below is only described the calling and the return values of the different scripts, for details, see the scripts themselves, they are pretty simple, and fairly well commented (and if not, complain to me).
To create a base file for making test cases by combining a tag list and a word form list. This way we only have to write the tag list once for each POS.
ARG1:
input file with inflectional tags, one tag on each line;
normally one of the files listed below (the filenames are not
hardcoded, but given by the Makefile):
noun-codes.txt
verb-codes.txt
adj-codes.txt
ARG2:
input file with inflected word forms, in the same order as
the tags; two or more alternate word forms on the same line,
separated by a comma ONLYA repeating, tab-separated list of fields (three fields), each such triple separated with a newline:
Field 1:
the baseform of the wordField 2:
a morphological tagField 3:
the word form(s) corresponding to the tag; in the case of
two or more alternative word forms, they are separated by a comma
ONLY (no space).Used in front of one of:
to create the actual test cases, and the corresponding facit files.
To create the input file for generating a paradigm by combining a tag list and a base form of a given word.
ARG1:
input file with inflectional tags, one tag on each line;
normally one of the files listed below (the filenames are not
hardcoded, but given by the Makefile):
noun-codes.txt
verb-codes.txt
adj-codes.txt
ARG2:
a word in its base form. The word has to belong to one of
the major POSes N, A or V.A list of baseform plus codes corresponding to the whole paradigm. There is one such combination on each line.
The output can be directly used as input for xfst
, to generate the
word forms that make up the paradigm.
To extract from a created testbase file the separate parts needed as input data for testing word form generation.
A testbase file created with 1. merge-codesNforms.pl
,
with the three fields baseform, inflectional codes, and word form(s)
corresponding to the inflectional codes.
Test file for word form generation testing: one line for each
inflection, consisting of baseform and inflectional codes appended. This
is the input format required by the Xerox xfst
tool.
Use as input to the Xerox xfst
tool (done in the Makefile).
To create the expected output from a generation test run, such that the actual test results can be compared with it. Based on the comparison, one can make further reports on the success of the test run.
Testbase file as created above.
A list of word forms in the same format as produced by the Xerox tools, extracted from the testbase file. One word form on each line.
Use the output of this script to diff against the actual test result (done in the Makefile). Any differences indicate possible errors in the morphological description.
To create a test file (or a facit file) for morphological analysis by
spitting out all the possible word forms with the corresponding analysis
at the end, formated almost as the output from the Xerox xfst
tool.
Some further postprocessing is needed both for making the test case, and
for creating the facit file. This is done in the Makefile.
A testbase file as created above.
A two-field, tab-separated list:
In cases where there are more than one alternative wordform, they have been split onto separate lines.
Use to create the basis for word form analysis testing. Further sorting and cutting (field 1 as test data, field 2 as facit data) is needed, and is done in the Makefile.
Whereas the perl scripts above are pretty short and simple, the Makefile used to automatise testing is pretty long and complex. Thus, the documentation is split into the following sections:
Below is outlined the flow of action for the test bed. The example file is from South Sami, but the flow itself is language independent. The flow diagram illustrates word form generation.
----------- =========== The corresponding
"Files", | Scripts | & || Tools || make target
==================================================================
"noun-codes.txt" "n-even-col6-ie-full.txt"
\ |
------------------------
| merge-codesNforms.pl |
------------------------
||
\/
"n-even-col6-ie-full.testbase" n-%.testbase
|| ||
|| \/
|| ------------------------
|| | make-gen-test.pl |
|| ------------------------
|| ||
|| \/
|| "n-even-col6-ie-full.gtest" %.gtest
|| ||
|| \/
|| ------------------------------------
|| | n-even-col6-ie-full-gtest-script |
|| ------------------------------------
|| ||
|| \/
|| ==============
|| || xfst ||
|| ==============
|| ||
|| \/
|| "n-even-col6-ie-full.gresult" %.gresult
|| ||
\/ ||
-------------------------- ||
| make-gen-test-facit.pl | ||
-------------------------- ||
|| ||
\/ ||
"n-even-col6-ie-full.gfacit" || %.gfacit
|| ||
\/ \/
=================
|| diff ||
=================
||
\/
"n-even-col6-ie-full.greport" %.greport
||
all *.greport files - \ \ \ \/ / / /
=================
|| cat ||
=================
||
\/
"n-g.summary" n-g.summary
The above scheme is repeated more or less identical for word form
analysis, with the exception that there is no separate -facit.pl
script - the same script is used for producing both test input and test
facit, with the help of some postprocessing in the Makefile.
The scheme for paradigm generation is much simpler, and it should be possible to read the Makefile directly. If not, complain to me!
The following built-in variables are used:
make
program used. This is useful f.ex.
when starting a make
command in another directory from within a
Makefile, to ensure they are using the same make
.The following variables defined by me are used:
make
documentation. That’s why.make
documentation.When defined, the variable names are written as such, when referenced, they are encapsulated in parenthesis, and prefixed with a dollar sign. Example: $(TEMP) is how the variable TEMP is referenced.
The main sections of the Makefile are the following: