Language Technology at UiT The Arctic University of Norway

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Page Content

Testing tools for the sámi language technology project

This page documents the scripts and the Makefile used as test tools. There are five perl scripts, all located in $CVSROOT/gt/script/testing/, and a Makefile, one copy for each language, located in $CVSROOT/gt/smX/testing/ (where smX is the ISO code of your favourite Saami language). The Southern Saami (sma) Makefile is used as the development version, and serves as the original, from which the others are copied.

Below is only described the calling and the return values of the different scripts, for details, see the scripts themselves, they are pretty simple, and fairly well commented (and if not, complain to me).

  1. merge-codesNforms.pl

Purpose:

To create a base file for making test cases by combining a tag list and a word form list. This way we only have to write the tag list once for each POS.

Input:

Output stream:

A repeating, tab-separated list of fields (three fields), each such triple separated with a newline:

Usage

Used in front of one of:

to create the actual test cases, and the corresponding facit files.

  1. merge-codesNword.pl

Purpose:

To create the input file for generating a paradigm by combining a tag list and a base form of a given word.

Input:

Output stream:

A list of baseform plus codes corresponding to the whole paradigm. There is one such combination on each line.

Usage

The output can be directly used as input for xfst, to generate the word forms that make up the paradigm.

  1. make-gen-test.pl

Purpose:

To extract from a created testbase file the separate parts needed as input data for testing word form generation.

Input:

A testbase file created with 1. merge-codesNforms.pl , with the three fields baseform, inflectional codes, and word form(s) corresponding to the inflectional codes.

Output stream:

Test file for word form generation testing: one line for each inflection, consisting of baseform and inflectional codes appended. This is the input format required by the Xerox xfst tool.

Usage

Use as input to the Xerox xfst tool (done in the Makefile).

  1. make-gen-test-facit.pl

Purpose:

To create the expected output from a generation test run, such that the actual test results can be compared with it. Based on the comparison, one can make further reports on the success of the test run.

Input:

Testbase file as created above.

Output stream:

A list of word forms in the same format as produced by the Xerox tools, extracted from the testbase file. One word form on each line.

Usage

Use the output of this script to diff against the actual test result (done in the Makefile). Any differences indicate possible errors in the morphological description.

  1. make-ana-test.pl

Purpose:

To create a test file (or a facit file) for morphological analysis by spitting out all the possible word forms with the corresponding analysis at the end, formated almost as the output from the Xerox xfst tool. Some further postprocessing is needed both for making the test case, and for creating the facit file. This is done in the Makefile.

Input:

A testbase file as created above.

Output stream:

A two-field, tab-separated list:

In cases where there are more than one alternative wordform, they have been split onto separate lines.

Usage

Use to create the basis for word form analysis testing. Further sorting and cutting (field 1 as test data, field 2 as facit data) is needed, and is done in the Makefile.

  1. Makefile

l

Whereas the perl scripts above are pretty short and simple, the Makefile used to automatise testing is pretty long and complex. Thus, the documentation is split into the following sections:

Flow diagram for testing

Below is outlined the flow of action for the test bed. The example file is from South Sami, but the flow itself is language independent. The flow diagram illustrates word form generation.

            -----------    ===========           The corresponding
   "Files", | Scripts |  & || Tools ||              make target
==================================================================
"noun-codes.txt"  "n-even-col6-ie-full.txt"
        \                 |
         ------------------------
         | merge-codesNforms.pl |
         ------------------------
                     ||
                     \/
    "n-even-col6-ie-full.testbase"                 n-%.testbase
       ||                  ||
       ||                  \/
       ||       ------------------------
       ||       |   make-gen-test.pl   |
       ||       ------------------------
       ||                  ||
       ||                  \/
       ||      "n-even-col6-ie-full.gtest"            %.gtest
       ||                  ||
       ||                  \/
       ||   ------------------------------------
       ||   | n-even-col6-ie-full-gtest-script |
       ||   ------------------------------------
       ||                  ||
       ||                  \/
       ||            ==============
       ||            ||   xfst   ||
       ||            ==============
       ||                  ||
       ||                  \/
       ||        "n-even-col6-ie-full.gresult"       %.gresult
       ||                              ||
       \/                              ||
  --------------------------           ||
  | make-gen-test-facit.pl |           ||
  --------------------------           ||
                 ||                    ||
                 \/                    ||
      "n-even-col6-ie-full.gfacit"     ||            %.gfacit
                        ||             ||
                        \/             \/
                        =================
                        ||     diff    ||
                        =================
                                ||
                                \/
                  "n-even-col6-ie-full.greport"      %.greport
                                ||
 all *.greport files - \  \  \  \/  / /  /
                        =================
                        ||     cat     ||
                        =================
                                ||
                                \/
                          "n-g.summary"             n-g.summary

The above scheme is repeated more or less identical for word form analysis, with the exception that there is no separate -facit.pl script - the same script is used for producing both test input and test facit, with the help of some postprocessing in the Makefile.

The scheme for paradigm generation is much simpler, and it should be possible to read the Makefile directly. If not, complain to me!

Variables Used

Predefined Variables

The following built-in variables are used:

Variables I have defined

The following variables defined by me are used:

When defined, the variable names are written as such, when referenced, they are encapsulated in parenthesis, and prefixed with a dollar sign. Example: $(TEMP) is how the variable TEMP is referenced.

Main sections of the Makefile

The main sections of the Makefile are the following: