Language Technology at UiT

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Page Content

The flowchart shows parsing done with two different analysers: Method 1, for hfst analysers where pre- and postprocessing is integrated in the morphological analysis, and method 2, for xfst analysers, using perl for pre- and postprocessing of the text.

Method 1 (hfst)

        Action taken..              ..by means of the commands:
        **************              ****************************
+------------------------------+
|     take incoming text       |    cat filename.txt |
+------------------------------+
             \/
+------------------------------+
| divide it into sentences and |
| words and give each token    |
| all possiblemorphological    |    hfst-tokenise -g tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst |
| analyses                     |
+------------------------------+
             \/
+------------------------------+
| disambiguating the m-analysis|
| (= picking only the relevant |    vislcg3 -g src/syntax/disambiguation.cg3 |
| morphological analyse)       |
+------------------------------+
             \/
+------------------------------+
| adding syntactic functions   |    vislcg3 -g src/syntax/functions.cg3 |
+------------------------------+
             \/
+------------------------------+
| adding depenency relations   |    vislcg3 -g src/syntax/dependency.cg3
+------------------------------+

Method 2 (xfst)

Method 2 differs only in that the morphological analysis is divided in 3 separate components

        Action taken..              ..by means of the commands:
        **************              ****************************
+------------------------------+
|     take incoming text       |    cat filename.txt |
+------------------------------+
             \/
 +--------------------------+
 | preprocessing it:        |
 | moving one word per line,|       preprocess --abbr=bin/abbr.txt |  # method 1 |
 | finding sentence bound.  |
 +--------------------------+
             \/
|-----------------------------+
| morphological analysis:     |
| give each word all possible |    lookup -flags mbTT -utf8 src/analyser-gt-desc.xfst |
| analyses                    |
|-----------------------------+
             \/
|-----------------------------+
| processing the output into  |
| a format that fits the dis- |    lookup2cg
| ambiguator, w/a perlscript  |
|-----------------------------+
            \/
       ...
  ( then continue with disambiguation as shown above )

The commands assume you stand in the directory of the language you work with. Method 2 may also be used with hfst (hfst-lookup), but method 1 works only for hfst.