Language Technology at UiT The Arctic University of Norway

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Page Content

  • 10. Other
  • 11. Next meeting, closing
  • Appendix - task lists for the next week
  • Meeting setup

    Agenda

    1. Opening, agenda review
    2. Reviewing the task list from last week
    3. Documentation - divvun.no
    4. Corpus gathering
    5. Corpus infrastructure
    6. Infrastructure
    7. Linguistics
    8. name lexicon infrastructure
    9. Spellers
    10. Other issues
    11. Summary, task lists
    12. Closing

    1. Opening, agenda review, participants

    Opened at 10:10, continued at 13.02.

    Present: Børre, Maaren, Saara, Sjur, Steinar, Thomas, Tomi, Trond

    Absent: none

    Agenda accepted as is.

    2. Updated task status since last meeting

    Børre

    Maaren

    Saara

    Sjur

    Steinar

    Thomas

    Tomi

    Trond

    3. Documentation

    The open documentation issues fall into these three categories:

    TODO:

    4. Corpus gathering

    The disamb project would want parallell texts relevant for terminological work ie bilingual domain-specific texts. Using our parallell corpus (interface), we can give back a very requested feature, filling some of the gaps in risten.no, as well as providing data for terminology workers.

    TODO:

    5. Corpus infrastructure

    Alignment

    The aligner output has too many errors. Possible tasks to improve it added to the TODO list below.

    TODO

    Conversion issues

    TODO:

    6. Infrastructure

    Steinar has gone through all of the documentation except access to the corpus, written a report and sent to Børre. Next step is to go through the report and fix whatever should be fixed.

    TODO:

    7. Linguistics

    Numbers:

    TODO:

    North Sámi

    TODO:

    Lule Sámi

    TODO:

    8. Name lexicon infrastructure

    Decisions made in Tromsø can be found in [this meeting memo.|/admin/physical_meetings/tromso-2006-08-propnoun.html]

    lexc2xml conversion bug

    Budej3ju Budeju
    
    Where do we draw the line between "words" and variants belonging to the same
    entry? Cf:
    

    Trondheim ? Trondhjem ? Nidaros (separate entry)

    
    An example where stem variation is just an expression of paradigm variation:
    

    Buckingham^shire:Buckingham^shire3 ACCRA-plc ; !SUB Buckingham^shire:Buckingham^shire ACCRA-plc ;

    Note: the final vowel e3, used for e-stems that NEVER may have illatives

    Buckinghamshire+N+Sg+Ill Buckinghamshire+N+Sg+Ill Buckinghamshirii Buckinghamshire+N+Sg+Ill Buckinghamshirej <== sic!

    
    In SD-terms, `Budejju` and `Budeju` would be stored as separate entries, but
    linked to each other, and with one marked as a sub of the other. Thus:
    
    
    Budej3ju Comment text Budeju
    
    TODO:
    1. fix bugs in lexc2xml; add comments to the log element (**Saara**)
    1. finish first version of the editing (**Sjur**)
    1. test editing of the xml files. If ok, then: (**Sjur, Thomas, Trond**)
    1. make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as
      well) (the morphological section should be kept intact, in e.g.
      propernoun-sme-morph.txt) (**Sjur, Saara**)
    1. convert propernoun-($lang)-lex.txt to a derived file from common xml files
      (**Sjur, Tomi, Saara**)
    1. implement data synchronisation between [risten.no](http://www.risten.no) and
       the cvs repo, and possibly other servers (ie the G5 as an alternative server
       to the public risten.no - it might be faster and better suited than the
       official one; also local installations could be treated the same way)
    1. start to use the xml file as source file
    1. clean terms-sme.xml such that all names have the correct tag for their use
      (e.g. @type=secondary) (**Thomas, Maaren, linguists**)
    1. merge placenames which are errouneously in different entries: e.g. Helsinki,
      Helsingfors, Helsset (**linguists**)
    1. publish the name lexicon on risten.no (**Sjur**)
    1. add missing parallel names for placenames (**linguists**)
    1. add informative links between first names like Niillas and Nils
      (**linguists**)
    
    # 9. Spellers
    
    ## MS Office speller
    
    There are no language codes available for Office 2004, and there won't be any,
    if I interpret the e-mail I just recieved correctly. It might be added to
    Office 2008 (at least they will try). This is the reply we got from Microsoft:
    
    

    Apologies about the delay. This is indeed a shortcoming of the application. The product team is currently looking at how it can be addressed moving forward.

    For Office 2004, the only global language setting I’m aware of is the one in the Custom tab in the document properties window. There you can set the Name field to “Language” then set the Type to text and enter the value “Saami.” Obviously this doesn’t provide object-model support via language settings.

    I hope this helps a little…

    
    The official *Microsoft* language codes for Sámi languages are:
    

    hex dec language-country combo —- —- ———————- 043b 1083 Northern Sami - Norway 083b 2107 Northern Sami - Sweden 0c3b 3131 Northern Sami - Finland 103b 4155 Lule Sami - Norway 143b 5179 Lule Sami - Sweden 183b 6203 Southern Sami - Norway 1c3b 7227 Southern Sami - Sweden 203b 8251 Skolt Sami - Finland 243b 9275 Inari Sami - Finland

    
    ## OOo speller(s)
    
    TODO after the MS Office Beta is delivered:
    * add Aspell/Hunspell data generation to the lexc2xspell (**Tomi** - after the
      PLX data generation is finished)
    * study Hunspell, perhaps also Soikko (**Børre, Sjur, Tomi**)
    
    ## Testing
    
    ### Precision and recall testing
    
    A testbed has been set up (**Trond**), and some texts are marked for errors and
    corrections (**Steinar**). Versions alpha, beta 0.1 and beta 0.2 have been
    tested.
    
    Types of tests:
    
    1. Technical testing
    1. Testing for linguistic functionality
    1. Testing for lexical coverage
    1. Testing for normativity
    1. Testing the suggestions
    
    **TODO:**
    * get an Intel Mac for Tomi (**Sjur**)
        - not yet
    * Include a testbed and results in the cvs (gt/doc/proof/spelling/testing)
      (**Trond, Børre**)
        - textid - nu_wds - tp - fp - tn - fn - prec - rec - acc - spellid - ref_to_txt
    * Store the tested texts, for reference (**Trond, Børre**)
    
    The tester should identify these 4 values:
    
    * wds - number of words in the text
    * tp - correctly identified errors
    * fp - correctly written but marked as errors
    * fn - errors not marked as such
    
    The spreadsheet will then calculate precision, recall and accuracy. **Steinar**
    has marked some texts like this: Errors are makred§marked with paragraph number
    followed by the correct form. Way of finding precision: Take out the § entries
    and evalate them for tp and fp. Way of finding recall: Remove the § entries and
    count the fn among the remaining words. Then fill in and collect results.
    
    Testing of suggestion should follow the same lines:
    
    * errs - number of errors in the text
    * tp - the intended word is among the suggestions
    * fp - the intended word is not among the suggestions
    * fn - no suggestions
    * tn - (not relevant??)
    
    Ordering of suggestions:
    * place in the list of the intended correction
        - ordered first
        - ordered top-five
        - ordered below top-five
    
    "Perceived Quality", ie for all recognised errors/tp:
    * number of correct suggestions at top
    * number of correct suggestions among top-five
    * number of correct suggestions below top-five
    
    ### Testing on unseen texts
    
    We need to use unknown texts in order to measure the performance of the speller.
    
    ### Regression tests
    
    We need to ensure that we do not take steps backwars, ie all known spelling
    errors in the corpus should be correctly identified, with a proper suggestion
    among the top five. For this purpose we can use the regular corpus with
    correction markup.
    
    ### Storing test texts
    
    Test texts should be stored in the corpus catalogue, separated from the ordinary
    corpus files. They should be marked as to whether their unknown words have been
    added to the lexicon or not (in the former case, they cannot be used for testing
    of performance any more, only for regression testing). When the words have been
    added, the whole text can be transferred to the regular corpus repository.
    
    **TODO:**
    * get speller test tool from **Polderland** (**Sjur**)
    * Set up (sub)directories (**Børre, Saara**)
    * Add potential test texts (**Børre, Thomas, Trond, anyone, really**)
    * Manually mark them for typos (making them into gold standards)
      (**Steinar, Maaren**)
    * Format the added texts in appropriate ways - use our existing xml format, with
      correct markup as decided earlier (the only thing that separate these
      documents from regular corpus documents is the directory (tree) in which they
      reside), thus regular corpus conversion tools, plus error markup (**Børre**,
      **Saara**, **Steinar**?)
    * Set up ways of adding meta-information (source info, used in testing or not,
      added to lexicon or not) (**Sjur, Børre**)
    * Set up test record page in gt/doc/proof/spelling/testing/ (**Børre**)
    * Conduct tests on new beta versions on the basis of the unspoiled gold standard
      documents (**whoever has time**), and fill in data from testing (the testers:
      **who?**)
    * alternatively: make test scripts that will run the tests automatically,
      collect the numbers, and transform them into test results (**who?**)
      dependent upon the functionality of the Polderland tools.
    
    include the ones already tested in the testing/ catalogue
    test 0.3 on the same texts
    test each version before beta release
    
    ## The b0.3 / 2007.02.26 version
    
    Subjective impression: With the b0.3 version we have surpassed the alpha
    version as far as useful tools go. **Trond** will now (for the first time)
    migrate from Dutch (Alpha) to Catalan (latest internnal beta candidate) in his
    daily work of producing lectures in Sámi.
    
    Writing biillaviessu, ránskkabiila (gen+nom) we get error, and suggestion
    biilaviessu (nom+nom). I write biilaviessu, ránskabiila (nom+nom), and get ok.
    Given this, it seems compounding behaves as it should, ie according to the
    default compounding scheme. No entries have so far received compound override
    tags yet, thus the default behaviour is expected everywhere.
    
    SgNom does not have any L tag - but SgNomCmp does. Thus, `biilaviessu` is NOT
    a fp, it is correctly recognised and corrected, as shown above. The PLX entries
    look as follows:
    
    

    sát^ne^gir^ji NIR +N+Sg+Nom sát^ne^gir^ji GaIALR +N+Sg+Gen sát^ne^gir^ji GaAL +SgGenCmp

    vies^su NIR bii^la NIR +N+Sg+Nom biil^la GaIALR bii^la NAL +SgNomCmp biil^la GaAL

    
    The two `bii^la` entries above in effect create one entry
    `bii^la  NIALR` with the correct compounding info. Thus, basic compounding is
    now working as expected, ie according to the defaults.
    
    Known errors:
    * clitics do not work with W class words (uninflected words)
        - two options:
            - generate these with clitics (adds words from 6700 -> 100 000)
        (**Tomi, Saara**)
            - ask Polderland to look at it - **Sjur** will do that
    
    ## Localisation
    
    We need to translate the info added to our front page (and a separate page)
    regarding the beta release. Also the press release needs to be translated.
    
    TODO:
    * translate beta release docs to `sme` (**Thomas**)
    * translate beta release docs to `smj` (**Thomas**)
    
    ## Lexicon conversion to the PLX format
    
    We need to test that the conversion is correct and gives expected results in all
    cases, especially regarding compounding and derivation. For that we need a small
    set of test entries in lexc format, and the corresponding expected output in PLX
    format. By comparing the actual output with the expected output we get a measure
    of the quality of the conversion.
    
    **TODO:**
    1. add derivations to the PLX generation (**Tomi**)
    1. add prefixes to the PLX (**Børre**)
    1. middle nouns (**Børre**)
    1. make conversion test sample; add conversion testing to the make file
      (**Tomi**)
    1. improve number conversion (**Børre, Tomi**)
    
    ## Public Beta release
    
    Tentative public beta release: after the initial linguistic bugs and poor
    coverage, it is now moved to Thursday 15.3. - this time with derivations and
    numbers included:-)
    
    Internal deadlines:
    * A date for when lexical updates should be checked in, in
      order to make it to the beta.
    * A plan for how many pre-betas we should compile, and when(?)
        - alpha = Dutch (`sme`) + French (`smj`)
        - beta 0.1 = the first Catalan (`sme`) + Basque (`smj`)
        - beta 0.2 = the second Catalan
        - beta 0.3 =  26. or 27.: compound beta
        - beta 0.4 = 2.3.: first derivation beta, also including numbers, prefixes.
        - beta 0.5 = 7.3.: final derivation beta, also including middle nouns
    
    Linguistic issues still open:
    * derivations (**Tomi**)
    * numbers 1-20 (**Børre**)
    * prefixes (eahpe, ii-) (**Børre**)
    * middle nouns (LEXICON: lexc: Rmiddle, plx: L) (**Børre**)
    
    The middle nouns are:  `beai, beal, geaš, oahpaheai, oai, vuol`. They are also
    marginally used initially (not found in the corpus):
    
    

    beai+ShCmp:beai Rreal ; (not used init in our corpus) beal+ShCmp:beal Rreal ; (init with Num -goalmmat, -guđát, -nuppi, lexicalized) geaš+ShCmp:geaš Rreal ; (not used init in our corpus) oahpaheai+ShCmp:oahpaheai Rreal ; init, but then actually 2-part oai+ShCmp:oai Rreal ; (not used in corpus init oaivuolli (SUB? yes!) vuol+ShCmp:vuol Rreal ; (not used in our corpus)

    
    The **PLX** format does not allow encoding a stem as middle-only. For the public
    beta we will encode them as Left-only (which is really non-right), and evaluate
    their effect on the quality of the speller as we progress.
    
    DONE:
    * delivered PLX data of `sme` and `smj` including compounding
    * translated Windows installer to `sme` and `smj`
    * installed PLX compiler in G5 at `/usr/local/bin/mklex*` (one version for
      `sme` and one for `smj`)
    * added resources needed for compiling PLX lexicons to our cvs repo
    * tested the beta drop from Polderland - good we did, it is absolutely
      unacceptable (our responsibiliby - only linguistic errors (poor coverage)
      found so far)
    
    TODO:
    * write press release (**Sjur**)
        - done first draft, see `xtdoc/sd/.../xdocs/pr/`
    * add info to front page (incl. download links) (**Børre**)
    * write separate page with detailed info (incl. download links) (**Børre**)
        - **Sjur** wrote a start
    * compile new speller lexicons using the mklex* tools on the G5, following the
      instructions given in the documentation from Polderland (in mklex beta drop)
      (**Tomi, Børre**)
        - done regularly now
    * add compilation of MS Office spellers part of the Makefile (**Tomi**)
    * install Windows and MS Office; test tools on Windows (**Børre, Thomas**)
    * collect a list of PR recipients, forward to Berit Karen Paulsen
    * questions for Polderland (**Børre**):
        - version info in the speller?
        - remaking/updating the installer packages with linguistic updates - who?
    
    ## Version identification of speller lexicons
    
    See the Norwegian spellers for an example, with the trigger string `tfosgniL`.
    
    Suggestion:
    

    nuvviD -> Divvun nuvviD -> Veršuvdna_1.0b1 (based on cvs tag?) nuvviD -> 12.2.2007 (automatically generated/added) nuvviD -> Sjur_Nørstebø_Moshagen nuvviD -> Børre_Gaup nuvviD -> Thomas_Omma nuvviD -> Maaren_Palismaa nuvviD -> Tomi_Pieski nuvviD -> Trond_Trosterud nuvviD -> Saara_Huhmarniemi nuvviD -> Steinar_Nilsen nuvviD -> Lene_Antonsen nuvviD -> Linda_Wiechetek

    
    These correction rules (and their corresponding PLX entries) should be added
    automatically to the PLX file and the phonetic file as part of the compilation
    process, to include build date and version number.
    
    ## Conversion from LexC to PLX
    
    

    Adjectives compile at 60 sec/adjective, i.e. (500060) / 3600 = 83 hrs Nouns compile at 3 sec/noun, i.e. (236003) / 3600 = 19 hrs ```

    This is so far acceptable for nouns, but on the edge of being unacceptable for adjectives. These times will multiply many times when we add derivation, meaning we will need more than a week to convert the major POSes from LexC to PLX then.

    We need to investigate why adjectives are so slow, and try to improve on the conversion speed.

    Testing

    Different ways of testing:

    1. Impressionistic, functionality: try the program, try all the functions
    2. Impressionistic, coverage: try the program on different texts, look for false positives
    3. Systematic (in order of importance):
      1. Make a corpus of texts, from different genres (can be done before 0.2 release)
        1. For each text, detect precision
        2. For each text, detect recall
        3. For each text, detect accuracy

    Before beta release: precision is important, but have a look at recall as well.

    Recall and precision

    Definitions:

    10. Other

    Corpus contracts

    TODO:

    Bug fixing

    57 open Divvun/Disamb bugs, and 23 risten.no bugs

    11. Next meeting, closing

    The next meeting is 5.3.2007, 09:30 Norwegian time.

    The meeting was closed at 11:05 first time, then at 14.55.

    Appendix - task lists for the next week

    Boerre

    Maaren

    Saara

    Sjur

    Steinar

    Thomas

    Tomi

    Trond