Meeting setup
- Date: 18.10.2010
- Time: 09.30 Norw. time
- Place: Internet
- Tools: SubEthaEdit, iChat
Agenda
- Sma issues
- Forthcoming finsme dictionary
- Info topics
Cf. one of the following, depending on context:
- the upper bar of the SEE window (provided you use the JSPWiki syntax mode)
- the TOC in Forrest-rendered output, like HTML and PDF
Opening, agenda review, participants
Sma issues
Need to correct dictionary against speller
Wrong dict forms in speller? There are misspellings in the smanob dictionary,
and often those misspellings are found as well in the speller. Check that
misspellings in the dictionary are not replicated in the speller.
The fst and the dictionary show different Sg3 forms (where the Sg3 forms (marked
p3p
) are taken from the Hemnes dictionary. The p3p forms missing in the fst
have now been manually marked with XXX in the dictionary.
The XXX marks can mean:
- there is a difference between analyser and p3p or wordclass info (comes often from smaswe)
- there are several forms given by the analyser, one can correspond with
the p3p, and that means that we have to mark more than more verb class
in smanob (if both are correct) - how do we do that?
Status quo for narmativity
The letter on loan words was sent to SGM on friday. We also sent a letter
in June-July last year. Are there now normativity issues unresolved? Not that we
are aware of.
How shall Divvun and Gt cooperate on the sma work during what is left of this year?
Lene has a list of dictionary verbs not in the analyser. During this work some
issues have turned up:
- missing verbs (and other words)
- misspellings and errors in both the transducer and smanob (cf above)
Way of work:
- Read through smanob and look for errors (adj, noun)
- Check whether these errors are in the normative analyser
- mark problematic words as non-includes
- reverse sorting the fst lexicon, look at continuation lexicon
- unify lemma forms for all fst entries (ie multiple entries for the same word
should have the same baseform/lemma)
- sort and unique fst entries - double forms in sma lexicon shall be unified
- learn typical error patterns from the dictionary correction work
(-e vs -ie as stems, -a- vs -aa- or -ie- vs. -ïe- as root vowel),
grep similar patterns out of the fst code, and proofread them
The three i-s of sma - is this true?
- i as i
- ï as i and ï (klinïgke / klinigke)
- ï shall not be i (but is made so via spellrelax)
Before we implement it we shall find out whether the second category
exists.
TODO:
- sort and unique nav entries in sma (Trond)
- discuss the possible third i in sma (Maja, Thomas)
- go through the XXX in the dictionary verbs
(Maja, Marit, Sissel, Thomas, Toini)
- sma linguist meeting - date & time to be decided
(Lene, Maja, Marit, Sissel, Thomas, Toini, Trond)
Forthcoming finsme dictionary
We do not quite know when it comes, but what shall we do with it when it comes?
The dictionary is made by KOTUS (fin-sme-fin).
Hvis penger - Johanna Ijäs kan gjøre noe med dette?
Info topics
a new programmer in the sma-oahpa project
Ryan for 2 months.
Cips journey to Iceland
Here is my very minimalistic notes on tools:
- corpus_search tool (try it!) -
[http://corpussearch.sourceforge.net/]
- DATR (vs. XFST try it!) -
[http://www.informatics.sussex.ac.uk/research/groups/nlp/datr/]
- GLOSSA (Anders Nøklestad - GLOSSA is FLOSS, use it)
- [http://www.hf.uio.no/iln/tjenester/sprak/glossa/index.html]
- [http://tekstlab.uio.no/glossa//html/GLOSSA_manual.html]
- CLAN (Carnegy Mellon University) (try it) -
[http://childes.psy.cmu.edu/clan/]
- MOLTO (MT open source Aarne Raanta on GF) -
[http://www.molto-project.eu/]
- CORD (corpora historical English) -
[http://www.helsinki.fi/varieng/CoRD/index.html]
- TVÄRSLÅ (the scandinavian lexicon is accessible for single-word
queries on the net, but not available for developers) -
[http://ordbok.nada.kth.se/]
The amount of work for installing GLOSSA here is noticeable, yes, but
it will make data updating easier. Also, we might easier be able to
taylor the interface to our need.
The obt uses the Oslo fullform list for morphological analysis, with
a separate component for compound analysis. For syntax, they use CG.
In Bergen they are going to implement a separate corpus interface,
corpuscle.
Summing up the sma-oahpa week
- Aajege var med 3 personer i Tromsø 11-15.10
- de har samlet 2500 ord til Oahpa-leksikonet fra lærebøker - nå unifisert med
smanob
- resultert i 900 nye lemmaer i smanob-dict for subst, verb og
adjektiv (de andre ordklassene er ikke telt opp)
- Aajege arbeider med Oahpa-lemmaene rett i smanob: forbedrer oversettinger
(som er tatt rett fra lærebøkene hulter til bulter), og legger til
semantiske sett og evt. info om retning i Leksa - til dels erstatter de den
opprinnelige smanob-oversettelsen
- vi må finne ut på hvilket nivå vi legger til oversettelser til andre språk i
samme leksikonet (sme, swe). Konsekvensen vil bli at vi vil ha sma-fil
istenfor smanob, smaswe osv. og nob-fil istedenfor nobsma, nobsme…
- vi begynner å bli venner med XMLmind
- alfa-MorfaC, beta-MorfaS og beta-Leksa skal være ferdig i løpet
av de nærmeste ukene, testes på voksne kurselever 19-21.11 - til da bør
også ny versjon av ÅD være kompilert
- en del nye funksjoner legges til i Leksa -> kommer også med i smeOahpa:
- pronomener
- oversettingskommentar
- nettordbok inkludert i Oahpa-grensesnitt
- neste arbeidsmøte er i Røros 22.11 (Trond og Lene drar dit), dessuten
bruker vi ichat-møter - torsdager etter lunsj
Whitespace and empty element diffs:
- <book name="s4" />
+ <book name="s4"/>
- <stem class="trisyllabic"></stem>
+ <stem class="trisyllabic"/>
TODO:
- compile new version of ÅD (Ciprian)
- fix the XMLMind XML Editor whitespace issue (Sjur)