The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages
View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no
Tuesday afternoon
Conversion from original to internal format. Manually add meta-info on writer, mainlang, etc. into the xsl file. Manually (temporarily) edit the int file, adding language info in multilingual docs. Do a xmldiff on this, add the result to the xsl file.
Problem with xmldiff: The document structure of the int file could change, thus rendering the generated references in the xmldiff part of the xsl file invalid. Typically xmldiffs contain XPath expressions such as:
/doc/section[2]/p[3]
If a new version of antiword or our DocBook2GT generates a different structure, the p [n] type references will identify the wrong node, or nothing at all.
Solution to this problem: Version control of the intermediate xml file. You have two reasons for this: 1) manual addition of language information, 2) new generations of the internal xml file. After this is done, apply the xsl file and generate the final document.
We then started to look for free software to do automatic language recognition, and found some alternatives. We picked (TextCat|), a perl-based tool licensed under a GPL license. It is now installed in gt/scripts/, and rudimentary user doc can be found in our intranet.
What is left to do now, is to integrate language recognition with the corpus processing, adding language information automatically to at least the document (if not existing and correct), but preferably to all elements containing text, and the language of the text is different from the main language of the doc.
The name lexicon meeting turned into a very productive meeting, ending in a new classification for names, as well as a draft XML document. The new classification is:
Each name can have more than one of these classes.
We made an xml fragment illustrating the suggested document format. The fragment is found at:
gt/common/src/proper-nouns.xml
and contains the following:
<?xml version='1.1' encoding="UTF-8"?>
<!DOCTYPE propndict PUBLIC "-//DIVVUN//DTD Proper Noun Dictionary V1.0//EN"
"http://www.divvun.no/dtd/prop-noun-dict-v10.dtd">
<propndict>
<entry id="Guovdageaidnu">
<sem>
plc sur
</sem>
<form lg="sme">
<cnt>DEATNU</cnt>
</form>
<form lg="smj">
<lm>Guovdageajdno</lm>
</form>
<form lg="sma">
<lm>Goevtegeajno</lm>
</form>
<form lg="fin">
<lm>Koutokeino</lm>
</form>
<form lg="eng nno nob swe"/>
<lm>Kautokeino</lm>
</form>
</entry>
<entry id="India">
<sem>
fem plc
</sem>
<form lg="sme">
<cnt>ACCRA</cnt>
</form>
<form lg="smj">
<cnt>AKKRA</cnt>
</form>
<form lg="sma">
<cnt>IE_PNOUN</cnt>
</form>
<form lg="fin">
<lm>Intia</lm>
<cnt>14</cnt>
<sem>plc</sem>
</form>
<form lg="fin">
<lm>India</lm>
<cnt>14</cnt>
<sem>fem</sem>
</form>
<form lg="eng nno nob swe"/>
</entry>
</propndict>
Note:
The entries above are not necessarily correct, their purpose is to illustrate how to handle some different cases.
The following should be noted regarding the above XML document, and it’s interpretation:
The languages we considered were:
Others might and could easily be added later.
The next steps to be taken are:
Not covered.
Not covered.
We discussed project plans and planning a lot, as well as the possible tools to use. We tried or looked at a few, and settled on Merlin. It has nice OS X integration, but also its drawbacks, the main being that it really is a single-user application. This makes it hard to share and access project files and info among several users, which limits the usefulness of the application quite a lot. Nevertheless, the pros outweight the cons, at least compared to the alternatives.
There’s an online Handbook for Merlin, as well as a pdf version.
Real project plans are (finally!) forthcoming:-)
We did not discuss Árran.