Language Technology at UiT The Arctic University of Norway

The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages

View GiellaLT on GitHub divvungiellatekno/giellalt.uit.no

Page Content

Proper noun lexicon

What?

* (non-)application (speller, disamb, IR, etc) <appl ...> * link/merge/group two (or more IDs) => button in c-record entry edit (with a search option/multiple choice list) * during the initial conversion from lexc, remove ^, # and 0 from the IDs? Only the center/concept ID, the ID of the language entries should still contain these characters. This will make it easier to search the concept IDs and use them for merging and lookup purposes, and by keeping these chars in the language IDs, we don't loose any information that would be needed to be retained in another place, thus we only reduce the file size, we won't increase it. * hide / display ^ during browsing? (not in the entry display) * future: choose what info to display in the language browser, to keep the number of fields down. This will make it easy to browse also when there is more info for each language, and keep the info amount down in each given session. All the info will always be displayed in the entry display ## How? * search by language (using the language files, not the common file) - done * when adding a new lang to an existing record, display the existing languages ## Bugs * make searches work as expected: Bj* returns Am^bjørg - done using regular expressions: "^Bj*" (without quotes) will match as expected * change existing records: not yet possible! ## Conversion * add default smj (sma, nob, nno?) entries * exclude '^', '#' and '0' from the center IDs * add an initially emtpy element as last child of ( elements can then just be appended to the log element); should be added to both the center / concept file, and to the language files. * add a `last-update` attribute to the root element, to keep a timestamp on the document; the timestamp should either be an integer of the form `YYYYMMDDHHMMSS`, or a datetime string in standard format; the value should be the time of the conversion ``` ======= termcenter.xml ========= Before merge: ORG </entry> ORG </entry> After merge: ORG </entry> plc </entry> ======= terms-sme.xml ========= <== (today: NIILLAS-plc) (use only one?) => ref="Bb" after merge ======= terms-nob.xml ========= => ref="Bb" after merge ========================= ```