The Divvun and Giellatekno teams build language technology aimed at minority and indigenous languages
Present: Berit Merete, Børre, Ciprian, Tomi, Trond
Goal: Functioning corpus
The process has not ben run, and we thus do not have new results.
Run the same routine for nob.
Missing in nob:
converted/nob/admin/guovda/1.doc.xml
/home/apache_corpus/freecorpus/converted/sme/admin/depts/other_files
8.9000 26196 2334 STM200420050011000SE_PDFS.pdf.xml
Rá Rá +?
ehusa ehusa +?
jahkedie jahkedie +?
áhusáššiid áhusáššiid +?
8.4300 30893 2605 STM200420050044000SE_PDFS.pdf.xml
jahkedie jahkedie +?
áhus áhus +?
Rá Rá +?
ádallamat ádallamat +?
8.3300 7320 610 Reindrift_Omraadeprotokoll_til_konvensjon_mellom_Norge_Sverige_Nordsamisk.pdf.xml
7.1500 14438 1033 273777-raportti_saami.pdf.xml
6.0100 57637 3464 OTP200620070025000SE_PDFS.pdf.xml
5.6600 1535 87 faktablad_nordsamiska_wordversion.doc.xml
4.5900 8931 410 260965-h-2179s_2.pdf.xml
4.4800 3325 149 sami_rapporter_bruk_samisk_flagg_SA.pdf.xml
4.4700 18766 840 203210-q-1066_samisk_lav.pdf.xml
4.4600 3874 173 sami_rapport_sametinget_vedlegg4_SA.pdf.xml
/home/apache_corpus/freecorpus/converted/sme/admin/depts/regjeringen.no
30.4900 341 104 130-000-ruvnnu-kvena-proeavttaide.html_id=573764.xml
Rejeerinki Rejeerinki +?
anttaa anttaa +?
rahhaa rahhaa +?
Porsangin Porsangin +?
kolmekieliselle kolmekieliselle +?
laulukirjale laulukirjale +?
26.6600 30 8 plakater-til-valgdagen.html_id=575739.xml
26.6600 15 4 neahttakarta-.html_id=313865.xml
25.0000 12 3 neahttakarta.html_id=223274.xml
24.2400 33 8 nytt-og-nytting.html_id=544857.xml
23.5200 17 4 neahttakarta-.html_id=313868.xml
23.2500 43 10 neahttakarta-.html_id=313744.xml
22.8500 35 8 gulaskuddannotahtta.html_id=588787.xml
22.8500 35 8 adreassalistu.html_id=588788.xml
22.2200 18 4 ohcanveahkki-.html_id=446705.xml
21.8700 32 7 forskrifter.html_id=623.xml
21.4200 42 9 julebesok-til-oslo-fengsel.html_id=629537.xml
/home/apache_corpus/freecorpus/converted/sme/admin/guovda
50.0000 10 5 GUOVDAGEAINNU_NUORAIDSKUVLLA_OAHPAHEDDJIID_PLÁKÁHTTA.doc.xml
33.3300 12 4 GUOVDAGEAINNU_NUORAIDSKUVLLA_OHPPIID_PLÁKÁHTTA.doc.xml
29.9500 227 68 KS_áššelistu_24.06.2004.doc.xml
13.8400 65 9 Gártnetluohkka_ÁRVVOŠTALLANSKOVVI_22.04.03.doc.xml
12.3100 138 17 Bajasdoallansiehtadus_FKB-data_Guovdageainnu_suohkanis_05.05.05.doc.xml
10.3800 10409 1081 1_2.doc.xml
8.3500 431 36 vinterskole.doc.xml
8.3100 493 41 Sakspapirer_på_samisk_31.10.03.doc.xml
8.1300 209 17 MEAHCCESKUVLA.doc.xml
7.9600 427 34 Mearraskuvla.doc.xml
/home/apache_corpus/freecorpus/converted/sme/admin/others
15.6800 1326 208 uito-ohpenplana.txt.xml
15.1500 66 10 Reglement_Djupvik_havn.doc.xml
13.0800 107 14 VÁLGADIKKI.doc.xml
10.6700 637 68 skuterløyer_2006.doc.xml
9.4300 53 5 valgalistut_almmuhus.doc.xml
8.9200 112 10 SKJEMA___AMBULLERENDE.doc.xml
8.8800 45 4 Oversetting,_følgebrev.doc.xml
7.3600 95 7 UTBETALINGSANMODNING.doc.xml
7.1700 237 17 RETN.LINJER___KULTUR.doc.xml
7.0500 85 6 Reguleringsplan.doc.xml
/home/apache_corpus/freecorpus/converted/sme/admin/sd/other_files
38.1800 6270 2394 dc1990-4.pdf.xml
26.0400 14338 3734 satnelistu.doc.xml
25.3600 138 35 stedsnavn4.doc.xml
20.7600 6592 1369 dc1991-2.pdf.xml
15.0600 9294 1400 dč1994-2.pdf.xml
14.7100 9357 1377 dc1990-3.pdf.xml
14.2600 1311 187 dc1993-3.pdf.xml
13.9800 12240 1712 dc1990-1.pdf.xml
13.3600 11341 1516 dč1994-1.pdf.xml
13.1200 160 21 64547_1_P.doc.xml
/home/apache_corpus/freecorpus/converted/sme/admin/sd/samediggi.no
40.0000 5 2 samediggi-article-788.html.xml
40.0000 5 2 samediggi-article-315.html.xml
40.0000 5 2 samediggi-article-227.html.xml
40.0000 5 2 samediggi-article-225.html.xml
27.5500 196 54 samediggi-article-2933.html.xml
25.0000 8 2 samediggi-article-3179.html.xml
21.7300 23 5 samediggi-article-3114.html.xml
20.0000 35 7 samediggi-article-3217.html.xml
18.5100 27 5 samediggi-article-3451.html.xml
17.5700 165 29 samediggi-article-3683.html.xml
17.0800 158 27 samediggi-article-2738.html.xml
16.3900 61 10 samediggi-article-505.html.xml
16.0000 25 4 samediggi-article-2485.html.xml
Tomi to look into this, and discuss with Børre on unclear points.
TODO
Ultimate goal:
[dstroke]
[dstr juoga
oke]
I found this error yesterday:
<tuv lang="sme-NO">
<seg>Sámediggi gávnnaha 1unddo1ažžan ahte fy1kagielda váldá oasi giellanjuolggadusaid ovttastahttimii ja di1álašvuodaid 1 áhčimiidda gielddain mat gu11et doaibmaguv1ui Finnmárkku fylkkas .</seg>
And this:
<tuv lang="sme-NO">
<seg>Dan lassin lea bálkkašumi vuoiti vuođđudan alccesis duodjefitnodaga , ja lea máhtolašvuođainis ja hutkái ¬vuođainis ožžon alla árvvu duodjeealáhusas .</seg>
</tuv>
And this - đ is missing:
<tuv lang="sme-NO">
<seg>daid ektui , ja ahte gielddat ieža oidnet dárbbu doallat aktiivvalaš oktavuo a Sámedikkiin go galgá bargat kulturhistorjjá sihkkarastimiin , duo aštemiin , dutkamiin ja gaskkustemiin .</seg>
</tuv>
Same error - đ is missing:
<tuv lang="sme-NO">
<seg>Orru ahte dán gealdagasas dat lea sámi kultuvra vuoittahallan ja ahte eiseválddiid dáiddaáŋgiruššamat vuo uduvvojit minoritehtakultuvrra siskilkeahtesvuhtii .</seg>
Son !! boahtán (!! pro ii)
Status quo:
This message shows up when converting orig, and the issue is still open.
Trond has talked to Knut Hofland. We will get a new TCA2 version.
TODO
Trond had made an anchor.fst, which unfortunately was flawed. A new one is finished and ok, but not tested or checked in. The question now is whether to take nob or sme as a starting point.
TODO
Has anyone checked the output? No.
The cronjob did this
TODO
Make sure we have a fresh version on thursday. (Børre)
Error report, have a look:
tmp/STM200720080028000DDDPDFS.pdf.log:Conversion failed: Couldn't convert /Users/cipriangerstenberger/extra_gtsvn/new_fad/orig/nob/admin/depts/other_files/STM200720080028000DDDPDFS.pdf to intermediate xml format
tmp/STM200820090039000DDDPDFS.pdf.log:Conversion failed: Couldn't convert /Users/cipriangerstenberger/extra_gtsvn/new_fad/orig/nob/admin/depts/other_files/STM200820090039000DDDPDFS.pdf to intermediate xml format
tmp/STM200820090043000DDDPDFS.pdf.log:Conversion failed: Wasn't able to categorize the language(s) inside the text /Users/cipriangerstenberger/extra_gtsvn/new_fad/orig/nob/admin/depts/other_files/STM200820090043000DDDPDFS.pdf
tmp/Samiske_tall_forteller_3_NO.pdf.log:Conversion failed: Wasn't able to categorize the language(s) inside the text /Users/cipriangerstenberger/extra_gtsvn/new_fad/orig/nob/admin/depts/other_files/Samiske_tall_forteller_3_NO.pdf
tmp/Samiske_tall_forteller_II_Norsk.pdf.log:Conversion failed: Wasn't able to categorize the language(s) inside the text /Users/cipriangerstenberger/extra_gtsvn/new_fad/orig/nob/admin/depts/other_files/Samiske_tall_forteller_II_Norsk.pdf
tmp/retningslinjerforverneplanarbeid_sametinget.pdf.log:Conversion failed: Wasn't able to categorize the language(s) inside the text /Users/cipriangerstenberger/extra_gtsvn/new_fad/orig/nob/admin/depts/other_files/retningslinjerforverneplanarbeid_sametinget.pdf
drwxr-xr-x 4 cipriangerstenberger staff 136 7 apr 22:54 orig
drwxr-xr-x 201 cipriangerstenberger staff 6834 11 apr 13:29 tmp