Skip to content

Commit 14b5ec0

Browse files
Article update
1 parent 42d0e42 commit 14b5ec0

13 files changed

+147
-431
lines changed

article/accuracies.csv

+24-24
Original file line numberDiff line numberDiff line change
@@ -1,28 +1,28 @@
11
Epoch,CNN 1,CNN 2,CNN L,CNN L2,CNN w/o P,CNN N,CNN N 2,CNN L N,CNN L N 2,LSTM
2-
1,0.113865115473319,0.114842598051554,0.114195771601993,0.115835972718203,0.124521020683575,0.116527369560072,0.117429342790644,0.118438665735091,0.12033569284938,0.670462112879332
3-
2,0.046573883373152,0.046952832947328,0.047154250968127,0.046973550386671,0.052795819635434,0.047942750374486,0.048351091373045,0.048382771515501,0.048883984032243,0.668534556036643
4-
3,0.032694267109229,0.033033151428673,0.033280165905021,0.033360172860326,0.038818476028614,0.033720806391147,0.034038842824658,0.034184350584684,0.034354573703034,0.668378072623554
5-
4,0.026167214350533,0.026028367156286,0.026535042628588,0.026626830707364,0.031159712122058,0.026795169315774,0.02708861149642,0.027360685260013,0.027410148472623,0.668301192784179
6-
5,0.022008203308508,0.021994135071392,0.022488486767629,0.022683917405557,0.026208359567099,0.022613827432933,0.022897579075835,0.023170538136169,0.023339582982562,0.668256849272082
7-
6,0.019333275395343,0.019188345427759,0.019652754424256,0.019871564157376,0.022846316828775,0.019852554645879,0.020091619331483,0.020402554235748,0.020472278672618,0.66822136060947
8-
7,0.017191578073764,0.017174622671695,0.017636345444039,0.017785418009972,0.02025711226732,0.017702349902885,0.017976854370235,0.018332694952508,0.01830498856786,0.6681710468547
9-
8,0.015669044972837,0.015712366488998,0.016035391425753,0.016230992533413,0.018312343279831,0.016024948659686,0.016308253284692,0.016584884904732,0.01666422469114,0.668147657806964
10-
9,0.014462196761079,0.014294683938217,0.014831616148017,0.014970849858079,0.016752462491008,0.014812545204204,0.015090807492492,0.015456409141034,0.015448172509144,0.668118449522211
11-
10,0.013438037172303,0.01330412186394,0.013755708385421,0.013918037426764,0.01540361210404,0.013804012233179,0.014075514188161,0.014411296319643,0.01446253188623,0.668098743781558
12-
11,0.012614994064857,0.012388132844903,0.012907409961913,0.013092883495141,0.014325385593043,0.013042634371017,0.013200233379072,0.013567152115073,0.013424532527562,0.668066170005666
13-
12,0.011771994479459,0.011713160402408,0.012180195351373,0.012280231713345,0.013272779247436,0.012203698733701,0.012486865174958,0.012674308327806,0.012784990722253,0.668039920722751
14-
13,0.011200418432657,0.011081268179252,0.01155810675541,0.011789928899615,0.012566024011806,0.011771388267325,0.01185007607857,0.012110650314653,0.012181608975067,
15-
14,0.010673281670367,0.010509989099465,0.010956866740381,0.011073622622577,0.011777548090822,0.011125323825938,0.011269465597529,0.011554662297737,0.011558063195284,
16-
15,0.010155257868354,,0.010511980868521,0.010671981508129,0.011111998167427,0.010563416314357,0.010783351513197,0.01112991053508,0.01114224524391,
17-
16,0.009772817370016,,0.010126495669208,0.010249616529669,0.010586049911754,0.010161953321671,0.010339107201585,0.010276295343639,0.010683886237,
18-
17,0.009413273869357,,0.009674179078031,,0.010068297018136,0.009834481921024,0.009995451586674,0.009653073300594,0.010201461724811,
19-
18,,,0.009365777759049,,0.009601207305955,0.009462615105118,,0.009206381508441,0.009953611758449,
20-
19,,,0.009100519493314,,0.009212811686055,0.008768952755244,,0.008868569803463,0.00950575169014,
21-
20,,,,,0.008849915140144,,,0.008619247401446,0.009227716098966,
22-
21,,,,,0.008547240577846,,,0.008581858224226,0.008976117437202,
23-
22,,,,,0.008132688840701,,,,0.008787673159773,
24-
23,,,,,0.007695368266619,,,,0.008245814139646,
25-
24,,,,,0.007401325128144,,,,,
2+
1,0.113865115473319,0.114842598051554,0.114195771601993,0.115835972718203,0.124521020683575,0.116527369560072,0.117429342790644,0.118438665735091,0.12033569284938,0.17801982850368
3+
2,0.046573883373152,0.046952832947328,0.047154250968127,0.046973550386671,0.052795819635434,0.047942750374486,0.048351091373045,0.048382771515501,0.048883984032243,0.138875722616842
4+
3,0.032694267109229,0.033033151428673,0.033280165905021,0.033360172860326,0.038818476028614,0.033720806391147,0.034038842824658,0.034184350584684,0.034354573703034,0.129460469536855
5+
4,0.026167214350533,0.026028367156286,0.026535042628588,0.026626830707364,0.031159712122058,0.026795169315774,0.02708861149642,0.027360685260013,0.027410148472623,0.124081683395905
6+
5,0.022008203308508,0.021994135071392,0.022488486767629,0.022683917405557,0.026208359567099,0.022613827432933,0.022897579075835,0.023170538136169,0.023339582982562,0.120446059375332
7+
6,0.019333275395343,0.019188345427759,0.019652754424256,0.019871564157376,0.022846316828775,0.019852554645879,0.020091619331483,0.020402554235748,0.020472278672618,0.117896886024935
8+
7,0.017191578073764,0.017174622671695,0.017636345444039,0.017785418009972,0.02025711226732,0.017702349902885,0.017976854370235,0.018332694952508,0.01830498856786,0.115843590591404
9+
8,0.015669044972837,0.015712366488998,0.016035391425753,0.016230992533413,0.018312343279831,0.016024948659686,0.016308253284692,0.016584884904732,0.01666422469114,0.11436728350412
10+
9,0.014462196761079,0.014294683938217,0.014831616148017,0.014970849858079,0.016752462491008,0.014812545204204,0.015090807492492,0.015456409141034,0.015448172509144,0.113044516415113
11+
10,0.013438037172303,0.01330412186394,0.013755708385421,0.013918037426764,0.01540361210404,0.013804012233179,0.014075514188161,0.014411296319643,0.01446253188623,0.112150318365572
12+
11,0.012614994064857,0.012388132844903,0.012907409961913,0.013092883495141,0.014325385593043,0.013042634371017,0.013200233379072,0.013567152115073,0.013424532527562,0.111504298775433
13+
12,0.011771994479459,0.011713160402408,0.012180195351373,0.012280231713345,0.013272779247436,0.012203698733701,0.012486865174958,0.012674308327806,0.012784990722253,0.110764644605603
14+
13,0.011200418432657,0.011081268179252,0.01155810675541,0.011789928899615,0.012566024011806,0.011771388267325,0.01185007607857,0.012110650314653,0.012181608975067,0.110169274810619
15+
14,0.010673281670367,0.010509989099465,0.010956866740381,0.011073622622577,0.011777548090822,0.011125323825938,0.011269465597529,0.011554662297737,0.011558063195284,0.110021979940934
16+
15,0.010155257868354,,0.010511980868521,0.010671981508129,0.011111998167427,0.010563416314357,0.010783351513197,0.01112991053508,0.01114224524391,0.109612528576851
17+
16,0.009772817370016,,0.010126495669208,0.010249616529669,0.010586049911754,0.010161953321671,0.010339107201585,0.010276295343639,0.010683886237,0.109364766782617
18+
17,0.009413273869357,,0.009674179078031,,0.010068297018136,0.009834481921024,0.009995451586674,0.009653073300594,0.010201461724811,0.109138422803464
19+
18,,,0.009365777759049,,0.009601207305955,0.009462615105118,,0.009206381508441,0.009953611758449,0.108897648882189
20+
19,,,0.009100519493314,,0.009212811686055,0.008768952755244,,0.008868569803463,0.00950575169014,0.108934719837909
21+
20,,,,,0.008849915140144,,,0.008619247401446,0.009227716098966,0.108827389706558
22+
21,,,,,0.008547240577846,,,0.008581858224226,0.008976117437202,0.108966476859621
23+
22,,,,,0.008132688840701,,,,0.008787673159773,0.10890695855371
24+
23,,,,,0.007695368266619,,,,0.008245814139646,0.109034213741739
25+
24,,,,,0.007401325128144,,,,,0.109217857399867
2626
25,,,,,0.007038912817985,,,,,
2727
26,,,,,0.00690626405045,,,,,
2828
27,,,,,0.006711490155905,,,,,

article/article.bib

+8-8
Original file line numberDiff line numberDiff line change
@@ -238,7 +238,7 @@ @misc{jean_baptiste_camps_2019_2630574
238238
Cochet, A. and
239239
Ing, L. and
240240
Levêque, P.},
241-
title = {{Jean-Baptiste-Camps/Geste: Geste: un corpus de
241+
title = {{Jean-Baptiste-Camps/Geste: Geste: un corpus de
242242
chansons de geste, 2016-…}},
243243
month = apr,
244244
year = 2019,
@@ -254,13 +254,13 @@ @misc{germanica
254254
}
255255

256256
@misc{edh,
257-
author = { Witschel, C. and
258-
Alföldy, G. and
259-
Cowey, J. M.S. and
260-
Feraudi-Gruénais, F. and
261-
Gräf, B. and
262-
Grieshaber (IT), F. and
263-
Klar, R. and
257+
author = { Witschel, C. and
258+
Alföldy, G. and
259+
Cowey, J. M.S. and
260+
Feraudi-Gruénais, F. and
261+
Gräf, B. and
262+
Grieshaber (IT), F. and
263+
Klar, R. and
264264
Osnabrügge, J. and },
265265
title = {{Epigraphic Database Heidelberg}},
266266
year = 2019,

article/article.pdf

2.08 KB
Binary file not shown.

article/article.tex

+15-14
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,8 @@
99

1010
\title{Evaluating Deep Learning Methods for Tokenization of Space-less texts in Old French and Latin}
1111
\author[1]{Thibault Clérice}
12-
\affil[1]{École nationale des Chartes, France}
13-
\affil[2]{Université Lyon 3, France}
12+
\affil[1]{École nationale des Chartes, France}
13+
\affil[2]{Université Lyon 3, France}
1414

1515
\corrauthor{Thibault Clérice}{thibault.clerice@chartes.psl.eu}
1616

@@ -126,19 +126,20 @@ \subsubsection{Results}
126126
\begin{tikzpicture}
127127
\begin{axis}[
128128
width=\linewidth, % Scale the plot to \linewidth
129-
grid=major,
129+
grid=major,
130130
grid style={dashed,gray!30},
131131
xlabel=Epoch, % Set the labels
132132
ylabel=Accuracy,
133133
legend style={at={(0.5,-0.2)},anchor=north},
134134
x tick label style={rotate=90,anchor=east}
135135
]
136-
\addplot table[x=Epoch,y=CNN 1,col sep=comma] {accuracies.csv};
137-
\addplot table[x=Epoch,y=CNN L,col sep=comma] {accuracies.csv};
138-
\addplot table[x=Epoch,y=CNN w/o P,col sep=comma] {accuracies.csv};
139-
\addplot table[x=Epoch,y=CNN N,col sep=comma] {accuracies.csv};
140-
\addplot table[x=Epoch,y=CNN L N,col sep=comma] {accuracies.csv};
141-
\legend{CNN, CNN L, CNN P, CNN N, CNN L N}
136+
\addplot table[x=Epoch,y=CNN 1,col sep=comma] {accuracies.csv};
137+
\addplot table[x=Epoch,y=CNN L,col sep=comma] {accuracies.csv};
138+
\addplot table[x=Epoch,y=CNN w/o P,col sep=comma] {accuracies.csv};
139+
\addplot table[x=Epoch,y=CNN N,col sep=comma] {accuracies.csv};
140+
\addplot table[x=Epoch,y=CNN L N,col sep=comma] {accuracies.csv};
141+
\addplot table[x=Epoch,y=LSTM,col sep=comma] {accuracies.csv};
142+
\legend{CNN, CNN L, CNN P, CNN N, CNN L N, LSTM}
142143
\end{axis}
143144
\end{tikzpicture}
144145
\caption{Training Loss (Cross-entropy) until plateau was reached. N = normalized, L = Lower, P = no position embedding. LSTM was removed as it did not go below 0.65}
@@ -157,7 +158,7 @@ \subsubsection{Results}
157158
CNN P & \textbf{0.993} & \textbf{0.990}& \textbf{0.991} & \textbf{0.990} & 2432 & \textbf{2114} \\
158159
CNN N & 0.991 & 0.987 & 0.988 & 0.988 & 2756 & 3312 \\
159160
CNN L N & 0.992 & 0.988 & 0.989 & 0.988 & 2500 & 3567 \\
160-
LSTM & 0.741 & 0.184 & 0.500 & 0.269 & 169094 & 0 \\ \hline
161+
LSTM & 0.939 & 0.637 & 0.918 & 0.720 & 21174 & 18662 \\ \hline
161162
\end{tabular}
162163
\caption{Scores over the test dataset. \\\hspace{\textwidth}For models: N = normalized, L = Lower, P = no position embedding. \\\hspace{\textwidth}In headers, FN = False Negative, FP = False Positive}
163164
\label{tab:scores}
@@ -270,7 +271,7 @@ \subsubsection{Medieval Latin corpora}
270271
\textbf{Example:}
271272

272273
\begin{itemize}
273-
\item Input : nonparvamremtibi
274+
\item Input : nonparvamremtibi
274275
\item Output : non parvam rem tibi
275276

276277
\end{itemize}
@@ -313,21 +314,21 @@ \subsubsection{Latin epigraphic corpora}
313314
\textbf{Example:}
314315

315316
\begin{itemize}
316-
\item Input : DnFlClIuliani
317+
\item Input : DnFlClIuliani
317318
\item Output : D n Fl Cl Iuliani
318319
\end{itemize}
319320

320321
\subsection{Discussion}
321322

322323
Aside from a graphical challenge, word segmentation in OCR from manuscripts can actually be treated as a NLP task. Word segmentation for some text can be even difficult for humanist, as shown by the manuscript sample, and as such, it seems that the post-processing of OCR through tools like this one can be a better way to achieve data-mining of raw datasets.
323324

324-
The negligible effects of the different normalization methods (lower-casing; ASCII reduction; both) were surprising. The presence of certain MUFI characters might provide enough information about segmentation and be of sufficient quantity for them not to impact the network weights.
325+
The negligible effects of the different normalization methods (lower-casing; ASCII reduction; both) were surprising. The presence of certain MUFI characters might provide enough information about segmentation and be of sufficient quantity for them not to impact the network weights.
325326

326327
While the baseline performed unexpectedly well on the test corpus, the CNN model definitely performed better on a completely unknown corpus. In this context, the proposed model actually shows its ability to carry over unknown corpora in a better way than classical n-gram approaches. In light of the high accuracy of the CNN model, the model should perform the same way independently of the language in Medieval Western Europe,.
327328

328329
\subsection{Conclusion}
329330

330-
Achieving 0.99 accuracy on word segmentation with a corpus as large as 25,000 test samples seems to be the first step for a more thorough data mining of OCRed manuscript. Given the results, studying the importance of normalization and lowering should probably be a further step, as it might be of high influence in smaller corpora.
331+
Achieving 0.99 accuracy on word segmentation with a corpus as large as 25,000 test samples seems to be the first step for a more thorough data mining of OCRed manuscript. Given the results, studying the importance of normalization and lowering should probably be a further step, as it might be of high influence in smaller corpora.
331332

332333
\subsection{Acknowledgements}
333334

article/chars.csv

-195
This file was deleted.

article/code_excerpt.py

-9
This file was deleted.

0 commit comments

Comments
 (0)