You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: article/article.tex
+33-41
Original file line number
Diff line number
Diff line change
@@ -2,6 +2,9 @@
2
2
\usepackage[utf8]{inputenc}
3
3
\usepackage{array}
4
4
\usepackage{pgfplots}
5
+
\usepackage{tabularx}
6
+
\newcolumntype{C}{>{\arraybackslash}X} % centered "X" column
7
+
5
8
6
9
\title{Evaluating Deep Learning Methods for Tokenization of Space-less texts in Old French}
7
10
\author[1]{Thibault Clérice}
@@ -24,11 +27,10 @@
24
27
\section{Introduction}
25
28
26
29
% To Read : Stutzmann article.
27
-
% Parler plus de la variation orthographique de langue
28
30
29
31
Tokenization of space-less strings is a task that is specifically difficult for computer when compared to "whathumancando". \textit{Scripta continua} is a writing phenomenon where words would not be separated by spaces and it appears to have disappeared around the 8th century (see \citet{zanna1998lecture}). Never the less, spacing can be somewhat erratic in later centuries writings, as show by Figure \ref{fig:4lines}, a document from the 13th century. In the context of text mining of HTR or OCR output, lemmatization and tokenization of medieval western languages can be a pre-processing step for further research to sustain analyses such as authorship attribution \textbf{CITE JBCAMPS ?}.
@@ -48,7 +50,7 @@ \subsubsection{Encoding of input and decoding}
48
50
49
51
The model is based on traditional text input encoding where each character is transcoded to an index. Output of the model is a mask that needs to be applied to the input: in the mask, characters are classified either as word boundary or word content (\textit{cf.} Table \ref{lst:input_output_example}.
50
52
51
-
\begin{table}
53
+
\begin{table}[!ht]
52
54
\centering
53
55
\begin{tabular}{@{}ll@{}}
54
56
\hline
@@ -63,7 +65,7 @@ \subsubsection{Encoding of input and decoding}
63
65
64
66
For evaluation purposes, and to reduce the number of input classes, we propose two options for data transcoding: a lower-case normalization and a "reduction to the ASCII character set" feature (fr. \ref{fig:normalization}). On this point, a lot of issues were found with transliteration of medieval paelographic characters that were part of the original datasets, as they are badly interpreted by the \texttt{unidecode} python package. Indeed, \texttt{unidecode} will simply remove characters it does not understand. I built a secondary package named \texttt{mufidecode} (\citet{thibault_clerice_2019_3237731}) which precedes unidecode equivalency tables when the data is known of the Medieval Unicode Font Initiative (MUFI, \citet{mufi}).
65
67
66
-
\begin{figure}
68
+
\begin{figure}[!ht]
67
69
\centering
68
70
\includegraphics[width=\linewidth]{carbon.png}
69
71
\caption{Different possibilities of pre-processing. The option with join=False was kept, as it keeps abbreviation marked as single characters. Note how \texttt{unidecode} loses the P WITH BAR}
@@ -96,14 +98,14 @@ \subsubsection{Datasets}
96
98
97
99
The input was generated by grouping at least 2 words and a maximum of 8 words together per sample. On a probability of 0.2, noise character could be added (noise character was set to DOT ('.')) and some words were kept randomly from a sample to another on a probability of 0.3 and a maximum number of word kept of 1. If a minimum size of 7 characters was not met in the input sample, another word would be added to the chain. A maximum input size of 100 was kept. The results corpora should be varied in sizes as shown by \ref{fig:word_sizes}. The corpora is composed by 193 different characters when not normalized, in which some MUFI characters appears few hundred times \ref{tab:mufi_examples}.
98
100
99
-
\begin{figure}
101
+
\begin{figure}[!ht]
100
102
\centering
101
103
\includegraphics[width=\linewidth]{length.png}
102
104
\caption{Distribution of word size over the train, dev and test corpora}
103
105
\label{fig:word_sizes}
104
106
\end{figure}
105
107
106
-
\begin{table}
108
+
\begin{table}[!ht]
107
109
\begin{tabular}{llll}
108
110
\hline
109
111
& Train dataset & Dev dataset & Test dataset \\ \hline
@@ -119,23 +121,23 @@ \subsubsection{Results}
119
121
120
122
The training parameters was 0.00005 in learning rate for each CNN model and 0.001 for the LSTM one, and 64 in batch sizes. Training reached a plateau fairly quickly for each model (\textit{cf.} \ref{fig:loss}). Each model except LSTM reached a really low loss and a high accuracy on the test set (\textit{cf.} \ref{tab:scores})
\addplot table[x=Epoch,y=CNN L N,col sep=comma] {accuracies.csv};
139
141
\legend{CNN, CNN L, CNN P, CNN N, CNN L N}
140
142
\end{axis}
141
143
\end{tikzpicture}
@@ -144,7 +146,7 @@ \subsubsection{Results}
144
146
\end{center}
145
147
\end{figure}
146
148
147
-
\begin{table}
149
+
\begin{table}[!ht]
148
150
\centering
149
151
\begin{tabular}{lllll}
150
152
\hline
@@ -162,31 +164,31 @@ \subsubsection{Results}
162
164
163
165
\subsubsection{Example of outputs}
164
166
165
-
The following inputs has been tagged with the CNN P model. Batch are constructed around the regex \texttt{\\W} with package \texttt{regex}. This explains why inputs such as \texttt{".i."} are automatically tagged as \texttt{" . i . "} by the tool. The input was stripped of its spaces before tagging, we only show the ground truth by commodity.
166
-
167
-
\begin{itemize}
168
-
\item\texttt{Truth :} Aies joie et leesce en ton cuer car tu auras une fille qui aura .i. fil qui sera de molt grant merite devant Dieu et de grant los entre les homes.
\item\texttt{CNN :} Aiesjoie et leesce en ton cuer car tu auras une fille qui aura . i . fil qui sera de molt grant merite devant Dieu et de grant los entre les homes .
171
-
\item\texttt{CNN lower:} Aies joie et leesce en ton cuer car tu auras une fille qui aura . i . fil qui sera de molt grant merite devant Dieu et de grant los entre les homes .
172
-
\item\texttt{CNN without position:} Aiesjoie et leesce en ton cuer car tu auras une fille qui aura . i . fil qui sera de molt grant merite devant Dieu et de grant los entre les homes .
173
-
\item\texttt{CNN Normalize:} Aies joie et leesce en ton cuer car tu auras une fille qui aura . i . fil qui sera de molt grant merite devant Dieu et de grant los entre les homes .
174
-
\end{itemize}
175
-
167
+
The following inputs has been tagged with the CNN P model. Batch are constructed around the regular expression \texttt{\\W} with package \texttt{regex}. This explains why inputs such as \texttt{".i."} are automatically tagged as \texttt{" . i . "} by the tool. The input was stripped of its spaces before tagging, we only show the ground truth by commodity.
Aies joie et leesce en ton cuer car tu auras une fille qui aura .i. fil qui sera de molt grant merite devant Dieu et de grant los entre les homes.Conforte toi et soies liee car tu portes en ton ventre .i. fil qui son lieu aura devant Dieu et qui grant honnor fera a toz ses parenz. & Aies joie et leesce en ton cuer car tu auras une fille qui aura . i . fil qui sera de molt grant merite devant Dieu et de grant los entre les homes . Confort e toi et soies liee car tu portes en ton ventre . i . fil qui son lieu aura devant Dieu et qui grant honnor fera a toz ses parenz .
175
+
\\\hline
176
+
\end{tabularx}
177
+
\caption{Output examples on a text from outside the dataset}
178
+
\label{tab:example_output}
179
+
\end{table}
177
180
178
181
\subsection{Discussion}
179
182
180
-
We believe that, aside from a graphical challenge, word segmentation in OCR from manuscripts can actually be treated from a text point of view
183
+
We believe that, aside from a graphical challenge, word segmentation in OCR from manuscripts can actually be treated from a text point of view and as a NLP task. Word segmentation for some text can be even difficult for humanist, and as such, we believe that post-processing of OCR through tools like Boudams can be a better way to achieve data-mining of the dataset. In light of the high accuracy of the model, we believe the model should perform the same way independently from the language in Medieval Western Europe.
181
184
182
-
% "The purpose of the discussion is to interpret and describe the significance of your findings in light of what was already known about the research problem being investigated and to explain any new understanding or insights that emerged as a result of your study of the problem. The discussion will always connect to the introduction by way of the research questions or hypotheses you posed and the literature you reviewed, but the discussion does not simply repeat or rearrange the first parts of your paper; the discussion clearly explain how your study advanced the reader's understanding of the research problem from where you left them at the end of your review of prior research."
185
+
We were surprised by the negligible effects of the different normalization methods (lower-casing; ASCII reduction; both). The presence of certain MUFI characters might provide enough information about segmentation and be in enough numbers for them not to impact the network weights.
183
186
184
187
\subsection{Conclusion}
185
188
186
-
While
187
-
% The conclusion is intended to help the reader understand why your research should matter to them after they have finished reading the paper. A conclusion is not merely a summary of the main topics covered or a re-statement of your research problem, but a synthesis of key points and, if applicable, where you recommend new areas for future research. For most college-level research papers, one or two well-developed paragraphs is sufficient for a conclusion, although in some cases, three or more paragraphs may be required.
189
+
Achieving 0.99 accuracy on word segmentation with a corpus as large as 25,000 test samples seems to be the first step for a more important data mining of OCRed manuscript. In aftermath, we wonder if the importance of normalization and lowering should be higher depending on the size of the corpora and its content.
188
190
189
-
\subsection{Acknowledgement}
191
+
\subsection{Acknowledgements}
190
192
191
193
Boudams has been made possible by two open-source repositories from which I learned and copied bits of implementation of certain modules and without which none of this paper would have been possible: \citet{enrique_manjavacas_2019_2654987} and \citet{bentrevett}. This tool was originally intended for post-processing OCR for the presentation \citet{pinchecampsclerice} at DH2019 in Utrecht.
Aies joie et leesce en ton cuer car tu auras une fille qui aura . i . fil qui sera de molt grant merite devant Dieu et de grant los entre les homes . Confort e toi et soies liee car tu portes en ton ventre . i . fil qui son lieu aura devant Dieu et qui grant honnor fera a toz ses parenz .
0 commit comments