Skip to content

Commit fbb68de

Browse files
First draft of the article
1 parent ce1394c commit fbb68de

7 files changed

+37
-48
lines changed

article/article.pdf

49.5 KB
Binary file not shown.

article/article.tex

+33-41
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,9 @@
22
\usepackage[utf8]{inputenc}
33
\usepackage{array}
44
\usepackage{pgfplots}
5+
\usepackage{tabularx}
6+
\newcolumntype{C}{>{\arraybackslash}X} % centered "X" column
7+
58

69
\title{Evaluating Deep Learning Methods for Tokenization of Space-less texts in Old French}
710
\author[1]{Thibault Clérice}
@@ -24,11 +27,10 @@
2427
\section{Introduction}
2528

2629
% To Read : Stutzmann article.
27-
% Parler plus de la variation orthographique de langue
2830

2931
Tokenization of space-less strings is a task that is specifically difficult for computer when compared to "whathumancando". \textit{Scripta continua} is a writing phenomenon where words would not be separated by spaces and it appears to have disappeared around the 8th century (see \citet{zanna1998lecture}). Never the less, spacing can be somewhat erratic in later centuries writings, as show by Figure \ref{fig:4lines}, a document from the 13th century. In the context of text mining of HTR or OCR output, lemmatization and tokenization of medieval western languages can be a pre-processing step for further research to sustain analyses such as authorship attribution \textbf{CITE JBCAMPS ?}.
3032

31-
\begin{figure}
33+
\begin{figure}[!ht]
3234
\centering
3335
\includegraphics[width=\linewidth]{4-lines-p0215.png}
3436

@@ -48,7 +50,7 @@ \subsubsection{Encoding of input and decoding}
4850

4951
The model is based on traditional text input encoding where each character is transcoded to an index. Output of the model is a mask that needs to be applied to the input: in the mask, characters are classified either as word boundary or word content (\textit{cf.} Table \ref{lst:input_output_example}.
5052

51-
\begin{table}
53+
\begin{table}[!ht]
5254
\centering
5355
\begin{tabular}{@{}ll@{}}
5456
\hline
@@ -63,7 +65,7 @@ \subsubsection{Encoding of input and decoding}
6365

6466
For evaluation purposes, and to reduce the number of input classes, we propose two options for data transcoding: a lower-case normalization and a "reduction to the ASCII character set" feature (fr. \ref{fig:normalization}). On this point, a lot of issues were found with transliteration of medieval paelographic characters that were part of the original datasets, as they are badly interpreted by the \texttt{unidecode} python package. Indeed, \texttt{unidecode} will simply remove characters it does not understand. I built a secondary package named \texttt{mufidecode} (\citet{thibault_clerice_2019_3237731}) which precedes unidecode equivalency tables when the data is known of the Medieval Unicode Font Initiative (MUFI, \citet{mufi}).
6567

66-
\begin{figure}
68+
\begin{figure}[!ht]
6769
\centering
6870
\includegraphics[width=\linewidth]{carbon.png}
6971
\caption{Different possibilities of pre-processing. The option with join=False was kept, as it keeps abbreviation marked as single characters. Note how \texttt{unidecode} loses the P WITH BAR}
@@ -96,14 +98,14 @@ \subsubsection{Datasets}
9698

9799
The input was generated by grouping at least 2 words and a maximum of 8 words together per sample. On a probability of 0.2, noise character could be added (noise character was set to DOT ('.')) and some words were kept randomly from a sample to another on a probability of 0.3 and a maximum number of word kept of 1. If a minimum size of 7 characters was not met in the input sample, another word would be added to the chain. A maximum input size of 100 was kept. The results corpora should be varied in sizes as shown by \ref{fig:word_sizes}. The corpora is composed by 193 different characters when not normalized, in which some MUFI characters appears few hundred times \ref{tab:mufi_examples}.
98100

99-
\begin{figure}
101+
\begin{figure}[!ht]
100102
\centering
101103
\includegraphics[width=\linewidth]{length.png}
102104
\caption{Distribution of word size over the train, dev and test corpora}
103105
\label{fig:word_sizes}
104106
\end{figure}
105107

106-
\begin{table}
108+
\begin{table}[!ht]
107109
\begin{tabular}{llll}
108110
\hline
109111
& Train dataset & Dev dataset & Test dataset \\ \hline
@@ -119,23 +121,23 @@ \subsubsection{Results}
119121

120122
The training parameters was 0.00005 in learning rate for each CNN model and 0.001 for the LSTM one, and 64 in batch sizes. Training reached a plateau fairly quickly for each model (\textit{cf.} \ref{fig:loss}). Each model except LSTM reached a really low loss and a high accuracy on the test set (\textit{cf.} \ref{tab:scores})
121123

122-
\begin{figure}
124+
\begin{figure}[!ht]
123125
\begin{center}
124126
\begin{tikzpicture}
125127
\begin{axis}[
126128
width=\linewidth, % Scale the plot to \linewidth
127-
grid=major,
129+
grid=major,
128130
grid style={dashed,gray!30},
129131
xlabel=Epoch, % Set the labels
130132
ylabel=Accuracy,
131133
legend style={at={(0.5,-0.2)},anchor=north},
132134
x tick label style={rotate=90,anchor=east}
133135
]
134-
\addplot table[x=Epoch,y=CNN 1,col sep=comma] {accuracies.csv};
135-
\addplot table[x=Epoch,y=CNN L,col sep=comma] {accuracies.csv};
136-
\addplot table[x=Epoch,y=CNN w/o P,col sep=comma] {accuracies.csv};
137-
\addplot table[x=Epoch,y=CNN N,col sep=comma] {accuracies.csv};
138-
\addplot table[x=Epoch,y=CNN L N,col sep=comma] {accuracies.csv};
136+
\addplot table[x=Epoch,y=CNN 1,col sep=comma] {accuracies.csv};
137+
\addplot table[x=Epoch,y=CNN L,col sep=comma] {accuracies.csv};
138+
\addplot table[x=Epoch,y=CNN w/o P,col sep=comma] {accuracies.csv};
139+
\addplot table[x=Epoch,y=CNN N,col sep=comma] {accuracies.csv};
140+
\addplot table[x=Epoch,y=CNN L N,col sep=comma] {accuracies.csv};
139141
\legend{CNN, CNN L, CNN P, CNN N, CNN L N}
140142
\end{axis}
141143
\end{tikzpicture}
@@ -144,7 +146,7 @@ \subsubsection{Results}
144146
\end{center}
145147
\end{figure}
146148

147-
\begin{table}
149+
\begin{table}[!ht]
148150
\centering
149151
\begin{tabular}{lllll}
150152
\hline
@@ -162,31 +164,31 @@ \subsubsection{Results}
162164

163165
\subsubsection{Example of outputs}
164166

165-
The following inputs has been tagged with the CNN P model. Batch are constructed around the regex \texttt{\\W} with package \texttt{regex}. This explains why inputs such as \texttt{".i."} are automatically tagged as \texttt{" . i . "} by the tool. The input was stripped of its spaces before tagging, we only show the ground truth by commodity.
166-
167-
\begin{itemize}
168-
\item \texttt{Truth :} Aies joie et leesce en ton cuer car tu auras une fille qui aura .i. fil qui sera de molt grant merite devant Dieu et de grant los entre les homes.
169-
\item \texttt{Input :} Aiesjoieetleesceentoncuercartuaurasunefillequiaura.i.filquiserademoltgrantmeritedevantDieuetdegrantlosentreleshomes.
170-
\item \texttt{CNN :} Aiesjoie et leesce en ton cuer car tu auras une fille qui aura . i . fil qui sera de molt grant merite devant Dieu et de grant los entre les homes .
171-
\item \texttt{CNN lower:} Aies joie et leesce en ton cuer car tu auras une fille qui aura . i . fil qui sera de molt grant merite devant Dieu et de grant los entre les homes .
172-
\item \texttt{CNN without position:} Aiesjoie et leesce en ton cuer car tu auras une fille qui aura . i . fil qui sera de molt grant merite devant Dieu et de grant los entre les homes .
173-
\item \texttt{CNN Normalize:} Aies joie et leesce en ton cuer car tu auras une fille qui aura . i . fil qui sera de molt grant merite devant Dieu et de grant los entre les homes .
174-
\end{itemize}
175-
167+
The following inputs has been tagged with the CNN P model. Batch are constructed around the regular expression \texttt{\\W} with package \texttt{regex}. This explains why inputs such as \texttt{".i."} are automatically tagged as \texttt{" . i . "} by the tool. The input was stripped of its spaces before tagging, we only show the ground truth by commodity.
176168

169+
\begin{table}[!ht]
170+
\centering
171+
\begin{tabularx}{\textwidth}{|C|C|}
172+
\hline
173+
\textbf{Ground truth} & \textbf{Tokenized output} \\\hline
174+
Aies joie et leesce en ton cuer car tu auras une fille qui aura .i. fil qui sera de molt grant merite devant Dieu et de grant los entre les homes.Conforte toi et soies liee car tu portes en ton ventre .i. fil qui son lieu aura devant Dieu et qui grant honnor fera a toz ses parenz. & Aies joie et leesce en ton cuer car tu auras une fille qui aura . i . fil qui sera de molt grant merite devant Dieu et de grant los entre les homes . Confort e toi et soies liee car tu portes en ton ventre . i . fil qui son lieu aura devant Dieu et qui grant honnor fera a toz ses parenz .
175+
\\\hline
176+
\end{tabularx}
177+
\caption{Output examples on a text from outside the dataset}
178+
\label{tab:example_output}
179+
\end{table}
177180

178181
\subsection{Discussion}
179182

180-
We believe that, aside from a graphical challenge, word segmentation in OCR from manuscripts can actually be treated from a text point of view
183+
We believe that, aside from a graphical challenge, word segmentation in OCR from manuscripts can actually be treated from a text point of view and as a NLP task. Word segmentation for some text can be even difficult for humanist, and as such, we believe that post-processing of OCR through tools like Boudams can be a better way to achieve data-mining of the dataset. In light of the high accuracy of the model, we believe the model should perform the same way independently from the language in Medieval Western Europe.
181184

182-
% "The purpose of the discussion is to interpret and describe the significance of your findings in light of what was already known about the research problem being investigated and to explain any new understanding or insights that emerged as a result of your study of the problem. The discussion will always connect to the introduction by way of the research questions or hypotheses you posed and the literature you reviewed, but the discussion does not simply repeat or rearrange the first parts of your paper; the discussion clearly explain how your study advanced the reader's understanding of the research problem from where you left them at the end of your review of prior research."
185+
We were surprised by the negligible effects of the different normalization methods (lower-casing; ASCII reduction; both). The presence of certain MUFI characters might provide enough information about segmentation and be in enough numbers for them not to impact the network weights.
183186

184187
\subsection{Conclusion}
185188

186-
While
187-
% The conclusion is intended to help the reader understand why your research should matter to them after they have finished reading the paper. A conclusion is not merely a summary of the main topics covered or a re-statement of your research problem, but a synthesis of key points and, if applicable, where you recommend new areas for future research. For most college-level research papers, one or two well-developed paragraphs is sufficient for a conclusion, although in some cases, three or more paragraphs may be required.
189+
Achieving 0.99 accuracy on word segmentation with a corpus as large as 25,000 test samples seems to be the first step for a more important data mining of OCRed manuscript. In aftermath, we wonder if the importance of normalization and lowering should be higher depending on the size of the corpora and its content.
188190

189-
\subsection{Acknowledgement}
191+
\subsection{Acknowledgements}
190192

191193
Boudams has been made possible by two open-source repositories from which I learned and copied bits of implementation of certain modules and without which none of this paper would have been possible: \citet{enrique_manjavacas_2019_2654987} and \citet{bentrevett}. This tool was originally intended for post-processing OCR for the presentation \citet{pinchecampsclerice} at DH2019 in Utrecht.
192194

@@ -199,22 +201,12 @@ \subsection{Acknowledgement}
199201

200202
\section{Annex 1 : Confusion of CNN without position embeddings}
201203

202-
\begin{figure}
204+
\begin{figure}[!ht]
203205
\centering
204206
\includegraphics[width=\linewidth]{confusion.png}
205207
\caption{Confusion matrix of the CNN model without position embedding}
206208
\label{fig:confusion_matrix}
207209
\end{figure}
208210

209211

210-
211-
\section{Annex 2}
212-
Cras tristique vel nisi at aliquet. Proin egestas erat sit amet velit lobortis imperdiet. Integer et arcu sapien. Etiam id blandit
213-
sapien. Nam tempus lacus ac massa semper, vel laoreet turpis rutrum. Mauris eget nibh vitae justo porta imperdiet sed vel
214-
ligula. In imperdiet, augue vel condimentum convallis, neque augue imperdiet neque, eget dapibus nunc mauris ultricies
215-
tortor. Nam eget nunc egestas, blandit lectus non, aliquam nunc. Cras sed quam vitae arcu ornare lobortis. Ut ut lacus
216-
hendrerit, convallis orci sit amet, commodo nunc. Pellentesque eget tincidunt tortor. Nunc ornare molestie mauris id vehicula.
217-
Suspendisse pharetra tortor metus, sit amet fermentum tellus vehicula ut.
218-
219-
220212
\end{document}

article/ground_truth.txt

+2
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
Aies joie et leesce en ton cuer car tu auras une fille qui aura .i. fil qui sera de molt grant merite devant Dieu et de grant los entre les homes.
2+
Conforte toi et soies liee car tu portes en ton ventre .i. fil qui son lieu aura devant Dieu et qui grant honnor fera a toz ses parenz.

article/input.tokenized.txt

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Aies joie et leesce en ton cuer car tu auras une fille qui aura . i . fil qui sera de molt grant merite devant Dieu et de grant los entre les homes . Confort e toi et soies liee car tu portes en ton ventre . i . fil qui son lieu aura devant Dieu et qui grant honnor fera a toz ses parenz .

article/input.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
Aiesjoieetleesceentoncuercartuaurasunefillequiaura.i.filquiserademoltgrantmeritedevantDieuetdegrantlosentreleshomes.
1+
Aiesjoieetleesceentoncuercartuaurasunefillequiaura.i.filquiserademoltgrantmeritedevantDieuetdegrantlosentreleshomes.Confortetoietsoieslieecartuportesentonventre.i.filquisonlieuauradevantDieuetquigranthonnorferaatozsesparenz.

article/output.txt

-6
This file was deleted.

0 commit comments

Comments
 (0)