PonteIneptique
diff --git a/‎article/Evaluating_Deep_Learning_Methods_for_Tokenization_of_Scripta_Continua_in_Old_French_and_Latin(1).pdf
-814 KB b/‎article/Evaluating_Deep_Learning_Methods_for_Tokenization_of_Scripta_Continua_in_Old_French_and_Latin(1).pdf
-814 KB
diff --git a/‎article/article.pdf
49.5 KB b/‎article/article.pdf
49.5 KB
diff --git a/‎article/article.tex
+33-41 b/‎article/article.tex
+33-41
diff --git a/‎article/ground_truth.txt
+2 b/‎article/ground_truth.txt
+2
diff --git a/‎article/input.tokenized.txt
+1 b/‎article/input.tokenized.txt
+1
diff --git a/‎article/input.txt
+1-1 b/‎article/input.txt
+1-1
diff --git a/‎article/output.txt
-6 b/‎article/output.txt
-6
@@ -2,6 +2,9 @@
 \usepackage[utf8]{inputenc}
 \usepackage{array}
 \usepackage{pgfplots}
+\usepackage{tabularx}
+\newcolumntype{C}{>{\arraybackslash}X} % centered "X" column
+
 
 \title{Evaluating Deep Learning Methods for Tokenization of Space-less texts in Old French}
 \author[1]{Thibault Clérice}
@@ -24,11 +27,10 @@
 \section{Introduction}
 
 % To Read : Stutzmann article.
-% Parler plus de la variation orthographique de langue
 
 Tokenization of space-less strings is a task that is specifically difficult for computer when compared to "whathumancando". \textit{Scripta continua} is a writing phenomenon where words would not be separated by spaces and it appears to have disappeared around the 8th century (see \citet{zanna1998lecture}). Never the less, spacing can be somewhat erratic in later centuries writings, as show by Figure \ref{fig:4lines}, a document from the 13th century. In the context of text mining of HTR or OCR output, lemmatization and tokenization of medieval western languages can be a pre-processing step for further research to sustain analyses such as authorship attribution \textbf{CITE JBCAMPS ?}.
 
-\begin{figure}
+\begin{figure}[!ht]
   \centering
   \includegraphics[width=\linewidth]{4-lines-p0215.png}
 
@@ -48,7 +50,7 @@ \subsubsection{Encoding of input and decoding}
 
 The model is based on traditional text input encoding where each character is transcoded to an index. Output of the model is a mask that needs to be applied to the input: in the mask, characters are classified either as word boundary or word content (\textit{cf.} Table \ref{lst:input_output_example}.
 
-\begin{table}
+\begin{table}[!ht]
 \centering
 \begin{tabular}{@{}ll@{}}
 \hline
@@ -63,7 +65,7 @@ \subsubsection{Encoding of input and decoding}
 
 For evaluation purposes, and to reduce the number of input classes, we propose two options for data transcoding: a lower-case normalization and a "reduction to the ASCII character set" feature (fr. \ref{fig:normalization}). On this point, a lot of issues were found with transliteration of medieval paelographic characters that were part of the original datasets, as they are badly interpreted by the \texttt{unidecode} python package. Indeed, \texttt{unidecode} will simply remove characters it does not understand. I built a secondary package named \texttt{mufidecode} (\citet{thibault_clerice_2019_3237731}) which precedes unidecode equivalency tables when the data is known of the Medieval Unicode Font Initiative (MUFI, \citet{mufi}).
 
-\begin{figure}
+\begin{figure}[!ht]
   \centering
   \includegraphics[width=\linewidth]{carbon.png}
   \caption{Different possibilities of pre-processing. The option with join=False was kept, as it keeps abbreviation marked as single characters. Note how \texttt{unidecode} loses the P WITH BAR}
@@ -96,14 +98,14 @@ \subsubsection{Datasets}
 
 The input was generated by grouping at least 2 words and a maximum of 8 words together per sample. On a probability of 0.2, noise character could be added (noise character was set to DOT ('.')) and some words were kept randomly from a sample to another on a probability of 0.3 and a maximum number of word kept of 1. If a minimum size of 7 characters was not met in the input sample, another word would be added to the chain. A maximum input size of 100 was kept. The results corpora should be varied in sizes as shown by \ref{fig:word_sizes}. The corpora is composed by 193 different characters when not normalized, in which some MUFI characters appears few hundred times \ref{tab:mufi_examples}.
 
-\begin{figure}
+\begin{figure}[!ht]
   \centering
   \includegraphics[width=\linewidth]{length.png}
   \caption{Distribution of word size over the train, dev and test corpora}
   \label{fig:word_sizes}
 \end{figure}
 
-\begin{table}
+\begin{table}[!ht]
 \begin{tabular}{llll}
 \hline
                                                    & Train dataset & Dev dataset & Test dataset \\ \hline
@@ -119,23 +121,23 @@ \subsubsection{Results}
 
 The training parameters was 0.00005 in learning rate for each CNN model and 0.001 for the LSTM one, and 64 in batch sizes. Training reached a plateau fairly quickly for each model (\textit{cf.} \ref{fig:loss}). Each model except LSTM reached a really low loss and a high accuracy on the test set (\textit{cf.} \ref{tab:scores})
 
-\begin{figure}
+\begin{figure}[!ht]
   \begin{center}
     \begin{tikzpicture}
       \begin{axis}[
           width=\linewidth, % Scale the plot to \linewidth
-          grid=major,
+          grid=major, 
           grid style={dashed,gray!30},
           xlabel=Epoch, % Set the labels
           ylabel=Accuracy,
           legend style={at={(0.5,-0.2)},anchor=north},
           x tick label style={rotate=90,anchor=east}
         ]
-        \addplot table[x=Epoch,y=CNN 1,col sep=comma] {accuracies.csv};
-        \addplot table[x=Epoch,y=CNN L,col sep=comma] {accuracies.csv};
-        \addplot table[x=Epoch,y=CNN w/o P,col sep=comma] {accuracies.csv};
-        \addplot table[x=Epoch,y=CNN N,col sep=comma] {accuracies.csv};
-        \addplot table[x=Epoch,y=CNN L N,col sep=comma] {accuracies.csv};
+        \addplot table[x=Epoch,y=CNN 1,col sep=comma] {accuracies.csv}; 
+        \addplot table[x=Epoch,y=CNN L,col sep=comma] {accuracies.csv}; 
+        \addplot table[x=Epoch,y=CNN w/o P,col sep=comma] {accuracies.csv}; 
+        \addplot table[x=Epoch,y=CNN N,col sep=comma] {accuracies.csv}; 
+        \addplot table[x=Epoch,y=CNN L N,col sep=comma] {accuracies.csv}; 
         \legend{CNN, CNN L, CNN P, CNN N, CNN L N}
       \end{axis}
     \end{tikzpicture}
@@ -144,7 +146,7 @@ \subsubsection{Results}
   \end{center}
 \end{figure}
 
-\begin{table}
+\begin{table}[!ht]
 \centering
 \begin{tabular}{lllll}
 \hline
@@ -162,31 +164,31 @@ \subsubsection{Results}
 
 \subsubsection{Example of outputs}
 
-The following inputs has been tagged with the CNN P model. Batch are constructed around the regex \texttt{\\W} with package \texttt{regex}. This explains why inputs such as \texttt{".i."} are automatically tagged as \texttt{" . i . "} by the tool. The input was stripped of its spaces before tagging, we only show the ground truth by commodity.
-
-\begin{itemize}
-    \item \texttt{Truth :} Aies joie et leesce en ton cuer car tu auras une fille qui aura .i. fil qui sera de molt grant merite devant Dieu et de grant los entre les homes.
-    \item \texttt{Input :} Aiesjoieetleesceentoncuercartuaurasunefillequiaura.i.filquiserademoltgrantmeritedevantDieuetdegrantlosentreleshomes.
-    \item \texttt{CNN :} Aiesjoie et leesce en ton cuer car tu auras une fille qui aura . i . fil qui sera de molt grant merite devant Dieu et de grant los entre les homes .
-    \item \texttt{CNN lower:}	Aies joie et leesce en ton cuer car tu auras une fille qui aura . i . fil qui sera de molt grant merite devant Dieu et de grant los entre les homes .
-    \item \texttt{CNN without position:} Aiesjoie et leesce en ton cuer car tu auras une fille qui aura . i . fil qui sera de molt grant merite devant Dieu et de grant los entre les homes .
-    \item \texttt{CNN Normalize:}	Aies joie et leesce en ton cuer car tu auras une fille qui aura . i . fil qui sera de molt grant merite devant Dieu et de grant los entre les homes .
-\end{itemize}
-
+The following inputs has been tagged with the CNN P model. Batch are constructed around the regular expression \texttt{\\W} with package \texttt{regex}. This explains why inputs such as \texttt{".i."} are automatically tagged as \texttt{" . i . "} by the tool. The input was stripped of its spaces before tagging, we only show the ground truth by commodity.
 
+\begin{table}[!ht]
+\centering
+\begin{tabularx}{\textwidth}{|C|C|}
+\hline
+\textbf{Ground truth} & \textbf{Tokenized output} \\\hline
+Aies joie et leesce en ton cuer car tu auras une fille qui aura .i. fil qui sera de molt grant merite devant Dieu et de grant los entre les homes.Conforte toi et soies liee car tu portes en ton ventre .i. fil qui son lieu aura devant Dieu et qui grant honnor fera a toz ses parenz. & Aies joie et leesce en ton cuer car tu auras une fille qui aura .  i .  fil qui sera de molt grant merite devant Dieu et de grant los entre les homes .  Confort e toi et soies liee car tu portes en ton ventre .  i .  fil qui son lieu aura devant Dieu et qui grant honnor fera a toz ses parenz .
+\\\hline
+\end{tabularx}
+\caption{Output examples on a text from outside the dataset}
+\label{tab:example_output}
+\end{table}
 
 \subsection{Discussion}
 
-We believe that, aside from a graphical challenge, word segmentation in OCR from manuscripts can actually be treated from a text point of view
+We believe that, aside from a graphical challenge, word segmentation in OCR from manuscripts can actually be treated from a text point of view and as a NLP task. Word segmentation for some text can be even difficult for humanist, and as such, we believe that post-processing of OCR through tools like Boudams can be a better way to achieve data-mining of the dataset. In light of the high accuracy of the model, we believe the model should perform the same way independently from the language in Medieval Western Europe.
 
-% "The purpose of the discussion is to interpret and describe the significance of your findings in light of what was already known about the research problem being investigated and to explain any new understanding or insights that emerged as a result of your study of the problem. The discussion will always connect to the introduction by way of the research questions or hypotheses you posed and the literature you reviewed, but the discussion does not simply repeat or rearrange the first parts of your paper; the discussion clearly explain how your study advanced the reader's understanding of the research problem from where you left them at the end of your review of prior research."
+We were surprised by the negligible effects of the different normalization methods (lower-casing; ASCII reduction; both). The presence of certain MUFI characters might provide enough information about segmentation and be in enough numbers for them not to impact the network weights.
 
 \subsection{Conclusion}
 
-While
-% The conclusion is intended to help the reader understand why your research should matter to them after they have finished reading the paper. A conclusion is not merely a summary of the main topics covered or a re-statement of your research problem, but a synthesis of key points and, if applicable, where you recommend new areas for future research. For most college-level research papers, one or two well-developed paragraphs is sufficient for a conclusion, although in some cases, three or more paragraphs may be required.
+Achieving 0.99 accuracy on word segmentation with a corpus as large as 25,000 test samples seems to be the first step for a more important data mining of OCRed manuscript. In aftermath, we wonder if the importance of normalization and lowering should be higher depending on the size of the corpora and its content. 
 
-\subsection{Acknowledgement}
+\subsection{Acknowledgements}
 
 Boudams has been made possible by two open-source repositories from which I learned and copied bits of implementation of certain modules and without which none of this paper would have been possible: \citet{enrique_manjavacas_2019_2654987} and \citet{bentrevett}. This tool was originally intended for post-processing OCR for the presentation \citet{pinchecampsclerice} at DH2019 in Utrecht.
 
@@ -199,22 +201,12 @@ \subsection{Acknowledgement}
 
 \section{Annex 1 : Confusion of CNN without position embeddings}
 
-\begin{figure}
+\begin{figure}[!ht]
   \centering
   \includegraphics[width=\linewidth]{confusion.png}
   \caption{Confusion matrix of the CNN model without position embedding}
   \label{fig:confusion_matrix}
 \end{figure}
 
 
-
-\section{Annex 2}
-Cras tristique vel nisi at aliquet. Proin egestas erat sit amet velit lobortis imperdiet. Integer et arcu sapien. Etiam id blandit
-sapien. Nam tempus lacus ac massa semper, vel laoreet turpis rutrum. Mauris eget nibh vitae justo porta imperdiet sed vel
-ligula. In imperdiet, augue vel condimentum convallis, neque augue imperdiet neque, eget dapibus nunc mauris ultricies
-tortor. Nam eget nunc egestas, blandit lectus non, aliquam nunc. Cras sed quam vitae arcu ornare lobortis. Ut ut lacus
-hendrerit, convallis orci sit amet, commodo nunc. Pellentesque eget tincidunt tortor. Nunc ornare molestie mauris id vehicula.
-Suspendisse pharetra tortor metus, sit amet fermentum tellus vehicula ut.
-
-
 \end{document}
@@ -0,0 +1,2 @@
+Aies joie et leesce en ton cuer car tu auras une fille qui aura .i. fil qui sera de molt grant merite devant Dieu et de grant los entre les homes.
+Conforte toi et soies liee car tu portes en ton ventre .i. fil qui son lieu aura devant Dieu et qui grant honnor fera a toz ses parenz.
@@ -0,0 +1 @@
+Aies joie et leesce en ton cuer car tu auras une fille qui aura .  i .  fil qui sera de molt grant merite devant Dieu et de grant los entre les homes .  Confort e toi et soies liee car tu portes en ton ventre .  i .  fil qui son lieu aura devant Dieu et qui grant honnor fera a toz ses parenz .  
@@ -1 +1 @@
-Aiesjoieetleesceentoncuercartuaurasunefillequiaura.i.filquiserademoltgrantmeritedevantDieuetdegrantlosentreleshomes.
+Aiesjoieetleesceentoncuercartuaurasunefillequiaura.i.filquiserademoltgrantmeritedevantDieuetdegrantlosentreleshomes.Confortetoietsoieslieecartuportesentonventre.i.filquisonlieuauradevantDieuetquigranthonnorferaatozsesparenz.
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,2 @@`
	`1`	`+Aies joie et leesce en ton cuer car tu auras une fille qui aura .i. fil qui sera de molt grant merite devant Dieu et de grant los entre les homes.`
	`2`	`+Conforte toi et soies liee car tu portes en ton ventre .i. fil qui son lieu aura devant Dieu et qui grant honnor fera a toz ses parenz.`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+Aies joie et leesce en ton cuer car tu auras une fille qui aura . i . fil qui sera de molt grant merite devant Dieu et de grant los entre les homes . Confort e toi et soies liee car tu portes en ton ventre . i . fil qui son lieu aura devant Dieu et qui grant honnor fera a toz ses parenz .`
Original file line number	Diff line number	Diff line change
`@@ -1 +1 @@`
`1`		`-Aiesjoieetleesceentoncuercartuaurasunefillequiaura.i.filquiserademoltgrantmeritedevantDieuetdegrantlosentreleshomes.`
	`1`	`+Aiesjoieetleesceentoncuercartuaurasunefillequiaura.i.filquiserademoltgrantmeritedevantDieuetdegrantlosentreleshomes.Confortetoietsoieslieecartuportesentonventre.i.filquisonlieuauradevantDieuetquigranthonnorferaatozsesparenz.`