-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathmaster_thesis_giovanni_garifo.tex
2542 lines (2161 loc) · 115 KB
/
master_thesis_giovanni_garifo.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% use UTF-8 encoding in editors such as TeXworks
% !TEX encoding = UTF-8 Unicode
% !TEX TS-program = pdflatex
\documentclass[%
corpo=13.5pt,
twoside,
% stile=classica,
oldstyle,
% autoretitolo,
tipotesi=magistrale,
greek,
evenboxes
]{toptesi}
\usepackage[utf8]{inputenc}% codifica d'entrata
\usepackage[T1]{fontenc}% codifica dei font
\usepackage{lmodern}% scelta dei font
\usepackage{listings} % code listing
\usepackage{mathtools} % math
\usepackage{eucal} % math calligraphy
\usepackage{amsfonts} % blackboard bold letters (like R for real values)
\usepackage{url} % URLs usage: \url{https://example.com}
\usepackage{cite} % cite bibtex entries
\usepackage{hyperref} % internal references to text, chapters ecc
\def\UrlBreaks{\do\/\do-} % break urls also using "-"
\usepackage{makecell}
\renewcommand\theadalign{bc}
\renewcommand\theadfont{\bfseries}
\renewcommand\theadgape{\Gape[4pt]}
\renewcommand\cellgape{\Gape[4pt]}
\usepackage{tabularx, booktabs, multirow, subcaption} % tables
\usepackage[table]{xcolor}
\newcolumntype{b}{X}
\newcolumntype{m}{>{\hsize=.5\hsize}X}
\newcolumntype{s}{>{\hsize=.4\hsize}X}
\newcolumntype{z}{>{\hsize=.3\hsize}X}
\newcolumntype{k}{>{\hsize=.2\hsize}X}
\newcolumntype{i}{>{\hsize=.2\hsize}>{\centering\arraybackslash}X}
\newcolumntype{u}{>{\hsize=.1\hsize}X}
\newcolumntype{Y}{>{\centering\arraybackslash}X} %centering
\newcommand*\rot{\rotatebox{90}} % multirow rotation
\usepackage{color}
\definecolor{gray}{rgb}{0.4,0.4,0.4}
\definecolor{darkblue}{rgb}{0.0,0.0,0.6}
\definecolor{cyan}{rgb}{0.0,0.6,0.6}
\definecolor{codebackground}{rgb}{0.95,0.95,0.92}
\lstset{
basicstyle=\ttfamily,
columns=fullflexible,
showstringspaces=false,
commentstyle=\color{gray}\upshape,
aboveskip=12.5pt,
belowskip=12.5pt
}
\lstdefinelanguage{XML}
{
morestring=[b]",
morestring=[s]{>}{<},
morecomment=[s]{<?}{?>},
stringstyle=\color{black},
identifierstyle=\color{darkblue},
keywordstyle=\color{cyan},
morekeywords={xmlns,version}
}
% Vedere la documentazione toptesi-it.pdf per le
% attenzioni che bisogna usare al fine di ottenere un file
% veramente conforme alle norme per l'archiviabilità.
\usepackage{hyperref}
\hypersetup{%
pdfpagemode={UseOutlines},
bookmarksopen,
pdfstartview={FitH},
colorlinks,
linkcolor={blue},
citecolor={blue},
urlcolor={blue}
}
%%%%%%% Definizioni locali
\newtheorem{osservazione}{Osservazione}% Standard LaTeX
\ExtendCaptions{english}{Abstract}{Acknowledgements}
\begin{document}\errorcontextlines=9
% set english ad primary language
\english
%%%%%%%%%%%%%%%%%%%%
% BEGIN front page %
%%%%%%%%%%%%%%%%%%%%
\begin{ThesisTitlePage}*
\ateneo{Politecnico di Torino}
\nomeateneo{DEPARTMENT OF CONTROL AND COMPUTER ENGINEERING}
\CorsoDiLaureaIn{Master of Science in}
\corsodilaurea{Computer Engineering}
\TesiDiLaurea{Master Degree Thesis}
\titolo{Deep Learning on Academic Knowledge Graphs}
\sottotitolo{
Predicting new facts in a novel semantic graph built
on top of the Politecnico di Torino scholarly data
}
\CandidateName{Candidate}
\candidato{Giovanni \textsc{Garifo}}
\AdvisorName{Supervisors}
\relatore{Dr.~Antonio Vetrò}
\secondorelatore{Prof.~Juan Carlos De Martin}
\sedutadilaurea{\textsc{Academic~Year} 2018-2019}%
\logosede[6cm]{logopolito}
\end{ThesisTitlePage}
%%%%%%%%%%%%%%%%%%
% END front page %
%%%%%%%%%%%%%%%%%%
% offset rilegatura
%\setbindingcorrection{3mm}
\makeatletter
\newenvironment{miadedica}{
\clearpage
\if@twoside
\ifodd\c@page\else\thispagestyle{empty}\null\clearpage\fi
\fi
\thispagestyle{empty}%
\list{}{\labelwidth\z@
\leftmargin.73\textwidth
\parindent\z@
\raggedright\LARGE\itshape}\item[]
\normalsize
}
\begin{miadedica}
To Monia\\
To my Grandfather
\end{miadedica}
\paginavuota
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% -------------------------------- Abstract ---------------------------------- %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\sommario
The publication and sharing of new research results is one of the
main goal of an academic institution. In recent years, many efforts have been
made to collect and organize the scientific knowledge through new,
comprehensive data repositories.
To achieve such goal new tools that are able not only to store data, but also to
describe them are needed.
Knowledge graphs are a particular class of graphs that are used to
semantically describe the human knowledge in a specific domain by linking
entities through labeled and directed edges.
In this work is presented a novel semantic graph built on top of the
scholarly data produced by the Politecnico di Torino (Polito), and how
state-of-the-art machine learning techniques have been used for the
prediction of new links in this graph.
Such graph, built by leveraging Semantic Web technologies, links together
publications, researchers, fields of study and scientific journals in order
to build a knowledge base that describes the Politecnico di Torino scientific
community.
This new academic graph has been called the \emph{Polito Knowledge Graph}.
The prediction of non-existent links between graph nodes is one of the
most challenging tasks in the field of statistical relational learning for graph
data, mainly because, in order to obtain meaningful predictions, the vector
representations of the entities must embed their semantic characteristics.
To accomplish such goal, a deep neural network derived
from the image recognition field and specifically adapted to the task of
representation learning for graph data have been used.
Such architecture allowed to obtain representations that have been
directly learnt from the graph structure itself, without requiring any prior
knowledge or feature engineering.
Such representations allowed to obtain meaningful predictions
that have been used to empower a recommendation system for the suggestion of useful
insights.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% ------------------------------- acknowledgements --------------------------- %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\ringraziamenti
I would like to express my deep gratitude to Giuseppe Futia, for his useful
advices, his support in the revision of this work and his enthusiastic
encouragement.
I would also like to express my gratitude to the supervisors of this work,
Dr. Antonio Vetrò and Prof. Juan Carlos De Martin, for the useful comments
and revisions.
Thanks also need to be made to all my colleagues of the
Nexa Center for Internet \& Society, who supported me during my master
degree studies.
Special thanks must be given to Marco Conoscenti and Selina Fenoglietto for
the advices and support given during this years.
I also want to thank my family and friends, with a particular mention to
Angelo Cataldo, friend of a lifetime.
This work is dedicated to Monia Fontana, without whose support I would not
have been able to face the challenges that the life posed me in the latest
years, and to my Grandfather, Giovanni Rotolo, passed away shortly after
the start of my master degree, whom I hope to make proud by completing my
university studies.
% normalmente questa riga non serve ed e' commentata
\tablespagetrue\figurespagetrue
\indici
\mainmatter
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% ---------------------------- Introduction ---------------------------------- %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{Introduction}
\section{Motivation}
Graphs are used to empower some of the most complex IT services available
today, an example among all is the Google search engine\footnote{\url{https://blog.google/products/search/introducing-knowledge-graph-things-not/}}.
Graphs can be used to represent almost any kind of information, and they are
particularly capable of representing the structure of complex systems and
describe the relationships between their elements.
Over the last decade, much effort has been put in trying to leverage the power
of graphs to represent human knowledge and to build search tools capable of
querying and understanding the semantic relations within them.
RDF\footnote{The Resource Description Framework will be introduced in
section \ref{subsec:semanticweb}} graphs are a
particular class of graphs that can be used to build knowledge
bases. Ontologies are used to shape such knowledge repositories, in order to
have a semantically coherent representation of the domain knowledge.
Given a domain and an ontology, RDF graphs allows to build a structured
representation of the knowledge in such domain.
Modern machine learning techniques can be used to mine latent information
from such graphs. One of the main challenges in this field is how to learn
meaningful representations of entities and relations that embed
the underlying knowledge. Such representations can then be used to evaluate
new links within the graph or to classify unseen nodes.
Deep learning techniques have proved to be first class citizens when
dealing with representation learning tasks, being able to learn latent
representations without any prior knowledge other than the graph structure.
\section{Goal and Contribution}
% issue
Knowledge sharing is one of the main goals of research organizations. In
particular, universities are among the most interested in making publicly
available their research results. Today most universities have embraced the Open
Science movement, making their scientific publications publicly available
through web portals.
An example of such tools is the Institutional Research Information
System (IRIS)\footnote{\url{https://iris.polito.it/}}, an
institutional publication repository developed by the Cineca Consortium and
used by the Politecnico di Torino to store and share all the scientific papers
published by its researchers.
IRIS allows to explore the published papers by searching for a field of study,
matching it with the keywords inserted by the authors of the publications.
The current implementation has some limitations: being
inserted by the authors, the keywords can be acronyms, can contain
abbreviations, initials written without capital letters, or misspelled words.
In addition, they do not represent unique and field-wide semantic concepts, but
they are simple character strings.
The consequence is that the search engine of IRIS is unable to correctly
retrieve all the publications about a specific research topic, because the
system cannot match the searched field of study with an unambiguous
\emph{semantic entity}, but only with character strings that are not
uniquely identified or semantically linked each other, and also prone to
lexical errors.
% goal
The goal of this work is to overcome such limitations and enabling new possibilities for
exploring and obtaining insights about the scientific community of the
Politecnico di Torino. A new semantic-empowered search engine can be
one of the possible solutions to obtain this results, allowing for coherent and
precise results to be retrieved.
At the foundations of this new semantic search engine there must be a data
structure capable of representing semantic relations and concepts. Once such
knowledge base of the scholarly data is obtained, it can be enhanced and
completed by automatically extracting latent information through the use of
advanced machine learning algorithms.
% contribution
In the next chapters is presented a newly built structured and
semantically coherent representation of the scholarly data produced by the
Politecnico di Torino, and how implicit facts can be automatically
extracted from such knowledge repository by leveraging knowledge base
completion techniques, implemented by means of an advanced link prediction
algorithm.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% ------------------------------- Background --------------------------------- %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{Background}
\section{Semantic Web}
\subsection{From a Web of Contents to a Web of Data}
The World Wide Web has been developed as a tool to easily access
documents and to navigate through them by following hyperlinks.
This simple description already resembles the structure of a graph:
documents can be thought of as nodes and hyperlinks as edges.
The unstoppable growth
of the \emph{Web graph} led to the raise of new tools to explore such
complexity. Search engines have been developed to easily navigate such a
giant graph. First approaches were based on analytics evaluations,
such as the number of times a document has been linked, as in the case of the
PageRank \cite{page1999} algorithm developed by Google.
\newline
The Web rapidly became one of the most innovative technology ever built,
allowing to retrive information quickly and easily as never before.
The next evolutionary step has been to think about a Web not only exploitable by
human beings but also by machines. In order to build such a comprehensive
system, where information can be not only machine-readable, but
machine-understandable, the World Wide Web had to move from a web of content, to
a web of data.
\newline
The World Wide Web Consortium (W3C) introduced the Semantic Web as an extention
to the prior standard of the WWW. Its primary goal has been
to define a framework to describe and query semantic information contained
in the documents available on the Web, so as to allow machines to understand
the semantic information contained in web pages. In the vision of Tim
Berners-Lee, the father of WWW, this would bring to the transition from a
World Wide Web to a Giant Global Graph\footnote{\url{https://web.archive.org/web/20160713021037/http://dig.csail.mit.edu/breadcrumbs/node/215}},
where a web page contains metadata that provides to a machine the needed
information to understand the concepts and meanings expressed in it.
\subsection{The Semantic Web Building Blocks}
\label{subsec:semanticweb}
The three key components of the Semantic Web standard are:
\begin{enumerate}
\item OWL: the Web Ontology Language\cite{mcguinness2004}
\item RDF: the Resource Description Framework\cite{lassila1998}
\item SPARQL: The SPARQL Protocol and RDF Query Language\cite{ducharme2013}
\end{enumerate}
\bigskip
OWL is a language used to define ontologies. In this context, an ontology
is defined as a collection of concepts, relations and constraints between
these concepts that describes an area of interest or a domain.
OWL allows to classify things in terms of their meaning by describing
their belonging to classes and subclasses defined by the ontology: if
a thing is defined as member of a class, this means that it shares the
same semantic meaning as all the other members of such class. The result of
such classification is a taxonomy that defines a hierarchy of how things
are semantically interrelated in the domain under analysis.
The instances of OWL classes are called individuals, and can be related
with other individuals or classes by means of properties. Each individual
can be characterized with additional information using literals, that
represent data values like strings, dates or integers.
\newline
The Resource Description Framework defines a standard model for the
description, modelling and interchange of resources on the Web.
The first component of the framework is the \emph{RDF Model and Syntax},
which defines a data model that describes how the RDF resources should be
represented. The basic model consist of only three object types: resource,
property, and statement.
A resource is uniquely identified by an Uniform Resource Identifier (URI).
A property can be both a resource attribute or a relation between resources.
A statement describes a resource property, and is defined as a triple
between a subject (the resource), a predicate (the property) and an
object (a literal or another resource).
The second component of the framework is the \emph{RDF Schema} (RDFS),
that defines a basic vocabulary for describing RDF resources and the
relationships between them. Many of the vocabularies and ontologies available
today are built on top of RDFS, such as the Friend of a Friend (FOAF)
ontology \cite{brickley2007}, for describing social networks, or the one
maintained by the Dublin Core Metadata Initiative \cite{weibel1998}, that
defines common terms and relations used in the definition of metadata for
digital resources.
\begin{figure}[ht]
\centering
\includegraphics[scale=0.6]{img/owl-ontology-example.png}
\caption{An example of ontology defined using OWL and RDF Schema.}
\label{fig:owl-ontology-example}
\end{figure}
SPARQL is a query language for triplestores, a class of Database
Management Systems (DBMS) specialized in storing RDF databases. Such DBMS
often expose endpoints that can be used to query the database and obtain
results. Given the complexity of the stored data, the query language has
been designed to be as simple as possible, for instance by allowing the use
of variables, whose definition is preceded by a question mark.
The syntax of SPARQL is heavily derived from SQL, with some
minor adaptations to be more suited for querying graph data. The
following is an example of query which selects all the labels
(human-readable description of a resource) of all the entities that
match the given resource type.
\begin{lstlisting}[
language=sparql,
frame=single,
]
PREXIF plants:<http://example.org/plants/>
SELECT ?name
WHERE {
?subject rdf:type plants:flowers .
?subject rdfs:label ?name .
}
\end{lstlisting}
\subsection{Knowledge Bases as Knowledge Repositories}
% table: comparison of some of the biggest existing KG
\begin{table}
\footnotesize
\centering
\caption{
Comparison of some of the biggest industry-scale knowledge graphs
developed to this date. Table adapted from \cite{noy2019}.
}
\label{tab:kg-comparison}
\begin{tabularx}{1.0\textwidth}{ s b b m }
\toprule
& \textbf{Data model} & \textbf{Graph size} & \textbf{Development stage} \\
\midrule
\textbf{Microsoft} & Entities, relations and attributes defined in an ontology. & 2 billion entities and 55 billion facts. & Actively used in products. \\
\midrule
\textbf{Google} & Strongly typed entities, relations with domain and range inference. & 1 billion entities, 70 billion assertions. & Actively used in products. \\
\midrule
\textbf{Facebook} & All of the attributes and relations are structured and strongly typed. & 50 million primary entities, 500 million assertions. & Actively used in products. \\
\midrule
\textbf{eBay} & Entities and relation, well-structured and strongly typed. & 100 million products, more than 1 billion triples. & Early stages of development. \\
\midrule
\textbf{IBM} & Entities and relations with evidence information associated with them. & Various sizes. Proven on scales documents >100 million, relationships >5 billion, entities >100 million. & Actively used in products and by clients. \\
\bottomrule
\end{tabularx}
\end{table}
Even though the raise of the Semantic Web has suffered a slowdown in its growth
due to the complexity of its vision, many new projects were born from its
enabling technologies. Efforts have been put by profit and
non-profit organizations in trying to build complex knowledge repositories
starting from the knowledge already available in the Web. An example among all
is the DBpedia\footnote{\url{https://wiki.dbpedia.org/}} project, which
developed a structured knowledge base from the semi-structured data available on
Wikipedia.
Another example is the
\emph{Google Knowledge Graph}\footnote{\url{https://blog.google/products/search/introducing-knowledge-graph-things-not/}},
which is used to enhance the Google search engine and virtual assistant
capabilities, allowing to retrieve punctual information about everything that
has been classified in its ontology and described in its knowledge base, or
the \emph{Open Academic Graph}\footnote{\url{https://www.openacademic.ai/oag/}},
a Scientific Knowledge Graph that collects more then three hundred million
academic papers. A comparison between some of the biggest knowledge graphs
developed to this date is available in Table \ref{tab:kg-comparison}.
From an implementation perspective, knowledge bases can be created by defining
an ontology and a vocabulary for a specific domain of knowledge using OWL and
RDF Schema, and then by describing the concepts of such domain using the
RDF Model and Syntax.
The RDF document obtained can then be stored in a triplestore and queryed
using SPARQL.
The main effort in building knowledge bases is to have a correct understanding
and prior knowledge of the domain of interest, to avoid the risk of
mischaracterize or misrepresent concepts.
\begin{figure}[h]
\centering
\includegraphics[scale=0.4]{img/geranium-knowledge-base-example.png}
\caption{An extract of the Polito Knowledge Graph whose details will be
described in section \ref{sec:buildingpkg}.}
\label{fig:geranium-knowledge-base-example}
\end{figure}
If all the requirements and precautions are met, a well formed knowledge base may
prove to be a critical resource for an organization. It permits to
build new services upon it, and also to improve the existing knowledge inside
the organization by performing reasoning upon the available knowledge, in order
to derive implicit facts starting from the existing entities and relationships.
Another field of applications is the development of Expert Systems\footnote{\url{https://en.wikipedia.org/wiki/Expert_system}},
AI software that emulates the behavior of a human decision-making process by
navigating the knowledge base and taking decisions like in a rule-based system.
Today's knowledge bases are commonly composed of tens of
thousands nodes and by hundreds of thousands of edges.
Considering such dimensions, storing and querying giant graphs requires
the adoption of
specialized DBMS that are capable of efficiently store and query the RDF
input representation. Moreover, performing analysis and gathering statistics
from such giant graphs requires the adoption of highly efficient algorithms in
order to retrieve the desired output in an acceptable time.
The availability of such a complex and informative data structure leads
to the opening of interesting scenarios, especially when thinking about
the latent information that can be extracted from it. In
fact, a knowledge base is a structured representation of the
human knowledge in a specific field, thus its comprehensiveness is restricted
by the human understanding.
\section{Learning on Graphs}
\label{sec:learninggraphs}
\subsection{Representation Learning}
Machine learning (ML) algorithms are used to learn models from the
available data, with the final goal to obtain a set of parameters
that are fine-tuned to identify seen characteristics in the data
used for training. The trained models can be used to
recognize unseen inputs by leveraging the knowledge embedded
in such parameters.
An important task in the ML field is the learning of machine-understandable
representations of the input data, task known as representation learning.
Natural Language Processing is one of the research branches that in
the past years has made a great use of machine learning algorithms both for
language recognition and for embedding words meaning into words
vectors.
One of the most successful algorithms when dealing with representation learning
of words is Word2Vec \cite{mikolov2013}, where the model obtained is trained to
learn a vector representation for each word in a vocabulary.
In Word2Vec, the concept of meaning of a word is related to the context in
which such word is frequently used, so two words are recognized as similar if
they are used in similar contexts. In the vector space of the learnt
representations, words that have similar meaning have an higher
cosine similarity\footnote{
Cosine similarity is a heuristic method to measure the
similarity between two vectors by computing the cosine of the angle between
them:
$similarity(A,B) = cos(\theta) = \frac{A \cdot B}{\Vert A \Vert \Vert B \Vert}$
}
with respect to the dissimilar ones.
For instance, the cosine similarity between the word vectors of the words "Man"
and "King" is roughly the same as the one between the words "Woman" and "Queen",
since such words are used in similar contexts. This has open up new scenarios
for language processing, since it allowed to perform vector
operations on words vectors, which brought interesting results.
An example can be seen in Figure \ref{fig:word2vec}.
\begin{figure}[t]
\centering
\includegraphics[scale=0.4]{img/word2vec.png}
\caption{
Word vectors allows to perform vector operations, the results
obtained reflect the fact that Word2Vec is capable of embed the
meaning of such words.
}
\label{fig:word2vec}
\end{figure}
This idea of words characterized by the context in which they're used
can be generalized and applied to other fields of research, such as
the field of representation learning on graphs.
Graphs are composed of nodes and edges, and are used to describe complex
systems, such as social networks or the interactions in a molecular biology
system. To apply machine learning algorithms to such data structures
vector representations of nodes and edges are needed in order
to be able to learn from the available data and predict new facts.
Such vector representations are often referred to as \emph{embeddings} because
they should embed the characteristics of the graph nodes, so that similar nodes
have similar embeddings. For instance, in a scholarly knowledge base publications
with same authors and similar subjects should have similar embeddings.
Early approaches required these representations to be learned from feature
vectors that where handcrafted, task that required not
only a relevant amount of effort, but also a deep understanding of the domain
of interest. This has long been one of the main obstacles when dealing with
representation learning tasks, since who has knowledge of the domain and who
has to engineer the features were unlikely the same individual.
\subsection{Deep Learning on Graphs}
In the latest years a big shift towards deep architectures has been made
in machine learning, mainly thanks to the development of highly
parallelized architectures that are able to efficiently compute
at the hardware level vector and matrix multiplications, operations that
are at the basis of any machine learning task.
Deep learning algorithms are able to extract relevant features from
raw data by applying simple mathematical operations, such as convolution, to
the input data.
An example of one of the most successful applications of deep neural networks
is in image recognition, where matrix representations of images are
convolved with self-trained filters that are able to
extract the relevant features needed to recognize patterns present
in the input images.
Deep learning techniques have proven to perform well also in the field of
representation learning for graph data.
As can be seen in figure \ref{fig:pixels-as-graph}, a digital image
is composed of pixels which can be thought of as nodes in a graph, where
each pixel is connected by an edge to its immediate neighbors. This suggests
that the techniques used when dealing with images can be adapted, with
some major changes, to the field of representation learning on graphs, but
also in other fields of research, such as learning on manifolds.
\begin{figure}[h]
\centering
\includegraphics[scale=0.4]{img/pixels-as-graph.png}
\caption{A digital image can be thought of as a graph.}
\label{fig:pixels-as-graph}
\end{figure}
One of the issues when working with graph data is that commonly graphs
are built to describe complex systems, such as the knowledge of a
domain or field for knowledge graphs, and thus are composed of
a fairly high amount of nodes and edges. The matrices used to
store the graph structure can thus explode in dimensionality, becoming
impractical as input data. Moreover, graphs are not
regular structures with a given shape and size, such a matrix of pixels
for images, but they live in an irregular domain which led to highly
irregular structures.
The first issue can be solved by randomly sampling the graph at each
training epoch, the immediate drawback being that more than one epoch
is required to train over all graph nodes. The second issue can instead be
solved by adapting already existing algorithms to work on irregular domains.
One of the possible approaches, which has proven to work well, is
the one based on convolutions.
Graph Convolutional Networks (GCNs) \cite{kipf2016} are a class of
semi-supervised deep learning algorithms for graphs which are based on the
same convolution and backpropagation operations as the well known
Convolutional Neural Networks\cite{krizhevsky2012} (CNNs) used for feature
learning on images.
The main difference between CNNs and GCNs is in how the convolution is
performed, instead the backpropagation phase is the same as the one
used to update the parameters of CNNs, with a task-specific loss function.
In a CNN the input matrix of each network layer, which is the pixel matrix
of the input image for the first layer, is convolved with some layer-specific
convolutional filters, whose parameters are then updated during the
backpropagation phase.
GCNs works similarly by convolving at the \emph{l}-th layer
of the network the feature vector of each node with the feature
vectors of its \emph{l}-nearest neighbors by means of a convolutional filter.
This operation is done by applying the following transformation:
\begin{equation} \label{gcn1}
H^{(l+1)}=\sigma(\tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2}H^{(l)}W^{(l)})
\end{equation}
Where $H^{(l)} \in\mathbb{R}^{N \times d^{(l)}}$ is the nodes hidden
representation matrix, which is the output of the previous layer, with $N$ being
the number of nodes in the graph and $d^{(l)}$ being the dimensionality of
the current layer. For the first
layer, $H^{0}$ is equal to the input feature matrix $X$, where each row can be
initialized with the respective node feature or with a one-hot encoded vector.
For the last layer, $H^{(l+1)}$ is the embeddings matrix.
$\tilde{A}$ is an adjacency matrix of the graph that includes self loops.
$\tilde{D}$ is the node degree matrix of $\tilde{A}$ and is used to
normalize it.
$W^{l}\in\mathbb{R}^{d^{(l)} \times d^{(l+1)}}$ is the convolutional filter that
is shared among all nodes and is unique for each layer.
The shape of this filter will directly impact the dimensionality of the
embeddings obtained, in fact $H^{l+1}$ is shaped as a $N\times d^{(l+1)}$ matrix.
Finally, $\sigma$ is a non linear activation function, for instance $ReLU$.
Looking at the update rule of a single node embedding will make more clear how
the convolution is actually performed.
The forward rule to update the embedding of a single node at the $l$-th layer
of the network is the following:
\begin{equation} \label{gcn2}
h^{(l+1)}_{i}=\sigma(\sum_{j\in\eta_{i}} \frac{1}{c_{i,j}}h_j^{(l)}W^{{(l)}})
\end{equation}
Where $\eta_i$ is the set of neighbors of node $i$, which contains the node
itself because of the added self loop, and $c_{ij}$ is a
normalization constant obtained from the multiplication between the adjancency
and degree matrices.
The updated feature vector $h^{(l+1)}_{i}$ of the node $i$ is
obtained by performing the following operations:
\begin{enumerate}
\item The feature vectors $h_j^{(l)}$ of its neighbors are transformed by
the matrix multiplication with the layer filter $W^{(l)}$.
\item The resulting $1 \times d^{(l+1)}$ shaped features vectors
are multiplied with the respective normalization constants and summed
together.
\item The non-linear function is applied to the result of the summation,
obtaining the updated feature vector of the node $i$.
\end{enumerate}
Applying such transformations to all nodes, at the
$k$-th layer a node is represented by its transformed feature vector, which
embeds the structural information within the node's $k$-hop neighborhood.
As a consequence of this, the amount of layers of the network is an
hyperparameter that controls how much information from furthest nodes has to be
collected in each node embedding.
This represents a fundamental architectural difference between CNNs and GCNs:
while the former are commonly built by stacking a lot of layers, the latter
relies on architectures that are more wider, due to the dimensionality of the
graphs involved, and that consist of a fairly low amount of layers, in order to
characterize the nodes only through their immediate surroundings.
\begin{figure}[h]
\centering
\includegraphics[scale=0.4]{img/gcn.png}
\caption{Representation of a GCN updating the feature vector of the node \emph{0}
by summing the convolved features of its adjacent nodes.}
\label{fig:gcn}
\end{figure}
As a result of the propagation step each node embedding will be characterized by
its context, just like it happens in Word2Vec, but in a non-euclidean domain.
So for example, in a social graph where each person is characterized by its
friends, hobbies, places visited and so on, two people will have similar
embeddings if they are both linked to the same nodes, and so they share the
same interests and social connections.
The node embeddings obtained by applying a GCN or one of its variants can
then be used to perform some learning tasks on the graph, two examples are
the classification of unseen nodes or the link prediction of non-existent
edges. The latter is one of the most interesting tasks, since it allows to
complete the information contained in the graph by predicting unseen facts.
\subsection{Link Prediction on Knowledge Graphs}
\label{subsec:linkprediction}
Predicting new facts is one of the most common task in the field
of knowledge graph completion. The goal of such task is to predict new, unseen
triples that correspond to missing facts in the knowledge base, and that can be
later added to the graph.
Deep learning techniques for link prediction are based on the following two
main steps:
\begin{enumerate}
\item Train an \emph{encoder} model
that is able to embed the node features and produce meaningful embeddings.
\item Apply a factorization model that act as a \emph{decoder}, which is
used to score the unseen triples under evaluation.
\end{enumerate}
Deep learning techniques such as GCN can be exploited to obtain meaningful
node embeddings, but fall short when dealing with graphs where
nodes are connected by different relations (multi-relational graphs).
In fact, if a single layer GCN had been used to obtain the embeddings of two nodes
that share the same neighborhood, but are connected to the neighbors via
different relations, the resulting embeddings would have been very similar, even if it is
clear that they do not share the same characteristics. For example in a
producer-consumer framework one node could be the producer while the other
the consumer, having both a common neighborhood of nodes
which are produced by the former and consumed by the latter.
If a GCN is used, the embeddings obtained would be very similar, even if
the two nodes clearly do not have the same role and do not belong to the
same class.
To overcome this limitation, changes to the basic GCN architecture have been
proposed, so to obtain models that works well when applied to multi-relational
graphs.
\newline
The Relational Graph Convolutional Network \cite{schlichtkrull2018} (R-GCN) is an
extension of GCNs which is focused on modeling multi-relational graphs composed
by labeled and directed edges, and thus is particularly capable of embedding
both nodes and relations of a knowledge graph.
R-GCN can be used for both link prediction and node classification tasks.
At an high level the R-GCN architecture can be seen as a special case of
the \emph{message passing framework} \cite{gilmer2017}, which groups together
under a common scheme most of the existing neural models for graph data.
The framework defines two major phases: a per-edge message computation and a
per-node message reduction.
In the first phase a function or linear transformation is applied to each edge
to obtain an edge-specific message.
Then, in the reduce phase, the embedding of each node is updated by aggregating
together all the messages of the incoming edges.
This two phases can be grouped together through the following equation:
\begin{equation}
h_i^{(l+1)} = \sigma \left(
\sum_{m\in{\CMcal{M}_i}} g_m(h_i^{(l)}, h_j^{(l)})
\right)
\end{equation}
Where $\CMcal{M}_i$ is the set of incoming messages for node $i$ and
$g_m$ is a message specific transformation.
\newline
The idea behind the R-GCN is to have different set of parameters
for different relations.
At each step inside the network, the feature vector of a node is updated by
convolving its first neighbors features with a convolutional filter that is
different based on the kind of relation that connects the nodes.
The forward
rule to update the embedding of a node at the \emph{l}-th layer is the following:
\begin{equation}
h^{(l+1)}_{i} = \sigma \left(
W_0^{(l)}h_i^{(l)} + \sum_{r\in\CMcal{R}}\sum_{j\in\CMcal{N}_i^r}
\frac{1}{c_{i,r}} W_r^{(l)} h_j^{(l)}
\right)
\end{equation}
Where $h_i^{(l)}$ is the embedding (or, for the first layer, the input feature
vector) of the node \emph{i}, $W_0^{(l)}$ is the learnt kernel for the
self loop relation, $\CMcal{N}_i^r$ is the
set of indices of the neighbors of node \emph{i} under the relation
$r\in\CMcal{R}$, with $\CMcal{R}$ being the set of all the relations present in
the graph. $W_r^{(l)}\in\mathbb{R}^{d^{(l+1)}\times d^{(l)}}$ is the learnt
filter for the relation $r$. As for the GCN architecture, $\sigma$ is a non
linear activation function and $c_{i,r}$ is a normalization constant, commonly
initialized to $|\CMcal{N}_i^r|$.
From a message passing framework perspective, the message function (per-edge
transformation) is equal to the linear transformation $W_rh_j$, and the reduce
function (per-node transformation) is just the sum of
all the messages computed for the edges connected to each node.
As can be seen the update rule looks similar to the one for GCNs (\ref{gcn2}),
with the major difference that in the case of a R-GCN the filters used to
convolve the feature vectors of neighboring nodes are relation-specific, so the
number of filters at each layer will be equal to the number of relations inside
the graph. As a consequence, the kind of relation that connect the node
to its neighbors has an important role in determining the transformed node
embedding.
\begin{figure}[h]
\centering
\includegraphics[scale=0.48]{img/rgcn-encoder.png}
\caption{R-GCN encoder visualization. Image taken from \cite{schlichtkrull2018}.}
\label{fig:rgcn-encoder}
\end{figure}
Some form of regularization is required in order to avoid overfitting on rare
relations, thus to obtain a more generalized model, and also to avoid
the rapid growth in the number of parameters of the network for
highly multi-relational graphs.
\newpage
One of the solutions proposed by the original
paper \cite{schlichtkrull2018} is to decompose each relation filter $W_r$
using basis decomposition:
\begin{equation}
W_r^{(l)} = \sum_{b=1}^B a_{r,b}^{(l)} V_b^{(l)}
\end{equation}
This allows to store only the relation-specific coefficients and the basis,
which will be shared by all the relations.
The model obtained by training a R-GCN can then be used to build an encoder
that given as input a graph, gives as output the embeddings of all nodes and
the relations parameters. Then a factorization method can be used to
evaluate unseen facts inside the graph, exploiting the resulting embeddings.
Such methods are used as scoring functions in order to obtain, starting from
the entities embeddings, a real value that can be used to score the unseen
triples under evaluation.
\newline
DistMult \cite{yang2014} is one of the most common and simple factorization
methods used to score unseen triples. Given a triple $(s, p, o)$, where $s$
is the source node, $o$ is the destination node, and $r$ is the relation of the
edge that links the source to the destination, DistMult allows to compute
an associated real valued score as follows:
\begin{equation}
score_{(s,r,o)} = f(s,r,o) = e_s^T R_r e_o
\end{equation}
Where $e_s$ and $e_o$ are the embeddings of the source and the destination node,
obtained by means of an encoder model like R-GCN, and
$R_r\in\mathbb{R}^{d \times d}$ is obtained by transforming the embedding
of the relation $r$ into a diagonal matrix. Such relation embedding is not
learnt by the R-GCN encoder, but is a randomly initialized vector that
represent the relation.
The score obtained by applying the factorization method above can then be used
to evaluate wether the triple $(s,r,o)$ is a good candidate to be added to the
graph: an high score has to be interpreted as a high confidence of the model
in the fact that the triple should belong to the knowledge base.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% -------------------------- Approach and methodology ------------------------ %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{Approach and Methodology}
\label{ch:approach}
This chapter introduces the architecture developed to build, enhance
and visualize the Polito Knowledge Graph (PKG), an academic RDF graph
built to organize in a structured and semantically coherent way the publications
produced by the researchers of the Politecnico di Torino. The graph
also includes publication-related entities, such as authors, journals,
and fields of study.
The architecture is composed of three main modules, that at an high level
can be seen as part of a producers-consumers architecture, where the former
produces the graph data, and the latter consumes it.
The architecture, which is shown in Figure \ref{fig:pipeline}, is structured
as a pipeline composed of the following three modules:
\begin{enumerate}
\item The \emph{Builder}, which creates a first version of the
RDF graph.
\item The \emph{Enhancer}, which implements ML techniques to predict
unseen facts. Such facts can then be added to the graph in order to
complete the knowledge base.
\item The \emph{Viewer}, a web application that allows to query
and visualize the graph data.
\end{enumerate}
The \emph{Builder} act as producer taking as input the IRIS data
and producing a set of RDF statements that together composes the
Polito Knowledge Graph.
The \emph{Enhancer} act as both a producer and a consumer, given that it
takes as input the PKG and use it to predict unseen facts that can be
later added as new RDF statements.
The \emph{Viewer} consumes the graph by storing it a triplestore and exposing
a public web interface for querying and visualizing the data.
\newline
\newline
\begin{figure}[h]
\centering
\includegraphics[scale=0.8]{img/pipeline.png}
\caption{Schema of the pipelined software architecture developed to build,
enhance and visualize the Polito Knowledge Graph.
The legend is showed in the bottom left corner of the Figure.
The circled numbers are used to identify the steps in the pipeline.
}
\label{fig:pipeline}
\end{figure}
\newpage
\section{Builder Module}
The Builder input is a dump of the
IRIS\footnote{\url{https://iris.polito.it/}} database, which is the platform
used by the Politecnico di Torino to store and share all the
scientific papers written by its researchers. The dump is a
JSON file that contains all the information available for the scientific
paper published in a period of five years that goes from 2013 to 2017.
The Builder goal is to translate all the information contained in the
dump in a set of semantically coherent RDF triples. To do so, an ontology that
describes the domain of the IRIS scholarly data has been defined.
The builder uses as reference the defined ontology to analyze each record in
the JSON dump, and builds facts as RDF triples by matching the information
contained in the record with the concepts defined by the ontology.
For example, given that a publication may have more then one contributor, the
defined ontology differentiates the principal author from the
contributors by using two different statements to tie the corresponding entities
to the publication.
One of the goal of this work is to link each publication to its research
fields through unique and field-wide semantic concepts.
To do so, the publication abstract has been used
as input text for TellMeFirst (TMF) \cite{rocha2015}, a tool for the automatic
extraction of semantic keywords from texts. Such semantic keywords, to which from now
will be referred to as \emph{topics}, are retrieved from the DBpedia taxonomy
and are uniquely identified by their corresponding URI.
The use of TMF allowed to automatically tie each publication to
its relevant topics, by adding the corresponding RDF statements to the graph.
Being the topics added as graph entities which are uniquely identified, all the
publications that share a topic are linked to the same semantic entity.
The result of this process is a set of RDF statements that constitutes the
first version of the Polito Knowledge graph, a semantically coherent description
of the publications produced by the Polito researchers, where each
publication is uniquely identified and linked by means of semantic relations