This package provides some basic necessary tools to do attribute value normalization and string clustering in data cleaning and integration pipelines.
At the moment, due to incompatibility of Linux and Mac versions of PyQt5 package, we have two separate branches for Linux and Mac user interfaces. The master
branch contains the version which works on Linux and the pyqt5mac-dev
branch contains the version which works in Mac.
To install this package:
-
First install PyQt5, e.g. using one of the following methods:
-
If you have Anaconda installed on your machine, then use the following command:
$ conda install pyqt
-
Otherwise follow the instructions at http://pyqt.sourceforge.net/Docs/PyQt5/installation.html
-
Then install the package using one of the following methods:
-
Run the following command:
pip install git+https://github.com/adelaneh/py_valuenormalization.git
-
First clone the package source code using the following command:
$ git clone https://github.com/adelaneh/py_valuenormalization
Then enter the source code root folder (
py_valuenormalization
) and install the package using the following command:$ python setup.py install
You can use
--prefix
to change the destination folder for installing the package (see the help using$ python setup.py --help
).
To use this package, import it by running the following python command:
>>> import py_valuenormalization as vn
We have developed two main approaches to normalize values, namely:
- Manual value normalization
- Clustering-based value normalization
In manual value normalization, you merge values into clusters to normalize them:
-
First load your data values into a list using the following command:
>>> vals = vn.read_from_file('PATH-TO-TEXT-FILE')
where the file at
PATH-TO-TEXT-FILE
contains the values to be normalized, one data value per line. You can download one of our sample datasets from https://github.com/adelaneh/py_manual_vn/tree/master/py_valuenormalization/data. -
Now run the command:
>>> res = vn.normalize_values(vals)
This will open the value normalization application which gives you instructions on how to normalize the input values.
-
Finally when you finish the normalization process and close the above application, the normalization results are returned in the variable
res
: it is a dictionary where each key is the label of a cluster of data values, and the corresponding value is the set of data values in this cluster.
In clustering-based value normalization, you first cluster the values using one of the following three method:
- Regular hierarchical agglomerative clustering (HAC)
- Smart clustering, which finds the best HAC parameter settings using input training data
- Hybrid clustering, which finds a clustering of input values such that you need to spend minimal time to clean up the clustering
To cluster the values, follow these steps:
-
First load your data values into a list using the following command:
>>> vals = vn.read_from_file('PATH-TO-TEXT-FILE')
where the file at
PATH-TO-TEXT-FILE
contains the values to be normalized, one data value per line. You can download one of our sample datasets from https://github.com/adelaneh/py_manual_vn/tree/master/py_valuenormalization/data. -
Then use one of the following sequences of commands to cluster the values:
-
Regular HAC:
>>> hac = vn.HierarchicalClustering(vals)
>>> clusts = hac.cluster(sim_measure = '3gram Jaccard', linkage = 'single', thr = 0.7)
where
vals
is the set of input values,sim_measure
,linkage
andthr
are standard HAC parameters, andclusts
is a dictionary where each key is the label of a cluster of data values, and the corresponding value is the set of data values in this cluster. -
Smart clustering:
>>> (_, training_pairs) = vn.calibrate_normalization_cost_model(vals)
>>> smc = vn.SmartClustering(vals, training_pairs)
>>> (clusts, best_setting) = smc.cluster()
where
training_pairs
is a dictionary where each key is a value pair(v1, v2)
withv1
andv2
being distinct input values, and the corresponding value isTrue
ifv1
andv2
refer to the same entity andFalse
otherwise.The output consists of a dictionary
clusts
and a tuplebest_setting
. Each key of the dictionaryclusts
is the label of a cluster of data values, and the corresponding value is the set of data values in this cluster.best_setting = (agrscore, simk, lnk, thr)
is a tuple of agreement score and HAC parameter settings using whichclusts
is obtained.agrscore
is the agreement score betweenclusts
andtraining_pairs
; i.e. the fraction of the value pairs intraining_pairs
which agree withclusts
.sim_measure
,linkage
andthr
are the standard HAC parameters settings using whichclusts
is obtained. -
Hybrid clustering:
>>> (cm, _) = vn.calibrate_normalization_cost_model(vals)
>>> hybhac = vn.HybridClustering(vals, cm)
>>> (clusts, mcl) = hybhac.cluster()
where
cm
is a cost model used by hybrid clustering algorithm to find the clustering of the input data set that requires minimum effort by you to clean it up.The outpurt consists of a dictionary
clusts
and an integermcl
. Each key of the dictionaryclusts
is the label of a cluster of data values, and the corresponding value is the set of data values in this cluster.mcl
is the maximum size of the clusters inclusts
.
-
Now you can clean up the clusters obtained above to arrive at the correct clustering of the input values. This phase consists of two main steps:
- Split step, where you split clusters containing values referring to more than one real-world entity into smaller clusters each of which contains values referring to a single entity
- Merge steps, in which you merge clusters referring to the same entity
To clean up the clustering results run the following command:
```>>> clean_clusts = vn.normalize_clusters(clusts)```
where clusts
is a dictionary where each key is the label of a cluster of data values, and the corresponding value is the set of data values in this cluster. This will open a graphical user interface to clean up clusts
and the results with be returned in clean_clusts
which is a dictionary where each key is the label of a cluster of data values, and the corresponding value is the set of data values in this cluster.