Skip to content

adelaneh/py_stringclustering

Repository files navigation

https://travis-ci.org/adelaneh/py_stringclustering.svg?branch=master https://ci.appveyor.com/api/projects/status/123srwchg7gd6e1d?svg=true

py_stringclustering

This project seeks to build a Python-based collection of commands for clustering a set of strings.

Given a set of strings D, the goal of string clustering is to create a partitioning of D such that every pair of strings falling into the same partition refer to the same real-world entity, and furthermore, no two strings assigned to different partitions refer to the same real-world entity. A typical string clustering session involves six steps:

  1. Read the data into memory
  2. Blocking: try to remove obvious non-matching string pairs and reduce the set considered for similarity score calculation
  3. Calculate pairwise similarity scores between blocked string pairs
  4. Generate a similarity matrix based on the result of the previous step
  5. Execute a clustering algorithm using the similarity matrix created above
  6. Generate string clusters based on the labels assigned by the clustering algorithm

Current clustering packages do not provide easy-to-use, straightforward commands and workflows to perform all the above steps. py_stringclustering seeks to support all the steps involved in the above workflow.

The package is free, open-source, and BSD-licensed.

Important links

Dependencies

The required dependencies to build the packages are:

  • pandas (provides data structures to store and manage tables)
  • numpy (used to store similarity matrices and required by pandas)

Platforms

py_stringclustering has been tested on Linux, OS X and Windows.

About

Scalable String Clustering in Python

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published