GitHub - adelaneh/py_stringclustering: Scalable String Clustering in Python

https://travis-ci.org/adelaneh/py_stringclustering.svg?branch=master

https://ci.appveyor.com/api/projects/status/123srwchg7gd6e1d?svg=true

py_stringclustering

This project seeks to build a Python-based collection of commands for clustering a set of strings.

Given a set of strings D, the goal of string clustering is to create a partitioning of D such that every pair of strings falling into the same partition refer to the same real-world entity, and furthermore, no two strings assigned to different partitions refer to the same real-world entity. A typical string clustering session involves six steps:

Read the data into memory
Blocking: try to remove obvious non-matching string pairs and reduce the set considered for similarity score calculation
Calculate pairwise similarity scores between blocked string pairs
Generate a similarity matrix based on the result of the previous step
Execute a clustering algorithm using the similarity matrix created above
Generate string clusters based on the labels assigned by the clustering algorithm

Current clustering packages do not provide easy-to-use, straightforward commands and workflows to perform all the above steps. py_stringclustering seeks to support all the steps involved in the above workflow.

The package is free, open-source, and BSD-licensed.

Important links

Project Homepage: https://sites.google.com/site/anhaidgroup/projects/magellan/py_stringclustering
Code repository: https://github.com/anhaidgroup/py_stringclustering
Issue Tracker: https://github.com/anhaidgroup/py_stringclustering/issues

Dependencies

The required dependencies to build the packages are:

pandas (provides data structures to store and manage tables)
numpy (used to store similarity matrices and required by pandas)

Platforms

py_stringclustering has been tested on Linux, OS X and Windows.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
LICENSES		LICENSES
conda.recipe		conda.recipe
continuous-integration/appveyor		continuous-integration/appveyor
docs		docs
notebooks		notebooks
py_stringclustering		py_stringclustering
.gitignore		.gitignore
.travis.yml		.travis.yml
CHANGES.txt		CHANGES.txt
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.rst		README.rst
appveyor.yml		appveyor.yml
asv.conf.json		asv.conf.json
build.bat		build.bat
requirements.txt		requirements.txt
requirements.yml		requirements.yml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

py_stringclustering

Important links

Dependencies

Platforms

About

Releases

Packages

Languages

License

adelaneh/py_stringclustering

Folders and files

Latest commit

History

Repository files navigation

py_stringclustering

Important links

Dependencies

Platforms

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages