MapReduce-KMeans-Clustering

This project implements the K-means clustering algorithm using Java and Hadoop MapReduce. It processes a large dataset of 3-dimensional points to iteratively assign points to clusters and update cluster centroids until convergence. The project demonstrates big data processing techniques, distributed computation, and scalable clustering using multiple iterations.

Key features:

Multi-iteration K-means implementation.
Customizable number of clusters (K) and iterations (R)
Developed using IntelliJ IDEA, designed for Hadoop-compatible environments

Dataset Creation

Points Dataset:
- 5,000+ 3-dimensional points (x, y, z), where: x ranges from 0 to 10,000.
- y and z are randomly generated within a defined range.
Seed Points:
- A file containing K randomly chosen seed points, where K is a configurable parameter.
Output Files:
- 3d_points_dataset.csv: Dataset of 3D points.
- seed_points_K.csv: Initial cluster centers.

K-means Variants

This project implements and compares four variations of K-means clustering using Hadoop MapReduce:

Task 1: Single-Iteration K-means (R=1): Executes one iteration of the K-means algorithm to assign points to clusters and compute new centers.

Task 2: Basic Multi-Iteration K-means (R=5): Executes the K-means algorithm for a fixed number of iterations (R=5), without checking for early convergence.

Task 3: Advanced Multi-Iteration K-means with Early Termination: Includes an early termination condition: Stops if cluster centers remain unchanged over two consecutive iterations or meet a predefined threshold.

Task 4: Optimized Multi-Iteration K-means: Introduces Hadoop optimizations: 1) Uses a combiner to reduce intermediate data size 2) Improves Mapper and Reducer logic for faster convergence

Output Variations

The project produces two types of outputs:

Cluster Centers: Final cluster centers with a flag indicating whether convergence was reached.
Clustered Data: The dataset with points labeled by their assigned cluster centers.

Clustering Evaluation

Evaluation Metric:

Silhouette Score

Measures the quality of clustering by comparing intra-cluster cohesion and inter-cluster separation.

Implemented using MapReduce

Mapper: Computes distances between points and clusters.
Reducer: Aggregates results to compute the Silhouette score for each cluster and overall dataset.

Purpose: Evaluate and compare the performance of different K-means variations

Installation and Usage

Prerequisites:

Java Development Kit (JDK) 8 or higher.
Apache Hadoop installed and configured.
IntelliJ IDEA (or any preferred IDE).

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.idea		.idea
output		output
src/main		src/main
.gitignore		.gitignore
3d_points_dataset.csv		3d_points_dataset.csv
README.md		README.md
pom.xml		pom.xml
seed_points.csv		seed_points.csv
seed_points_K5.csv		seed_points_K5.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MapReduce-KMeans-Clustering

Key features:

Dataset Creation

K-means Variants

Output Variations

Clustering Evaluation

Installation and Usage

About

Releases

Packages

Languages

Mandar-1007/MapReduce-KMeans-Clustering

Folders and files

Latest commit

History

Repository files navigation

MapReduce-KMeans-Clustering

Key features:

Dataset Creation

K-means Variants

Output Variations

Clustering Evaluation

Installation and Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages