Skip to content

ashshetty90/spark-kafka-application

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spark Kafka Streaming Application

FLOW_DIAGRAM

Overview:

This project is a dockerised pub sub based application written in python to publish data using Kafka and consume/aggregate data using the Spark processing framework.

Requirement:

  1. A dockerised kafka application that produces user behaviour data and pushes into its producer.
  2. A dockerised pyspark based application that could read real time data from a kafka producer and perform aggregations on the raw data and perform aggregations real time.

Architecture:

The Architecture involves the following tech stack.

  • Kafka v1.4.7
  • Spark v2.4.4
  • Python v3.7.2
  • Docker

What can be done better

  1. Push the data into a persistant storage like InfluxDB/MySQl/PostGres.
  2. Stream real time aggregations from persistant storage systems into dashboards built in Grafana.

How to run this application

### Clone the repository [https://github.com/ashshetty90/spark-kafka-application.git]

### After cloning the repository, run the below command in the root directory of the project. Make sure docker is installed on your machine.
$ docker-compose up -d
### The above command will run the kafka producer and the spark consumer inside a docker container.

$ docker container ps 

### This would show all the currently running containers inside docker

$ docker logs <container_id>

The logs for each of the container can be seen by running the above command

Screenshots

Sample Outputs

KAFKA PRODUCER

SPARK CONSUMER

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •