Skip to content

This Reddit Post Analysis Workflow collects and processes Reddit data using Apache Spark and Delta Lake. It transforms raw data, applies sentiment analysis, and extracts TF-IDF features. The pipeline ensures reliable, high-quality data storage and supports continuous analytics.

Notifications You must be signed in to change notification settings

Redgerd/Reddit-Post-Analysis-Workflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reddit-to-Delta-Live

Python Azure Databricks PySpark

Overview

The Reddit-to-Delta-Live project ingests data from Reddit in near real-time and processes it through a Delta Live pipeline. The raw Reddit data is ingested into the Bronze Layer, enriched in the Silver Layer, and transformed into the final Gold Layer, which is stored in Delta format for further analysis. The pipeline is designed to handle large volumes of data efficiently and ensure that the data is processed, cleaned, and enriched for downstream use.

Project Architecture

  • Bronze Layer: Raw data ingestion from Reddit via API. This layer stores unprocessed data, such as post ID, title, author, score, and creation timestamp in Delta tables.
  • Silver Layer: Data refinement and cleaning. Duplicates are removed, missing values are handled, and the data is transformed and enriched using sentiment and emotion analysis.
  • Gold Layer: Final processed data, including sentiment and emotion features, stored in Delta format, ready for analysis or reporting.

Data Flow

  1. Data Ingestion: Reddit posts are ingested via the Reddit API and stored in the Bronze Layer as raw data in Delta format.
  2. Data Transformation: In the Silver Layer, the data is cleaned and transformed. This includes handling missing data, removing duplicates, and performing aggregations.
  3. Enrichment: Sentiment and emotion analysis is performed on Reddit post titles and descriptions, and features like TF-IDF are extracted.
  4. Final Output: The cleaned and enriched data is stored in the Gold Layer in Delta format, ready for further analysis.

Technologies Used

  • Python: For data processing and transformation.
  • Azure Databricks: For managing and running Spark-based workflows in the cloud.
  • PySpark: For distributed data processing and transformation.

WhatsApp Image 2024-12-21 at 19 51 03_b606a2c6

About

This Reddit Post Analysis Workflow collects and processes Reddit data using Apache Spark and Delta Lake. It transforms raw data, applies sentiment analysis, and extracts TF-IDF features. The pipeline ensures reliable, high-quality data storage and supports continuous analytics.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published