Reddit-to-Delta-Live

Overview

The Reddit-to-Delta-Live project ingests data from Reddit in near real-time and processes it through a Delta Live pipeline. The raw Reddit data is ingested into the Bronze Layer, enriched in the Silver Layer, and transformed into the final Gold Layer, which is stored in Delta format for further analysis. The pipeline is designed to handle large volumes of data efficiently and ensure that the data is processed, cleaned, and enriched for downstream use.

Project Architecture

Bronze Layer: Raw data ingestion from Reddit via API. This layer stores unprocessed data, such as post ID, title, author, score, and creation timestamp in Delta tables.
Silver Layer: Data refinement and cleaning. Duplicates are removed, missing values are handled, and the data is transformed and enriched using sentiment and emotion analysis.
Gold Layer: Final processed data, including sentiment and emotion features, stored in Delta format, ready for analysis or reporting.

Data Flow

Data Ingestion: Reddit posts are ingested via the Reddit API and stored in the Bronze Layer as raw data in Delta format.
Data Transformation: In the Silver Layer, the data is cleaned and transformed. This includes handling missing data, removing duplicates, and performing aggregations.
Enrichment: Sentiment and emotion analysis is performed on Reddit post titles and descriptions, and features like TF-IDF are extracted.
Final Output: The cleaned and enriched data is stored in the Gold Layer in Delta format, ready for further analysis.

Technologies Used

Python: For data processing and transformation.
Azure Databricks: For managing and running Spark-based workflows in the cloud.
PySpark: For distributed data processing and transformation.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
DLT Pipeline V1 Reddit		DLT Pipeline V1 Reddit
Pyspark		Pyspark
Bronze.ipynb		Bronze.ipynb
README.md		README.md
Sentiment Analysis.ipynb		Sentiment Analysis.ipynb
Silver.html		Silver.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reddit-to-Delta-Live

Overview

Project Architecture

Data Flow

Technologies Used

About

Releases

Packages

Contributors 2

Languages

Redgerd/Reddit-Post-Analysis-Workflow

Folders and files

Latest commit

History

Repository files navigation

Reddit-to-Delta-Live

Overview

Project Architecture

Data Flow

Technologies Used

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages