The Reddit-to-Delta-Live project ingests data from Reddit in near real-time and processes it through a Delta Live pipeline. The raw Reddit data is ingested into the Bronze Layer, enriched in the Silver Layer, and transformed into the final Gold Layer, which is stored in Delta format for further analysis. The pipeline is designed to handle large volumes of data efficiently and ensure that the data is processed, cleaned, and enriched for downstream use.
- Bronze Layer: Raw data ingestion from Reddit via API. This layer stores unprocessed data, such as post ID, title, author, score, and creation timestamp in Delta tables.
- Silver Layer: Data refinement and cleaning. Duplicates are removed, missing values are handled, and the data is transformed and enriched using sentiment and emotion analysis.
- Gold Layer: Final processed data, including sentiment and emotion features, stored in Delta format, ready for analysis or reporting.
- Data Ingestion: Reddit posts are ingested via the Reddit API and stored in the Bronze Layer as raw data in Delta format.
- Data Transformation: In the Silver Layer, the data is cleaned and transformed. This includes handling missing data, removing duplicates, and performing aggregations.
- Enrichment: Sentiment and emotion analysis is performed on Reddit post titles and descriptions, and features like TF-IDF are extracted.
- Final Output: The cleaned and enriched data is stored in the Gold Layer in Delta format, ready for further analysis.
- Python: For data processing and transformation.
- Azure Databricks: For managing and running Spark-based workflows in the cloud.
- PySpark: For distributed data processing and transformation.