Predict Book Ratings 📚: Project Overview

Created a tool that can estimate the book ratings (Mean Absolute Error - 0.21) to help bibliophiles to predict the average book ratings when it comes to picking the next book to read.

Pulled over 11123 books from Kaggle using pandas and opendatasets libraries in python.

Applied Linear Regression and Random Forest Regression and optimized using RandomGridSearchCV to find the best model.

Built a client-facing API using flask.

Code Used

Python version: Python 3.7.11

Packages: pandas, opendatasets, seaborn, matplotlib, numpy, nltk, wordcloud, pickle, flask, json

For Web Framework Requirements: pip install -r requirements.txt

Resources Used

The dataset from Kaggle

How to download Kaggle datasets to Jupyter notebook guide

Language code table

Flask productionization

Instructions for Git LFS

Cheatsheet for Markdown

Data Collection

Used Kaggle to pull the datasets 11123 books with 12 columns:

bookID
title
authors
average_rating
isbn
isbn13
language_code
num_pages
ratings_count
text_reviews_count
publication_date
publisher

Data Cleaning

After pulling the data, I cleaned up the dataset to be ready to use for the model. The changes were made follows:

Pulled years from publication_date column and created publication_year column.
Calculated age of each book by substracting the publication year by 2022 (the current year we are in).
Dropped ISBN column since we have the isbn13 column.
Changed language codes with their full name and merged English-based ones (e.g. en-ca, en-us, etc.) to English.

Exploratory Data Analysis

Visualized the cleaned data to see the trends and correlation between attributes and check if there is any outliers.

Created *Bar graphs, Box plots, Scatter plot, and Heat map * for numerical variables.
Created Pie Chart and WordCloud for categorical variables.

Model Building

Categorical variables were transformed:

language variable transformed to a dummy variable.
authors, title, and publisher variables were encoded.

Data were split into train (80%) and test (20%) sets.

I used two models (Linear Regression and Random Forest Regression) to predict book ratings and evaluated them by using MAE (Mean Absolute Error).

Model Performance Evaluation

The Random Forest Regression model performed better than the Linear Regression model in this project.

Linear Regression	Train	Test
MSE	0.122	0.098
R^2	0.042	0.008
MAE	0.228	0.223
RMSE	0.350	0.313

Random Forest Regression	Train	Test
MSE	0.014	0.101
R^2	0.888	-0.019
MAE	0.079	0.210
RMSE	0.120	0.318

Productionization

The flask API endpoint was built and hosted on a local server (web), it takes in a request and predicts a book rating.

Notes

My model was over 100 MB which GitHub doesn't support any file over 100MB. I tried this method and it worked for me, attaching for future reference.

Thanks :)

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.ipynb_checkpoints		.ipynb_checkpoints
flaskAPI		flaskAPI
goodreadsbooks		goodreadsbooks
images		images
.DS_Store		.DS_Store
.gitattributes		.gitattributes
Data_Cleaning.ipynb		Data_Cleaning.ipynb
Data_Collection.ipynb		Data_Collection.ipynb
Model_Building.ipynb		Model_Building.ipynb
README.md		README.md
Untitled.ipynb		Untitled.ipynb
book_data_cleaned.csv		book_data_cleaned.csv
exploratory_data_analysis.ipynb		exploratory_data_analysis.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predict Book Ratings 📚: Project Overview

Code Used

Resources Used

Data Collection

Data Cleaning

Exploratory Data Analysis

Model Building

Model Performance Evaluation

Productionization

Notes

About

Releases

Packages

Languages

cerenkasap/book_ratings

Folders and files

Latest commit

History

Repository files navigation

Predict Book Ratings 📚: Project Overview

Code Used

Resources Used

Data Collection

Data Cleaning

Exploratory Data Analysis

Model Building

Model Performance Evaluation

Productionization

Notes

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages