Created a tool that can estimate the book ratings (Mean Absolute Error - 0.21) to help bibliophiles to predict the average book ratings when it comes to picking the next book to read.
Pulled over 11123 books from Kaggle using pandas and opendatasets libraries in python.
Applied Linear Regression and Random Forest Regression and optimized using RandomGridSearchCV to find the best model.
Built a client-facing API using flask.
Python version: Python 3.7.11
Packages: pandas, opendatasets, seaborn, matplotlib, numpy, nltk, wordcloud, pickle, flask, json
For Web Framework Requirements: pip install -r requirements.txt
How to download Kaggle datasets to Jupyter notebook guide
Used Kaggle to pull the datasets 11123 books with 12 columns:
- bookID
- title
- authors
- average_rating
- isbn
- isbn13
- language_code
- num_pages
- ratings_count
- text_reviews_count
- publication_date
- publisher
After pulling the data, I cleaned up the dataset to be ready to use for the model. The changes were made follows:
- Pulled years from publication_date column and created publication_year column.
- Calculated age of each book by substracting the publication year by 2022 (the current year we are in).
- Dropped ISBN column since we have the isbn13 column.
- Changed language codes with their full name and merged English-based ones (e.g. en-ca, en-us, etc.) to English.
Visualized the cleaned data to see the trends and correlation between attributes and check if there is any outliers.
Categorical variables were transformed:
- language variable transformed to a dummy variable.
- authors, title, and publisher variables were encoded.
Data were split into train (80%) and test (20%) sets.
I used two models (Linear Regression and Random Forest Regression) to predict book ratings and evaluated them by using MAE (Mean Absolute Error).
The Random Forest Regression model performed better than the Linear Regression model in this project.
Linear Regression | Train | Test |
---|---|---|
MSE | 0.122 | 0.098 |
R^2 | 0.042 | 0.008 |
MAE | 0.228 | 0.223 |
RMSE | 0.350 | 0.313 |
Random Forest Regression | Train | Test |
---|---|---|
MSE | 0.014 | 0.101 |
R^2 | 0.888 | -0.019 |
MAE | 0.079 | 0.210 |
RMSE | 0.120 | 0.318 |
The flask API endpoint was built and hosted on a local server (web), it takes in a request and predicts a book rating.
My model was over 100 MB which GitHub doesn't support any file over 100MB. I tried this method and it worked for me, attaching for future reference.
Thanks :)