This repository contains an example of using dbd database prototyping tool for loading Kaggle dataset files to a database.
The dbd tool supports Kaggle datasets since its version 0.8.3.
This example supports loading of Kaggle files to SQLite, Postgres, and MySQL databases. The dbd tool supports more database engines. Modifying the example to support Snowflake, Redshift, or BigQuery should be trivial. Let me know (by submitting an Issue) if you need help.
There are two examples available:
- Kaggle Omicron dataset that loads COVID-19 Omicron variant worldwide data. This example's sources are here.
- Kaggle NYT COVID-19 dataset that loads New York Times COVID-19 US data. This example's sources are here
Here are quick steps that assume that you already have Python 3.8+ and Python virtual environment installed on your computer. If not, please refer to the Installation chapter below.
Also, the instructions below are for SQLite. Read the Using MySQL or Postgres database chapter below if you have MySQL or Postgres.
- Modify the
KAGGLE_USERNAME
andKAGGLE_KEY
environment variables in thebin/env.sh
(Linux and MacOS) orbin\env.bat
scripts.
To use the Kaggle API, sign up for a Kaggle account at https://www.kaggle.com. Then go to the 'Account' tab of your user profile (https://www.kaggle.com/<username>/account
) and select 'Create API Token'. This will trigger the download of kaggle.json, a file containing your API credentials. The file looks like this:
{"username": "<kaggle-username>", "key": "<kaggle-key>"}
Set the KAGGLE_USERNAME
and KAGGLE_KEY
environment variables in the bin/env.sh
or bin\env.bat
script to to the <kaggle-username>
and <kaggle-key>
values.
NOTE: You can double-quote ("
) these values on Linux or MacOS, but don't quote them on Windows.
-
Install dbd by running
bin/install.sh
on Linux or MacOS orbin\install.bat
on Windows. -
Set your environment by calling
bin/setenv.sh
orbin\setenv.bat
-
Run the example using a script in the
bin
directory
on Linux or MacOS:
cd etl
../bin/omicron_sqlite.sh
on Windows:
cd etl
..\bin\omicron_sqlite.bat
OR run the example directly
on Linux or MacOS:
source bin/setenv.sh
cd etl/omicron
dbd --project sqlite.project run .
on Windows:
bin\setenv.bat
cd etl\omicron
dbd --project sqlite.project run .
You need to execute the bin/setenv.sh
or bin\setenv.bat
script just once per your terminal session.
You can easily modify files in the model directory of these examples to load your favorite Kaggle dataset. Here are the high-level steps:
-
Start your terminal or cmd and execute the
bin/setenv.sh
orbin\setenv.bat
script. -
Create a new dbd project by executing dbd
init
command
dbd init my-kaggle-dataset
- Modify the contents of the
my-kaggle-dataset/model
directory that has been generated. You most probably want to delete the demo data files and create a newmy-kaggle-dataset/model/name.ref
file with a reference to a Kaggle dataset:
kaggle://yamqwe/omicron-covid19-variant-daily-cases>covid-variants.csv
The reference is an URL with kaggle://
schema (dbd supports many other sources), the Kaggle dataset identificator (e.g. yamqwe/omicron-covid19-variant-daily-cases
) and finally the name of a file inside the dataset (e.g. covid-variants.csv
).
As the reference is lovcated in the file my-kaggle-dataset/model/name.ref
, it creates a new name
table in the target database and populates it with the data.
- Edit database connection in the
model/dbd.profile
file. The dbd uses SQLite by default as it doesn't require any database server. See the chapter below on how to modify themodel/dbd.profile
file to load data to MySQL or Postgres.
Here is an example dbd.profile
file:
databases:
mysql:
db.url: "mysql+pymysql://{{ MYSQL_USER }}:{{ MYSQL_PASSWORD }}@{{ MYSQL_HOST }}/{{ MYSQL_DB }}?charset=utf8mb4"
pgsql:
db.url: "postgresql://{{ POSTGRES_USER }}:{{ POSTGRES_PASSWORD }}@{{ POSTGRES_HOST }}/{{ POSTGRES_DB }}"
sqlite:
db.url: "sqlite:///omicron.db"
The dbd.profile
file can reference your environment variables for database parameters (e.g. host, port, password).
The dbd.profile
is usually located in your home directory and contains all database connections that you work with.
- Make sure that your
dbd.project
file references and existing database connection from thedbd.profile
config file. For example:
model: model
database: sqlite
- Load data from the referenced Kaggle dataset to a database
cd my-kaggle-dataset
dbd run .
dbd picks up the dbd.profile
and dbd.project
files from the current working directory. You can use --profile
and --project
commandline options to use different configuration files.
- If you want to modify the default column's data types or add database constraints like primary or foreign keys, checks or indexeas, you can create a YAML file with the same base name to override the defaults.
This chapter describes how to install Python3 and virtual environment on your computer.
Just install Python from the Microsoft Store. Search for Python 3.9
app and install it.
Then proceed with the steps in the Running the examples chapter.
Use Homebrew
# install homebrew itself
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
# install homebrew Python
brew install python@3.9
brew install virtualenv
Then proceed with the steps in the Running the examples chapter.
sudo yum install python3
sudo yum install python3-virtualenv
Then proceed with the steps in the Running the examples chapter.
sudo apt install python3
sudo apt install python3-venv
Then proceed with the steps in the Running the examples chapter.
First, you have to specify database connection parameters in the bin/env.sh
script on Linux or MacOS:
# Postgres environment
export POSTGRES_USER=zsvoboda
export POSTGRES_PASSWORD=
export POSTGRES_HOST=localhost
export POSTGRES_DB=postgres
# MySQL environment
export MYSQL_USER=root
export MYSQL_PASSWORD=
export MYSQL_HOST=localhost
export MYSQL_DB=public
or bin/env.bat
on Windows:
rem Postgres environment
set POSTGRES_USER=zsvoboda
set POSTGRES_PASSWORD=
set POSTGRES_HOST=localhost
set POSTGRES_DB=postgres
rem MySQL environment
set MYSQL_USER=root
set MYSQL_PASSWORD=
set MYSQL_HOST=localhost
set MYSQL_DB=public
Then you can run the examples by either use a corresponding script in the bin
directory. For example, bin/omicron_pgsql.sh
/ bin\omicron_pgsql.bat
or bin/covid_mysql.sh
/ bin\covid_mysql.bat
.
OR
you can run dbd with a corresponding profile file (e.g. pgsql.project
or mysql.project
) directly.
For example:
on Linux or MacOS:
source bin/setenv.sh
cd etl/omicron
dbd --project pgsql.project run .
on Windows:
bin\setenv.bat
cd etl\omicron
dbd --project mysql.project run .
Refer to the complete dbd documentation here
These examples are licensed under BSD license.