by DataTalksClub
$ dvc init
Initialized DVC repository.
You can now commit the changes to git.
+---------------------------------------------------------------------+
| |
| DVC has enabled anonymous aggregate usage analytics. |
| Read the analytics documentation (and how to opt-out) here: |
| <https://dvc.org/doc/user-guide/analytics> |
| |
+---------------------------------------------------------------------+
What's next?
------------
- Check out the documentation: <https://dvc.org/doc>
- Get help and share ideas: <https://dvc.org/chat>
- Star us on GitHub: <https://github.com/iterative/dvc>
curl https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data -o iris.csv
(used gitbash on windows11)
store it in /data
folder/directory
first we run dvc init
We don't want to push the dataset to git because it is not scalable. We can add remote storage if we want
dvc add data/iris.csv
$ dvc add data/iris.csv
100% Adding...|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|1/1 [00:00, 7.85file/s]
To track the changes with git, run:
git add 'data\.gitignore' 'data\iris.csv.dvc'
To enable auto staging, run:
dvc config core.autostage true
and we add,commit and push the iris.csv.dvc
to our code repository (not that iris.csv
is now ignored with .gitignore)
That's the beauty of dvc, we now just pushed the hash and not the actual file.
We actually now created a data repository!
Let's make a change. After some time let's say we add a new line in data/iris.csv
:
5.4,2.5,3.3,1.2,Iris-versicolor
dvc should have detected that we changed a file like git, so if we run dvc status
we get:
$ dvc status
data\iris.csv.dvc:
changed outs:
modified: data\iris.csv
so we add it again:
dvc add data/iris.csv
100% Adding...|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|1/1 [00:00, 12.31file/s]
To track the changes with git, run:
git add 'data\iris.csv.dvc'
To enable auto staging, run:
dvc config core.autostage true
git now should detect that iris.csv.dvc has changed since the data is different and thus the hash has changed.
$ git status
On branch main
Your branch is up to date with 'origin/main'.
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
modified: data/iris.csv.dvc
now let's commit, and push to our repo and learn how to access the previous ones:
We could also add a storage remotely
dvc remote add local /tmp/dvc-storage
and push it with
dvc push -r local
For simplicity let's create the 'remote' storage in a local file mkdir /tmp/dvcstore
and add it with:
dvc remote add -d myremote ./tmp/dvcstore
use dvc pull
to get your data. Usually we run it after git pull or git clone. (as the dvc guide mentions).
git checkout HEAD~1 data/iris.csv.dvc
then dvc checkout and pull
Let's create some stages which going to be a part of out ml pipeline.
we going to use the command:
dvc stage add -n process \
-p data_source,processed_data_source \
-d src/process.py -d data/iris.csv \
-o data/processed \
python src/process.py
Once you've added a stage, you can run the pipeline with dvc repro
.
let's add also the training stage:
dvc stage add -n training \
-p n_estimators,random_state,processed_data_source,model_path \
-d src/process.py -d data/processed/processed_iris.csv \
-o model/model.pkl \
python src/training.py
running dvc dag
we get:
+-------------------+
| data/iris.csv.dvc |
+-------------------+
*
*
*
+---------+
| process |
+---------+
*
*
*
+----------+
| training |
+----------+
let's commit our work and continue with Metrics, Plots, and Parameters
Adding the final stage:
dvc stage add -n evaluate \
-p model_path,test_data_path\
-d src/training.py -d model/model.pkl -d data/processed/processed_iris.csv -d data/processed/test_iris.csv \
-o eval \
python src/evaluation.py
to view the metrics we run dvc metrics show
Path accuracy step
eval/metrics.json 0.93333 0
also the plots:
dvc plots show
now let's change the n_estimators parameter to n_estimators: 54
in params.yaml
if we run dvc status
we get:
(dvc_venv) ubuntu@ip-172-31-1-194:~/POW_DVC$ dvc status
training:
changed deps:
params.yaml:
modified: n_estimators
so we detected that n_estimators changed so if we run: dvc repro
we get
vc repro
'data/iris.csv.dvc' didn't change, skipping
Stage 'process' didn't change, skipping
Running stage 'training':
> python src/training.py
Model saved to model/model.pkl
Test Accuracy: 0.9642857142857143
Updating lock file 'dvc.lock'
Running stage 'evaluate':
> python src/evaluation.py
WARNING:dvclive:Some DVCLive features are unsupported in `dvc repro`.
To use DVCLive with a DVC Pipeline, run it with `dvc exp run`.
Model loaded successfully: RandomForestClassifier(n_estimators=54, random_state=42)
Updating lock file 'dvc.lock'
``
Questions: How to change between data versions???
Troubleshoot:
- If you get this error, might want to add a version control like git (
git init
)
ERROR: failed to initiate DVC - C:\Users\AX-St\MyGIthub\POW_DVC is not tracked by any supported SCM tool (e.g. Git). Use `--no-scm` if you don't want to use any SCM or `--subdir` if initializing inside a subdirectory of a parent SCM repository.
- If you want to rewrite a stage with corrected dependences, parameters etc use --force flag:
for example:
dvc stage add --force -n evaluate \
-p model_path,test_data_path\
-d src/training.py -d model/model.pkl -d data/processed/processed_iris.csv -d data/processed/test_iris.csv \
-o eval \
python src/evaluation.py
the propably you might have to run also dvc repro --force