User Guide

Project Organization

├── Makefile                   <- Makefile with project operations like `make test` or `make docs`
├── README.md                  <- The top-level README for developers using this project.
├── data
│   ├── raw                    <- The initial raw data extract from the data warehouse.
│   ├── interim                <- Intermediate data that has been transformed but is not used directly for modeling or evaluation.
│   ├── processed              <- The final, canonical data sets for modeling and evaluation.
│   └── external               <- Data from third party sources.
│
├── docs                       <- Documentation templates guiding you through documentation expectations; Structured
│   │                             as a sphinx project to automate generation of formatted documents; see sphinx-doc.org for details
│   ├── figures                <- Generated graphics and figures to be used in reporting
│   └── model_documentation    <- Markdown templates for model documentation and model risk management.
│   └── _build                 <- Generated documentation as HTML, PDF, LaTeX, etc.  Do not edit this directory manually.
│
├── models                     <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks                  <- Jupyter notebooks. Naming convention is a number (for ordering),
│                                 the creator's initials, and a short `-` delimited description, e.g.
│                                 `1.0-jqp-initial-data-exploration.ipynb`.
├── references                 <- Data dictionaries, manuals, important papers, etc
│
├── pyproject.toml             <- Project configuration file; see [`setuptools documentation`](https://setuptools.pypa.io/en/latest/userguide/pyproject_config.html)
│
├── poetry.lock                <- The requirements file for reproducing the analysis environment, e.g.
│                                 generated with `poetry install`; see https://python-poetry.org/docs/
│
├── .pre-commit-config.yaml    <- Default configuration for pre-commit hooks enforcing style and formatting standards
│                                 and use of a linter (`isort`, `brunette`, and `flake8`)
│
├── setup.cfg                  <- Project packaging and style configuration
│
├── setup.py                   <- Project installation script; makes project pip installable (pip install -e .) so 
│                                 project module can be imported
│
├── .dvc                       <- Data versioning cache and configuration using dvc; see https://dvc.org
│   └── config                 <- YAML formatted configuration file for dvc project; defines default remote data storage cache location;
│
├── tests                      <- Automated test scripts; 
│   └── data                   <- tests for data download and/or generation scripts
│   │   └── test_make_dataset.py
│   │
│   └── features               <- tests for feature generation scripts 
│   │   └── test_build_features.py
│   │
│   ├── models                 <- tests for model training and prediction scripts
│   │   │                         
│   │   ├── test_predict_model.py
│   │   └── test_train_model.py
│   │
│   └── visualization           <- Scripts to create exploratory and results oriented visualizations
│       └── test_visualize.py    
│
├── milton_maps                <- Source code for use in this project.
│   │
│   ├── __init__.py                               <- Makes milton_maps a Python module
│   │
│   ├── data                                      <- Scripts to download or generate data
│   │   ├── make_dataset.py                       <- Utility CLI script to extract data from database tables to local parquet files.

│   │
│   ├── features                                  <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models                                    <- Scripts to train models and then use trained models to make
│   │   │                                            predictions. (write outputs to `PROJECT_ROOT/models`
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization                             <- Scripts to create exploratory and results oriented visualizations;
│       └── visualize.py                             (write outputs to `PROJECT_ROOT/reports/figures`

DVC How To…

Of the tools incorporated in this project template, DVC is likely the least familiar to the typical Data Scientist and has more of a learning curve to use effectively. The project has good documentation which is well worth reviewing in greater detail.

Below are a few of the most common DVC commands you will use frequently during model development, discussed in the order that they’re typically used in a standard workflow. Consult the DVC command reference for a full listing of commands and greater detail.

Check whether code and data are aligned: `dvc status`

dvc status shows changes in the project pipelines, as well as file mismatches either between the cache and workspace, or between the cache and remote storage. Use it to diagnose whether data and code are aligned and see which stages of the pipeline will run if dvc repro is called.

Align code and (previously-calculated) data: `dvc checkout`

dvc checkout updates DVC-tracked files and directories in the workspace based on current dvc.lock and .dvc files. This command is usually needed after git checkout, git clone, or any other operation that changes the current dvc.lock or .dvc files in the project (though the installed git hooks frequently automate this step). It restores the versions of all DVC-tracked data files and directories referenced in DVC metadata files from the cache to the workspace.

Define the pipeline DAG

You define the Directed Acyclic Graph (DAG) of your pipeline using the dvc.yaml. The individual scripts or tasks in the pipeline are called “stages”.

dvc.yaml defines a list of stages, the commands required to run them, their input data and parameter dependencies, and their output artifacts, metrics, and plots.

dvc.yaml uses the YAML 1.2 format and a human-friendly schema explained in detail here. DVC provides CLI commands to edit the dvc.yaml file, but it is generally easiest to edit it manually.

dvc.yaml files are designed to be small enough so you can easily version them with Git along with other DVC files and your project’s code.

Add a stage to the model pipeline

Let’s look at a smaple stage. It depends on a script file it runs as well as on a raw data input:

stages:
  prepare:
    cmd: source src/cleanup.sh
    deps:
    - src/cleanup.sh
    - data/raw.csv
    outs:
    - data/clean.csv

A new stage can be added to the model pipeline DAG in one of two ways:

Directly edit the pipeline in dvc.yaml files. (recommended)
Use the CLI command dvc stage add – a limited command-line interface to setup pipelines. For example:

$ dvc stage add --name train \
                --deps src/model.py \
                --deps data/clean.csv \
                --outs data/predict.dat \
                python src/model.py data/clean.csv

would add the following to dvc.yaml

stages:
  prepare:
  ...
  train:
    cmd: python src/model.py data/model.csv
    deps:
    - src/model.py
    - data/clean.csv
    outs:
    - data/predict.dat

Tip

One advantage of using dvc stage add is that it will verify the validity of the arguments provided (otherwise stage definition won’t be checked until execution). A disadvantage is that some advanced features such as templating are not available this way.

Calculate (or reproduce) pipeline outputs: `dvc repro`

dvc repro reproduces complete or partial pipelines by running their stage commands as needed in the correct order. This is similar to make in software build automation, in that DVC captures “build requirements” (stage dependencies) and determines which stages need to run based on whether there outputs are “up to date”. Unlike make, it caches the pipeline’s outputs along the way.

Define pipeline parameters

Parameters are any values used inside your code to tune analytical results. For example, a random forest classifier may require a maximum depth value. Machine learning experimentation often involves defining and searching hyperparameter spaces to improve the resulting model metrics.

Your source code should read params from structured parameters files (params.yaml by default). You can use the params field of dvc.yaml to tell DVC which parameter each stage depends on. When a param value has changed, dvc repro and dvc exp run invalidate any stages that depend on it, and reproduces them.

Run one or more pipelines with varying parameters: `dvc exp run`

dvc exp run runs or resumes a DVC experiment based on a DVC pipeline. DVC experiment tracking allows tracking of results from multiple runs of the pipeline with varying parameter values (as defined in params.yaml) without requiring each run to be associated with its own git commit[1]. It also runs the experiment using an isolated copy of the workspace, so that edits you make while the experiment is running will not impact results.

Caution

The experiment tracking feature in DVC is dangerous/confusing to use with a dirty git repository. DVC creates a copy of the experiment’s workspace in .dvc/tmp/exps/ and runs it there. Git-ignored files/dirs are excluded from queued/temp runs to avoid committing unwanted files into Git (e.g. once successful experiments are persisted). Under some circumstances, unstaged, git-tracked files are automatically staged and included in the isolated workspace. The expected behavior isn’t 100% clear in the documentation, so I just avoid using dvc exp run in a dirty repo.

Tip

Rule of thumb: If you want to run the pipeline to test whether uncommited changes are correct and ready to be committed, use dvc repro. dvc exp run is best used with a clean repository where you want to “experiment” with results from parameter changes, not code changes.

Visit the dvc documentation and their helpful hands-on tutorial to learn about the experiment-tracking features offered by DVC.

Record changes to code and data: `dvc commit`

dvc commit records changes to files or directories tracked by DVC. Stores the current contents of files and directories tracked by DVC in the cache, and updates dvc.lock and .dvc metadata files as needed. This forces DVC to accept any changed contents of tracked data currently in the workspace. [2]

Manage remote storage endpoints: `dvc remote`

dvc remote provides a set of commands to set up and manage remote storage: add, default, list, modify, remove, and rename.

Write or get data from remote storage: `dvc push|pull`

dvc push uploads tracked files or directories to remote storage based on the current dvc.yaml and .dvc files.

dvc pull download tracked files or directories from remote storage based on the current dvc.yaml and .dvc files, and make them visible in the workspace.

Manually track externally-created files: `dvc add`

dvc add tells DVC to track versions of data that is not created by the DVC pipeline in dvc.yaml. DVC allows tracking of such datasets using .dvc files as lightweight pointers to your data in the cache. The dvc add command is used to track and update your data by creating or updating .dvc files, similar to the usage of git add to add source code updates to git.

Tip

If a file is generated as a stage output of dvc.yaml, you do not need to run dvc add to track changes. dvc repro does this for you.

Review results metrics and their changes: `dvc metrics show|diff`

dvc metrics provides a set of commands to display and compare metrics: show, and diff.

In order to follow the performance of machine learning experiments, DVC has the ability to mark stage outputs or other files as metrics. These metrics are project-specific floating-point or integer values e.g. AUC, ROC, false positives, etc.

In pipelines, metrics files are typically generated by user data processing code, and are tracked using the -m (–metrics) and -M (–metrics-no-cache) options of dvc stage add

Compare results of two different pipeline runs or experiments

It’s a good idea to use git tags to identify project revisions that you will want to share with others and/or reference in discussions.

$ git tag -a my-great-experiment [revision]

creates an annotated tag at the given revision. If revision is left out, HEAD is used.

You can compare pipeline metrics across any two git revisions with

$ dvc metrics diff [rev1] [rev2]

where rev1 and rev2 are any git commit hash, tag, or branch name.

DVC internal directories and files

You shouldn’t need to muck around with DVC’s internals to use the tool successfully, but having a high-level understanding of how the tool works can increase your confidence and help you solve problems that inevitably arise. See the DVC documentation for more detail.

DVC creates a hidden directory in your project at .dvc/ relative to your project root folder, which contains the directories and files needed for DVC operation. The cache structure is similar to the structure of a .git/ cache folder in a git repository, if that’s something your familiar with.

.dvc/config: This is the default DVC configuration file. It can be edited by hand or with the dvc config command.
.dvc/config.local: This is an optional Git-ignored configuration file, that will overwrite options in .dvc/config. This is useful when you need to specify sensitive values (secrets) which should not reach the Git repo (credentials, private locations, etc). This config file can also be edited by hand or with dvc config --local.
.dvc/cache: Default location of the cache directory. By default, the data files and directories in the workspace will only contain links to the data files in the cache. See dvc config cache for related configuration options, including changing its location.

Important

Note that DVC includes the cache directory in .gitignore during initialization. No data tracked by DVC should ever be pushed to the Git repository, only the DVC files (*.dvc or dvc.lock) that are needed to locate or reproduce that data.

.dvc/cache/runs: Default location of the run cache.
.dvc/plots: Directory for plot templates
.dvc/tmp: Directory for miscellaneous temporary files
.dvc/tmp/updater: This file is used to store the latest available version of DVC. It’s used to remind the user to upgrade when the installed version is behind.
.dvc/tmp/updater.lock: Lock file for .dvc/tmp/updater
.dvc/tmp/lock: Lock file for the entire DVC project
.dvc/tmp/rwlock: JSON file that contains read and write locks for specific dependencies and outputs, to allow safely running multiple DVC commands in parallel
.dvc/tmp/exps: This directory will contain workspace copies used for temporary or queued experiments.

DVC pre-commit hooks

make initialize installs several DVC pre-commit hooks to simplify your DVC+git workflow.

The post-checkout hook executes dvc checkout after git checkout to automatically update the workspace with the correct data file versions.
The pre-commit hook executes dvc status before git commit to inform the user about the differences between cache and workspace.
The pre-push hook executes dvc push before git push to upload files and directories tracked by DVC to the dvc remote default.

Poetry How To…

Add software dependencies to the project

The poetry add command adds required packages to your pyproject.toml and installs them. This updates pyproject.toml and poetry.lock, which should both be committed to the git repository.

If you do not specify a version constraint, poetry will choose a suitable one based on the available package versions.

$ poetry add requests pendulum

# Allow >=2.0.5, <3.0.0 versions
poetry add pendulum@^2.0.5

# Allow >=2.0.5, <2.1.0 versions
poetry add pendulum@~2.0.5

# Allow >=2.0.5 versions, without upper bound
poetry add "pendulum>=2.0.5"

# Allow only 2.0.5 version
poetry add pendulum==2.0.5

Documentation How To…

Build the docs

Just run

$ make docs

from the project root and the documentation website will be generated in docs/_build/html.

Use advanced markup supported by MyST Parser

This project uses Sphinx combined with the MyST Parser Plugin to enable you to write documentation in markdown, but retain many of the powerful features Restructured Text (rst) format.

Check out the MyST plugin documentation for an overview of everything that’s possible.

Admonitions

You can create admonitions

:::{tip}
Let's give readers a helpful hint!
:::

produces

Tip

Let’s give readers a helpful hint!

Equations

You can add equations in LaTex

:::{math}
:label: mymath
(a + b)^2 &=& a^2 + 2ab + b^2 \\
          &=& (a + b)(a + b)
:::

produces

(1)

\begin{aligned} (a + b)^{2} & = & a^{2} + 2 a b + b^{2} \\ = & (a + b) (a + b) \end{aligned}

You can reference them too!

Equation {eq}`mymath` is a quadratic equation.

produces

Equation (1) is a quadratic equation.

Footnotes

You can also make footnotes

Look at this footnote [^example-footnote]

[^example-footnote]: This is a footnote.

produces

Look at this footnote [3]

Serve the documentation website

To build and share the documentation website:

make docs builds the project documentation to docs/_build/html.
make start-doc-server runs an http server for the documentation website.
make stop-doc-server stops a running http server.

User Guide

Project Organization

DVC How To…

Check whether code and data are aligned: dvc status

Align code and (previously-calculated) data: dvc checkout

Define the pipeline DAG

Add a stage to the model pipeline

Calculate (or reproduce) pipeline outputs: dvc repro

Define pipeline parameters

Run one or more pipelines with varying parameters: dvc exp run

Record changes to code and data: dvc commit

Manage remote storage endpoints: dvc remote

Write or get data from remote storage: dvc push|pull

Manually track externally-created files: dvc add

Review results metrics and their changes: dvc metrics show|diff

Compare results of two different pipeline runs or experiments

DVC internal directories and files

DVC pre-commit hooks

Poetry How To…

Add software dependencies to the project

Documentation How To…

Build the docs

Use advanced markup supported by MyST Parser

Admonitions

Equations

Footnotes

Serve the documentation website

Check whether code and data are aligned: `dvc status`

Align code and (previously-calculated) data: `dvc checkout`

Calculate (or reproduce) pipeline outputs: `dvc repro`

Run one or more pipelines with varying parameters: `dvc exp run`

Record changes to code and data: `dvc commit`

Manage remote storage endpoints: `dvc remote`

Write or get data from remote storage: `dvc push|pull`

Manually track externally-created files: `dvc add`

Review results metrics and their changes: `dvc metrics show|diff`