Getting started
This section will enable you to reproduce the project’s analyses and data from a fresh clone fo the repository.
Software Prerequisites
This project uses poetry to recreate an identical analytical software environment on each developer’s machine. To bootstrap this environment, you will need an existing installation of
Python version 3.10
Poetry v1.5.1+
For instructions to install these prerequisites for the first time, consult First Time Setup.
Environment set up
Begin by cloning the model development repository
$ git clone git@github.com:ahasha/milton_maps.git
and change to the root directory of the project. The command
$ make initialize
will install the project’s software envionrment using poetry and install the project’s git hooks, which enforce consistent code style, linting, and dvc actions to ensure data and code versions are synchronized.
You then run
$ poetry shell
to enter the virtual environment associated with the project. Typing exit
will exit the poetry shell, analogous to deactivate
for a virtual environment.
Get project data
This project uses DVC to store and track versioned data. Similar to how git works, the DVC cache is a hidden storage folder (by default in
.dvc/cache
) containing all versions of all files and directories tracked by
DVC. It uses a content-addressable structure that allows only the current version of tracked data corresponding to the current state of the code in the git repository to be automatically loaded into the workspace.
A shared dvc cache (analogous to a shared git remote on github.com) is located
in Google Cloud storage at gs://hasha-ds-portfolio-projects/milton_maps/
.
As long as you can access this cloud bucket from your current working environment,
you can populate your local cache (analogous to running git clone
to get the latest
version of a codebase) by running
$ dvc pull
from any directory inside the project.
Reproduce the project results
Once the initial setup steps are complete, you can run the analysis pipeline to create the
notebook data inputs with the dvc repro
command.
$ dvc repro
generates pipeline results by executing the sequence of stages defined in dvc.yaml to calculate all outputs referenced in dvc.lock
. Stages are checked to determine which ones need to run – if the stage output checksum is referenced in dvc.lock
and already exists in the cache or the workspace, the stage is skipped.
If you wish to recalculate pipeline outputs to verify that they match reported results, run
$ dvc repro -f --no-commit
The -f
flag tells DVC to recalculate all pipeline stages, even if there are no changes to their input dependencies. The --no-commit
flag tells DVC not to store the outputs of this execution in the cache. If the pipeline has large outputs, a single byte difference will cause its checksum to change and DVC will add it permanently to the cache. With many binary data formats, you can get a different file checksum even if the data contents are functionally identical. --no-commit
will enable you to verify the results have been appropriately reproduced without bloating the cache with functionally identical data files. If later you do want to add the results to the cache, you can do so by running dvc comit
.