Skip to content

1.4. Git

What is Git?

Git is a distributed version control system (VCS) essential for managing projects of any scale. It excels at tracking changes in source code, enabling multiple developers and data scientists to collaborate seamlessly. Git is renowned for its data integrity, performance, and support for complex, non-linear workflows, making it a cornerstone of modern software development and MLOps.

Git from XKCD
Git Comic (source)

Why is Git essential for MLOps?

In MLOps, Git is more than just a code repository; it’s a foundational tool for ensuring reproducibility, collaboration, and governance.

  • Reproducibility: Git allows you to version control not only your code but also your configurations, model parameters, and experiment definitions. By tagging specific commits, you can recreate any experiment or model version with precision.
  • Collaboration: MLOps projects involve diverse teams (data scientists, ML engineers, developers). Git provides a structured environment for collaboration, allowing team members to work on different features or experiments in parallel using branches, and then merge their work systematically.
  • Traceability and Auditing: Git maintains a complete history of every change, including who made it and why. This is critical for debugging, understanding the evolution of a model, and meeting regulatory compliance requirements.
  • Automation (GitOps): Git repositories can serve as the "source of truth" for CI/CD pipelines. A git push can automatically trigger processes for testing, validating, training, and deploying a model, embedding automation at the core of your workflow.

How do you install Git?

Git is available for all major operating systems. The official Git Installation Guide provides detailed, platform-specific instructions.

# Example: Install on macOS with Homebrew
brew install git

# Example: Install on Debian/Ubuntu with apt
sudo apt-get update
sudo apt-get install git

# After installation, verify it
git --version

How should you use Git in your project?

A typical Git workflow involves a few key commands. For a comprehensive start, refer to GitHub's Git Tutorial.

  1. Initialize a Repository: To start tracking a new project, navigate to your project directory and run git init. This creates a new local repository.
  2. Stage Files: Use git add <file> to select which changes you want to include in your next commit. You can add specific files or use git add . to stage all changes in the current directory.
  3. Check Status: Run git status frequently. It shows which files are staged, modified but not staged, and which are untracked. This helps you stay aware of your project's state.
  4. Commit Changes: A commit is a snapshot of your staged changes. Use git commit -m "Your descriptive message" to save your changes to the repository's history. A clear message explains the "why" behind a change, which is invaluable for your future self and your team.

What should you commit to your repository?

A clean repository is an efficient one. Not every file belongs in Git.

  • DO Commit: Source code (.py, .R), configuration files (.yml, .toml), documentation (.md), and scripts.
  • DO NOT Commit:
    • Secrets: Never commit sensitive data like API keys, passwords, or database credentials. Use environment variables or a secrets management tool instead.
    • Large Files: Datasets, model checkpoints, and other large binary files (>100MB) should be handled by Git Large File Storage (LFS). Git LFS stores pointers in the repository while keeping the large files in separate storage, preventing your repository from becoming bloated and slow.
    • Temporary Files: Caches, logs, build artifacts, and environment-specific files (like .venv/ or mlruns/) do not belong in the repository.

To enforce these rules, use a .gitignore file in your project's root directory. This file tells Git which files and directories to ignore.

# .gitignore for a typical MLOps project
# For more examples, see https://git-scm.com/docs/gitignore

# Environments & Dependencies
.env
.venv/
/env/
/venv/
/node_modules/

# Caches & Logs
.cache/
.coverage*
.mypy_cache/
.pytest_cache/
.ruff_cache/
__pycache__/
*.py[cod]
*.log

# Build & Distribution
/build/
/dist/
/site/

# Editor & OS-specific
.idea/
.vscode/
.DS_Store
.ipynb_checkpoints/

# MLOps Artifacts & Outputs
/data/
/datasets/
/mlruns/
/outputs/
/models/
!**/.gitkeep

The !**/.gitkeep entry is a common convention to allow tracking of otherwise empty directories. Git does not track empty directories, so placing a .gitkeep file inside one allows the directory structure itself to be committed.

What is a good branching strategy for ML projects?

A branching strategy keeps your repository organized and your main branch stable. A simple and effective model is Feature Branching:

  1. main Branch: This branch represents the production-ready state of your project. All code here should be tested, validated, and deployable. Direct commits to main are typically forbidden.
  2. Feature Branches: For any new work—whether it's a new feature, a bug fix, or an ML experiment—create a new branch from main (e.g., git checkout -b experiment-with-new-optimizer).
  3. Pull Requests (PRs): Once your work on the feature branch is complete, you open a Pull Request to merge it into main. This is a formal request for review, allowing teammates to provide feedback and for automated checks (like tests and linting) to run before the merge occurs.

This strategy isolates work, prevents conflicts, and ensures that the main branch always remains a reliable source of truth.

Additional Resources