1.4. Git
What is Git?
Git is a distributed version control system (VCS) essential for managing projects of any scale. It excels at tracking changes in source code, enabling multiple developers and data scientists to collaborate seamlessly. Git is renowned for its data integrity, performance, and support for complex, non-linear workflows, making it a cornerstone of modern software development and MLOps.

Why is Git essential for MLOps?
In MLOps, Git is more than just a code repository; it’s a foundational tool for ensuring reproducibility, collaboration, and governance.
- Reproducibility: Git allows you to version control not only your code but also your configurations, model parameters, and experiment definitions. By tagging specific commits, you can recreate any experiment or model version with precision.
- Collaboration: MLOps projects involve diverse teams (data scientists, ML engineers, developers). Git provides a structured environment for collaboration, allowing team members to work on different features or experiments in parallel using branches, and then merge their work systematically.
- Traceability and Auditing: Git maintains a complete history of every change, including who made it and why. This is critical for debugging, understanding the evolution of a model, and meeting regulatory compliance requirements.
- Automation (GitOps): Git repositories can serve as the "source of truth" for CI/CD pipelines. A
git push
can automatically trigger processes for testing, validating, training, and deploying a model, embedding automation at the core of your workflow.
How do you install Git?
Git is available for all major operating systems. The official Git Installation Guide provides detailed, platform-specific instructions.
# Example: Install on macOS with Homebrew
brew install git
# Example: Install on Debian/Ubuntu with apt
sudo apt-get update
sudo apt-get install git
# After installation, verify it
git --version
How should you use Git in your project?
A typical Git workflow involves a few key commands. For a comprehensive start, refer to GitHub's Git Tutorial.
- Initialize a Repository: To start tracking a new project, navigate to your project directory and run
git init
. This creates a new local repository. - Stage Files: Use
git add <file>
to select which changes you want to include in your next commit. You can add specific files or usegit add .
to stage all changes in the current directory. - Check Status: Run
git status
frequently. It shows which files are staged, modified but not staged, and which are untracked. This helps you stay aware of your project's state. - Commit Changes: A commit is a snapshot of your staged changes. Use
git commit -m "Your descriptive message"
to save your changes to the repository's history. A clear message explains the "why" behind a change, which is invaluable for your future self and your team.
What should you commit to your repository?
A clean repository is an efficient one. Not every file belongs in Git.
- DO Commit: Source code (
.py
,.R
), configuration files (.yml
,.toml
), documentation (.md
), and scripts. - DO NOT Commit:
- Secrets: Never commit sensitive data like API keys, passwords, or database credentials. Use environment variables or a secrets management tool instead.
- Large Files: Datasets, model checkpoints, and other large binary files (>100MB) should be handled by Git Large File Storage (LFS). Git LFS stores pointers in the repository while keeping the large files in separate storage, preventing your repository from becoming bloated and slow.
- Temporary Files: Caches, logs, build artifacts, and environment-specific files (like
.venv/
ormlruns/
) do not belong in the repository.
To enforce these rules, use a .gitignore
file in your project's root directory. This file tells Git which files and directories to ignore.
# .gitignore for a typical MLOps project
# For more examples, see https://git-scm.com/docs/gitignore
# Environments & Dependencies
.env
.venv/
/env/
/venv/
/node_modules/
# Caches & Logs
.cache/
.coverage*
.mypy_cache/
.pytest_cache/
.ruff_cache/
__pycache__/
*.py[cod]
*.log
# Build & Distribution
/build/
/dist/
/site/
# Editor & OS-specific
.idea/
.vscode/
.DS_Store
.ipynb_checkpoints/
# MLOps Artifacts & Outputs
/data/
/datasets/
/mlruns/
/outputs/
/models/
!**/.gitkeep
The
!**/.gitkeep
entry is a common convention to allow tracking of otherwise empty directories. Git does not track empty directories, so placing a.gitkeep
file inside one allows the directory structure itself to be committed.
What is a good branching strategy for ML projects?
A branching strategy keeps your repository organized and your main branch stable. A simple and effective model is Feature Branching:
main
Branch: This branch represents the production-ready state of your project. All code here should be tested, validated, and deployable. Direct commits tomain
are typically forbidden.- Feature Branches: For any new work—whether it's a new feature, a bug fix, or an ML experiment—create a new branch from
main
(e.g.,git checkout -b experiment-with-new-optimizer
). - Pull Requests (PRs): Once your work on the feature branch is complete, you open a Pull Request to merge it into
main
. This is a formal request for review, allowing teammates to provide feedback and for automated checks (like tests and linting) to run before the merge occurs.
This strategy isolates work, prevents conflicts, and ensures that the main
branch always remains a reliable source of truth.