4.2. Testing

What are software tests?

Software tests are automated procedures designed to verify that a piece of software behaves as expected. They are fundamental to ensuring reliability, functionality, and preventing unintended changes, known as regressions.

Tests are typically categorized by their scope:

Unit Tests: Focus on the smallest testable parts of an application, such as individual functions or methods, in isolation. For example, a unit test might verify that a data normalization function correctly scales a feature to a [0, 1] range.
Regression Tests: Ensure that recent code changes have not adversely affected existing features. These tests are run frequently to catch bugs introduced during development.
End-to-End (E2E) Tests: Simulate a complete user workflow from start to finish. In an MLOps context, an E2E test could involve running an entire prediction pipeline, from raw data input to model prediction output, to ensure all components integrate correctly.

Why is testing critical in MLOps projects?

Testing is indispensable in any professional software project, but it carries unique importance in MLOps:

Quality Assurance: Confirms that data pipelines, models, and APIs meet their specified requirements.
Regression Prevention: In MLOps, regressions can be subtle, such as a slight drop in model accuracy. A robust test suite helps catch these issues before they impact users.
Confidence in Refactoring: Allows developers to improve and optimize code with confidence, knowing that a suite of tests will validate the continued correctness of their logic.
Living Documentation: Tests serve as executable examples of how the code is intended to be used, which is often clearer than static documentation.

While print() statements are useful for immediate debugging, they provide no lasting guarantees. Automated tests, on the other hand, continuously validate behavior with every code change. This is especially vital in a dynamic language like Python, where the compiler doesn't perform static type checking at compile time.

Which tool should you use for Python testing?

While Python has a built-in unittest module, its syntax can be verbose. We strongly recommend pytest, a modern, powerful framework that simplifies test creation and scales from simple functions to complex applications.

A pytest test is just a function with an assert statement:

# content of tests/test_sample.py
def inc(x: int) -> int:
    return x + 1

def test_answer():
    # Assert that the function's output matches the expected behavior
    assert inc(3) == 4

To execute pytest across your project:

# Install pytest (one-time)
uv add --dev pytest

# Run pytest on the `tests` directory
uv run pytest tests/

You can extend pytest's functionality with plugins:

pytest-cov: Generates code coverage reports to identify which parts of your codebase are not being tested.

# Generate a coverage report for the `src` directory
uv run pytest --cov=src/ tests/

pytest-xdist: Executes tests in parallel, significantly reducing runtime by utilizing all available CPU cores.

# Run tests in parallel using all available cores
uv run pytest -n auto tests/

How should you configure your project for testing?

First, prevent pytest cache files from being committed to Git by adding .pytest_cache/ to your .gitignore file.

Next, enable pytest support in VS Code by adding the following to your .code-workspace file:

{
    "settings": {
        "python.testing.pytestEnabled": true,
        "python.testing.pytestArgs": [
            "tests"
        ]
    }
}

Finally, define global pytest configurations in your pyproject.toml file to ensure consistency:

[tool.pytest.ini_options]
# Run tests with high verbosity and set the source path
addopts = "--verbosity=2 --cov=src"
# Add the `src` directory to the Python path for imports
pythonpath = ["src"]

[tool.coverage.run]
# Measure branch coverage to check if `if/else` statements are tested
branch = true
source = ["src"]
omit = ["**/__main__.py"] # Exclude non-testable files

How should you structure your tests?

Organize tests in a dedicated tests/ directory that mirrors your source code structure. For a module like src/bikes/models.py, the corresponding test file should be tests/test_models.py. Test function names must be prefixed with test_.

src/
    bikes/
        models.py
        metrics.py
        datasets.py
tests/
    test_models.py
    test_metrics.py
    test_datasets.py

This structure keeps your test suite organized and easy to navigate. A popular pattern for structuring individual tests is Given-When-Then, which clearly separates setup, execution, and validation.

def test_inputs_schema_is_valid(inputs_reader: datasets.Reader) -> None:
    # Given: A predefined data schema and an inputs reader
    schema = schemas.InputsSchema

    # When: The input data is read
    data = inputs_reader.read()

    # Then: The data conforms to the schema
    assert schema.check(data) is not None, "Input data validation failed!"

How can you define reusable test components?

pytest fixtures are functions that provide a fixed baseline for tests to build upon. They are ideal for setting up reusable objects like datasets, models, or temporary file paths, eliminating redundant code.

Fixtures can be defined in individual test files or in a central tests/conftest.py file to be shared across the entire test suite.

# in tests/conftest.py
import pytest
import os

@pytest.fixture(scope="session")
def tests_path() -> str:
    """Return the path of the tests folder."""
    file_path = os.path.abspath(__file__)
    parent_directory = os.path.dirname(file_path)
    return parent_directory

@pytest.fixture(scope="function")
def tmp_outputs_path(tmp_path: str) -> str:
    """Return a tmp path for the outputs dataset."""
    return os.path.join(tmp_path, "outputs.parquet")

The scope parameter controls the fixture's lifecycle. A session-scoped fixture is created once for the entire test run, while a function-scoped fixture is recreated for every test.

How can you avoid repetitive test scenarios?

To test a function against multiple input scenarios without writing duplicate code, use the @pytest.mark.parametrize decorator. This feature allows you to run the same test function with different argument sets.

import pytest

@pytest.mark.parametrize(
    "name, interval, greater_is_better",
    [
        ("mean_squared_error", [0, float("inf")], False),
        ("mean_absolute_error", [0, float("inf")], False),
        ("r2_score", [float("-inf"), 1.0], True),
    ],
)
def test_sklearn_metric_properties(
    name: str, interval: list, greater_is_better: bool
) -> None:
    # This test will run three times, once for each set of parameters.
    assert isinstance(name, str)
    assert isinstance(interval, list)
    assert isinstance(greater_is_better, bool)

How do you validate program outputs and exceptions?

pytest provides built-in fixtures for capturing output and testing for exceptions.

capsys: Captures anything written to standard output (stdout) and standard error (stderr).
pytest.raises: A context manager that asserts a specific exception is raised.

import json
import pytest

def test_json_output_is_valid(capsys) -> None:
    # Given: A program that prints a JSON object
    print(json.dumps({"key": "value"}))

    # When: The output is captured
    captured = capsys.readouterr()

    # Then: The output is a valid JSON with no errors
    assert captured.err == "", "Stderr should be empty"
    assert json.loads(captured.out), "Stdout should be a valid JSON"

def test_main_raises_error_on_no_configs() -> None:
    # Given: A function that requires configuration
    def run_main(argv: list):
        if not argv:
            raise RuntimeError("No configs provided.")

    # When/Then: Calling the function with no arguments raises a RuntimeError
    with pytest.raises(RuntimeError) as error:
        run_main([])

    assert "No configs provided." in str(error.value)

How do you test code with randomness?

Machine learning code often involves randomness (e.g., train/test splits, model initialization). To make tests deterministic and reproducible, always set a fixed random seed at the beginning of your test.

import numpy as np
from sklearn.model_selection import train_test_split

def test_data_split_is_reproducible():
    # Given: A dataset and a fixed random state
    X = np.arange(100).reshape(50, 2)
    y = np.arange(50)

    # When: The data is split with a fixed random_state
    X_train_1, _, y_train_1, _ = train_test_split(X, y, random_state=42)
    X_train_2, _, y_train_2, _ = train_test_split(X, y, random_state=42)

    # Then: The splits are identical
    np.testing.assert_array_equal(X_train_1, X_train_2)

What are best practices for writing unit tests?

Write Clear, Readable Tests: A test should be easy to understand. Use descriptive names and the Given-When-Then structure to clarify intent.
Keep Tests Independent and Isolated: Each test should be able to run on its own, without depending on the state left by other tests.
Use Fixtures for Setup and Teardown: Leverage fixtures to manage setup and cleanup logic, keeping tests clean and focused.
Test for Edge Cases: Go beyond the "happy path." Test for invalid inputs, empty data, and other edge conditions that could cause failures.
Aim for High Test Coverage: Strive for at least 80% code coverage. This ensures most of your codebase is validated and encourages a culture of quality.
Keep Tests Fast: Slow tests slow down development. Optimize test performance to ensure the suite can be run quickly and frequently.
Run Tests Automatically: Integrate your test suite into a CI/CD workflow to catch issues early and often.
Review and Refactor Tests: Just like production code, tests should be reviewed and updated as the codebase evolves.