4.0. Typing

What is programming typing?

Typing is the practice of assigning a specific data type (e.g., string, integer, boolean) to variables, function arguments, and return values. This discipline governs how data is used, ensuring that operations are performed on compatible types.

Programming languages approach typing in different ways:

Static Typing: Types are checked at compile time, before the code is run. This catches errors early in the development cycle. Examples include Java, C++, and Rust.
Dynamic Typing: Types are checked at runtime, as the code executes. This offers flexibility but can lead to unexpected errors. Python, Ruby, and JavaScript are dynamically typed.
Gradual Typing: A hybrid approach that allows developers to introduce static type checking into a dynamically typed language. Python (version 3.5+) supports this through optional type hints, offering the best of both worlds.

Additionally, type systems can be categorized by their strictness:

Strong Typing: The language enforces strict rules about how types can interact. For example, you cannot add an integer to a string without explicit conversion (str(1) + "s"). Python is strongly typed.
Weak Typing: The language automatically converts types when performing operations (a process called type coercion), which can sometimes lead to unpredictable results (e.g., in some languages, 1 + "s" might result in "1s").

Why is typing useful in programs?

Adopting a typing discipline, especially in large-scale MLOps projects, delivers significant advantages:

Fewer Bugs: Catching type-related errors before runtime prevents a common class of bugs that can be difficult to track down in production.
Improved Readability: Type annotations act as a form of documentation, making it immediately clear what kind of data a function expects and returns.
Better IDE Support: Modern IDEs like VS Code leverage type hints to provide intelligent code completion, error highlighting, and safe refactoring.
Enhanced Collaboration: A clear type system ensures that all team members understand the data structures and interfaces, reducing integration friction.
Greater Confidence: Writing typed code provides a safety net, giving developers more confidence when modifying or extending the codebase.

While adding types requires an initial effort, the long-term payoff in code quality, maintainability, and reliability is substantial.

What is the relation between Python and typing?

Python is a dynamically and strongly typed language. Historically, developers did not need to declare variable types. However, this flexibility can become a liability in complex applications.

Since Python 3.5, the language officially supports gradual typing through type hints. These are optional annotations that specify the expected types. It's crucial to understand that the Python interpreter itself does not enforce these hints at runtime; they are primarily for static analysis tools.

Consider this simple function without type hints:

def print_n_times(message, n):
    for _ in range(n):
        print(message)

With type hints, the function's intent becomes much clearer:

def print_n_times(message: str, n: int) -> None:
    for _ in range(n):
        print(message)

The annotation -> None explicitly states that the function does not return a value. Using type hints is a best practice for modern Python development, as it significantly improves code clarity and robustness.

To learn more, consult the official typing module documentation and the Mypy cheatsheet.

How can you add types to a dataframe?

Yes, by using a data validation library like Pandera. Pandera provides a flexible and intuitive API to define schemas for dataframe-like objects, making data processing pipelines more robust and readable.

Key features of Pandera include:

Unified Schema Definition: Define a schema once and use it to validate various dataframe libraries, including pandas, Dask, Modin, and PySpark.
Type and Property Checks: Enforce column data types, value ranges, and other properties.
Advanced Validation: Perform complex statistical validations, such as hypothesis testing.
Seamless Integration: Use function decorators to integrate data validation directly into your pipelines.
Pydantic-Style Models: Define schemas using a class-based API that feels familiar to Pydantic users.
Data Synthesis: Generate synthetic data from a schema for property-based testing.

Here is an example of a Pandera schema for validating input data in an MLOps project:

import pandera as pa
import pandera.typing as papd
import pandera.typing.common as padt

class InputsSchema(pa.DataFrameModel):
    """Schema for the project inputs."""

    instant: papd.Index[padt.UInt32] = pa.Field(ge=0, check_name=True)
    dteday: papd.Series[padt.DateTime] = pa.Field()
    season: papd.Series[padt.UInt8] = pa.Field(isin=[1, 2, 3, 4])
    yr: papd.Series[padt.UInt8] = pa.Field(ge=0, le=1)
    mnth: papd.Series[padt.UInt8] = pa.Field(ge=1, le=12)
    hr: papd.Series[padt.UInt8] = pa.Field(ge=0, le=23)
    holiday: papd.Series[padt.Bool] = pa.Field()
    weekday: papd.Series[padt.UInt8] = pa.Field(ge=0, le=6)
    workingday: papd.Series[padt.Bool] = pa.Field()
    weathersit: papd.Series[padt.UInt8] = pa.Field(ge=1, le=4)
    temp: papd.Series[padt.Float16] = pa.Field(ge=0, le=1)
    atemp: papd.Series[padt.Float16] = pa.Field(ge=0, le=1)
    hum: papd.Series[padt.Float16] = pa.Field(ge=0, le=1)
    windspeed: papd.Series[padt.Float16] = pa.Field(ge=0, le=1)
    casual: papd.Series[padt.UInt32] = pa.Field(ge=0)
    registered: papd.Series[padt.UInt32] = pa.Field(ge=0)

How can you improve type safety in classes?

While Python's built-in @dataclass is useful, Pydantic takes class-based data modeling to the next level by performing data validation and serialization at runtime.

Why use Pydantic?

Type Enforcement: Pydantic validates data against your type hints and raises clear errors if the data is invalid.
High Performance: The core validation logic is written in Rust, making it extremely fast.
JSON Schema Support: Automatically generate JSON Schemas from your models for seamless API integration.
Strict and Lax Modes: Choose between strict type checking or allowing Pydantic to coerce data into the correct type.
Broad Compatibility: Works with many standard library types, including dataclasses and TypedDict.
Rich Ecosystem: Integrates with popular libraries like FastAPI, Typer, and SQLModel.

Here is an example of using Pydantic to define a configuration object in an MLOps codebase:

import pydantic as pdt

class GridCVSearcher(pdt.BaseModel):
    """Grid searcher with cross-fold validation for better model performance metrics."""

    n_jobs: int | None = None
    refit: bool = True
    verbose: int = 3
    error_score: str | float = "raise"
    return_train_score: bool = False

How can you check types in Python?

The standard tool for static type checking in Python is Mypy. It can be run from the command line or integrated directly into your IDE.

uv add --group check mypy
uv run mypy src/ tests/

While Mypy is the most established type checker, several faster alternatives exist:

pyright: Developed by Microsoft and serves as the engine for Pylance in VS Code.
pyre-check: A performant type checker developed by Meta.
pytype: A static type analyzer from Google that can infer types for untyped code.

Mypy remains a popular choice due to its maturity and extensive plugin ecosystem, which allows it to understand and validate libraries like Pydantic and Pandera.

How can you configure Mypy for your project?

You can configure Mypy by adding a [tool.mypy] section to your pyproject.toml file. This ensures consistent type checking for all developers on the project. Remember to add the .mypy_cache/ directory to your .gitignore file.

Here is a recommended Mypy configuration:

[tool.mypy]
# Improve error messages for better readability.
pretty = true

# Specify the target Python version.
python_version = "3.13"

# Flag functions with missing type annotations.
check_untyped_defs = true

# Suppress errors about missing stubs for third-party libraries.
ignore_missing_imports = true

# Enable plugins for libraries like Pandera and Pydantic.
plugins = ["pandera.mypy", "pydantic.mypy"]

If you need to bypass type checking for a specific line or file, you can use a type: ignore comment:

def func(a: int, b: int) -> bool:  # type: ignore[empty-body]
    # This function body is intentionally empty for now.
    pass

For more options, refer to the Mypy configuration documentation.

What are the best practices for typing in Python?

Apply the 80/20 Rule: Prioritize adding types where they provide the most value, such as in function signatures and data model definitions.
Master the typing Module: Become familiar with the tools in the typing module, like List, Dict, Optional, Union, and Protocol.
Use Implicit Typing Where Appropriate: You don't need to annotate every single variable. A type checker can often infer the type from its assignment (e.g., x = 5 is clearly an int).
Use typing.Any as a Last Resort: Avoid using Any whenever possible, as it effectively disables type checking for that variable.
Integrate Type Checking into CI/CD: Run Mypy as part of your continuous integration pipeline to catch type errors before they reach production.