4.0. Typing
What is programming typing?
Typing in programming involves designating specific data types for variables, functions, and classes within a programming language. This concept is critical for managing how data is stored, processed, and interacted within software applications.
Programming languages are categorized into three main types based on how they handle typing:
- Static typing: In statically typed languages, the data type of a variable is known at compile time, which means that type checking is done during the compilation of the program. Examples include Java, C++, and Haskell. This approach allows for early detection of type-related errors, contributing to more robust and error-resistant code.
- Dynamic typing: Dynamically typed languages determine the type of a variable at runtime. This flexibility allows for more rapid development but can introduce type-related errors that are harder to detect early in the development process. Examples of dynamically typed languages are Ruby, JavaScript, PHP, and Python.
- Gradual typing: Gradual typing offers a blend of static and dynamic typing, allowing developers to choose when to enforce type constraints. This approach provides the flexibility of dynamic typing while still enabling the benefits of static type checking where it's most useful. Languages that support gradual typing include TypeScript, Dart, and Python (from version 3.5 onwards with type annotations).
Additionally, languages can have either a weak or strong type system:
- Weak typing: In languages with weak typing, type coercion is common, allowing for more flexibility in operations between different types but at the risk of unexpected behavior or errors (e.g., 1 + "s" => "1s").
- Strong typing: Strongly typed languages enforce stricter rules about interactions between data types, reducing the chances of runtime errors due to unexpected type conversions but requiring more explicit declarations and conversions by the developer (e.g., 1 + "s" => error, str(1) + "s" = "1s").
Why is typing useful in programs?
The role of typing in programming, especially in complex or large-scale projects, is invaluable for several reasons:
- Early Bug Detection: Typing helps in identifying potential type-related issues at the early stages of development, preventing bugs that could become costly and complex to resolve later.
- Enhanced Code Clarity: Type annotations clarify the expected data types for function inputs and outputs, making the code more readable and understandable.
- Improved Development Workflow: Adopting typing encourages a disciplined coding practice, resulting in fewer errors and enhanced code quality.
- Facilitates Collaboration: In team settings, clear type annotations ensure that all members understand the data structures and function interfaces, leading to more effective collaboration.
- Integration with IDEs: Advanced IDEs utilize type hints to offer superior code completion, error highlighting, and refactoring capabilities.
Although specifying types requires additional effort, this investment significantly improves the codebase's quality.
What is the relation between Python and typing?
Python is primarily recognized as a strong and dynamically typed language, allowing programmers to write code without specifying types explicitly. This approach is straightforward but may not be scalable for larger projects. Since Python 3.5, the language has supported gradual typing, enabling developers to annotate types. This feature enhances code clarity and aids in error prevention, especially during development.
For instance, a simple function without type annotations in Python might look like this:
def print_n_times(message, n):
for _ in range(n):
print(message)
However, for better clarity and to take advantage of gradual typing, the same function with type annotations would be:
def print_n_times(message: str, n: int) -> None:
for _ in range(n):
print(message)
Incorporating type annotations is highly recommended for the benefits they bring in terms of code clarity and early error detection, except in some cases where the effort might not justify the value.
It's important to note that Python types are checked during development time, meaning they're used to verify the program's logic and flow rather than affecting runtime performance or optimization.
To dive deeper into Python typing, exploring resources such as the Mypy cheatsheet and Python's built-in typing module is beneficial.
Is it possible to provide types for a dataframe?
It's possible to provide types for dataframes using the Pandera library. Pandera offers a flexible and expressive API for validating data in dataframe-like objects, enhancing the readability and robustness of data processing pipelines.
Pandera allows for:
- Defining a schema once and validating different dataframe types, including pandas, dask, modin, and pyspark.pandas.
- Checking the types and properties of columns in a pandas DataFrame or values in a pandas Series.
- Performing complex statistical validations, such as hypothesis testing.
- Integrating seamlessly with data analysis and processing pipelines through function decorators.
- Using a class-based API for dataframe models, similar to pydantic, and validating dataframes with typing syntax.
- Synthesizing data from schema objects for property-based testing.
- Validating dataframes lazily to execute all validation rules before raising an error.
- Integrating with a rich ecosystem of Python tools like pydantic, fastapi, and mypy.
Here's an example schema for validating a dataframe in an MLOps codebase:
import pandera as pa
import pandera.typing as papd
import pandera.typing.common as padt
class InputsSchema(pa.DataFrameModel):
"""Schema for the project inputs."""
instant: papd.Index[padt.UInt32] = pa.Field(ge=0, check_name=True)
dteday: papd.Series[padt.DateTime] = pa.Field()
season: papd.Series[padt.UInt8] = pa.Field(isin=[1, 2, 3, 4])
yr: papd.Series[padt.UInt8] = pa.Field(ge=0, le=1)
mnth: papd.Series[padt.UInt8] = pa.Field(ge=1, le=12)
hr: papd.Series[padt.UInt8] = pa.Field(ge=0, le=23)
holiday: papd.Series[padt.Bool] = pa.Field()
weekday: papd.Series[padt.UInt8] = pa.Field(ge=0, le=6)
workingday: papd.Series[padt.Bool] = pa.Field()
weathersit: papd.Series[padt.UInt8] = pa.Field(ge=1, le=4)
temp: papd.Series[padt.Float16] = pa.Field(ge=0, le=1)
atemp: papd.Series[padt.Float16] = pa.Field(ge=0, le=1)
hum: papd.Series[padt.Float16] = pa.Field(ge=0, le=1)
windspeed: papd.Series[padt.Float16] = pa.Field(ge=0, le=1)
casual: papd.Series[padt.UInt32] = pa.Field(ge=0)
registered: papd.Series[padt.UInt32] = pa.Field(ge=0)
Is it possible to provide better types for classes?
Pydantic enhances the native class syntax by validating class attributes and providing a cleaner, more efficient syntax.
Features of Pydantic include:
- Validation and serialization powered by type hints, integrating seamlessly with IDEs and static analysis tools.
- High performance due to core validation logic written in Rust.
- Capability to emit JSON Schema for easy integration with other tools.
- Support for both strict and lax modes for data validation.
- Validation for many standard library types, including dataclasses and TypedDicts.
- Extensive customization options for validators and serializers.
- A rich ecosystem of integrations with popular libraries like FastAPI and SQLModel.
- Reliability proven by widespread use across various industries and projects.
Example usage in an MLOps codebase:
import pydantic as pdt
class GridCVSearcher(pdt.BaseModel):
"""Grid searcher with cross-fold validation for better model performance metrics."""
n_jobs: int | None = None
refit: bool = True
verbose: int = 3
error_score: str | float = "raise"
return_train_score: bool = False
How can you check your types with Python?
Mypy is the primary tool for type checking in Python, providing command-line and IDE integration options.
uv add --group checkers mypy
uv run mypy src/ tests/
Faster alternatives to mypy include:
- pyright: Static Type Checker for Python. MIT, Microsoft
- pyre-check: Performant type-checking for python. MIT, Meta
- pytype: A static type analyzer for Python code. Apache-2, Google
Compared to other alternatives, Mypy supports additional plugins as we are doing to see below.
How can you configure mypy to improve your validation workflow?
To enhance your validation workflow, you can configure mypy in your project's pyproject.toml
. Before committing code, it's advisable to run mypy across your codebase to ensure type correctness. You can ignore the .mypy_cache/
folders generated by mypy by adding them to your .gitignore
.
Example mypy configuration in pyproject.toml
:
[tool.mypy]
# improve error messages
pretty = true
# specify the python version
python_version = "3.12"
# check untyped definitions
check_untyped_defs = true
# all missing imports in code
ignore_missing_imports = true
# enable additional mypy plugins
plugins = ["pandera.mypy", "pydantic.mypy"]
If you need to ignore mypy for entire file or single line, you can add the following comment:
def func(a: int, b: int) -> bool: # type: ignore[empty-body]
pass
More configuration options are available in the mypy documentation.
What are the best practices for providing types in Python?
- Follow the 80-20 rule: Focus on annotating types where it brings the most benefit.
- Familiarize yourself with the typing module: for further understanding Python types.
- Use implicit typing judiciously, as not all variables require explicit annotations.
- Employ
typing.Any
sparingly when specific types are not necessary or known. - Leverage tools like mypy for continuous type checking during development.