3.1. Modules

What are Python modules?

A Python module is a file ending in .py that contains Python definitions and statements. Modules are the cornerstone of code organization in Python, allowing you to group related functions, classes, and variables into a single, manageable unit.

Think of a module as a self-contained namespace. When you import a module, you gain access to the objects defined within it, promoting code reuse and preventing naming conflicts.

You can inspect a module's properties programmatically. To find its file location, use the __file__ attribute, and to list its contents, use the dir() function:

# Discovering a module's file path
import math
print(math.__file__)

# Listing the names defined in a module
print(dir(math))

Why are modules critical for MLOps projects?

In MLOps, projects quickly evolve beyond simple scripts. Modules are essential for building robust, scalable, and maintainable machine learning systems. They provide several key benefits:

Organization: By separating concerns, modules make your codebase easier to navigate. For instance, you can have distinct modules for data loading (datasets.py), feature engineering (features.py), model definitions (models.py), and training pipelines (training.py).
Reusability: A function to normalize data, once defined in a utility module, can be imported and reused across different experiments and services (e.g., training and inference).
Collaboration: When team members work on different modules, it reduces merge conflicts and allows for parallel development. A data scientist can refine the modeling logic in models.py while an ML engineer optimizes the data pipeline in datasets.py.
Testability: Encapsulating logic within modules makes it easier to write targeted unit tests, ensuring each component of your ML system works as expected.

Without a modular structure, a project becomes a monolithic script that is difficult to debug, test, and extend.

How do you create a Python module?

Creating a module is as simple as creating a new file with a .py extension inside your project's source directory. For example, within a src/bikes package, you could define modules for handling data and models:

$ touch src/bikes/datasets.py
$ touch src/bikes/models.py

src/bikes/datasets.py might contain functions to load raw data from a CSV file, clean it, and split it into training and testing sets.
src/bikes/models.py could define the architecture of your machine learning model using a framework like Scikit-learn or PyTorch.

These files are now modules that can be imported elsewhere in your project.

How do you import from your own modules?

Python uses a list of directories called sys.path to find modules during an import statement. When you install your project in editable mode (e.g., with uv pip install -e .), your project's source directory is added to this path.

This allows you to use absolute imports, which is the recommended practice for clarity and avoiding ambiguity:

# Assuming 'bikes' is your package in the 'src' directory
from bikes.datasets import load_data
from bikes.models import create_pipeline

# You can inspect sys.path to see where Python looks for modules
import sys
print(sys.path)

Using absolute imports from your project's root makes your code more readable and your import statements less brittle to file reorganizations compared to relative imports (from . import models).

What is an effective way to organize modules?

A powerful strategy for structuring modules in an MLOps project is to separate I/O-bound code from pure domain logic. This pattern, inspired by concepts like Domain-Driven Design, isolates the predictable, testable parts of your code from the unpredictable parts that interact with the outside world.

Consider this structure:

Domain Layer: Contains the core logic of your application. This code is pure, deterministic, and has no external dependencies like databases or APIs.
- domain/models.py: Defines model architectures or pipelines.
- domain/features.py: Contains pure functions for feature transformations.
I/O Layer: Manages interactions with external systems. This is where side-effects (reading files, querying databases, making API calls) happen.
- io/datasets.py: Handles loading data from sources (e.g., S3, SQL) and saving artifacts.
- io/services.py: Connects to external services like model registries or monitoring dashboards.
Application Layer: Orchestrates the domain and I/O layers to perform high-level tasks.
- training.py: A script that uses io.datasets to load data, domain.features to process it, domain.models to define a model, and io.datasets again to save the trained artifact.
- inference.py: An entrypoint for serving predictions, combining I/O and domain logic.

This separation makes your core logic highly testable and reusable, as it doesn't depend on specific infrastructure.

What are the risks of using modules?

The primary risk with modules is executing code with side-effects upon import. A side-effect is any operation that affects state outside its local scope, such as modifying a file, connecting to a database, or even printing to the console.

If a module performs a heavy computation or a destructive operation at the top level, it will be executed the moment it's imported, which can lead to slow, unpredictable, and dangerous behavior.

Unsafe Example:

# unsafe_module.py
import pandas as pd

# Side-effect: This large file is loaded into memory on import
print("Loading large dataset...")
df = pd.read_csv("very_large_dataset.csv")
print("Dataset loaded.")

# main.py
print("Importing the unsafe module...")
import unsafe_module  # This line will trigger the file loading
print("Import finished.")

To prevent this, all executable code should be placed inside functions or guarded by an if __name__ == "__main__": block. This ensures the code only runs when the module is executed as a script, not when it's imported.

Safe Practice:

# safe_module.py
import pandas as pd

def load_large_dataset():
    """Loads the dataset when explicitly called."""
    print("Loading large dataset...")
    df = pd.read_csv("very_large_dataset.csv")
    print("Dataset loaded.")
    return df

# This block only runs when you execute `python safe_module.py`
if __name__ == "__main__":
    # This is a safe place for script-level logic or tests
    data = load_large_dataset()
    print("Module executed directly. Data shape:", data.shape)