3.4. Configurations

What are configurations?

Software configurations are parameters and constants that control your application's behavior, which are kept separate from the source code to allow for greater flexibility. These can be supplied through environment variables, configuration files, or command-line interface (CLI) arguments.

For example, a YAML configuration file provides a human-readable way to define settings:

# Example: confs/training.yaml
job:
  KIND: TrainingJob # Specifies the type of job to run
  inputs:
    KIND: ParquetReader # Defines the reader for input data
    path: data/inputs.parquet # Path to the input dataset
  targets:
    KIND: ParquetReader # Defines the reader for target data
    path: data/targets.parquet # Path to the target dataset

This separation allows you to change parameters like file paths or component types without modifying the application's code, making it adaptable to different environments and experiments.

Why is externalizing configurations crucial in MLOps?

Externalizing configurations is a cornerstone of robust MLOps practices because it decouples your code from the environment it runs in. This is critical for:

Environment Switching: Seamlessly transition between development, staging, and production environments by simply swapping configuration files for different databases, file paths, and API keys.
Reproducible Experimentation: Tweak model hyperparameters (e.g., learning rate, batch size) or select different algorithms for training runs by defining them in configuration files, ensuring experiments are easy to track and reproduce.
Scalability and Deployment: Adjust resource allocations (e.g., CPU, memory, GPU) or change deployment targets (e.g., local, cloud, edge) without touching the core logic.

Which configuration file format is best?

While JSON and TOML are viable options, YAML is often the preferred choice in the Python ecosystem for its superior readability and support for comments, which is invaluable for documenting complex settings.

However, a critical security consideration with YAML is that it can be used to execute arbitrary code. Always use yaml.safe_load() to parse YAML files, as it restricts this capability and prevents potential security vulnerabilities.

How do you provide configurations to an application?

The most common and flexible method is to pass configuration files via the command-line interface (CLI). This approach allows you to layer configurations, where a base file can be overridden by environment-specific or experiment-specific files.

For example, you could combine a default configuration with a training-specific one:

$ bikes defaults.yaml training.yaml --verbose

Here, defaults.yaml might contain shared settings, while training.yaml holds parameters specific to a training job. The --verbose flag is an additional CLI argument that can control application behavior, such as logging levels.

What are the best Python libraries for managing configurations?

A powerful combination for modern Python applications is using OmegaConf for parsing and Pydantic for validation.

Parsing with OmegaConf: It excels at loading, merging, and managing hierarchical configurations from YAML files. It also supports variable interpolation, allowing you to reference one part of the configuration from another.
Validation with Pydantic: It ensures your application receives the data in the format it expects. By defining a schema, Pydantic validates data types, enforces constraints, and provides clear error messages, preventing bugs and failures during runtime.

Here is how you can combine them to create a robust configuration system:

import typing as T
import omegaconf as oc

Config = oc.ListConfig | oc.DictConfig

def parse_file(path: str) -> Config:
    """Parse a config file from a path."""
    return oc.OmegaConf.load(path)

def merge_configs(configs: T.Sequence[Config]) -> Config:
    """Merge a list of config into a single config."""
    return oc.OmegaConf.merge(*configs)

args = parser.parse_args(argv)
files = [configs.parse_file(file) for file in args.files]
config = configs.merge_configs(files)

This pattern gives you the flexibility of YAML and the safety of static type checking, which is a significant improvement over using plain Python dictionaries.

Utilizing Pydantic for configuration validation and default values ensures that your application behaves as expected by catching mismatches or errors in configuration files early in the process, thereby avoiding potential failures after long-running jobs. This is an improvement over Python's dicts as each key are validated and mentioned explicitly in your code base:

import pydantic as pdt

class TrainTestSplitter(pdt.BaseModel):
    """Split a dataframe into a train and test set.

    Parameters:
        shuffle (bool): shuffle the dataset. Default is False.
        test_size (int | float): number/ratio for the test set.
        random_state (int): random state for the splitter object.
    """

    shuffle: bool = False
    test_size: int | float
    random_state: int = 42

When should you use environment variables over configuration files?

Environment variables are ideal for settings that change between deployments or contain sensitive data. According to the Twelve-Factor App methodology, configuration should always be stored in the environment.

Use environment variables for: - Secrets: API keys, database passwords, and other credentials should never be hardcoded or stored in version-controlled files. - System-Level Settings: Environment-specific details like a MLFLOW_TRACKING_URI or DATABASE_URL.

$ MLFLOW_TRACKING_URI=./mlruns bikes one two three

In this example, the MLflow tracking URI is securely passed to the bikes program without being part of the configuration files.

What are the best practices for configuration management?

To build a scalable and maintainable configuration system, follow these best practices:

Security:
- Always use yaml.safe_load() to prevent arbitrary code execution.
- Store secrets and credentials in environment variables or a secure vault, not in configuration files.
Robustness:
- Validate Early: Use a library like Pydantic to validate configurations at startup to catch errors immediately.
- Handle Errors Gracefully: Implement robust error handling for file I/O and parsing operations.
- Use Context Managers: Ensure files are properly opened and closed to prevent resource leaks.
Maintainability:
- Provide Sensible Defaults: Set default values for optional parameters to make the application easier to use.
- Document Everything: Use comments within your configuration files to explain what each parameter does.
- Keep it Consistent: Maintain a consistent format and structure across all configuration files.
- Consider Versioning: For large projects, version your configuration schema to manage changes over time.

You can use the Configs section of your notebooks to initialize the configuration files for your Python package.