3.4. Configurations
What are configurations?
Software configurations consist of parameters or constants for the operation of your program, externalized to allow flexibility and adaptability. Configurations can be provided through various means, such as environment variables, configuration files, or command-line interface (CLI) arguments. For instance, a YAML configuration file might look like this:
job:
KIND: TrainingJob
inputs:
KIND: ParquetReader
path: data/inputs.parquet
targets:
KIND: ParquetReader
path: data/targets.parquet
This structure allows for easy adjustment of parameters like file paths or job kinds, facilitating the program's operation across diverse environments and use cases.
Why do you need to write configurations?
Configurations enhance your code's flexibility, making it adaptable to different environments and scenarios without source code modifications. This separation of code from its execution environment boosts portability and simplifies updates or changes, much like adjusting settings in an application without altering its core functionality.
Which file format should you use for configurations?
When choosing a format for configuration files, common options include JSON, TOML, and YAML. YAML is frequently preferred for its readability, ease of use, and ability to include comments, which can be particularly helpful for documentation and maintenance. However, it's essential to be aware of YAML's potential for loading malicious content; therefore, always opt for safe loading practices.
How should you pass configuration files to your program?
Passing configuration files to your program typically utilizes the CLI, offering a straightforward method to integrate configurations with additional command options or flags. For example, executing a command like:
$ bikes defaults.yaml training.yaml --verbose
This example enables the combination of configuration files with verbosity options for more detailed logging. This flexibility is also extendable to configurations stored on cloud services, provided your application supports such paths.
Which toolkit should you use to parse and load configurations?
For handling configurations in Python, OmegaConf offers a powerful solution with features like YAML loading, deep merging, variable interpolation, and read-only configurations. It's particularly suited for complex settings and hierarchical structures. Additionally, for applications involving cloud storage, cloudpathlib facilitates direct loading from services like AWS, GCP, and Azure.
import typing as T
import omegaconf as oc
Config = oc.ListConfig | oc.DictConfig
def parse_file(path: str) -> Config:
"""Parse a config file from a path."""
return oc.OmegaConf.load(path)
def merge_configs(configs: T.Sequence[Config]) -> Config:
"""Merge a list of config into a single config."""
return oc.OmegaConf.merge(*configs)
args = parser.parse_args(argv)
files = [configs.parse_file(file) for file in args.files]
config = configs.merge_configs(files)
Utilizing Pydantic for configuration validation and default values ensures that your application behaves as expected by catching mismatches or errors in configuration files early in the process, thereby avoiding potential failures after long-running jobs. This is an improvement over Python's dicts as each key are validated and mentioned explicitly in your code base:
import pydantic as pdt
class TrainTestSplitter(pdt.BaseModel):
"""Split a dataframe into a train and test set.
Parameters:
shuffle (bool): shuffle the dataset. Default is False.
test_size (int | float): number/ratio for the test set.
random_state (int): random state for the splitter object.
"""
shuffle: bool = False
test_size: int | float
random_state: int = 42
When should you use environment variables instead of configurations files?
Environment variables are more suitable for simple configurations or when dealing with sensitive information that shouldn't be stored in files, even though they lack the structure and type-safety of dedicated configuration files. They are universally supported and easily integrated but may become cumbersome for managing complex or numerous settings.
$ MLFLOW_TRACKING_URI=./mlruns bikes one two three
In this example, the MLFLOW_TRACKING_URI
is passed as an environment variable to the bikes
program, while the command also accepts 3 positional arguments: one, two, and three.
What are the best practices for writing and loading configurations?
To ensure effective configuration management:
- Always use
yaml.safe_load()
to prevent the execution of arbitrary code. - Utilize context managers for handling files to ensure proper opening and closing.
- Implement robust error handling for I/O operations and parsing.
- Validate configurations against a schema to confirm correctness.
- Avoid storing sensitive information in plain text; instead, use secure mechanisms.
- Provide defaults for optional parameters to enhance usability.
- Document configurations with comments for clarity.
- Maintain a consistent format across configuration files for better readability.
- Consider versioning your configuration format to manage changes effectively in larger projects.
You can use the Configs
section of your notebooks to initialize the configuration files for your Python package.