2.2. Configs

What are configs?

Configurations, often abbreviated as "configs," serve as a cornerstone in programming. They encapsulate a set of parameters or settings designed to adapt the behavior of your code. By employing configs, you introduce a layer of flexibility and customization, enabling easy adjustments of critical variables without the need to tamper with the core logic of your codebase. This strategy not only enhances code usability but also its adaptability across various scenarios.

Here's a practical illustration of configs within a notebook context:

# Define paths for caching and training data
ROOT = Path("../")
DATA = str(ROOT / "data")
CACHE = str(ROOT / ".cache")
HOUR = str(DATA / "hour.csv")
# Configure random state for reproducibility
RANDOM = 42
# Define dataset columns for feature engineering
INDEX = "instant"
TARGET = "cnt"
# Setup dataset parameters for testing and shuffling
SPLITS = 4
SHUFFLE = False  # required (time sensitive)
TEST_SIZE = 24 * 30 * 2  # use 2 months for backtesting
# Parameters for pipeline configurations
SCORING = "neg_mean_squared_error"
PARAM_GRID = {
    "regressor__max_depth": [12, 15, 18, 21],
    "regressor__n_estimators": [150, 200, 250, 300],
}

Why should you create configs?

Incorporating configs into your projects is a reflection of best practices in software development. This approach ensures your code remains:

Flexible: Facilitating effortless adaptations and changes to different datasets or experimental scenarios.
Easy to Maintain: Streamlining the process of making updates or modifications without needing to delve deep into the core logic.
User-Friendly: Providing a straightforward means for users to tweak the notebook's functionality to their specific requirements without extensive coding interventions.
Avoid hard coding and magic numbers: Name and document key variables in your notebook to make them understandable and reviewable by others.

Effectively, configurations act as a universal "remote control" for your code, offering an accessible interface for fine-tuning its behavior.

Which configs can you provide out of the box?

When it comes to data science projects, several common configurations are frequently utilized, including:

Data Processing: Parameters like SHUFFLE, TEST_SIZE, and RANDOM_STATE are instrumental in controlling how data is prepared and manipulated.
Model Parameters: Definitions such as N_ESTIMATORS and MAX_DEPTH cater to tuning machine learning model behaviors.
Execution Settings: Variables like BATCH_SIZE and EPOCHS are crucial for defining the operational aspects of iterative processes, with LIMIT setting constraints on dataset sizes.

An example of how you might define some of these settings is as follows:

# Configuration for shuffling the dataset to mitigate selection bias
SHUFFLE = False
# Setting aside a portion of the data for testing purposes
TEST_SIZE = 0.2
# Ensuring reproducibility across experiments through fixed randomness
RANDOM_STATE = 0

How should you organize the configs in your notebook?

A logical and functional organization of your configurations can significantly enhance the readability and maintainability of your code. Grouping configs based on their purpose or domain of application is advisable:

## Paths

Define inputs and outputs paths ...

## Randomness

Configure settings to fix randomness ...

## Dataset

Specifications on how to load and transform datasets ...

## Pipelines

Details on defining and executing model pipelines ...

Such categorization makes it easier for both users and developers to navigate and modify configurations as needed.

What are options?

In the context of data science notebooks, options are akin to configurations but are specifically tied to the behavior and presentation of libraries such as pandas, matplotlib, and scikit-learn. These options offer a means to customize various aspects, including display settings and output formats, to suit individual needs or project requirements.

Here's an example showcasing the use of options in a notebook:

import pandas as pd
import sklearn

# Configure pandas display settings
pd.options.display.max_rows = None
pd.options.display.max_columns = None
# Adjust sklearn output format
sklearn.set_config(transform_output="pandas")

Why do you need to pass options?

Library defaults may not always cater to your specific needs or the demands of your project. For instance:

Pandas' default display settings might truncate your data, hiding essential details.
The standard figure size in Matplotlib could be too small for a thorough examination.

Adjusting these options helps tailor the working environment to better fit your workflow and analytical needs, ensuring that outputs are both informative and visually accessible.

How should you configure library options?

To optimize your working environment, consider customizing the settings of key libraries according to your project's needs. Here are some guidelines:

For Pandas:

import pandas as pd

# Adjust maximum display settings for rows and columns
pd.options.display.max_rows = None
pd.options.display.max_columns = None
# Increase maximum column width to improve readability
pd.options.display.max_colwidth = None

For Matplotlib:

import matplotlib.pyplot as plt

# Customize default figure size for better visibility
plt.rcParams['figure.figsize'] = (20, 10)

For Scikit-learn:

import sklearn

# Modify the output format to return pandas dataframes instead of numpy arrays
sklearn.set_config(transform_output='pandas')

Additional Resources

Configs example from the MLOps Python Package