5.5. AI/ML Experiments

What is an AI/ML experiment?

An AI/ML experiment is a systematic and iterative process for building robust machine learning models. It involves testing different algorithms, tuning hyperparameters, and using various datasets to discover the optimal configuration for a specific predictive task. Each experiment is a structured trial designed to measure the impact of changes on model performance, such as accuracy, efficiency, and reliability.

Why is experiment tracking essential in AI/ML?

In MLOps, the complexity and often non-deterministic nature of model development require a disciplined approach. Experiment tracking provides the necessary structure, much like a lab notebook for a scientist. Key benefits include:

Ensuring Reproducibility: Tracking guarantees that every aspect of an experiment—code, data, environment, and parameters—is recorded. This allows you and your team to reliably replicate and verify results.
Optimizing Hyperparameters: It provides a systematic way to test and compare different hyperparameter configurations, helping you pinpoint the settings that maximize model performance.
Organizing Your Work: By logging experiments and using tags, you can categorize runs by model type, dataset, or objective. This organization is crucial for managing complex projects and quickly retrieving past results.
Monitoring Performance: Tracking metrics during a run offers real-time insight into how adjustments affect model behavior, enabling faster, data-driven decisions.
Seamless Framework Integration: Modern tracking tools integrate with popular AI/ML frameworks, creating a unified and streamlined workflow across your entire toolchain.

Which experiment tracking solution should you use?

Numerous solutions are available for tracking AI/ML experiments. Major cloud platforms like Google Cloud (Vertex AI), Azure (Azure ML), and AWS (SageMaker) offer powerful, integrated MLOps capabilities. Specialized commercial tools like Weights & Biases and Neptune AI also provide excellent features.

For those starting out or preferring an open-source, framework-agnostic solution, MLflow is an outstanding choice. It is versatile, robust, and integrates with a wide array of ML libraries.

To install MLflow, run:

uv add mlflow

To verify the installation and start the MLflow UI server locally:

uv run mlflow doctor
uv run mlflow server

For a more permanent setup using Docker, you can use a docker-compose.yml file to launch the MLflow server:

services:
  mlflow:
    image: ghcr.io/mlflow/mlflow:v2.14.1
    ports:
      - 5000:5000
    environment:
      - MLFLOW_HOST=0.0.0.0
    command: mlflow server

Run docker compose up to start the service. For information on production-grade deployments, refer to the MLflow documentation.

How do you configure MLflow in a project?

To integrate MLflow, you first need to configure it to store experiment data. You can start by setting the tracking and registry URIs to a local directory, such as ./mlruns. Then, define an experiment name to group related runs.

Enabling MLflow's autologging is highly recommended. It automatically captures metrics, parameters, and models from popular ML libraries without requiring explicit logging statements.

import mlflow

# Configure MLflow to save data to a local directory
mlflow.set_tracking_uri("file://./mlruns")
mlflow.set_registry_uri("file://./mlruns")

# Set a name for the experiment
mlflow.set_experiment(experiment_name="Bike Sharing Demand Prediction")

# Enable autologging for automatic tracking
mlflow.autolog()

To start a new run, wrap your training code within an MLflow run context. This allows you to add descriptive metadata and enable system metric logging.

with mlflow.start_run(
    run_name="Forecast Model with Feature Engineering",
    description="Training a model with an enhanced feature set.",
    log_system_metrics=True,
) as run:
    # Your model training and evaluation code goes here
    print(f"MLflow Run ID: {run.info.run_id}")

What information can you track in an experiment?

While MLflow's autologging captures a wealth of information automatically, you can enhance it with manual logging for greater detail:

Parameters: Log key-value parameters with mlflow.log_param() or a dictionary of parameters with mlflow.log_params().
Metrics: Record single metrics over time (e.g., per epoch) with mlflow.log_metric() or a dictionary of metrics with mlflow.log_metrics().
Inputs: Log dataset details and context with mlflow.log_input(), including tags for better categorization.
Tags: Assign custom tags to a run for improved filtering and organization using mlflow.set_tag() or mlflow.set_tags().
Artifacts: Save output files, such as plots, model files, or data samples, with mlflow.log_artifact() for single files or mlflow.log_artifacts() for directories.

How can you compare experiments to find the best model?

Comparing experiments is essential for model selection. MLflow provides two powerful ways to do this: its web UI and its programmatic API.

Comparing via the MLflow Web UI

The MLflow UI offers an intuitive, visual way to compare runs.

Launch the MLflow Server: If it's not running, start it with uv run mlflow server.
Select Runs: Navigate to the experiment page, where all runs are listed. Use the checkboxes to select the runs you want to compare.
Click Compare: A "Compare" button will appear. Clicking it opens a detailed view that places the selected runs side-by-side.
Analyze Results: This view provides a comprehensive summary of parameters, metrics, and artifacts for each run. You can use it to identify which configurations yielded the best performance.

Comparing Programmatically

Programmatic comparison is ideal for automated analysis and custom reporting.

Query Runs: Use mlflow.search_runs() to fetch run data into a pandas DataFrame. You can filter by experiment, metrics, parameters, or tags.

import mlflow

# Fetch runs from specific experiments
experiment_ids = ["1", "2"]
runs_df = mlflow.search_runs(experiment_ids)

Filter and Sort: With the data in a DataFrame, you can use pandas to sort and filter the results to find the top-performing runs based on your criteria.

# Find the best run based on validation accuracy
best_run = runs_df.sort_values("metrics.validation_accuracy", ascending=False).iloc[0]
print(f"Best Run ID: {best_run.run_id}")

Visualize Comparisons: Use libraries like Matplotlib or Seaborn to create custom visualizations that make comparisons clear and intuitive.

import matplotlib.pyplot as plt

# Plot validation accuracy for the top 5 runs
top_5_runs = runs_df.sort_values("metrics.validation_accuracy", ascending=False).head(5)
plt.figure(figsize=(12, 7))
plt.bar(top_5_runs["run_id"].str[:7], top_5_runs["metrics.validation_accuracy"])
plt.title("Comparison of Validation Accuracy Across Top 5 Runs")
plt.xlabel("Run ID")
plt.ylabel("Validation Accuracy")
plt.show()

What are some best practices for experiment tracking?

To maximize the value of your experiments, adopt these practices:

Use Clear Naming Conventions: Give experiments and runs descriptive names to make them easily identifiable (e.g., PROD_Retraining_ResNet50 vs. test_run_1).
Align with Business Metrics: Ensure that the metrics you track are directly relevant to project goals and business outcomes.

Leverage Nested Runs: Use nested runs to organize complex experiments, such as hyperparameter tuning, where each child run explores a different parameter set.

with mlflow.start_run(run_name="Hyperparameter Search") as parent_run:
    params = [0.01, 0.02, 0.03]
    for p in params:
        with mlflow.start_run(nested=True, run_name=f"alpha_{p}") as child_run:
            mlflow.log_param("alpha", p)
            # ... training logic ...
            mlflow.log_metric("val_loss", val_loss)

Tag Extensively: Use tags to add metadata like the dataset version, model type, or evaluation status (e.g., dataset:v2, model:xgboost, status:production_candidate).
Track Progress Over Time: Log metrics at each step or epoch to visualize the learning process and diagnose issues like overfitting.
```
# Inside your training loop
mlflow.log_metric(key="train_loss", value=train_loss, step=epoch)
```
Register Promising Models: When a run produces a high-quality model, log it to the MLflow Model Registry to version it and prepare it for deployment.

How does experiment tracking fit into the MLOps lifecycle?

Experiment tracking is a cornerstone of the MLOps lifecycle, bridging the gap between development and production.

Development: It provides the tools to systematically explore and refine models.
CI/CD Integration: The artifacts and metadata from experiments feed directly into continuous integration and deployment pipelines. For example, a CI pipeline can automatically trigger when a new model is registered, running tests and preparing it for deployment.
Production Monitoring: The metrics and parameters from training runs serve as a baseline for monitoring the model's performance in production. If performance degrades (a concept known as model drift), the tracked experiments provide a clear, reproducible history to inform retraining and debugging efforts.

By maintaining a detailed record of every experiment, you create a transparent, auditable, and efficient workflow that accelerates the entire MLOps cycle.