Skip to content

7. Observability

Observability in Machine Learning Operations (MLOps) is crucial for gaining insights into the performance, behavior, and health of AI/ML models and their supporting infrastructure in production environments. It encompasses practices and tools that enable teams to understand how models are performing, detect issues early on, and make data-driven decisions to optimize and maintain these systems. This chapter delves into several key aspects of observability, equipping you with the knowledge and strategies to build more reliable and effective MLOps pipelines.

  • 7.0. Reproducibility: Explore how to make machine learning experiments and pipelines more reproducible using MLflow Projects, enabling others to verify findings, share knowledge, and build upon existing work.
  • 7.1. Monitoring: Learn the fundamental principles and tools for monitoring AI/ML models, focusing on tracking key metrics, setting up alerts, and understanding changes in model behavior using MLflow Evaluate API and Evidently.
  • 7.2. Alerting: Understand how to design effective alert systems to promptly notify stakeholders of potential issues with models or infrastructure using tools like Slack, Discord, Datadog, and PagerDuty.
  • 7.3. Lineage: Delve into data and model lineage, discovering how to track the origin and transformation of data and models throughout the ML lifecycle using MLflow Dataset.
  • 7.4. Costs and KPIs: Explore techniques for managing costs associated with running AI/ML workloads and for defining and tracking key performance indicators (KPIs) aligned with business goals, using MLflow Tracking for analysis.
  • 7.5. Explainability: Explore the concept of explainable AI, focusing on techniques like SHAP to understand model predictions and build trust in AI systems.
  • 7.6. Infrastructure: Discover the importance of infrastructure monitoring, learning how to track resource usage and performance metrics to optimize efficiency and costs through MLflow System Metrics. ```