<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Ayub Shah</title>
    <description>The latest articles on Forem by Ayub Shah (@ayubshah014sys).</description>
    <link>https://forem.com/ayubshah014sys</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3906545%2F9b00e6e0-15d4-41a5-8a69-a61d354056ec.jpg</url>
      <title>Forem: Ayub Shah</title>
      <link>https://forem.com/ayubshah014sys</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/ayubshah014sys"/>
    <language>en</language>
    <item>
      <title>Model Drift Detection: Stop Silent Failures Before They Kill Your Model (2026)</title>
      <dc:creator>Ayub Shah</dc:creator>
      <pubDate>Sat, 02 May 2026 07:34:40 +0000</pubDate>
      <link>https://forem.com/ayubshah014sys/model-drift-detection-stop-silent-failures-before-they-kill-your-model-2026-1fio</link>
      <guid>https://forem.com/ayubshah014sys/model-drift-detection-stop-silent-failures-before-they-kill-your-model-2026-1fio</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://mlopslab.org/model-drift-detection-tutorial/" rel="noopener noreferrer"&gt;mlopslab.org&lt;/a&gt; — updated weekly. 0 sponsors, 0 affiliate links.&lt;/em&gt;&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;⚡ The problem in one sentence:&lt;/strong&gt; Your model shipped and worked great on day one. Now, weeks later, it's making worse decisions — silently, without throwing a single error. Drift detection is how you catch this before the damage is done.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;What is model drift detection?&lt;/li&gt;
&lt;li&gt;Three types of drift you must monitor&lt;/li&gt;
&lt;li&gt;Why it matters in production&lt;/li&gt;
&lt;li&gt;Statistical methods for drift detection&lt;/li&gt;
&lt;li&gt;Tools comparison&lt;/li&gt;
&lt;li&gt;Step-by-step tutorial with Evidently AI&lt;/li&gt;
&lt;li&gt;When drift is detected — what to do&lt;/li&gt;
&lt;li&gt;FAQ&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  1. What is model drift detection?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Model drift detection&lt;/strong&gt; is the practice of monitoring ML models in production to identify when they start degrading due to changes in real-world data.&lt;/p&gt;

&lt;p&gt;Without it, a model that worked perfectly at deployment starts making worse predictions — often silently, without any errors or alerts. This is the #1 reason ML projects fail in production. By the time you notice the problem, you've already lost revenue, damaged user trust, or made critical bad decisions based on stale model outputs.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;📉 &lt;strong&gt;The silent killer:&lt;/strong&gt; Most teams only discover drift when a stakeholder complains. By then, the model has been wrong for weeks — sometimes months. A drift detection system would have flagged this on day one.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  2. Three types of drift you must monitor
&lt;/h2&gt;

&lt;p&gt;There are three distinct failure modes, and they require different detection strategies.&lt;/p&gt;

&lt;h3&gt;
  
  
  Data drift
&lt;/h3&gt;

&lt;p&gt;Input feature distributions change over time. Your model encounters data it was never trained on. This is the most common type and the easiest to detect — you're comparing distributions, not outcomes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; A fraud detection model trained on 2024 transaction patterns encounters completely different spending behavior in 2026. Feature distributions shift, and accuracy silently collapses.&lt;/p&gt;

&lt;h3&gt;
  
  
  Concept drift
&lt;/h3&gt;

&lt;p&gt;The relationship between inputs and outputs changes. What the model learned is no longer valid in the current world — even if the input data looks similar, the correct answer has changed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; A house price model trained pre-COVID fails badly after remote work permanently shifts housing demand dynamics. The features are the same; the world has changed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prediction drift
&lt;/h3&gt;

&lt;p&gt;The distribution of model outputs shifts over time — even before you can measure accuracy. This is a leading indicator that something upstream has changed and is often the earliest signal you'll get.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; A recommendation model starts surfacing entirely different categories as user behavior shifts after a product redesign.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;⚠️ &lt;strong&gt;The hard truth:&lt;/strong&gt; These three types compound each other. Data drift often causes concept drift, which then shows up as prediction drift. Monitor all three.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  3. Why it matters in production
&lt;/h2&gt;

&lt;p&gt;The business impact of undetected drift breaks down into three categories:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Revenue loss&lt;/strong&gt; — Bad recommendations, wrong pricing, and failed fraud detection translate directly to lost money. A single undetected fraud spike or pricing error can cost more than an entire year of monitoring infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;User trust&lt;/strong&gt; — Users notice when your model is wrong before you do. Once trust is damaged it's extremely hard to recover.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compliance&lt;/strong&gt; — In regulated industries like finance and healthcare, model monitoring isn't optional. It's legally required. Unmonitored model degradation is an audit finding.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;✅ &lt;strong&gt;Business case:&lt;/strong&gt; One properly implemented drift detection system can save months of debugging time and prevent millions in revenue loss. The ROI justifies itself on the first incident it catches.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  4. Statistical methods for drift detection
&lt;/h2&gt;

&lt;p&gt;Three methods cover the vast majority of production use cases:&lt;/p&gt;

&lt;h3&gt;
  
  
  PSI — Population Stability Index
&lt;/h3&gt;

&lt;p&gt;The most widely used metric in production drift detection. PSI measures how much a distribution has shifted between a reference (training) sample and a current (production) sample.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;PSI value&lt;/th&gt;
&lt;th&gt;Interpretation&lt;/th&gt;
&lt;th&gt;Action&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&amp;lt; 0.1&lt;/td&gt;
&lt;td&gt;Stable&lt;/td&gt;
&lt;td&gt;None required&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0.1 – 0.25&lt;/td&gt;
&lt;td&gt;Moderate shift&lt;/td&gt;
&lt;td&gt;Investigate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&amp;gt; 0.25&lt;/td&gt;
&lt;td&gt;Major shift&lt;/td&gt;
&lt;td&gt;Retrain&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;PSI is fast to compute and easy to explain to non-technical stakeholders, which is why it's the default choice for most teams.&lt;/p&gt;

&lt;h3&gt;
  
  
  KS Test — Kolmogorov-Smirnov
&lt;/h3&gt;

&lt;p&gt;A non-parametric statistical test that compares two distributions and returns a p-value. Low p-value (&amp;lt; 0.05) signals that the distributions are statistically different. More rigorous than PSI, better for smaller sample sizes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Distribution plots
&lt;/h3&gt;

&lt;p&gt;Visual comparison of feature distributions over time. Look for shifts in mean, variance, shape changes, or the appearance of new modes. Essential for communicating drift results to stakeholders and debugging which features are causing the problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rule of thumb:&lt;/strong&gt; If any feature's distribution changes by more than 15–20%, investigate immediately. Don't wait for accuracy to drop.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Tools comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Best for&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Evidently AI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Open source&lt;/td&gt;
&lt;td&gt;Self-hosted drift reports, full customization&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;WhyLabs&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;SaaS (free tier)&lt;/td&gt;
&lt;td&gt;Teams without dedicated ML infra&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Prometheus + Grafana&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Infrastructure&lt;/td&gt;
&lt;td&gt;Drift as time-series metrics, custom alerting&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MLflow&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Open source&lt;/td&gt;
&lt;td&gt;Teams already using MLflow for experiment tracking&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This tutorial uses &lt;strong&gt;Evidently AI&lt;/strong&gt; — it's free, self-hosted, runs as a pip install, and produces detailed HTML reports with PSI and KS tests across all features automatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Step-by-step tutorial with Evidently AI
&lt;/h2&gt;

&lt;p&gt;⏱ &lt;em&gt;~30 minutes end-to-end&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The setup assumes you have a FastAPI model serving endpoint already running. If not, the logging and detection steps still apply — just swap the FastAPI parts for however you're serving predictions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1 — Install Evidently
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;evidently
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2 — Log predictions from your FastAPI endpoint
&lt;/h3&gt;

&lt;p&gt;Add prediction logging to your &lt;code&gt;/predict&lt;/code&gt; endpoint. Every prediction gets stored to a JSONL file for later drift analysis.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Add this to your FastAPI /predict endpoint
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;log_prediction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prediction&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;probability&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;log_entry&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timestamp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;utcnow&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;isoformat&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;features&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prediction&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prediction&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;probability&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;probability&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;predictions.jsonl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;log_entry&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Call &lt;code&gt;log_prediction()&lt;/code&gt; inside your endpoint every time you serve a result. The JSONL format appends one JSON object per line — it's cheap, crash-safe, and trivial to read back.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3 — Load your reference (training) distribution
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;evidently&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ColumnMapping&lt;/span&gt;

&lt;span class="c1"&gt;# Load the same data the model was trained on
&lt;/span&gt;&lt;span class="n"&gt;reference_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data/training_features.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;reference_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prediction&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;training_predictions&lt;/span&gt;
&lt;span class="n"&gt;reference_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;probability&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;training_probabilities&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is your ground truth — the distribution your model expects to see. Evidently will compare everything in production against this.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4 — Run the drift detection report
&lt;/h3&gt;

&lt;p&gt;Evidently's &lt;code&gt;DataDriftPreset&lt;/code&gt; automatically runs PSI and KS tests across all features and produces a visual HTML report.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;evidently.report&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Report&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;evidently.metric_preset&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DataDriftPreset&lt;/span&gt;

&lt;span class="c1"&gt;# Load recent production predictions
&lt;/span&gt;&lt;span class="n"&gt;current_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;predictions.jsonl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Run the report
&lt;/span&gt;&lt;span class="n"&gt;data_drift_report&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Report&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;DataDriftPreset&lt;/span&gt;&lt;span class="p"&gt;()])&lt;/span&gt;
&lt;span class="n"&gt;data_drift_report&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;reference_data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;reference_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;current_data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;current_data&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Save as HTML
&lt;/span&gt;&lt;span class="n"&gt;data_drift_report&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save_html&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;drift_report.html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open &lt;code&gt;drift_report.html&lt;/code&gt; in your browser. Evidently shows a per-feature breakdown with drift scores, p-values, and distribution overlay plots for every feature in your dataset.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5 — Configure alerts (Slack / PagerDuty)
&lt;/h3&gt;

&lt;p&gt;Don't just generate reports — trigger alerts automatically. The report is useful for debugging, but you need push notifications so your team is alerted the moment drift appears.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;evidently.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DatasetDriftMetric&lt;/span&gt;

&lt;span class="n"&gt;drift_metric&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DatasetDriftMetric&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;drift_metric&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reference&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;reference_data&lt;/span&gt;
&lt;span class="n"&gt;drift_metric&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;current_data&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;drift_metric&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_result&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;drift_detected&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;⚠ DRIFT DETECTED — investigate immediately&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
    &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://hooks.slack.com/services/YOUR_WEBHOOK&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;🚨 Model drift detected in production! Check drift_report.html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Replace the Slack webhook with PagerDuty, email, or any HTTP webhook your team uses.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 6 — Automate with cron or Airflow
&lt;/h3&gt;

&lt;p&gt;Drift detection should run on a schedule, not manually. For simplicity, add it to cron:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Run drift detection every day at 9:00 AM&lt;/span&gt;
&lt;span class="c"&gt;# Add with: crontab -e&lt;/span&gt;
0 9 &lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt; python3 /opt/ml/drift_detection.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For teams that need retry logic, backfill, or alerting on the pipeline itself, wrap the detection script in an Airflow DAG instead.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;💡 &lt;strong&gt;Pro tip on frequency:&lt;/strong&gt; Run detection daily for revenue-critical models, weekly for lower-stakes ones. The cost of a single missed drift event vastly outweighs the cost of running scheduled checks.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  7. When drift is detected — what to do
&lt;/h2&gt;

&lt;p&gt;Detection is only half the job. Here's the response workflow:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Identify the drifting features.&lt;/strong&gt; Open the Evidently report and look at which specific features are flagged. Sort by drift score descending. One or two features causing the problem is common — it narrows your investigation significantly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Diagnose the root cause.&lt;/strong&gt; Is it seasonal? A data pipeline bug? A real-world behavioral shift? Drift detection tells you &lt;em&gt;what&lt;/em&gt; changed, not &lt;em&gt;why&lt;/em&gt;. You still need to investigate upstream — check your data pipeline, talk to the product team, look at recent product changes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Trigger retraining if drift is confirmed.&lt;/strong&gt; If the drift is real and significant, retrain on newer labeled data. Don't retrain blindly — confirm you have sufficient new labeled data first. Retraining on insufficient data can make performance worse.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Recalibrate your thresholds.&lt;/strong&gt; Update your alert thresholds based on what you learned. Some drift is acceptable for your use case (seasonal variation, for example). Tune PSI/KS thresholds to minimize false alarms without missing real incidents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Document the incident.&lt;/strong&gt; Add it to your model's changelog. Include what drifted, what the root cause was, and how you resolved it. This becomes your team's institutional knowledge for the next incident.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;🔁 &lt;strong&gt;Retraining strategy:&lt;/strong&gt; Don't retrain reflexively every time an alert fires. Only retrain when drift is confirmed AND you have sufficient new labeled data. Premature retraining on noisy data is a common mistake that creates more instability.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  8. FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;How is drift detection different from just monitoring accuracy?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Accuracy monitoring requires labeled ground-truth data, which is often delayed or unavailable in real time. Drift detection works on inputs alone — you can catch problems the moment production data starts diverging from training data, before any labels are needed. Think of drift detection as an early warning system and accuracy monitoring as confirmation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How much production data do I need before running drift detection?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For PSI to be statistically meaningful, you generally want at least 500–1,000 recent predictions as your "current" window. With smaller samples, KS test tends to be more reliable. Start collecting logs from day one, even if you don't run reports immediately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What if my model has hundreds of features?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Evidently handles this automatically — it runs tests per feature and aggregates them into a dataset-level drift score. In practice, flag features with PSI &amp;gt; 0.25 or KS p-value &amp;lt; 0.05, then focus your investigation on the top 5–10 by drift score. Feature importance can also help prioritize which drifting features actually affect model output.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can I use this approach for LLMs or generative models?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The statistical methods in this tutorial (PSI, KS test) work on structured tabular data. For LLMs, drift detection looks different — you'd monitor prompt distribution shifts, output length changes, semantic similarity between batches, or task-specific evaluation metrics. See &lt;a href="https://mlopslab.org/llm-observability/" rel="noopener noreferrer"&gt;LLM Observability: The ML Engineer's Practical Guide&lt;/a&gt; for that use case.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does retraining always fix drift?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not always. If the drift is caused by a data pipeline bug, retraining on corrupted data makes things worse. If it's concept drift (the world changed), you need new labeled data that reflects the new reality — retraining on old data does nothing. Always diagnose before retraining.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Model drift is not an edge case — it's the default outcome for any model in production long enough. The question isn't whether your model will drift, it's whether you'll find out before your users do.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The minimum viable setup:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Log every prediction → run Evidently daily → alert on &lt;code&gt;drift_detected=True&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That combination, running on a cron schedule, gives you complete coverage with maybe a day of implementation work.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;🔗 &lt;strong&gt;Next step:&lt;/strong&gt; Add &lt;code&gt;log_prediction()&lt;/code&gt; to your serving code today. Even if you don't set up Evidently yet, having the logs means you can run drift analysis retroactively. The habit of logging is the foundation.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Related articles on MLOpsLab
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://mlopslab.org/mlflow-tutorial-how-to-track-machine-learning-experiments-2026/" rel="noopener noreferrer"&gt;MLflow Tutorial: Track ML Experiments Like a Pro (2026)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://mlopslab.org/kubeflow-vs-airflow-which-pipeline-tool-should-you-use-for-ml/" rel="noopener noreferrer"&gt;Kubeflow vs Airflow: Which Pipeline Tool Should You Use for ML?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://mlopslab.org/ml-pipeline-tutorial-build-your-first-production-ml-pipeline-2026/" rel="noopener noreferrer"&gt;ML Pipeline Tutorial: Build Your First Production ML Pipeline (2026)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://mlopslab.org/mlops-roadmap-2026-how-to-become-an-ml-engineer-step-by-step/" rel="noopener noreferrer"&gt;MLOps Roadmap 2026: How to Become an ML Engineer Step by Step&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://mlopslab.org/llm-observability/" rel="noopener noreferrer"&gt;LLM Observability: The ML Engineer's Practical Guide (2026)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Evidently AI Documentation. &lt;a href="https://docs.evidentlyai.com" rel="noopener noreferrer"&gt;https://docs.evidentlyai.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Gama, J., et al. (2014). A Survey on Concept Drift Adaptation. &lt;em&gt;ACM Computing Surveys&lt;/em&gt;, 46(4). &lt;a href="https://doi.org/10.1145/2523813" rel="noopener noreferrer"&gt;https://doi.org/10.1145/2523813&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;FastAPI Documentation. &lt;a href="https://fastapi.tiangolo.com" rel="noopener noreferrer"&gt;https://fastapi.tiangolo.com&lt;/a&gt;
&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;em&gt;Written by Ayub Shah — ML Engineering student, MLOps enthusiast. Testing every tool so you don't have to. No sponsors, no affiliate links.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;→ More at &lt;a href="https://mlopslab.org" rel="noopener noreferrer"&gt;mlopslab.org&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>mlops</category>
      <category>python</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>MLflow Tutorial: How to Track ML Experiments Like a Pro (2026)</title>
      <dc:creator>Ayub Shah</dc:creator>
      <pubDate>Fri, 01 May 2026 19:04:47 +0000</pubDate>
      <link>https://forem.com/ayubshah014sys/mlflow-tutorial-how-to-track-ml-experiments-like-a-pro-2026-362f</link>
      <guid>https://forem.com/ayubshah014sys/mlflow-tutorial-how-to-track-ml-experiments-like-a-pro-2026-362f</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://mlopslab.org/mlflow-tutorial/" rel="noopener noreferrer"&gt;mlopslab.org/mlflow-tutorial&lt;/a&gt; — updated weekly. 0 sponsors, 0 affiliate links.&lt;/em&gt;&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;⚡ Quick answer:&lt;/strong&gt; MLflow is an open-source platform that tracks everything about your ML experiments — parameters, metrics, model artifacts, and code versions — so you can reproduce any result and never lose a winning configuration again. You'll have your first experiment tracked in under 20 minutes.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;What is MLflow?&lt;/li&gt;
&lt;li&gt;Before you start&lt;/li&gt;
&lt;li&gt;Step 1 — Install MLflow&lt;/li&gt;
&lt;li&gt;Step 2 — Start the tracking server&lt;/li&gt;
&lt;li&gt;Step 3 — Write your first tracking script&lt;/li&gt;
&lt;li&gt;Step 4 — View results in the UI&lt;/li&gt;
&lt;li&gt;Step 5 — Compare multiple runs&lt;/li&gt;
&lt;li&gt;What to learn next&lt;/li&gt;
&lt;li&gt;FAQ&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  1. What is MLflow?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;MLflow is an open-source platform that tracks everything about your ML experiments&lt;/strong&gt; — parameters, metrics, model artifacts, and code versions — so you can reproduce any result and never lose a winning configuration again.&lt;/p&gt;

&lt;p&gt;Without experiment tracking, most ML engineers waste hours rerunning experiments they've already done — or ship models they can't reproduce. MLflow eliminates both problems permanently.&lt;/p&gt;

&lt;p&gt;At its core, MLflow gives you four things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tracking&lt;/strong&gt; — log parameters, metrics, and artifacts for every run&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Projects&lt;/strong&gt; — package code so it's reproducible on any machine&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Models&lt;/strong&gt; — a standard format to package models for deployment&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Registry&lt;/strong&gt; — a central hub to manage model lifecycle (staging → production)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This tutorial covers the Tracking component, which is where 90% of the day-to-day value lives.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;💡 &lt;strong&gt;Note:&lt;/strong&gt; MLflow is model-framework agnostic. It works with scikit-learn, PyTorch, TensorFlow, XGBoost, Keras, LightGBM — anything you're already using.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  2. Before you start
&lt;/h2&gt;

&lt;p&gt;You need three things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Python 3.8+&lt;/strong&gt; — run &lt;code&gt;python --version&lt;/code&gt; to check&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;pip installed&lt;/strong&gt; — comes with Python 3.4+&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Basic ML knowledge&lt;/strong&gt; — you should know what "training a model" and "accuracy" mean&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's it. No Docker, no AWS account, no paid tier.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Step 1 — Install MLflow
&lt;/h2&gt;

&lt;p&gt;⏱ &lt;em&gt;2 minutes&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;MLflow is a single pip install. It includes the tracking server, the UI, and the full Python API.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;mlflow scikit-learn
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verify the install:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;mlflow &lt;span class="nt"&gt;--version&lt;/span&gt;
&lt;span class="c"&gt;# mlflow, version 2.x.x&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;✅ &lt;strong&gt;Using a virtual environment?&lt;/strong&gt; Run &lt;code&gt;python -m venv .venv &amp;amp;&amp;amp; source .venv/bin/activate&lt;/code&gt; before installing. Recommended to keep your environment clean.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  4. Step 2 — Start the tracking server
&lt;/h2&gt;

&lt;p&gt;⏱ &lt;em&gt;1 minute&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In a terminal, run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;mlflow ui
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You'll see:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[2026-04-15 10:23:01 +0000] [INFO] Starting gunicorn 21.2.0
[2026-04-15 10:23:01 +0000] [INFO] Listening at: http://127.0.0.1:5000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open &lt;strong&gt;&lt;a href="http://localhost:5000" rel="noopener noreferrer"&gt;http://localhost:5000&lt;/a&gt;&lt;/strong&gt; in your browser — you'll see an empty MLflow dashboard. Leave this terminal running.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;⚠️ &lt;strong&gt;Port conflict?&lt;/strong&gt; If port 5000 is taken (common on macOS), run &lt;code&gt;mlflow ui --port 5001&lt;/code&gt; and visit &lt;code&gt;http://localhost:5001&lt;/code&gt; instead.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  5. Step 3 — Write your first tracking script
&lt;/h2&gt;

&lt;p&gt;⏱ &lt;em&gt;10 minutes&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Create a file called &lt;code&gt;train.py&lt;/code&gt; and paste this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;mlflow&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;mlflow.sklearn&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.datasets&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_iris&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.ensemble&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RandomForestClassifier&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.model_selection&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;train_test_split&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;accuracy_score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f1_score&lt;/span&gt;

&lt;span class="c1"&gt;# Configuration — change these to experiment
&lt;/span&gt;&lt;span class="n"&gt;N_ESTIMATORS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;
&lt;span class="n"&gt;MAX_DEPTH&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;
&lt;span class="n"&gt;RANDOM_STATE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;42&lt;/span&gt;

&lt;span class="c1"&gt;# Load data
&lt;/span&gt;&lt;span class="n"&gt;iris&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_iris&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;train_test_split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;iris&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;iris&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;RANDOM_STATE&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Name your experiment (MLflow creates it if it doesn't exist)
&lt;/span&gt;&lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_experiment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;iris-classifier&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_run&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="c1"&gt;# Train model
&lt;/span&gt;    &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RandomForestClassifier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;n_estimators&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;N_ESTIMATORS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;max_depth&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;MAX_DEPTH&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;RANDOM_STATE&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Evaluate
&lt;/span&gt;    &lt;span class="n"&gt;predictions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;accuracy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;accuracy_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;predictions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;f1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;f1_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;predictions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;average&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weighted&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Log everything to MLflow
&lt;/span&gt;    &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log_param&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;n_estimators&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;N_ESTIMATORS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log_param&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_depth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;MAX_DEPTH&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log_metric&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;accuracy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;accuracy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log_metric&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;f1_score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sklearn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;random-forest-model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Accuracy: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;accuracy&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; | F1: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;f1&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Run ID: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;active_run&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python train.py
&lt;span class="c"&gt;# Accuracy: 0.9667 | F1: 0.9667&lt;/span&gt;
&lt;span class="c"&gt;# Run ID: a1b2c3d4e5f6...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;MLflow created an &lt;code&gt;mlruns/&lt;/code&gt; folder in your working directory. That's where everything is stored locally.&lt;/p&gt;

&lt;h3&gt;
  
  
  What each MLflow call does
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Call&lt;/th&gt;
&lt;th&gt;What it logs&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;mlflow.set_experiment()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Groups runs under a named experiment&lt;/td&gt;
&lt;td&gt;&lt;code&gt;"iris-classifier"&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;mlflow.log_param()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;A single key-value config value&lt;/td&gt;
&lt;td&gt;&lt;code&gt;n_estimators=100&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;mlflow.log_metric()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;A numeric result (can be stepped over time)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;accuracy=0.967&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;mlflow.sklearn.log_model()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The trained model artifact + signature&lt;/td&gt;
&lt;td&gt;Serialized RandomForest&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;✅ &lt;strong&gt;It worked!&lt;/strong&gt; Every run gets a unique run ID, timestamp, and its own folder under &lt;code&gt;mlruns/&lt;/code&gt;. Nothing overwrites anything.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  6. Step 4 — View results in the MLflow UI
&lt;/h2&gt;

&lt;p&gt;⏱ &lt;em&gt;2 minutes&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Go back to &lt;strong&gt;&lt;a href="http://localhost:5000" rel="noopener noreferrer"&gt;http://localhost:5000&lt;/a&gt;&lt;/strong&gt;. You'll now see your &lt;code&gt;iris-classifier&lt;/code&gt; experiment with one run logged.&lt;/p&gt;

&lt;p&gt;Click the run to see:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Parameters tab&lt;/strong&gt; — &lt;code&gt;n_estimators&lt;/code&gt;, &lt;code&gt;max_depth&lt;/code&gt;, &lt;code&gt;random_state&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metrics tab&lt;/strong&gt; — &lt;code&gt;accuracy&lt;/code&gt;, &lt;code&gt;f1_score&lt;/code&gt; with a time-series chart&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Artifacts tab&lt;/strong&gt; — the serialized model, ready to load&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn2pf6nnrqvn9hl1scms8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn2pf6nnrqvn9hl1scms8.png" alt="MLflow UI showing metric tracking dashboard" width="800" height="438"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Figure 1: MLflow tracking UI — parameters and metrics are visualized automatically per run&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  7. Step 5 — Compare multiple runs
&lt;/h2&gt;

&lt;p&gt;⏱ &lt;em&gt;5 minutes&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This is where MLflow pays off. Run &lt;code&gt;train.py&lt;/code&gt; a few more times with different parameters:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Edit N_ESTIMATORS and MAX_DEPTH in train.py between runs, then:&lt;/span&gt;
python train.py  &lt;span class="c"&gt;# run 2: n_estimators=50, max_depth=3&lt;/span&gt;
python train.py  &lt;span class="c"&gt;# run 3: n_estimators=200, max_depth=10&lt;/span&gt;
python train.py  &lt;span class="c"&gt;# run 4: n_estimators=10, max_depth=2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the MLflow UI, check the checkboxes next to multiple runs and click &lt;strong&gt;"Compare"&lt;/strong&gt;. You'll get a side-by-side table of every parameter and metric across all runs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhv0j8hom7z8kbuxoa8lu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhv0j8hom7z8kbuxoa8lu.png" alt="MLflow run comparison table" width="800" height="608"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Figure 2: Compare runs side-by-side — MLflow shows exactly which parameters produced the best results&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;You can now answer: &lt;em&gt;"Which configuration gave us the best result, and can we reproduce it?"&lt;/em&gt; — with a single click, using the run ID.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;🏆 &lt;strong&gt;Pro tip:&lt;/strong&gt; In the UI, click any metric column header to sort runs by that metric. The best run floats to the top instantly.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  8. What to learn next
&lt;/h2&gt;

&lt;p&gt;Once you have basic tracking working, these are the natural next steps in order of complexity:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model Registry&lt;/strong&gt; — promote your best run from "Experiment" to "Staging" to "Production" with one click. Gives you a version-controlled model store with transition history.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Log more metrics&lt;/strong&gt; — use &lt;code&gt;mlflow.log_metric("loss", loss, step=epoch)&lt;/code&gt; inside your training loop to track metrics over time, not just at the end. The UI plots them automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Serve your model&lt;/strong&gt; — run &lt;code&gt;mlflow models serve -m runs:/&amp;lt;RUN_ID&amp;gt;/random-forest-model --port 8080&lt;/code&gt; to expose your logged model as a REST API endpoint. No extra code needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Remote tracking server&lt;/strong&gt; — instead of &lt;code&gt;mlflow ui&lt;/code&gt; on localhost, point your team at one shared PostgreSQL-backed server: &lt;code&gt;mlflow server --backend-store-uri postgresql://...&lt;/code&gt;. Every engineer's runs go to the same place.&lt;/p&gt;




&lt;h2&gt;
  
  
  9. FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What's the difference between MLflow and Weights &amp;amp; Biases?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;MLflow is fully open-source and self-hostable — your data never leaves your infrastructure. W&amp;amp;B is cloud-first with a better UI and more advanced features (sweeps, reports), but costs money at scale. For teams that need data sovereignty or are cost-sensitive, MLflow wins. See the &lt;a href="https://mlopslab.org/mlflow-vs-weights-biases-which-actually-saves-engineering-time/" rel="noopener noreferrer"&gt;full MLflow vs W&amp;amp;B comparison&lt;/a&gt; for a detailed breakdown.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can MLflow track deep learning training loops?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Yes. Use &lt;code&gt;mlflow.log_metric("loss", loss, step=epoch)&lt;/code&gt; inside your epoch loop and MLflow plots the full training curve. It also has autologging support for PyTorch Lightning, Keras, and Hugging Face — one line enables automatic logging of all metrics, params, and the final model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What happens to my runs if I delete &lt;code&gt;mlruns/&lt;/code&gt;?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;They're gone. For anything beyond local experimentation, set up a proper backend store (SQLite at minimum, PostgreSQL for teams) and an artifact store (S3, GCS, or Azure Blob). Then your runs survive machine restarts and are shareable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does MLflow work with open-source models like Llama or Mistral?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Yes — MLflow has a &lt;code&gt;mlflow.transformers&lt;/code&gt; flavor for Hugging Face models and supports custom Python function flavors for anything else. You can log any model as long as you can serialize it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How does MLflow compare to ClearML?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Both are strong open-source options. ClearML has a richer built-in UI and experiment orchestration features out of the box. MLflow has a larger ecosystem and better framework integrations. See the &lt;a href="https://mlopslab.org/mlflow-vs-clearml-which-open-source-mlops-tool-actually-wins-2026/" rel="noopener noreferrer"&gt;MLflow vs ClearML breakdown&lt;/a&gt; for a production-focused comparison.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;MLflow experiment tracking isn't optional once you're running more than a handful of experiments. The "I'll remember which config worked best" approach breaks fast.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The minimum viable setup:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;pip install mlflow&lt;/code&gt; → &lt;code&gt;mlflow ui&lt;/code&gt; → &lt;code&gt;mlflow.log_param()&lt;/code&gt; + &lt;code&gt;mlflow.log_metric()&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That combination gives you full reproducibility with maybe 30 minutes of implementation work.&lt;/p&gt;

&lt;p&gt;Don't set up the perfect MLflow infrastructure before you ship. Start local, log everything, move to a shared server when you have a team. The habit of logging compounds.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;🔗 &lt;strong&gt;Next step:&lt;/strong&gt; Run the &lt;code&gt;train.py&lt;/code&gt; above → check your first trace in the UI at &lt;code&gt;localhost:5000&lt;/code&gt;. That's the first 15 minutes. Everything else follows from having that first run visible.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Related articles on MLOpsLab
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://mlopslab.org/mlflow-vs-weights-biases-which-actually-saves-engineering-time/" rel="noopener noreferrer"&gt;MLflow vs Weights &amp;amp; Biases: Which Actually Saves Engineering Time?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://mlopslab.org/mlflow-vs-clearml-which-open-source-mlops-tool-actually-wins-2026/" rel="noopener noreferrer"&gt;MLflow vs ClearML: Which Open Source MLOps Tool Actually Wins (2026)?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://mlopslab.org/how-to-deploy-a-machine-learning-model-with-docker-and-mlflow-2026-tutorial/" rel="noopener noreferrer"&gt;How to Deploy a Machine Learning Model with Docker &amp;amp; MLflow (2026)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://mlopslab.org/llm-observability/" rel="noopener noreferrer"&gt;LLM Observability: The ML Engineer's Practical Guide (2026)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;MLflow Documentation. &lt;a href="https://mlflow.org/docs/latest/index.html" rel="noopener noreferrer"&gt;https://mlflow.org/docs/latest/index.html&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Chen, A., et al. (2020). Developments in MLflow: A System to Accelerate the Machine Learning Lifecycle. DEEM Workshop, ACM SIGMOD. &lt;a href="https://doi.org/10.1145/3399579.3399867" rel="noopener noreferrer"&gt;https://doi.org/10.1145/3399579.3399867&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;scikit-learn Documentation. &lt;a href="https://scikit-learn.org/stable/" rel="noopener noreferrer"&gt;https://scikit-learn.org/stable/&lt;/a&gt;
&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;em&gt;Written by Ayub Shah — ML Engineering student, MLOps enthusiast. Testing every tool so you don't have to. No sponsors, no affiliate links.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;→ More at &lt;a href="https://mlopslab.org" rel="noopener noreferrer"&gt;mlopslab.org&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>mlops</category>
      <category>python</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>What is LLM Observability? The ML Engineer's Practical Guide (2026)</title>
      <dc:creator>Ayub Shah</dc:creator>
      <pubDate>Fri, 01 May 2026 17:49:00 +0000</pubDate>
      <link>https://forem.com/ayubshah014sys/what-is-llm-observability-the-ml-engineers-practical-guide-2026-1l4h</link>
      <guid>https://forem.com/ayubshah014sys/what-is-llm-observability-the-ml-engineers-practical-guide-2026-1l4h</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://mlopslab.org/llm-observability/" rel="noopener noreferrer"&gt;mlopslab.org/llm-observability&lt;/a&gt; — updated weekly. 0 sponsors, 0 affiliate links.&lt;/em&gt;&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;⚡ Quick answer:&lt;/strong&gt; LLM observability is the practice of collecting metrics, traces, and logs from large language model applications to monitor behavior, catch failures, control costs, and improve output quality — in real time. Unlike traditional APM, it handles non-deterministic outputs, prompt/response pairs, token costs, hallucination rates, and multi-step agent chains that standard monitoring tools were never built for.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;LLM observability: the actual definition&lt;/li&gt;
&lt;li&gt;Why traditional APM fails for LLMs&lt;/li&gt;
&lt;li&gt;Why it matters in 2026&lt;/li&gt;
&lt;li&gt;The three pillars: metrics, traces, logs&lt;/li&gt;
&lt;li&gt;Key LLM observability metrics&lt;/li&gt;
&lt;li&gt;Best LLM observability tools (2026)&lt;/li&gt;
&lt;li&gt;How to implement it in Python — step by step&lt;/li&gt;
&lt;li&gt;RAG observability: what's different&lt;/li&gt;
&lt;li&gt;Common mistakes to avoid&lt;/li&gt;
&lt;li&gt;FAQ&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  1. LLM observability: the actual definition
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;LLM observability&lt;/strong&gt; is the ability to understand what your large language model is doing, why it's doing it, and whether it's doing it well — while it's running in production.&lt;/p&gt;

&lt;p&gt;The formal definition: it's the process of instrumenting LLM applications to collect structured data (metrics, traces, logs) about inputs, outputs, latency, token usage, and downstream behavior — then making that data queryable and actionable.&lt;/p&gt;

&lt;p&gt;But here's the part most definitions skip: &lt;strong&gt;LLMs are non-deterministic&lt;/strong&gt;. The same prompt can produce different outputs. That single fact breaks every assumption traditional application monitoring was built on.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;💡 &lt;strong&gt;Note:&lt;/strong&gt; "Observability" comes from control theory — a system is observable if you can infer its internal state from its outputs. For LLMs, the "internal state" is opaque by design. Observability is how you compensate for that opacity.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A complete LLM observability setup lets you answer questions like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Why did this prompt return garbage output on Tuesday at 3pm?&lt;/li&gt;
&lt;li&gt;How many tokens did we burn last week, and on which features?&lt;/li&gt;
&lt;li&gt;Is our retrieval step actually finding relevant context, or just noise?&lt;/li&gt;
&lt;li&gt;Which user flows are generating the most hallucinations?&lt;/li&gt;
&lt;li&gt;Did our prompt change last Wednesday improve or hurt response quality?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without observability, you're guessing at all of the above.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Why traditional APM fails for LLMs
&lt;/h2&gt;

&lt;p&gt;You might already have Datadog, New Relic, or Prometheus running. They're great tools. They will &lt;strong&gt;not&lt;/strong&gt; help you monitor an LLM application properly. Here's why:&lt;/p&gt;

&lt;h3&gt;
  
  
  Traditional APM vs LLM Observability
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Traditional APM&lt;/th&gt;
&lt;th&gt;LLM Observability&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Output nature&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Deterministic — same input → same output&lt;/td&gt;
&lt;td&gt;Non-deterministic — same prompt → different outputs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Failure mode&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Binary (HTTP 200 vs 500)&lt;/td&gt;
&lt;td&gt;Output can be grammatically correct but factually wrong&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Performance definition&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Speed + uptime&lt;/td&gt;
&lt;td&gt;Relevance, factual accuracy, coherence&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Quality&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Not applicable&lt;/td&gt;
&lt;td&gt;First-class concern with dedicated metrics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Tracing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Fixed execution paths&lt;/td&gt;
&lt;td&gt;Spans across prompt → retrieval → generation → re-ranking&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cost tracking&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Not needed&lt;/td&gt;
&lt;td&gt;Token cost per request is critical (it's your AWS bill)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Errors&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Clear: stack traces, exceptions&lt;/td&gt;
&lt;td&gt;"Silent failures" — plausible-sounding wrong answers&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The most dangerous failure mode in LLM production is the &lt;strong&gt;silent failure&lt;/strong&gt;: the model returns a 200 OK with a confident, fluent, completely wrong answer. Your APM sees green. Your users are getting misinformation. You have no idea.&lt;/p&gt;

&lt;p&gt;That's the problem LLM observability is built to solve.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Why it matters in 2026
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. You're paying per token — and it adds up fast
&lt;/h3&gt;

&lt;p&gt;GPT-4o charges ~$5 per million input tokens. Claude Opus is $15. If you're running a RAG pipeline that sends 3,000-token prompts for every user query, and you have 10,000 daily active users, you're burning through tokens fast.&lt;/p&gt;

&lt;p&gt;Without observability, you have zero visibility into:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which features are expensive&lt;/li&gt;
&lt;li&gt;Which prompts are bloated&lt;/li&gt;
&lt;li&gt;Which retrieval chunks are redundant&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;A 40% cost reduction is realistic&lt;/strong&gt; just from instrumenting your token usage and trimming waste.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Hallucinations don't throw exceptions
&lt;/h3&gt;

&lt;p&gt;When a SQL query fails, you get an error. When an LLM confidently fabricates a legal clause, a medical dosage, or a product spec — you get a 200 OK.&lt;/p&gt;

&lt;p&gt;The only way to catch this is output evaluation: either automated (LLM-as-judge, assertion checks) or via user feedback signals — both of which require an observability layer to collect and route.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. LLM apps are increasingly multi-step
&lt;/h3&gt;

&lt;p&gt;A modern RAG agent might do:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;query rewriting → vector search → reranking → generation → post-processing → tool calls
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Any step can fail silently. Without distributed tracing across all those steps, you have no way to know which node in the chain is degrading your quality.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;✅ &lt;strong&gt;Tip:&lt;/strong&gt; If you're already logging prompts and responses to a database, you have the raw material for LLM observability. The difference is structure, aggregation, and making that data queryable — which is what proper tooling does.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  4. The three pillars: metrics, traces, logs
&lt;/h2&gt;

&lt;p&gt;LLM observability, like traditional observability, rests on three data types. But each has LLM-specific meaning:&lt;/p&gt;

&lt;h3&gt;
  
  
  📊 Metrics — aggregated numbers over time
&lt;/h3&gt;

&lt;p&gt;Latency percentiles, token consumption per day, error rates, hallucination rate, TTFT (time to first token), user thumbs-up/down ratio.&lt;/p&gt;

&lt;p&gt;These are your dashboards — the signals that tell you whether the system is healthy at a glance.&lt;/p&gt;

&lt;h3&gt;
  
  
  🔍 Traces — the execution path of a single request
&lt;/h3&gt;

&lt;p&gt;A trace for an LLM request spans every step:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;input received → prompt constructed → retrieval triggered → chunks fetched → LLM called → response parsed → returned
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Traces tell you &lt;em&gt;where&lt;/em&gt; time and tokens were spent on a specific request and let you drill into failures.&lt;/p&gt;

&lt;h3&gt;
  
  
  📋 Logs — raw structured records of events
&lt;/h3&gt;

&lt;p&gt;Every prompt sent, every response received, every retrieved chunk, every tool call. Logs are the ground truth — unsampled, timestamped, filterable.&lt;/p&gt;

&lt;p&gt;They're what you reach for during incident investigation when metrics tell you &lt;em&gt;something is wrong&lt;/em&gt; but not exactly what.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A mature LLM observability setup collects all three and links them:&lt;/strong&gt; a metric spike points you to a trace, a trace links to the logs of that specific exchange.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;⚠️ &lt;strong&gt;Warning:&lt;/strong&gt; Logging raw prompts and responses raises data privacy and compliance considerations. If users send PII, it ends up in your logs. Make sure you have a redaction or anonymization strategy before you log at full fidelity in production.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  5. Key LLM observability metrics
&lt;/h2&gt;

&lt;p&gt;These are the metrics that actually matter — not the generic list you'll find everywhere, but the ones that show up when something goes wrong.&lt;/p&gt;

&lt;h3&gt;
  
  
  ⏱️ Latency metrics
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;What it measures&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;TTFT&lt;/strong&gt; (Time To First Token)&lt;/td&gt;
&lt;td&gt;Latency before streaming starts&lt;/td&gt;
&lt;td&gt;User-perceived speed — low TTFT feels fast even if total is high&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;TPS&lt;/strong&gt; (Tokens Per Second)&lt;/td&gt;
&lt;td&gt;Generation speed&lt;/td&gt;
&lt;td&gt;Degrades under load — track p50, p95, p99&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;End-to-end latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Total request time including retrieval + generation&lt;/td&gt;
&lt;td&gt;What SLAs are measured against&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  💸 Cost metrics
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;What it measures&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Input tokens/request&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Prompt tokens per call&lt;/td&gt;
&lt;td&gt;Where cost bloat hides — long system prompts, noisy chunks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cost per request&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Input+output tokens × model price&lt;/td&gt;
&lt;td&gt;Unit economics for your feature&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Daily token burn rate&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Total tokens across all requests&lt;/td&gt;
&lt;td&gt;Set alerts here — a loop bug shows up here before your bill does&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  🎯 Quality metrics
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;What it measures&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Faithfulness&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Does answer stay grounded in retrieved context?&lt;/td&gt;
&lt;td&gt;Unfaithful answers are hallucinations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Relevance score&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Is the answer relevant to what was asked?&lt;/td&gt;
&lt;td&gt;Factually correct but wrong-topic answers still fail&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;User feedback rate&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Thumbs up/down, ratings, correction events&lt;/td&gt;
&lt;td&gt;Highest-signal quality metric — direct from users&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;💡 &lt;strong&gt;Note:&lt;/strong&gt; Quality metrics are the hardest to collect automatically. Start with user feedback signals (explicit) and retry/abandon rate (implicit). Then layer in automated evaluation once you have a baseline.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  6. Best LLM observability tools (2026)
&lt;/h2&gt;

&lt;p&gt;Honest breakdown. I've tested all of these. No affiliate links, no vendor bias.&lt;/p&gt;

&lt;h3&gt;
  
  
  🦜 Langfuse — &lt;em&gt;Best open-source default&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;Self-hostable, developer-first LLM tracing. Best OSS option if you want full data control and a clean SDK.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Self-hostable via Docker (free)&lt;/li&gt;
&lt;li&gt;SDKs for Python, JS, LangChain, LlamaIndex&lt;/li&gt;
&lt;li&gt;Prompt management + version tracking&lt;/li&gt;
&lt;li&gt;Dataset + evaluation workflows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Most teams. Start here.&lt;/p&gt;




&lt;h3&gt;
  
  
  🔥 Arize Phoenix — &lt;em&gt;Best for embedding analysis&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;ML observability platform with strong LLM support.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OpenInference tracing standard&lt;/li&gt;
&lt;li&gt;Embedding drift &amp;amp; cluster visualization&lt;/li&gt;
&lt;li&gt;Built-in evals (hallucination, toxicity)&lt;/li&gt;
&lt;li&gt;Works fully offline / local&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Teams already using Arize for traditional ML monitoring.&lt;/p&gt;




&lt;h3&gt;
  
  
  ⚡ Helicone — &lt;em&gt;Fastest to set up&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;Proxy-based approach — zero SDK changes. One header = instant logging.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One-line integration (proxy URL swap)&lt;/li&gt;
&lt;li&gt;Real-time cost dashboard&lt;/li&gt;
&lt;li&gt;Request caching (reduces cost)&lt;/li&gt;
&lt;li&gt;10k req/month free&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Cost tracking, teams that want zero implementation overhead.&lt;/p&gt;




&lt;h3&gt;
  
  
  🌊 W&amp;amp;B Weave — &lt;em&gt;Best if you're already on W&amp;amp;B&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;Weights &amp;amp; Biases' LLM observability layer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Native W&amp;amp;B integration&lt;/li&gt;
&lt;li&gt;Automatic function tracing via decorator&lt;/li&gt;
&lt;li&gt;Evaluation pipelines built-in&lt;/li&gt;
&lt;li&gt;Free for individual use&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Teams using W&amp;amp;B for experiment tracking.&lt;/p&gt;




&lt;h3&gt;
  
  
  📡 OpenTelemetry — &lt;em&gt;Most flexible, most work&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;Vendor-neutral observability standard. Build your own pipeline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Vendor-neutral (ship to any backend)&lt;/li&gt;
&lt;li&gt;OpenLLMetry SDK for LLM spans&lt;/li&gt;
&lt;li&gt;Works with Jaeger, Tempo, Datadog&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Enterprise, multi-backend infrastructure.&lt;/p&gt;




&lt;h3&gt;
  
  
  🐕 Datadog LLM Observability — &lt;em&gt;Enterprise grade, enterprise price&lt;/em&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Unified with existing Datadog APM&lt;/li&gt;
&lt;li&gt;Auto-instrumentation for OpenAI/Anthropic&lt;/li&gt;
&lt;li&gt;Cluster analysis for prompt patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Existing Datadog shops with budget.&lt;/p&gt;




&lt;h3&gt;
  
  
  Quick comparison table
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Open Source&lt;/th&gt;
&lt;th&gt;Self-hostable&lt;/th&gt;
&lt;th&gt;RAG support&lt;/th&gt;
&lt;th&gt;Evals built-in&lt;/th&gt;
&lt;th&gt;Best for&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Langfuse&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Most teams — best OSS default&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Arize Phoenix&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Embedding analysis, ML teams&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Helicone&lt;/td&gt;
&lt;td&gt;⚠️&lt;/td&gt;
&lt;td&gt;⚠️&lt;/td&gt;
&lt;td&gt;⚠️&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;Cost tracking, fastest setup&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;W&amp;amp;B Weave&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;W&amp;amp;B users, experiment correlation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenTelemetry&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;Enterprise, multi-backend&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Datadog LLM Obs&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;Existing Datadog shops&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;✅ &lt;strong&gt;Recommendation:&lt;/strong&gt; Start with Langfuse. Open source, self-hostable with Docker in 5 minutes, clean Python SDK, covers 90% of what you need. Graduate to OpenTelemetry when you need unified tracing across complex multi-service infra.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  7. How to implement it in Python — step by step
&lt;/h2&gt;

&lt;p&gt;Enough theory. Here's how you actually do it. We'll use &lt;strong&gt;Langfuse&lt;/strong&gt; — the best open-source option — for the full flow from a simple LLM call to a RAG pipeline with spans, scores, and cost tracking.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Set up Langfuse (self-hosted via Docker)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Clone and start Langfuse locally&lt;/span&gt;
git clone https://github.com/langfuse/langfuse.git
&lt;span class="nb"&gt;cd &lt;/span&gt;langfuse
docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt;

&lt;span class="c"&gt;# Langfuse UI will be at http://localhost:3000&lt;/span&gt;
&lt;span class="c"&gt;# Create a project and grab your API keys&lt;/span&gt;

&lt;span class="c"&gt;# Install the Python SDK&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;langfuse openai
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Basic LLM call with full tracing
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langfuse&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Langfuse&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langfuse.openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;  &lt;span class="c1"&gt;# drop-in replacement
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;

&lt;span class="c1"&gt;# Init — reads LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, LANGFUSE_HOST from env
&lt;/span&gt;&lt;span class="n"&gt;langfuse&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Langfuse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;public_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pk-lf-...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;secret_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sk-lf-...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:3000&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# or https://cloud.langfuse.com
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# This single import swap gives you automatic tracing
# of every OpenAI call: prompt, response, tokens, latency, cost
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a helpful MLOps assistant.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is MLflow used for?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="c1"&gt;# Optional: tag this trace for filtering in the UI
&lt;/span&gt;    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mlops-qa&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;feature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;u_123&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# All trace data is now visible in Langfuse UI — zero extra code needed
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The import swap is the key. &lt;code&gt;from langfuse.openai import openai&lt;/code&gt; patches the OpenAI client and captures everything automatically: token counts, cost, latency, the full prompt and response.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Custom spans for multi-step pipelines
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langfuse&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Langfuse&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langfuse.openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langfuse.decorators&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;langfuse_context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;observe&lt;/span&gt;

&lt;span class="n"&gt;langfuse&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Langfuse&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# @observe creates a span for this function automatically
&lt;/span&gt;&lt;span class="nd"&gt;@observe&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;retrieve_chunks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Simulated vector store retrieval&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# In production: call your Chroma / Pinecone / Weaviate here
&lt;/span&gt;    &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MLflow is an open source platform for ML lifecycle management...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.92&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MLflow Tracking logs parameters, metrics, and artifacts...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.87&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="c1"&gt;# Log retrieval metadata to the span
&lt;/span&gt;    &lt;span class="n"&gt;langfuse_context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update_current_observation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;top_k&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chunk_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;

&lt;span class="nd"&gt;@observe&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Assemble the final prompt from query + retrieved context&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Answer using only the context below.

Context:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Question: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
Answer:&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="nd"&gt;@observe&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# The root trace — wraps the whole pipeline
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;rag_answer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Step 1: retrieve — traced as a child span
&lt;/span&gt;    &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;retrieve_chunks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Step 2: build prompt — traced as a child span
&lt;/span&gt;    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;build_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Step 3: generate — traced via patched OpenAI client
&lt;/span&gt;    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;

    &lt;span class="c1"&gt;# Step 4: score the output quality (0-1 scale)
&lt;/span&gt;    &lt;span class="n"&gt;langfuse_context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;score_current_trace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;answer_quality&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# replace with your eval logic
&lt;/span&gt;        &lt;span class="n"&gt;comment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Auto-scored: retrieval found relevant chunks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;

&lt;span class="c1"&gt;# Run it
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;rag_answer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is MLflow used for?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Flush traces before script exits
&lt;/span&gt;&lt;span class="n"&gt;langfuse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;flush&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 4: Automated quality scoring (LLM-as-judge)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;raw_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# unpatched — don't trace the judge calls
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;evaluate_faithfulness&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    LLM-as-judge: score whether the answer is faithful to the retrieved context.
    Returns a score from 0.0 (hallucination) to 1.0 (fully grounded).
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;judge_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;You are evaluating an AI assistant&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s answer for faithfulness.

RETRIEVED CONTEXT:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

QUESTION: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

ANSWER: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Task: Score whether the answer is ONLY based on the retrieved context (not hallucinated).
Respond with JSON only: {{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: 0.0-1.0, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;brief explanation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;}}
0.0 = completely hallucinated | 1.0 = fully grounded in context&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;raw_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;judge_prompt&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
        &lt;span class="n"&gt;response_format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;json_object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Post the score back to Langfuse for any trace
&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reason&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;evaluate_faithfulness&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;langfuse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;trace_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-trace-id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# from langfuse_context.get_current_trace_id()
&lt;/span&gt;    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;faithfulness&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;comment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;reason&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 5: Capture user feedback signals
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langfuse&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Langfuse&lt;/span&gt;

&lt;span class="n"&gt;langfuse&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Langfuse&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handle_user_feedback&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;trace_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;thumbs_up&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;comment&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Record user feedback against the trace that generated the response&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;langfuse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;trace_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;trace_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_feedback&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;thumbs_up&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;comment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;comment&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Example: in your FastAPI endpoint
# @app.post("/feedback")
# async def feedback(trace_id: str, positive: bool, comment: str = None):
#     handle_user_feedback(trace_id, positive, comment)
#     return {"status": "recorded"}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After this implementation, your Langfuse dashboard shows: every trace, constituent spans (retrieval → prompt build → generation), token counts, latency by step, faithfulness scores, and user feedback — all correlated.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;✅ &lt;strong&gt;Pro tip:&lt;/strong&gt; Get the current &lt;code&gt;trace_id&lt;/code&gt; inside any &lt;code&gt;@observe&lt;/code&gt;-decorated function with &lt;code&gt;langfuse_context.get_current_trace_id()&lt;/code&gt;. Store this in your response payload so you can link user feedback back to the exact trace.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  8. RAG observability: what's different
&lt;/h2&gt;

&lt;p&gt;RAG pipelines have unique failure modes that generic LLM observability doesn't capture.&lt;/p&gt;

&lt;h3&gt;
  
  
  RAG-specific metrics to track
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;What it measures&lt;/th&gt;
&lt;th&gt;Good range&lt;/th&gt;
&lt;th&gt;Bad signal&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Context precision&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Are retrieved chunks actually relevant?&lt;/td&gt;
&lt;td&gt;&amp;gt; 0.8&lt;/td&gt;
&lt;td&gt;Low → noisy retrieval, poor embedding&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Context recall&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Did retrieval find all needed chunks?&lt;/td&gt;
&lt;td&gt;&amp;gt; 0.75&lt;/td&gt;
&lt;td&gt;Low → answer is incomplete&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Faithfulness&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Is the answer grounded in context?&lt;/td&gt;
&lt;td&gt;&amp;gt; 0.85&lt;/td&gt;
&lt;td&gt;Low → hallucination&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Answer relevance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Does the answer address what was asked?&lt;/td&gt;
&lt;td&gt;&amp;gt; 0.8&lt;/td&gt;
&lt;td&gt;Low → model answering wrong question&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Retrieval latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Time spent in vector search&lt;/td&gt;
&lt;td&gt;&amp;lt; 200ms&lt;/td&gt;
&lt;td&gt;High → index needs optimization&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Chunk token count&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Avg tokens per retrieved chunk&lt;/td&gt;
&lt;td&gt;200–600&lt;/td&gt;
&lt;td&gt;Too high → inflated cost, diluted signal&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  The RAG failure nobody talks about: context stuffing
&lt;/h3&gt;

&lt;p&gt;The most common undetected RAG failure: retrieval returns chunks that look semantically similar to the query but &lt;strong&gt;don't contain the actual answer&lt;/strong&gt;. The model then either hallucinates or returns a plausible-sounding non-answer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context precision catches this.&lt;/strong&gt; Track it per query, and set an alert if it drops below 0.6 for more than 5% of requests.&lt;/p&gt;

&lt;h3&gt;
  
  
  Measuring RAG quality with RAGAS
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ragas&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;evaluate&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ragas.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;faithfulness&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;answer_relevancy&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;context_precision&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;context_recall&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datasets&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Dataset&lt;/span&gt;

&lt;span class="c1"&gt;# Collect your RAG pipeline outputs
&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is MLflow used for?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MLflow is used for experiment tracking...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;contexts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MLflow is an open source platform...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MLflow Tracking logs...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ground_truth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MLflow manages the ML lifecycle including tracking...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_dict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Run RAGAS evaluation — gives you all 4 RAG metrics at once
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;faithfulness&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;answer_relevancy&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context_precision&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context_recall&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# {'faithfulness': 0.92, 'answer_relevancy': 0.88,
#  'context_precision': 0.94, 'context_recall': 0.81}
&lt;/span&gt;
&lt;span class="c1"&gt;# Then post these scores to Langfuse for the corresponding trace
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  9. Common mistakes to avoid
&lt;/h2&gt;

&lt;h3&gt;
  
  
  ❌ Logging everything with no retention policy
&lt;/h3&gt;

&lt;p&gt;Storing every raw prompt and response forever will balloon your storage costs. Set a 30–90 day retention window. Sample high-volume low-value traces (e.g., 1 in 10 for healthy routine calls), and keep 100% of error traces and scored traces.&lt;/p&gt;

&lt;h3&gt;
  
  
  ❌ Treating latency as the only quality signal
&lt;/h3&gt;

&lt;p&gt;Fast bad answers are worse than slow good ones. Build quality metrics from day one — even if it's just a user thumbs-up/down. Don't let "it's fast" become your proxy for "it's working."&lt;/p&gt;

&lt;h3&gt;
  
  
  ❌ Adding observability as an afterthought
&lt;/h3&gt;

&lt;p&gt;If you retrofit tracing into a production system with no span structure, you'll get a flat blob of logs with no actionable signal. Instrument at the architecture level — define your spans (retrieval, generation, eval) from the first prototype.&lt;/p&gt;

&lt;h3&gt;
  
  
  ❌ Not separating judge calls from production traces
&lt;/h3&gt;

&lt;p&gt;If you're using an LLM to evaluate your LLM's outputs, those evaluation calls &lt;strong&gt;must&lt;/strong&gt; use an unpatched client. Otherwise: recursive tracing, inflated token counts, meaningless cost data.&lt;/p&gt;

&lt;h3&gt;
  
  
  ❌ Ignoring PII in logs
&lt;/h3&gt;

&lt;p&gt;Users will send email addresses, names, medical info into your LLM app. In production, run a PII redaction pass before writing traces to storage. This is not optional if you're handling EU users (GDPR).&lt;/p&gt;




&lt;h2&gt;
  
  
  10. FAQ
&lt;/h2&gt;

&lt;p&gt;
  "What's the difference between LLM monitoring and LLM observability?"
  &lt;p&gt;Monitoring tracks predefined metrics (latency, error rate) and alerts when they cross thresholds.&lt;/p&gt;

&lt;p&gt;Observability is broader — it's the ability to ask arbitrary questions about your system's behavior from its outputs, including things you didn't anticipate when you set up the system.&lt;/p&gt;

&lt;p&gt;In practice: &lt;strong&gt;monitoring tells you &lt;em&gt;something is wrong&lt;/em&gt;, observability helps you figure out &lt;em&gt;why and what&lt;/em&gt;.&lt;/strong&gt;&lt;/p&gt;



&lt;/p&gt;

&lt;p&gt;
  "Can I use Prometheus and Grafana for LLM observability?"
  &lt;p&gt;Yes, for system-level metrics (latency, throughput, error rate, token counts). Expose these via a &lt;code&gt;/metrics&lt;/code&gt; endpoint and scrape with Prometheus.&lt;/p&gt;

&lt;p&gt;But you'll still need a purpose-built tool like Langfuse or Phoenix for prompt/response tracing, RAG-specific metrics, and quality evaluation. Prometheus doesn't understand the semantic content of LLM outputs.&lt;/p&gt;



&lt;/p&gt;

&lt;p&gt;
  "How do you detect hallucinations automatically?"
  &lt;p&gt;Three main approaches:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Faithfulness scoring&lt;/strong&gt; — use an LLM judge to check if the answer is grounded in retrieved context&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Assertion checks&lt;/strong&gt; — programmatic rules for your domain (e.g., "answer must not contain dates before 2020")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Semantic similarity&lt;/strong&gt; — compare answer embedding to context embedding; low similarity suggests "off-context" generation&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;None of these are perfect. Start with LLM-as-judge faithfulness scoring combined with user feedback signals.&lt;/p&gt;



&lt;/p&gt;

&lt;p&gt;
  "Is LLM observability the same as MLOps?"
  &lt;p&gt;MLOps is the broader practice of operationalizing machine learning — including training pipelines, experiment tracking, model deployment, and monitoring.&lt;/p&gt;

&lt;p&gt;LLM observability is a specific subset focused on monitoring LLM-powered applications in production. It overlaps with MLOps but has different tooling: token costs, prompt management, output quality evaluation vs. model drift, retraining pipelines.&lt;/p&gt;



&lt;/p&gt;

&lt;p&gt;
  "What's the cheapest way to start?"
  &lt;p&gt;Self-host Langfuse via Docker (free). Use the Python SDK with the OpenAI import swap (5 lines of code). You'll have full tracing, token tracking, and a queryable UI for $0.&lt;/p&gt;

&lt;p&gt;Your only cost is the server running Langfuse — a $5/month DigitalOcean droplet is enough for early-stage projects.&lt;/p&gt;



&lt;/p&gt;

&lt;p&gt;
  "Does LLM observability work with open-source models (Llama, Mistral)?"
  &lt;p&gt;Yes. Langfuse and Phoenix work with any model via their generic SDK (you manually log inputs/outputs). For models served via vLLM or Ollama with an OpenAI-compatible API, the OpenAI import swap works directly.&lt;/p&gt;

&lt;p&gt;Token cost tracking requires manual calculation since open-source model servers don't report costs.&lt;/p&gt;



&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;LLM observability isn't optional at production scale. The "it works in testing" mindset breaks fast when real users send unexpected inputs, when retrieval quality degrades silently, when a token-hungry prompt pattern starts inflating your inference bill.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The stack to start with:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Langfuse&lt;/strong&gt; for tracing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RAGAS&lt;/strong&gt; for RAG quality metrics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;User feedback signals&lt;/strong&gt; for ground truth&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That combination gives you 80% of what you need with maybe a day of implementation work.&lt;/p&gt;

&lt;p&gt;Don't build the perfect observability system before shipping. Instrument as you build. Add quality metrics when you have baseline data to compare against. The value compounds.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;🔗 &lt;strong&gt;Next step:&lt;/strong&gt; Set up Langfuse locally → instrument one LLM call → check the trace in the UI. That's the first 20 minutes. Everything else follows from having that first trace visible.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Related articles on MLOpsLab
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://mlopslab.org/ml-pipeline-tutorial-build-your-first-production-ml-pipeline-2026/" rel="noopener noreferrer"&gt;ML Pipeline Tutorial: Build Your First Production ML Pipeline (2026)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://mlopslab.org/model-drift-detection-tutorial-how-to-monitor-ml-models-in-production-2026/" rel="noopener noreferrer"&gt;Model Drift Detection: Monitor ML Models in Production (2026)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://mlopslab.org/mlops-roadmap-2026-how-to-become-an-ml-engineer-step-by-step/" rel="noopener noreferrer"&gt;MLOps Roadmap 2026: How to Become an ML Engineer&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Dong, L., Lu, Q., &amp;amp; Zhu, L. (2024). AgentOps: Enabling Observability of LLM Agents. arXiv. &lt;a href="https://arxiv.org/abs/2411.05285" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2411.05285&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Es, S., et al. (2023). RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv. &lt;a href="https://arxiv.org/abs/2309.15217" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2309.15217&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Langfuse Documentation. &lt;a href="https://langfuse.com/docs" rel="noopener noreferrer"&gt;https://langfuse.com/docs&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;OpenTelemetry Semantic Conventions for LLM systems. &lt;a href="https://opentelemetry.io/docs/specs/semconv/gen-ai/" rel="noopener noreferrer"&gt;https://opentelemetry.io/docs/specs/semconv/gen-ai/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Vesely, K., &amp;amp; Lewis, M. (2024). Real-Time Monitoring and Diagnostics of ML Pipelines. Journal of Systems and Software, 185, 111136.&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;em&gt;Written by Ayub Shah — ML Engineering student, MLOps enthusiast. Testing every tool so you don't have to. No sponsors, no affiliate links.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;→ More at &lt;a href="https://mlopslab.org" rel="noopener noreferrer"&gt;mlopslab.org&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>mlops</category>
      <category>llm</category>
      <category>python</category>
    </item>
  </channel>
</rss>
