Forem: Raluca Crisan

Turning Observability into a Tunable Search Space

Raluca Crisan — Sun, 10 May 2026 13:24:39 +0000

In the Mlops world, people have long used DAGs/graphs or at least the consensus has been that best practice was to use them. With AI and agents, the types of graphs used for orchestration or instrumentation are more varied, but they share the same core idea: capturing and tracking the artifacts produced by a pipeline, along with their parent and child relationships. The reason for this is intuitive: a common pattern across data, ML, and agent pipelines is a sequence of steps that can be represented, stored, and discovered through a graph structure. Tracking, observing, and optimizing this sequence broadly supports monitoring, reproducibility, orchestration, backfills, retraining, and related workflows.
Etiq is a tool that creates a lineage and captures artifacts (a bit like a DAG or similar graph) on executed code. The question I’m addressing is what impact would such a tool have on different types of coding agents and coding agents architectures given how it seems tailor made for these types of agents.

Experiment Description

This first quick project looks at trying to incorporate AutoML type tuning into a coding agent whose main task is to solve various data science related challenges. Everything but the Etiq tool is more or less vibe-coded and the agent repo is not designed to be used but just to illustrate a point and for some quick hypothesis testing.

The main flow of the agent is:

Gets given a data science type task
A codegenerator codes and runs a script to answer the task till if finds a configuration that runs
Etiq tracks the artifacts and lineage (graph) of this baseline script, including the intermediate data objects, the model object, and the flow between them
A small, controlled search space from the script’s configuration is derived and passed to SMAC (a known AutoML tuner) for optimization.
Each SMAC trial reruns the same script with a different configuration, captures the resulting metric, and stores the full attempt record

The current system turns the DAG into tunable space via a number of rules: only include executed nodes, block certain node classes, inspect only literal call arguments when deciding the type of tuning, drop low-impact knobs, and optionally add safe remove-node controls. But really this is a rather arbitrary part of this process.

Also, in this implementation, the lineage edges are not used to create dependency constraints between knobs. The current implementation is therefore better described as executed-node-to-source-control tuning than true DAG-topology optimization.

Metrics and benchmark

The hardest part here was finding what to compare against. The real question is does this set-up help with something - performance, time to best result or cost in terms of LLM API calls? But what is a fair comparison point?

The starting point is always an executable script by an LLM tasked to address the given data science problem, but after this initial step, initially 4 different approaches were explored to see if we can isolate the impact of using an Etiq/DAG. The tasks themselves were adapted from MLE-Bench - only 5 of them and for structured data only. Well performing solutions to the tasks are short scripts (of no more than a few hundred lines of codes each).

While it quickly became apparent that this type of comparison above is fraught, some lessons have been learned (by me).

Because the No-DAG + SMAC also needed a tunable space, a kind of ad-hoc space was procured through some AST parsing + rules combo. The implementation and the idea was quite half baked and although it ran on a few of the tasks, it was problematic. What SMAC truly optimizes in this instance is the model. When the whole-pipeline DAG pretense was dropped from the approach, and the SMAC only optimized the model, it all made a lot more sense. In both cases the DAG+SMAC approach outperformed the No-DAG + SMAC one, in the second instance because it optimized for the data prep as well as the model (and, as we all know, data matters!). The difference was not too large which is a trend and also made sense on the small tasks/pipelines the comparison was ran on.

The harder but more interesting lesson (which made me think a bit more about the logic behind what I’m trying to do) was that the free LLM search usually outperforms everything else (or one cannot tell the difference). Again here when the LLM-search was constrained by some ad-hoc made up search space using parsing and arbitrary rules (to make the comparison seem more ‘fair’), the no-DAG LLM-search also failed or underperformed slightly the DAG version. But when the search was completely free, the LLM only (no DAG) did outperform.

Why having a DAG can help, and in which instances

When thinking about it a bit harder, this finding kind of made sense.
There are a few potential benefits of the DAG + LLM search approach vs. free LLM search approach:

Lower cost via localization (in theory DAG acts as a king of context compression, and I would emphasize ‘in theory’ here)
Better search/higher overall performance

Cost is a trickier story, but generally the second benefit only really show up for context windows that are large enough. For smaller pipelines/scripts, like the ones produced to answer the benchmark used here, it really doesn’t matter, if anything it makes things worse. At the very beginning, when the pipeline is still small, the model benefits from seeing the whole design end to end. A DAG can potentially start to become useful once the pipeline has stabilized into recognizable stages, is large enough and/or most changes become local. At that point, the DAG could help because it externalizes structure that the no-DAG agent would otherwise have to rediscover from code again and again. Additionally, if the pipelines/codebase truly is too large for one context/attention window, then the DAG is an appropriate search optimization approach.

Before concluding, it is worth making a quick detour to see if this DAG-based idea appears in other ‘nearby’ areas.

First, semantic search seems to me the closest comparison to a DAG-based approach because both aim to avoid resending the full script on every iteration. However, one localizes context by similarity, while a DAG localizes context by explicit dependency structure. E.g. a DAG can show which pipeline stage feeds another, what is upstream or downstream, and which artifacts connect different components. In data-science and ML-type pipelines, the important code may matter because of execution order, dataflow, or artifact dependencies, not because it looks textually similar to the request.

Second, looking at it from the coding agent angle, most coding agents do not natively extract a pipeline DAG and use it to guide local rewrites. Aider is the closest mainstream example, but it uses a repository graph rather than a true pipeline or dataflow DAG. And tools like Cline, Roo Code, and Sourcegraph Cody mainly rely on semantic search, AST/file analysis, repository maps.
DAG-like approaches may appear more often in context engines and MCP tools than in mainstream coding agents, but they are primarily based on static analysis, not runtime observation. Static-analysis tools usually parse files into AST-like structures and combine them into a repository-level index or graph, which can indeed be very useful for general coding. But for data-science pipelines runtime DAGs are often more relevant. because failures and performance issues depend on the specific data and configuration used.
I believe the reason we don’t really see these DAGs in practice are two-fold. One, they are extremely hard to produce reliably and then integrate in a useful manner, and two, and more importantly, their benefits only show up on either large codebases, where the main approach is so far semantic based, or specific types of long-horizon agents, which don’t really show up so often in practice. In the next blogpost, I will try to explore how setting up these observability DAGs as part of a long horizon architecture itself improves the performance (or doesn’t).

3 Levels of Observability for Coding Agents

Raluca Crisan — Sun, 10 May 2026 11:45:43 +0000

In this blogpost I am exploring how a framework that translates code into a graph fits within the observability stack. Intuitively, something that helps decompose a pipeline code into its corresponding elements should help - it should help coding agent with testing & verification, optimized debugging, auditability. It should help support a host of coding agent architectures, especially with longer-term horizons. But this blogpost is trying to explore less of what it can do, and more of where this framework fits in.

First, as a quick reminder: the framework I’m exploring - Etiq, maps your code and traces artifacts and their lineage deterministically and without manual instrumentation (a bit like extreme auto-logging) and it works for data and AI pipelines. It does so on a mix of static analysis and run-time execution.

Second, the observability stack for agent derived code is a bit hard to pin down but it can roughly fit three buckets:

Agent ‘orchestration’ - state/memory store
Telemetry
Anything that helps you assess what actually happens in the code

The diagram below represents a high level view of a coding agent structure - or at least the main idea. In a coding agent, an orchestrator manages the task, asks the LLM what to do next, invokes tools, runs code in an isolated environment, formats and checks the results before returning outputs. High level it uses something that can be described as a plan - act - verify loop, with complexity increasing depending on the agent.

Translated into our three buckets, we have the below:

Light green = state / memory store & separately artifact store
Light blue = OpenTelemetry
Pink = QA & test/ grading record

The first bucket - light green on the diagram - helps provide the agent context. That context is essential for spotting potential issues, because it shows the shape of the run and what was intended, e.g. why was a patch made, did the agent originally intend to modify one file before branching into a different fix, etc. This bucket provides what the system believed it was doing and the end artifact store: the end outputs produced by an end-to-end run.

The second bucket, the light blue one, is the runtime execution capture via OpenTelemetry. This layer captures traces, metrics, and logs, which in a coding-agent system can include model and tool-call spans, subprocess execution, HTTP and database activity, timings, statuses, exit codes, service-to-service requests, and logs and metrics surrounding the run.

Runtime telemetry provides evidence that does not depend on whether the agent was honest, accurate, or even aware of what happened. The process either ran or it did not; the HTTP request either happened or it did not. OpenTelemetry shows what the platform observed rather than what the agent claimed. It can answer questions such as whether the model call happened, whether the patch step executed, whether the script ran, if/where latency occurred, and which retry loop consumed most of the time.

The third bucket - the pink one - looks in more detail at what happens with the code that was produced by the agent in this run. It can look at code logic, unit tests, static analysis and capturing vulnerabilities. And with the Etiq framework it can have in depth observability on the executed code beyond OpenTelemetry. Let’s say this is an agent that creates workflows based on various data feeds. At some point it calls an LLM, but prior to this call, it does 10 steps that are just about data processing, once the LLM returns an answer this gets joined up with another data source and the pipeline keeps going. The green bucket would provide us with the agent’s intention in writing this code and hopefully a coherent plan, the blue telemetry bucket would capture the API calls to the LLM and to get the initial data and would associate the full code with them. But regarding the 10 interim steps there is no way to log them in an observability framework outside instructing the agent itself to capture the artifacts and associate them with the appropriate function. Semantic search does not have a direct link to the produced interim artifacts. And this is where a framework like Etiq comes in - that is able to log granular steps of interim artifact/functions pairs and lineage.

In the case of a very simple example code generation agent with the following structure:

The orchestration would capture details on each of the agent’s nodes, below just for example purposes:

The OpenTelemetry logging would capture information as per below:

And Etiq would log the detail of what actually ran during the code execution for the given run:

The information produced via the Etiq framework serves a few different purposes:

It captures interim artifacts/function pairs thus allowing verification, test harnesses and checks on them - this enables the kind of granular testing data and AI pipelines need
It optimizes debugging as it can point exactly to the function that is producing the wrong interim step
It provides a level of audibility that open telemetry and agent orchestration or end artifact capture cannot do as it traces the lineage of data through the pipeline

Fundamentally it is great that we are able to to observe what the system is trying to do and what it stores at the end as code or output artifacts, it is equally important that we can capture the API calls and tool calls to the data sources, LLMs, the sandbox in which the code runs, etc. But there is currently a gap when it comes to observing the executed code the system produces. And the solution to this gap is an observability framework beyond what we currently have in the space, namely a framework that can trace the interim artifacts produced by the code and their producer functions and map their relationships, so they can be tested, debugged and audited.

The observability gap for data science and analytics agents

Raluca Crisan — Sun, 10 May 2026 11:05:31 +0000

Databricks and similar enterprise data platforms have spent a great deal of effort and time to full-proof their product suite with relevant observability and tracing. Not surprisingly this is needed as part of enterprise support especially in regulated sectors. But for the specific case of sophisticated data science and analytics agents there is a gap in the observability suite not just for Databricks but across all big and small analytics and data science agent providers.

In the case of Databricks, even with notebooks as a primary user interface, given the offerings across data lineage, data management and MLflow, the level of control and tracing is no doubt high. However both large vendors like Databricks and Snowflake and smaller analytics and data science agents suppliers share an observability gap. The gap is inherent to coding agent architectures and does not apply equally to all agents. A text-to-SQL assistant can be wrong in an ‘obvious’ way: the result makes no sense. A multi-step python or spark pipeline produced by an agent is different. Even when made by a human, it’s hard to unpick pipeline logic given endless combinations of joins, data issues, data characteristics. This problem doesn’t go away when an agent is involved. E.g. Genie can plan a solution,run code, use cell outputs to improve results, and fix errors automatically. The question is what beyond the initial reasoning and the final artifact can be inspected in this instance and what can be reliably/not-probabilistically logged.

To achieve their objectives, these more sophisticated data science and analytics agents need to create relatively complex multi-step pipelines. Past the initial data retrieval and the final storage step, the pipelines themselves are just arbitrary code. Observability for this type of scripts when they are man-made span a whole area of companies in the MLOps space including Databricks’ own Mlflow. But it is unclear what observability is out there when this code is produced by agents - short of asking the agent itself to instrument the code (probabilistically), thus somewhat defeating the purpose of observability in the first place.

Now that we’ve narrowed the gap in observability from the bigger data platform context to a specific area: the ‘executed pipeline code’ element part of these more sophisticated analytics and data science agents workflow, my first question was to see if Mlflow or a different ‘off-the-shelf’ tool in the ecosystem can fill this gap directly. For why OpenTelemetry is not enough here please see the previous blogpost.

Unsurprisingly, Mlflow is heading in the direction of more granular instrumentation with the least amount of effort - on anyone’s part, human or agent. For classic ML, a single mlflow.autolog() call can automatically capture params, metrics, models, datasets, and artifacts around supported training APIs, while for GenAI and agent workflows, one-line tracing primitives like @mlflow.trace, mlflow.trace(...), and mlflow.start_span() add function- and block-level visibility, including parent-child relationships, inputs, outputs, exceptions, and execution time.

My initial experiments with trying to instrument agent-created code with Mlflow deterministically have allowed me to track the models as experiments which was a good step in the right direction 👍, but of course I cannot track data transformations - with Mlflow or with anything else that I’m familiar with.
Trying to track with autolog was the better option for me - rather than the tracing function, because I’m not really tracking the agent, I’m trying to track what’s happening in the code produced by the agent when it runs. Below some example basic tracking:

The gap is of course tracking what actually happens inside the pipeline outside the model itself, all the data operations for which no observability is present. While the code is of course the best evidence in other use cases, for pipeline types structures where the outcomes are heavily influenced by the particulars of the data, the code is not enough - observability on code and runtime execution both is needed and for these data science and analytics agents, the code they produce (outside the model itself) is currently a black box - an example table of interim artifacts below (made using Etiq), which at the moment tooling like Mlflow does not capture for agent written code.

In this space we were brainwashed to believe that observability matters at all cost; however I feel for this instance given the perception of coding agents in the market, an argument might have to be made for why it really matters.
First, it’s about auditability. Truly not everyone cares about this and not everyone should. But in regulated sectors like finance or healthcare this matters. For model validation in e.g. finance, the type of data lineage documentation required involves more than what gets stored in Unity catalogue, Delta lakes or Mlflow model tracking - all useful components. This type of use case needs to reflect the transformations that happen in the code itself once executed and teams currently do this manually. At the moment, the use of semiautonomous coding agents for these use cases is minimal but this is not where the enterprise stack is going.

Second, observability for these more sophisticated agents moves into other related risks, such as reproducibility, error propagation across longer pipelines, and general control issues for agent generated code.
Without observability, it is harder to track ‘semantic mistakes’ the agent might make, such as not using the correct metric definition, or applying the analysis or model to the wrong population. A bad transformation early in the pipeline affects everything downstream. I’m not sure what exactly is the level of observability needed to help us mitigate the potential issues, but without any we certainly would struggle.

Reproducibility is another area that does require some level of observability: if transformation execution is not observable, the final notebook may not be a faithful record of the run that produced the result. Similarly, we would struggle to compare agent runs over time (or rather without observability we would struggle more).

The key argument for in-depth-observability on agent generated code is enterprise level control especially for regulated sectors. Usage of these sophisticated data science and analytics agents in regulated sectors might be small to begin with relative to the size of the overall data platform offering. However as Databricks and large enterprise data platforms are feeling the pressure from coding agents and foundational models, there just aren’t that many avenues left to go into. If Databricks’ long-term position is around providing the governed system in which semiautonomous enterprise agents can actually run, then any observability gap will prove problematic.