Forem: David Campbell

From HAR File to Running Load Test in 60 Seconds With AI

David Campbell — Tue, 21 Apr 2026 19:02:04 +0000

The traditional workflow for creating a performance test goes like this. Record a user journey. Import it into your load testing tool. Run it. Watch it fail. Then spend hours doing detective work: tracing dynamic values through request-response chains, writing regex extractors, debugging why the extractor captured the wrong thing, running again, and repeating until the script works.

For a typical enterprise application, that process takes 5-25 hours. Per script. And the script breaks the next time the application changes a token format.

AI replaces this workflow. Not incrementally – structurally. Here is what it actually looks like.

Step 1: Record a HAR File (2 minutes)

Open your browser's dev tools (F12), switch to the Network tab, and navigate through the user journey you want to test. Login, navigate, perform the business action, verify the result, log out. Save the recording as a HAR file. Done.

A couple of recording tips that matter:

Use Firefox over Chrome. Chrome's recent versions default to a sanitised export that strips cookies and auth tokens. It can also drop large response bodies - meaning dynamic values that need correlating may not appear in the file. Firefox produces more complete captures.

For complex enterprise apps, use a proxy recorder (Fiddler, mitmproxy). A proxy captures complete request-response pairs without browser export quirks. Nothing gets dropped, nothing gets sanitised.

HAR files are JSON - structured, parseable, machine-readable. An AI agent can reason about the traffic without format translation. This is a meaningful advantage over tool-specific recording formats like JMX or LoadRunner scripts.

Step 2: Import and Let the Pipeline Work (30-90 seconds)

When you import a HAR file into an AI-powered pipeline, several things happen automatically:

Noise filtering. A typical browser recording contains 200-500 entries. Most are CDN requests, analytics, fonts, and third-party widgets. The pipeline classifies each domain and filters irrelevant traffic. The 40-80 actual API calls and page loads are what matter.

Assertion generation. Before any correlation work begins, the pipeline analyses responses and generates validation rules. Status code checks, content-type validation, and - most valuably - soft failure detection. A 200 response containing "success": false in the body. A login form on a page that should show a dashboard. These silent failures are invisible to HTTP status codes. The AI catches them before testing begins.

Observation. Every value in every response is compared against every subsequent request. The scanner flags values that change and records where each one first appeared. It assigns uniqueness scores: a 36-character UUID scores high (safe to replace globally), a user ID of "1" scores low (needs boundary-aware replacement to avoid corrupting unrelated data).

Decision and extraction. AI agents evaluate each candidate. For JSON responses, a JSONPath extractor is preferred – precise, readable, resistant to formatting changes. For HTML or text, a regex specialist builds expressions tuned for the specific regex engine of the target tool (JMeter's ORO engine has syntax quirks a generic regex will trip over). Each extraction rule is validated against the recorded data before insertion.

Replacement. Hardcoded values get swapped for variable references. High-uniqueness values get global replacement. Low-uniqueness values get boundary-aware replacement - the pipeline checks each occurrence in context before committing.

Proof. The test runs. Every extraction and substitution is checked against actual server responses. Failures are classified: was the observation wrong, the decision wrong, or did the application change? Each classification triggers a different repair path.

Safety net. A final scan catches anything the main loop missed: NOT_FOUND patterns, dynamic parameters outside the original candidate list, new tokens that appeared during the run. A QA agent validates across six categories: configuration, assertions, correlation, data and variables, scripts, and load profile.

The full pipeline completes in under two minutes for a standard web application. Complex enterprise apps with hundreds of entries: five to ten minutes.

What Comes Out

The output is not a fragile script. It is an engineered test asset:

Extractors for every dynamic value, validated against real server responses
Assertions including soft failure detection invisible to status codes
A QA report flagging remaining issues with proposed fixes
A baseline - the pipeline knows what "working" looks like for this application, so when things change, self-healing kicks in

Compare that to the old way:

	Traditional	AI Pipeline
Time to working script	5-25 hours	1-10 minutes
Dynamic values caught	Depends on engineer skill	Comprehensive scan
Assertions	Usually none	Auto-generated
Maintenance on app change	Full re-correlation	Self-healing diff

Why HAR Files Over Script Recorders

Most load testing tools ship with proxy-based script recorders. They work. But they bind you to one tool (JMeter's recorder produces JMX, LoadRunner's produces C). A HAR file is tool-agnostic — the same recording can produce a JMeter test plan or a Locust Python script.

There is also a practical issue: JMeter's proxy recorder cannot handle WebSocket connections. When the browser attempts a WebSocket upgrade, the proxy recording stops. Modern applications that rely on real-time communication become partially or entirely unrecordable. A browser-based HAR recording keeps going.

The Honest Limitations

HAR files capture what was sent but not why. They show a token in a request header but not which JavaScript function generated it. For most correlation work, what-was-sent is enough. But as recording technology evolves, capturing the "why" alongside the "what" will unlock new capabilities.

The AI pipeline is not magic. It handles the mechanical, repetitive work — the work that requires thoroughness and patience rather than creativity or judgment. When something genuinely novel appears (a bespoke authentication flow no system has seen before), human expertise still matters. The AI compresses the common case so that human attention can focus where it actually adds value.

Try It Yourself

The quickest way to see this in action:

Open Firefox, hit F12, go to the Network tab
Navigate through any web application (a login flow works well)
Right-click in the Network panel and "Save All as HAR"
Import that file into an AI-powered testing platform

What used to take hours of manual scripting now takes minutes. The time saved is real. What you do with that time - actual performance analysis, tuning, strategic testing - is where the engineering happens.

This article is adapted from AI Performance Engineering: How Agentic AI Is Transforming Load Testing by David Campbell. The book covers the full workflow in depth, including the agent architecture, a time-motion study, self-healing tests, and a build-your-own guide. Available on Leanpub.

LoadMagic offers a free tier if you want to try the HAR-to-test workflow yourself.

Why AI Correlation Is Harder Than You Think (And What 25 Years of Pain Taught Me)

David Campbell — Mon, 20 Apr 2026 17:23:50 +0000

Every performance tester knows the feeling. You record a user journey, hit replay, and watch your script crash within seconds. The culprit is almost always the same: dynamic data. Session tokens, CSRF values, authentication keys – they all change between requests. If your script replays the values it recorded rather than extracting fresh ones from server responses, the test is dead on arrival.

The process of fixing this – identifying dynamic values, finding their origin in a previous server response, and extracting them for reuse – is called correlation. It is the single most time-consuming and frustrating part of performance test preparation. A simple script might need a handful of correlations. A complex enterprise application (Salesforce, SAP, a modern microservices checkout) can require dozens or even hundreds.

I have spent most of my career wrestling with this problem. First as a tester, then as a consultant helping teams dig out of correlation backlogs, and now as the founder of a platform built to solve it. Along the way I built a framework for thinking about the different approaches the industry has tried.

I call it the Correlation Spectrum: five levels of capability, from fully manual through to fully autonomous. Understanding these levels is not academic. It determines whether your performance testing programme is viable, efficient, or too expensive to maintain.

Level 1: Manual Correlation (With AI Hints)

The engineer opens a recorded script containing hundreds of HTTP requests. They find a failing request, compare the recorded response to the replayed response, spot a value that changed, then manually search backward through prior responses to find where it first appeared. They write an extraction rule, insert it after the originating response, and replace the hardcoded value with a variable reference.

Modern tools at this level may use lightweight AI to suggest "this looks dynamic" or auto-generate a regex once the engineer has identified the target. But the detective work remains human-driven.

The problem is scale. Manual correlation effort does not grow linearly with complexity – it grows exponentially. Each additional dynamic value increases the search space and the likelihood of cascading errors, where fixing one correlation breaks another. For scripts with thirty or more correlation candidates, manual effort can hit 40+ hours per script. At that point, teams abandon scripts rather than maintain them.

I call this the "Script Museum": test assets that sit unused because they are too expensive to keep current.

Level 2: Rules-Based Frameworks

The tool ships with a library of predefined rules organised by framework. Record against a .NET application and it auto-detects __VIEWSTATE, __EVENTVALIDATION, and ASP.NET session IDs. All the leading commercial tools (LoadRunner, BlazeMeter, NeoLoad, OctoPerf) implement some version of this. Success rates of 60-90% on well-matched stacks.

The limitation is inherent: rules only work for patterns the vendor has already seen. Custom frameworks, bespoke auth mechanisms, anything not in the library – missed. And that last 10-40% often represents the hardest correlations.

Level 3: Algorithmic Diffing

Compare recordings, spot values that change, generate extractors. Framework-agnostic. Scales better than manual work. But algorithms are smart, not intelligent. They tell you what changed but struggle with why it changed and whether it matters. False positives require review. There is no learning between sessions — each new script starts from scratch.

Level 4: General-Purpose AI

LLMs enter the workflow. Feed the recorded traffic to GPT/Claude/Gemini and let it reason about what needs correlating. The AI understands that "csrfToken": "abc123" is a security token. It can trace authentication flows and reason about error messages.

But general-purpose AI was not built for this problem, and it shows:

Context window limits: A complex recording has thousands of requests. Feeding it all in exceeds limits or forces summarisation that loses critical detail.
No accumulated knowledge: Each session starts fresh. The AI re-discovers patterns it solved last week.
Tool-specific blindness: Generating a valid regex is one thing. Generating a regex safe for JMeter's ORO engine, which has specific syntax quirks and boundary handling, is another.
Hallucination risk: Plausible-looking but incorrect JSONPaths and regex patterns propagate without warning.

I ran a stopwatch study on this exact scenario. Same test plan, same HAR recording. Manual correlation with ChatGPT as a coding assistant versus automated correlation.

Metric	Automated	Manual + ChatGPT
Total time	75 seconds	25 min 20 sec
Context switches	0	15
Human errors	0	6
Candidates found	6 of 6	2 of 6

The headline is 20x faster. But the coverage gap matters more: the manual run missed four out of six dynamic values. The "finished" script was sending stale data to the server. A test that looks correct but sends hardcoded tokens is worse than no test at all — it creates false confidence.

Level 5: Where It Gets Interesting - Specialised AI With Persistent Knowledge

This is the level I have been building toward. The key insight: correlation is not one problem. It is three.

Observe: Scan every value in every response against every subsequent request. Flag what changes. Detect encodings. Fingerprint frameworks. This is mechanical work — no intelligence required, just thoroughness. Machines do this in milliseconds.
Decide: Which values need extraction? What type of extractor? Where does it go? How does it interact with other correlations? This requires understanding, not pattern matching.
Prove: Run the test. Did the extractor capture the right value? Did the server accept the request? This is ground truth from a real execution.

Most tools collapse all three into a single operation. When they break, you cannot tell which step failed. A broken extractor might mean the pattern was wrong (bad observation), the extraction strategy was wrong (bad decision), or the application changed (invalid proof). Single-layer tools show you a red result and leave you guessing.

Why separation enables self-healing

When the three layers are separate, self-healing becomes diagnostic:

Application update detected:
  - Observation layer: "Token format changed from opaque to JWT"
  - Decision layer: "Extraction strategy (regex on refresh_token) still correct,
    but response structure moved from flat JSON to nested auth.tokens object"
  - Proof layer: "Updated JSONPath works. Stamping golden baseline."

Result: 3 minutes, one targeted fix
vs. 20 minutes re-running the entire pipeline and hoping

The system builds a Golden Map - a snapshot of the world view at the moment everything was proven to work. When something breaks, the first response is not AI investigation. It is a deterministic diff against the Golden Map and a restore of what changed. Faster than an LLM call, cheaper (no API tokens), more predictable (same input, same output every time).

AI agents only get involved when the restore fails - meaning the application itself changed, not just the test configuration.

The compounding effect

The real power is what happens over time. Each layer feeds the next cycle:

The observation layer builds a world view. First import: sparse. Tenth import: the system already knows where tokens live.
The decision layer accumulates proven strategies. A JSONPath that worked for this Salesforce token last month still applies today.
The proof layer builds a golden baseline. Drift is detected and investigated, not discovered during a production test run.

The gap between specialised AI and every other approach widens with every session:

Profile	Manual Estimate	Automated Estimate
Simple login flow (measured)	25 min	75 sec
Standard commercial app (projected)	3-5 hours	~10 min
Complex financial system (projected)	~1 week	~1-2 hours
Heavy enterprise (SAP-scale) (projected)	8-10 weeks	~1-2 days

Manual costs accelerate (more candidates = more cascading errors). Automated costs stay near-linear.

The Question to Ask Your Tooling

Whatever correlation approach you use today, ask this: when a test breaks, can the system tell you whether the observation was wrong, the decision was wrong, or the application changed?

If the answer is no, you are working with a single-layer tool. It may work at small scale. But as applications grow and change accelerates, you will spend more time on diagnostic work that a well-separated architecture handles by design.

Correlation improves through architecture, not through better pattern matching or bigger language models.

This article is adapted from AI Performance Engineering: How Agentic AI Is Transforming Load Testing by David Campbell. The book covers the full Correlation Spectrum framework, the three-layer architecture, a minute-by-minute time-motion study, self-healing tests, and a step-by-step guide to building your own AI testing pipeline. Available on Leanpub.

If you want to see the automated correlation workflow in action, LoadMagic has a free tier.