Forem: tokozen

Building a RAG Pipeline That Stays Fresh with Live Web Data

tokozen — Tue, 05 May 2026 06:02:44 +0000

Building a RAG Pipeline That Stays Fresh with Live Web Data

You build a RAG pipeline, embed your documents, stand up a vector store, and it works great. Then three months later, users start complaining that the answers are wrong. Your product pricing changed. A regulation was updated. A library released a breaking version. The documents you indexed at setup time are now lying to your users.

The fix is not to re-index more aggressively. The fix is to stop treating the web as a one-time data source and start treating it as a live feed that your pipeline can query at retrieval time.

Here is how to wire that up.

The Core Problem with Static RAG

Standard RAG looks like this: ingest documents, chunk them, embed them, store vectors, retrieve on query, generate. Every step happens at ingest time except retrieval and generation. That is fine for a corporate knowledge base with a weekly update cycle. It breaks down when:

You are answering questions about prices, regulations, or news
Your users ask about things that happened after your last ingest
The authoritative source is a website that updates continuously

The mental model shift is to treat retrieval as a two-stage process. First, check your local vector store for stable reference content (your docs, FAQs, internal data). Second, run a live web query for anything time-sensitive and merge those results into the context window before generation.

What the Pipeline Looks Like

Here is the revised flow:

User sends a query
Classify the query: does it need fresh data?
If yes, fire a web search, scrape the top results, pull clean text
Combine local retrieval results with fresh web content
Pack both into the prompt context and generate

The classification step does not need to be fancy. A simple heuristic works: if the query contains words like "current", "latest", "today", "now", "price", or a named entity that changes frequently, route it to the live path. You can also ask the LLM itself with a one-shot classifier prompt.

The scraping step is where most implementations fall apart. Raw HTML is terrible context. You want clean, structured text. Anakin's Scrape API handles this well: you give it a URL and it returns the page content as clean markdown or plain text, stripping nav, ads, and boilerplate. That matters a lot because every token of garbage HTML in your context window is a token not used for actual reasoning.

A Concrete Implementation

import os
import requests
from openai import OpenAI

ANAKIN_API_KEY = os.environ["ANAKIN_API_KEY"]
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]

client = OpenAI(api_key=OPENAI_API_KEY)

def needs_fresh_data(query: str) -> bool:
    trigger_words = ["current", "latest", "today", "now", "price", "recent", "2024", "2025"]
    return any(word in query.lower() for word in trigger_words)

def web_search(query: str, num_results: int = 3) -> list[dict]:
    resp = requests.get(
        "https://api.anakin.ai/v1/search",
        headers={"Authorization": f"Bearer {ANAKIN_API_KEY}"},
        params={"q": query, "limit": num_results},
        timeout=10,
    )
    resp.raise_for_status()
    return resp.json().get("results", [])

def scrape_url(url: str) -> str:
    resp = requests.post(
        "https://api.anakin.ai/v1/scrape",
        headers={"Authorization": f"Bearer {ANAKIN_API_KEY}", "Content-Type": "application/json"},
        json={"url": url, "format": "markdown"},
        timeout=15,
    )
    if resp.status_code != 200:
        return ""
    return resp.json().get("content", "")[:3000]  # cap tokens per source

def build_context(query: str, local_chunks: list[str]) -> str:
    sections = []

    if local_chunks:
        sections.append("## Internal Knowledge\n" + "\n\n".join(local_chunks))

    if needs_fresh_data(query):
        search_results = web_search(query)
        web_sections = []
        for result in search_results:
            url = result.get("url", "")
            title = result.get("title", url)
            content = scrape_url(url)
            if content:
                web_sections.append(f"### {title}\nSource: {url}\n\n{content}")
        if web_sections:
            sections.append("## Live Web Results\n" + "\n\n".join(web_sections))

    return "\n\n".join(sections)

def answer(query: str, local_chunks: list[str]) -> str:
    context = build_context(query, local_chunks)
    system_prompt = (
        "You are a helpful assistant. Answer the user's question using the provided context. "
        "If you cite information from a web source, mention the source URL. "
        "If the context does not contain enough information, say so clearly."
    )
    user_message = f"Context:\n{context}\n\nQuestion: {query}"
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_message},
        ],
        temperature=0.2,
    )
    return response.choices[0].message.content

# Example usage
local_knowledge = [
    "Our refund policy allows returns within 30 days of purchase.",
    "Enterprise plans include SSO and audit logs.",
]
query = "What is the current price of GPT-4o API calls?"
print(answer(query, local_knowledge))

A few things worth noting in this code:

The scrape_url function caps content at 3000 characters per source. You will want to tune this based on how many sources you pull and what your context window budget looks like. With three sources at 3000 chars each plus your local chunks, you are comfortably under 16k tokens.
needs_fresh_data is intentionally simple. In production you would replace this with a proper classifier or even a quick LLM call with a boolean output schema.
The system prompt tells the model to cite sources when using web content. This matters for user trust and for debugging when something goes wrong.

Keeping Latency Manageable

The obvious concern with adding live web requests to a RAG pipeline is latency. A search call plus three scrape calls adds up. A few approaches that help:

Run search and scrape in parallel using asyncio or concurrent.futures. The three scrape calls should happen simultaneously, not serially.
Cache search results for identical or near-identical queries with a short TTL, maybe 5 to 15 minutes depending on how time-sensitive your domain is.
Set aggressive timeouts. A scrape that takes more than 10 seconds is probably not worth waiting for. Fail gracefully and use whatever you got.
Limit scraping to the top two results rather than five. Marginal sources past the second result rarely change the answer quality, but they do add latency.

With parallelism and reasonable timeouts, you can usually keep the live web path under two seconds of added latency for the retrieval step.

Where to Go from Here

The pattern above handles the common case: detect, search, scrape, merge. But there is a more interesting direction when you need depth rather than breadth. Anakin's Agentic Search does the research loop for you, returning a synthesized answer with sources rather than raw search results. That can be a better fit when the query needs multi-step reasoning across sources rather than a simple "what does this page say."

The thing I would tackle next is making the freshness classifier smarter. Right now it is a keyword list. A better version would use embeddings to detect semantic similarity to a set of "time-sensitive topics" defined for your specific domain. That gets you higher precision without much more code.

Static RAG is a starting point, not a destination. The web is your database. Treat retrieval accordingly.

Anti-bot without the arms race: what Camoufox does differently

tokozen — Tue, 28 Apr 2026 06:18:46 +0000

Anti-bot without the arms race: what Camoufox does differently

You spin up Playwright, navigate to a site, and get a CAPTCHA or a 403 before the page even loads. The site never saw your request headers. It fingerprinted your browser.

This is the core problem with headless automation: most anti-bot systems do not just check whether you look like a bot by your behavior. They check what your browser is at a runtime level, things like the values returned by navigator.webdriver, the shape of the AudioContext API, your canvas rendering hash, font metrics, screen geometry, and dozens of other signals that are consistent across real user sessions but drift or become obviously synthetic in headless environments.

The typical response is a cat-and-mouse loop: someone finds a leak, writes a patch in JavaScript, the detection vendors update their fingerprinting logic, repeat. Camoufox takes a different approach.

What Camoufox actually patches

Camoufox is a modified Firefox build. The patches live at the C++ and Rust level inside the browser engine, not in a JavaScript shim injected at runtime. That distinction matters because JS-layer patches can be detected: a site can inspect whether navigator.__proto__ has been tampered with, check timing consistency between API calls, or look for the characteristic fingerprint of Object.defineProperty overrides. Engine-level patches do not leave those artifacts because the values come from the compiled binary itself.

The specific things Camoufox addresses include:

Canvas fingerprinting: real browsers produce slightly different canvas renders depending on GPU driver and OS. Headless Chrome on a Linux server produces a consistent, recognizable hash. Camoufox injects controlled noise at the rendering level so hashes vary realistically across sessions.
WebGL vendor strings: these normally reflect actual hardware. In a container you get Mesa software rendering strings, which are a strong signal. Camoufox spoofs these to plausible hardware values.
Screen and window geometry: headless browsers often report a viewport that does not match any screen, or a screen size of 0x0. Camoufox lets you configure these to match common real-world resolutions.
navigator.webdriver: obviously the first thing everyone patches, but Camoufox also handles the surrounding properties that tend to be absent or misshapen in automation contexts.
Font enumeration: the set of fonts available in a container differs from a desktop OS. Camoufox allows configuring which fonts are reported.

Because these patches live in the engine, you get them without writing any JavaScript injection code yourself.

Using Camoufox from Python

Camoufox ships a Python wrapper that handles launching the patched browser and configuring fingerprint parameters. Here is a minimal working example that loads a page and dumps the fingerprint values a detection library would see:

import asyncio
from camoufox.async_api import AsyncCamoufox

async def main():
    async with AsyncCamoufox(
        headless=True,
        os="windows",           # target OS persona
        screen={"width": 1920, "height": 1080},
        fonts=["Arial", "Times New Roman", "Calibri"],
    ) as browser:
        page = await browser.new_page()

        # Navigate to a fingerprint audit page
        await page.goto("https://abrahamjuliot.github.io/creepjs/")

        # Grab some of the values the page reports back
        webdriver_flag = await page.evaluate("navigator.webdriver")
        canvas_hash = await page.evaluate("""
            () => {
                const c = document.createElement('canvas');
                const ctx = c.getContext('2d');
                ctx.fillStyle = 'red';
                ctx.fillRect(0, 0, 50, 50);
                return c.toDataURL().slice(-20);
            }
        """)
        webgl_vendor = await page.evaluate("""
            () => {
                const gl = document.createElement('canvas').getContext('webgl');
                const ext = gl.getExtension('WEBGL_debug_renderer_info');
                return gl.getParameter(ext.UNMASKED_VENDOR_WEBGL);
            }
        """)

        print(f"webdriver: {webdriver_flag}")
        print(f"canvas tail hash: {canvas_hash}")
        print(f"webgl vendor: {webgl_vendor}")

        await page.close()

asyncio.run(main())

On a clean Playwright Chromium install you would typically see webdriver: true and a Mesa vendor string. With Camoufox configured as above, you get None for webdriver and a spoofed GPU vendor.

The os parameter tells Camoufox which persona to build. Choosing "windows" means the font list, screen geometry defaults, and some navigator properties are calibrated to a typical Windows desktop profile. You can also use "macos" or "linux".

Where this does and does not help

Camoufox handles the static fingerprint layer well. It is harder to fool behavioral analysis, which looks at things like mouse movement entropy, scroll patterns, timing between interactions, and whether you click precisely in the center of every button.

If you are doing research scraping or building data pipelines where you control the interaction flow, you can work around behavioral heuristics by adding realistic delays and some randomization to your action sequences. Camoufox does not do this for you, but it removes the browser-layer signals so behavioral analysis is the only remaining vector.

There is also the question of IP reputation. A perfectly spoofed browser fingerprint coming from a datacenter IP that has been seen making thousands of requests still gets blocked at the network layer. Pairing Camoufox with residential or mobile proxies gives you coverage at both layers.

For sites that use session-based auth and require you to maintain cookies across requests, Camoufox supports persistent contexts the same way Playwright does. If you are building something that needs to stay logged in across scraping runs, you can serialize the browser storage state to disk and restore it on the next run, same API.

What I would actually try next

If you are hitting detection walls and have already tried header spoofing and basic evasion without success, Camoufox is worth running against a fingerprinting audit tool like CreepJS or BrowserLeaks before pointing it at your target. See what signals still leak. The audit pages give you a structured view of exactly which APIs are returning anomalous values.

Engine-level patching is not magic, but it is a more durable foundation than stacking JS overrides. When detection vendors update their fingerprinting logic, a JS patch breaks immediately. A C++ patch to the canvas rendering pipeline does not.

The source is on GitHub under a Mozilla Public License variant. If you are doing this at scale and need something customized, the patch set is auditable and the build process is documented.

Authenticated Scraping: Why Session Persistence Matters

tokozen — Fri, 24 Apr 2026 05:54:58 +0000

Authenticated Scraping: Why Session Persistence Matters

You write a scraper that logs in, grabs a token, and then makes a request to a protected endpoint. It works once. You run it again five minutes later and get a 401. You add a retry. It works. Then it stops working entirely when the site starts fingerprinting your requests.

This is the session persistence problem, and it trips up a lot of scrapers that handle auth in a simplified way.

What "Authenticated Scraping" Actually Involves

Most people treat authentication as a one-time step: POST credentials, receive a cookie or token, attach it to subsequent requests. That model works fine for simple REST APIs. For real web applications, especially ones built for human users, it falls apart quickly.

Here's what actually happens when a person logs into a web app:

The browser sends credentials via a form POST.
The server sets a session cookie (often HttpOnly, Secure, SameSite=Lax).
Subsequent requests carry that cookie automatically.
The server may also rotate the cookie value on each request, or issue a CSRF token that must accompany state-changing requests.
If the browser goes quiet for too long, the session expires. The server might also tie the session to IP, User-Agent, or a device fingerprint.

A naive scraper might grab the initial cookie and replay it indefinitely. But if the site rotates session tokens, replaying an old value causes an immediate logout or a redirect to /login. If the site checks the User-Agent or Accept-Language headers for consistency, a mismatch triggers a challenge page.

Session persistence means maintaining the full browser-like state across requests: cookies, headers, timing, and sometimes JavaScript execution for SPAs that build auth state on the client side.

Where Naive Implementations Break

Here are the specific failure modes I see most often:

Cookie expiry without refresh. Sessions have TTLs. If your scraper sleeps for an hour between page fetches, the session may be dead when it wakes up. You need logic to detect a redirect to a login page (watch for Location: /login in a 302, or a 200 response whose URL or body content signals you've been kicked out) and re-authenticate.

CSRF token rotation. Some apps embed a CSRF token in the page HTML and require it on every POST. If you cache the token from login and reuse it, the second POST fails. You need to parse the current page's token before each state-changing request.

Missing cookies from redirects. requests follows redirects by default, but it does not always preserve cookies set during intermediate redirect steps. Using a requests.Session() object handles this correctly because it stores cookies across all requests in the session. Forgetting to use a session object is a common source of intermittent auth failures.

JavaScript-gated auth flows. OAuth and SSO flows often rely on JavaScript to exchange tokens, redirect, and set cookies. An HTTP-only client never executes that code, so it never completes the handshake. You need a real browser for these.

A Concrete Example with Session Handling

Here's a Python example using requests.Session to log in, detect session expiry, and re-authenticate transparently:

import requests
from bs4 import BeautifulSoup

LOGIN_URL = "https://example.com/login"
PROTECTED_URL = "https://example.com/dashboard/data"
CREDENTIALS = {"username": "user@example.com", "password": "hunter2"}

def get_csrf_token(session, url):
    resp = session.get(url)
    resp.raise_for_status()
    soup = BeautifulSoup(resp.text, "html.parser")
    token_input = soup.find("input", {"name": "csrf_token"})
    if not token_input:
        raise ValueError("CSRF token not found on page")
    return token_input["value"]

def login(session):
    csrf = get_csrf_token(session, LOGIN_URL)
    payload = {**CREDENTIALS, "csrf_token": csrf}
    resp = session.post(LOGIN_URL, data=payload, allow_redirects=True)
    resp.raise_for_status()
    # Confirm we're actually logged in, not silently redirected back
    if "dashboard" not in resp.url:
        raise RuntimeError(f"Login failed, landed at: {resp.url}")
    print("Logged in successfully")

def fetch_protected(session):
    resp = session.get(PROTECTED_URL)
    # Detect silent redirect to login page
    if "login" in resp.url or resp.status_code == 401:
        print("Session expired, re-authenticating...")
        login(session)
        resp = session.get(PROTECTED_URL)
    resp.raise_for_status()
    return resp.json()

with requests.Session() as s:
    s.headers.update({
        "User-Agent": "Mozilla/5.0 (compatible; MyCrawler/1.0)",
        "Accept-Language": "en-US,en;q=0.9",
    })
    login(s)
    data = fetch_protected(s)
    print(data)

A few things worth noting here. The requests.Session object persists cookies across all calls automatically, including cookies set on redirects. The CSRF token is fetched fresh from the login page before each login attempt. After the POST, we check resp.url rather than the status code alone, because many apps return a 200 with the login form when credentials are wrong. The fetch_protected function detects a stale session by inspecting the final URL after any redirects.

This handles most form-based auth flows. It does not handle JavaScript-rendered auth, multi-factor prompts, or fingerprint-based bot detection.

When You Need a Real Browser

Some situations require actual browser automation:

OAuth flows that rely on JavaScript redirects and postMessage between frames.
Apps using WebAuthn or device-bound credentials.
Sites that serve a blank HTML shell and populate auth state entirely via JavaScript.
Anti-bot systems (Cloudflare, PerimeterX, DataDome) that run behavioral challenges.

For these cases, a stateful browser session over CDP (Chrome DevTools Protocol) gives you a real browser context with proper JavaScript execution, cookie storage, and behavioral signals. You log in once through the browser, then reuse that browser context for subsequent requests. The session state (cookies, local storage, IndexedDB entries) persists exactly as a human user's would.

Anakin's Browser Sessions product works this way: you get a CDP-accessible browser instance where you can authenticate interactively and then hand off the live session to automated scraping. For sites with particularly aggressive bot detection, this is often the only path that works reliably.

What I'd Do Next

If you're building something that needs to scrape behind a login wall, start by mapping exactly what the login flow does. Open DevTools, go to the Network tab, filter by XHR and Fetch, and walk through the login manually. Note every cookie that gets set, every redirect, every token in the request or response body.

Then decide: can an HTTP client replay this, or does it require JavaScript execution? Most traditional web apps can be handled with a session-aware HTTP client and careful cookie and CSRF management. SPAs and OAuth-heavy flows usually need a real browser.

The session expiry detection logic is the part most people skip initially and then scramble to add later. Build it in from the start. A scraper that silently returns stale or wrong data because its session expired is harder to debug than one that fails loudly with a re-authentication attempt.

Scrape vs Crawl vs Map: Picking the Right Anakin API for the Job

tokozen — Tue, 21 Apr 2026 05:39:58 +0000

Scrape vs Crawl vs Map: Picking the Right Anakin API for the Job

You have a website you need data from. You open the docs, see three APIs that all sound like they touch web pages, and have to make a choice. Scrape, Crawl, Map. The names feel intuitive until you actually need to pick one.

This article breaks down what each one does, where it fits, and what the failure modes look like if you reach for the wrong one.

What Each API Actually Does

Scrape takes a single URL and returns clean, structured content from that page. You get the main text, metadata, possibly tables or links, depending on how you configure it. One request in, one page out. It is the right tool when you already know exactly which page holds the data you want.

Crawl starts at a URL and follows links, returning content from every page it visits up to some depth or page limit. You use it when the data you want is spread across a site and you do not know the exact URLs ahead of time. Think documentation sites, blog archives, or any site where the index page links to the content pages.

Map does not scrape content at all. It traverses a domain and returns every discoverable URL, nothing more. No page content, just a list of addresses. It is fast because it only needs to find links, not render and parse each page.

Here is a quick decision table:

You want...	Use
The content of one specific page	Scrape
The content of many pages on a site	Crawl
A list of all URLs on a site	Map
URLs first, then selectively scrape some	Map + Scrape

A Concrete Example: Building a RAG Knowledge Base

Say you are building a RAG pipeline over a software product's documentation. The docs live at docs.example.com. You want to ingest every page.

The wrong instinct here is to jump straight to Crawl. Before you do that, run Map to understand what you are dealing with.

import requests

ANAKIN_API_KEY = "your_api_key"

# Step 1: Map the docs site to get all URLs
map_response = requests.post(
    "https://api.anakin.ai/v1/map",
    headers={"Authorization": f"Bearer {ANAKIN_API_KEY}"},
    json={"url": "https://docs.example.com"}
)

urls = map_response.json().get("urls", [])
print(f"Found {len(urls)} URLs")

# Filter out non-content pages
content_urls = [
    u for u in urls
    if not any(skip in u for skip in ["/changelog", "/search", "/404"])
]
print(f"Scraping {len(content_urls)} content pages")

# Step 2: Scrape each page individually
results = []
for url in content_urls[:10]:  # test with first 10
    scrape_response = requests.post(
        "https://api.anakin.ai/v1/scrape",
        headers={"Authorization": f"Bearer {ANAKIN_API_KEY}"},
        json={"url": url, "formats": ["markdown"]}
    )
    data = scrape_response.json()
    results.append({
        "url": url,
        "title": data.get("title"),
        "content": data.get("markdown")
    })

print(f"Scraped {len(results)} pages")

This pattern gives you control. You can filter URLs before scraping, prioritize certain sections, and avoid wasting API calls on pages that will not add anything useful to your index (changelog entries, auto-generated search pages, and so on).

If you had used Crawl directly, you would have gotten all of that content automatically, but you would have less visibility into what was included until after the fact.

When Crawl Is the Right Choice

Crawl makes sense when you want comprehensive coverage and do not need to filter ahead of time. A few good cases:

Ingesting a blog where every post is worth including
Archiving a site before a migration
Building a competitive analysis tool that needs the full text of a competitor's site

The thing to watch with Crawl is depth configuration. A site that looks small can have thousands of pages once you follow pagination, tag pages, and user-profile URLs. Set a page limit and check the shape of what comes back before running it at full scale.

Crawl also handles the link-following logic for you. If a docs site has a sidebar with 200 links and you would have to manually extract and deduplicate them, Crawl saves you that work. Map does the same thing for URL discovery, but without the content.

When to Combine All Three

There is a pattern that shows up in more sophisticated pipelines: Map to discover, filter in code, then Scrape selectively. This is the approach in the example above.

It adds a step, but it gives you a clean separation between "what exists on this site" and "what I actually want." That matters when:

The site has sections you want to exclude (API reference pages that are too terse to be useful in a RAG context, for example)
You want to check which URLs you have already indexed before scraping again
You are building an incremental update system where you only re-scrape pages that have changed

For the incremental case, you would run Map periodically to detect new URLs, compare against your stored URL list, and then Scrape only the new ones. That is a lot cheaper than re-crawling the whole site every time.

The Failure Mode to Avoid

The most common mistake is using Scrape in a loop when Crawl would handle it better, or using Crawl when you only needed one page.

Scraping in a loop without any link discovery means you have to manually maintain the list of URLs. If the site adds a new page, you miss it. Crawl handles that automatically.

Using Crawl for a single known URL is just overhead. You will get the content of that page plus everything linked from it, which is not what you wanted, and you will pay for the extra pages.

Map is the one that tends to get overlooked. It feels redundant if you are already planning to crawl. But if you want to do anything intelligent with URL selection before fetching content, Map gives you that information at low cost.

What I Would Do Next

If you are starting a new data ingestion project, run Map first. Look at what comes back. Understand the structure of the site: how many pages, what the URL patterns look like, whether there are sections worth skipping. Then decide whether Crawl covers your needs or whether you want the finer control of Map plus selective Scraping.

That five-minute audit at the start will save you from over-fetching junk pages or under-fetching content that was one link deeper than you expected.

How Agentic Search Actually Works: The Research Loop Link-Fetching Agents Miss

tokozen — Thu, 16 Apr 2026 18:59:06 +0000

Most agent pipelines treat web search as a single-shot tool call: send a query, get back some URLs, fetch one or two of them, stuff the text into the context window, move on. That works fine for lookup tasks. "What is the capital of France?" does not need a research loop.

But real research tasks do. "What are the current funding trends in open-source AI infrastructure?" or "How does Company X's pricing compare to its three main competitors?" requires following threads, noticing gaps, and issuing follow-up queries. A single fetch-and-summarize pass almost always misses the part of the answer that was buried in a secondary source, a forum thread, or a page the first result happened to link to.

That is the gap agentic search is supposed to fill. Here is what the actual loop looks like, and why the naive version falls short.

What a naive search agent does

A typical ReAct-style agent calls a search tool, gets back a list of results, picks the top one or two, fetches the content, and hands it to the LLM. The LLM either answers from that or gives up.

The failure mode is quiet. The agent returns an answer, often a confident one, but it is based on whatever happened to rank highest in that one query. If the first result is a marketing page, a paywalled article, or a three-year-old blog post, the answer reflects that without the agent noticing.

Three concrete problems:

Single query coverage: one phrasing of a question surfaces a different slice of the web than a slightly different phrasing. No single query covers the topic.
No gap detection: the agent does not evaluate whether the retrieved content actually answers the question. It feeds whatever it got to the LLM and lets the LLM figure it out.
No follow-up: if the first batch of results is insufficient, the agent has no mechanism to try again with a refined query or drill into a promising link.

The research loop that fixes this

Agentic search replaces the single-shot pattern with a loop that has three phases: query generation, result evaluation, and follow-up decision.

Phase 1: generate multiple queries. Given a research goal, the LLM generates three to five distinct queries that approach the topic from different angles. Not just synonyms, but genuinely different framings: a factual lookup, a comparison query, a "what are people saying about" query, a date-scoped query.

Phase 2: fetch, extract, and score. For each query, fetch the top results. Extract clean text (not raw HTML with nav bars and cookie banners, but the actual prose). Score each chunk against the original research goal: is this relevant? does it introduce new information? does it contradict something already retrieved?

Phase 3: decide to continue or stop. If the retrieved content covers the goal, synthesize and return. If there are still gaps, generate new queries targeting those gaps specifically, and loop. Most well-scoped research tasks converge in two to four iterations.

Here is a minimal version of that loop in Python using Anakin's Agentic Search API, which handles the fetch-and-extract step and returns results with sources attached:

import requests
import os

ANAKIN_API_KEY = os.environ["ANAKIN_API_KEY"]
ANAKIN_ENDPOINT = "https://api.anakin.ai/v1/agentic-search"

def agentic_search(goal: str, max_iterations: int = 3) -> dict:
    collected_sources = []
    queries_tried = []
    current_query = goal

    for iteration in range(max_iterations):
        print(f"Iteration {iteration + 1}: querying '{current_query}'")

        response = requests.post(
            ANAKIN_ENDPOINT,
            headers={
                "Authorization": f"Bearer {ANAKIN_API_KEY}",
                "Content-Type": "application/json",
            },
            json={
                "query": current_query,
                "include_sources": True,
            },
            timeout=30,
        )
        response.raise_for_status()
        data = response.json()

        answer = data.get("answer", "")
        sources = data.get("sources", [])
        collected_sources.extend(sources)
        queries_tried.append(current_query)

        # Ask the LLM whether the answer covers the goal
        # or whether a follow-up query is needed.
        # In a real pipeline this is an LLM call.
        # Here we fake it with a length heuristic for illustration.
        if len(answer.split()) > 150:
            print(f"Goal covered after {iteration + 1} iteration(s).")
            return {"answer": answer, "sources": collected_sources}

        # Generate a follow-up query targeting the gap.
        # In production: call your LLM with the goal, the answer so far,
        # and ask it to produce a more specific query.
        current_query = f"{goal} detailed analysis {iteration + 1}"

    return {
        "answer": "Max iterations reached. Partial results below.",
        "sources": collected_sources,
    }


if __name__ == "__main__":
    result = agentic_search(
        goal="What are the main approaches to reducing LLM inference costs in 2024?"
    )
    print(result["answer"])
    for src in result["sources"]:
        print(" -", src.get("url"), src.get("title"))

The key thing the loop adds is the gap-detection step. Even with the fake heuristic above, the structure forces you to ask: did I actually get what I needed? That question is absent from the single-shot pattern.

What clean source attribution changes

The other thing a proper agentic search loop enables is traceable answers. When you fetch raw pages yourself and concatenate the text into a prompt, you lose the mapping between claims and sources. The LLM synthesizes across everything and you cannot tell which sentence came from where.

When each result comes back with its source URL attached to the specific text chunk it came from, you can build a citation index. For RAG pipelines this matters a lot: the final answer can include inline citations, and a downstream verification step can re-fetch the source to confirm a claim has not changed since it was indexed.

For agent memory, this is also useful. If the agent stores what it has already fetched (by URL or by a hash of the content), it avoids re-fetching the same page on the next iteration and can detect when two sources contradict each other.

Where to take this next

The loop I showed above is stateless across iterations. A more robust version would maintain a shared context object that accumulates:

All queries tried (to avoid rephrasing the same one)
All URLs fetched (to skip duplicates)
A running summary of what is known vs. what is still unknown
A confidence score that drives the stop condition

The stop condition is the hardest part to get right. Too aggressive and the agent stops after one good-looking result. Too lenient and it loops until it hits the token or cost limit. In practice, a small LLM call that scores coverage against a checklist derived from the original goal works better than a word-count heuristic.

The agents that produce genuinely useful research outputs are not the ones with the best base model. They are the ones with the tightest loop: query, evaluate, follow up, stop when done.