Forem: John Rooney

Scrapy AutoThrottle: How to tune crawl speed without getting blocked

John Rooney — Thu, 30 Apr 2026 19:58:32 +0000

AutoThrottle is one of Scrapy's most useful production features and one of the most commonly misconfigured. Most guides tell you to add four lines to your settings file and move on. This one explains what the algorithm is actually doing, what its assumptions are, why those assumptions sometimes break down, and how to tune it for real crawls.

If you have enabled AutoThrottle and are not sure whether it is working, or if your crawl speed is not what you expect despite having it turned on, this is the post to read.

What AutoThrottle actually measures

The most common misconception about AutoThrottle: it adjusts based on HTTP response codes. It does not. It adjusts based on response latency, the time between sending a request and receiving a response.

AutoThrottle never looks at whether you are receiving 429s, 503s, or any other status code that might indicate rate limiting or blocking. When responses come back quickly, it assumes the server has capacity and reduces the delay between requests. When responses slow down, it assumes load and increases the delay. The goal is to maintain a target number of in-flight concurrent requests to the server, adjusting the inter-request delay to keep actual concurrency close to that target.

This works well when server latency is an accurate signal of server load. It breaks in several common situations:

Content delivery network (CDN)-cached responses. A CDN edge node returns cached content at sub-millisecond latency. AutoThrottle sees very fast responses and reduces the delay aggressively, potentially hammering the origin server, which AutoThrottle never directly observes.
Silent rate limiting. Some sites return a 200 with a soft-block page, a CAPTCHA, a antiban challenge, or empty results, in fast response time. AutoThrottle interprets this as a healthy server and keeps the rate high. You are being blocked; the algorithm does not know.
Rate limiting via response code. A server that returns 429 in milliseconds looks, to AutoThrottle, like a fast healthy server. Latency-based throttling is irrelevant when the server is enforcing a request cap by policy rather than by load.

Knowing what AutoThrottle measures is what lets you decide when to use it. Before tuning delay settings, it is worth understanding your request requirements thoroughly: the post "The recipe for a request: Scaling data extraction" argues for investigating those requirements carefully before committing to a crawl strategy.

The settings that matter

Five settings control AutoThrottle behavior. Three of them matter for most crawls.

AUTOTHROTTLE_ENABLED = True

# The delay before the algorithm takes over — used for the first few requests
AUTOTHROTTLE_START_DELAY = 1.0

# Ceiling on the computed delay — prevents runaway backoff on very slow servers
AUTOTHROTTLE_MAX_DELAY = 60.0

# Target number of in-flight concurrent requests to the server
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Log per-request throttling decisions — useful for tuning
AUTOTHROTTLE_DEBUG = False

`AUTOTHROTTLE_TARGET_CONCURRENCY`

This is the most commonly misunderstood setting. It is not a hard concurrency cap; it is a target. AutoThrottle computes a delay to try to maintain approximately this many in-flight requests at any given time, given the observed latency.

The default of 1.0 means AutoThrottle aims for one request in flight at a time. At 200ms average response time, that translates to roughly five requests per second. At 50ms, roughly 20 requests per second. Raising TARGET_CONCURRENCY makes the crawl more aggressive; at 4.0 with 200ms average latency, AutoThrottle aims for about 20 requests per second.

Also note the interaction with CONCURRENT_REQUESTS_PER_DOMAIN: if that cap is lower than what AutoThrottle's computed delay would allow, the cap takes precedence. Both settings need to be sized for your throughput target.

`AUTOTHROTTLE_MAX_DELAY`

Without this, a slow or overloaded server can push computed delays into minutes. The default of 60 seconds is reasonable for polite crawling, but it means a blocked or very slow server will stall your crawl for up to a minute between requests. Set it proportional to your acceptable throughput floor: a ceiling of 10 seconds means the slowest you crawl is one request per 10 seconds.

`AUTOTHROTTLE_START_DELAY`

This is the delay used for the first few requests before the algorithm has enough latency samples to make good decisions. Set it to something in the range of the delay you would use if setting DOWNLOAD_DELAY manually. Too low and the first few requests come in a burst; too high and you waste time at the start of each domain.

Reading the debug output

The most direct way to understand what AutoThrottle is doing is to enable debug logging:

AUTOTHROTTLE_DEBUG = True

This logs one line per request showing the throttle decision:

2026-01-15 09:42:11 [scrapy.extensions.throttle] DEBUG: slot: example.com
  prev/curr concurrency: 3/4
  prev/curr latency: 0.24s/0.31s
  target latency: 0.31s
  delay: 0.08s -> 0.10s

slot is the domain this decision applies to.

prev/curr concurrency shows how many requests were in-flight on the previous request versus the current one, indicating whether actual concurrency is matching the target.

prev/curr latency is the signal AutoThrottle is responding to. Rising latency drives the delay up; falling latency drives it down.

target latency is what the algorithm is aiming for, derived from TARGET_CONCURRENCY and current observed latency.

delay is the computed inter-request delay before and after this adjustment. This is the number that feeds into the actual crawl rate.

What to look for during tuning:

Delay stuck at MAX_DELAY consistently. The server is responding slowly or you are being rate limited by latency (rare). Consider raising MAX_DELAY if the server is legitimately slow, or investigate whether you are being blocked.
Delay near zero consistently. AutoThrottle is running with almost no restriction. Either the server is very fast or TARGET_CONCURRENCY is too high relative to CONCURRENT_REQUESTS_PER_DOMAIN.
Concurrency consistently below target. The server is slower than expected and AutoThrottle cannot achieve the target without excessive load. Lower TARGET_CONCURRENCY.

When to use AutoThrottle, a fixed delay, or neither

AutoThrottle makes sense when you are crawling a domain where server response time is a reliable signal of server load, you want adaptive behavior that maximizes throughput without overloading the server, and you are not under a specific documented rate limit. That covers most general-purpose crawls.

Use a fixed DOWNLOAD_DELAY when you have a specific rate limit to respect: an API with documented caps, or a crawl policy agreed with the site operator. Fixed delays give you auditable, predictable behavior. The Scrapy 2026 release improved retry logic and delay handling, making fixed-delay crawls more reliable in high-volume scenarios.

DOWNLOAD_DELAY = 1.0           # one second between requests
RANDOMIZE_DOWNLOAD_DELAY = True  # randomise between 0.5s and 1.5s

RANDOMIZE_DOWNLOAD_DELAY (on by default when DOWNLOAD_DELAY is set) makes the timing pattern look less mechanical, which helps with some simple bot detection approaches.

Use neither when you are crawling many small domains in parallel where per-domain load is negligible, or when you are using a scraping API such as Zyte API that manages politeness for you. Adding a delay in those cases only reduces throughput without benefiting the target server.

Patterns that trick AutoThrottle

CDN caching

Many high-traffic sites serve content from CDN edges that return cached responses in under 10ms. AutoThrottle sees fast responses and reduces the delay, sometimes to near zero. The origin server may be under significant load from your crawl; CDN latency tells you nothing about it. If you are crawling a CDN-backed site and need to be polite to the origin, use a fixed DOWNLOAD_DELAY rather than AutoThrottle.

Silent rate limiting

A site that returns a soft-block page in fast response time looks identical to a successfully served page from AutoThrottle's perspective. You are being blocked; the algorithm sees only fast responses.

The diagnostic: if your item rate drops while response latency stays low and AutoThrottle is not backing off, you are being silently blocked. Slowing down will not fix it. The problem is elsewhere in the request fingerprint.

Variable-latency backends

Some sites have pages that respond in 30ms (static HTML, heavily cached) and pages that take 800ms (product pages with real-time inventory lookups). AutoThrottle makes decisions based on recent latency samples and will oscillate: fast pages drive the delay down, then a batch of slow pages backs it up. If the fast pages are not the ones you need to throttle for, lower TARGET_CONCURRENCY to give more headroom, or separate fast and slow page types into different crawl jobs with different throttle settings.

Settings cheat sheet

Recommended starting points for three common scenarios:

Scenario	`TARGET_CONCURRENCY`	`MAX_DELAY`	`START_DELAY`
Polite crawl of a single domain	1.0	10	2.0
Faster crawl of a cooperative server	4.0	30	1.0
Multi-domain crawl, many parallel slots	2.0	60	1.0

Enable AUTOTHROTTLE_DEBUG = True, run a short crawl, read the output, and adjust from there.

A note on blocking vs rate limiting

AutoThrottle addresses one problem: sending requests too fast for the server to handle comfortably. It does not address fingerprint-based detection, including TLS fingerprints, HTTP header patterns, browser behavior signatures, or IP reputation. Modern anti-bot systems primarily detect scrapers through these signals, not through request rate alone.

If you are following AutoThrottle's guidance on request rate and still getting blocked, the rate is probably not the issue. The problem is in the request fingerprint, and that requires a different solution.

Writing production-ready Scrapy spiders with opencode

John Rooney — Wed, 08 Apr 2026 09:19:39 +0000

AI-enabled code editors can now conjure scraping code on command. But anyone who has used a generic coding agent to build a spider knows what comes next: a plausible-looking file that falls apart the moment it hits a real website. The selectors are fragile, the error handling is missing, and the structure ignores everything Scrapy actually expects from production code.

The problem is not the AI. It's the prompts, the context, and knowing where to let the agent drive and where to stay in control. This article walks through using opencode to build Scrapy spiders that are actually deployable, covering setup, the prompts that work, and the pitfalls that will burn you if you are not careful.

Why opencode works well for scraping projects

Most AI coding agents are designed around general-purpose software projects. opencode is different in one important way: it is terminal-native, model-agnostic, and designed to operate inside your actual working directory. It reads your project, understands your file structure, and writes code into the files that already exist rather than pasting snippets into a chat window.

For Scrapy projects specifically, this matters. A spider is not a standalone script. It depends on items, settings, middlewares, pipelines, and page objects. An agent that can see all of those files at once produces far better output than one operating on a blank context.

opencode also supports custom commands stored as Markdown files. That means you can encode your own Scrapy conventions as reusable prompts and call them every time you start a new spider, without retyping the same context.

Getting set up

Install opencode with the one-liner:

curl -fsSL https://opencode.ai/install | bash

On macOS and Linux, the Homebrew tap gives you the fastest updates:

brew install anomalyco/tap/opencode

On Windows, use WSL for the best experience. The choco install opencode path works but the terminal experience is noticeably smoother inside a Linux environment.

Once installed, connect your model provider. The /connect command in the terminal user interface walks you through it. If you want to avoid managing API keys from multiple providers, opencode Zen gives you a curated set of pre-tested models through a single subscription at opencode.ai/auth.

For scraping work, choose a model with a large context window. Spider files, page objects, items, and a sample HTML fixture can easily fill 20,000 tokens before you have written a single prompt. Models with at least 64k context are the practical minimum.

Initialize your Scrapy project first

Before you open opencode, scaffold your Scrapy project as you normally would:

scrapy startproject myproject
cd myproject

Then initialize opencode inside the project root:

opencode init

This creates an AGENTS.md file. Commit it. opencode reads this file on every session to understand how your project is structured. Fill it with the conventions your project follows: which item classes exist, which middlewares are active, whether you are using scrapy-poet page objects, and which version of Zyte API or other HTTP backends you are using. The more context AGENTS.md carries, the less you repeat yourself in prompts.

A minimal AGENTS.md for a Scrapy project looks like this:

# Project conventions

- Python 3.12, Scrapy 2.12
- All spiders use scrapy-poet page objects (never parse in the spider class itself)
- Item classes are defined in items.py using dataclasses
- Zyte API is configured via scrapy-zyte-api; ZYTE_API_KEY is in .env
- Settings live in settings.py; never hardcode values in spider files
- All spiders output to JSON Lines via FEEDS setting
- Test fixtures live in tests/fixtures/ as .html files

The prompts that actually work

Generic prompts produce generic code. The prompts below are tested patterns that produce Scrapy-idiomatic output.

Starting a new spider

The most common mistake is asking opencode to "write a spider for X." That produces a working script, not a Scrapy spider. Be specific about structure:

Create a Scrapy spider for https://books.toscrape.com that:
- Uses a scrapy-poet page object called BookListPage for list pages and BookDetailPage for detail pages
- Extracts: title, price, availability, star rating, and product URL
- Handles pagination by following the "next" link
- Stores results in a BookItem dataclass in items.py
- Does not put any CSS selector logic inside the spider class itself

Start with the page objects in pages.py, then write the spider in spiders/books.py.

The explicit constraint against putting selectors in the spider class is important. Without it, the agent will inline everything, which defeats scrapy-poet's purpose and makes the code harder to test.

Asking for resilient selectors

Generated selectors are often too specific. They target a class that is only present on one layout variant, or chain through five levels of nesting that will break on the next site deploy.

Prompt the agent to justify its selector choices:

Write the CSS selectors for BookDetailPage. For each field, explain why you chose
that selector over alternatives. Prefer attribute-based selectors (like [itemprop] or
[data-*]) over class names where both options exist.

This produces more defensive selectors and, more importantly, gives you enough reasoning to judge whether to accept them.

Adding error handling

The agent will skip error handling unless you ask for it explicitly:

Add error handling to BookDetailPage:
- If price is missing, log a warning and return None (do not raise an exception)
- If star rating cannot be parsed, default to 0
- Add a try/except around the availability field and log the raw text if parsing fails

Never assume the agent will add graceful degradation on its own. It optimizes for the happy path.

Writing tests

opencode is genuinely useful for generating pytest fixtures and test scaffolding. Give it a concrete fixture to work from:

Write a pytest test for BookDetailPage using the HTML fixture at
tests/fixtures/book_detail.html. Test that:
- title is extracted as a non-empty string
- price is a float greater than zero
- availability is one of: "In stock", "Out of stock"
- star_rating is an integer between 0 and 5

Use pytest parametrize if testing multiple fixture variants.

Pitfalls to watch for

The agent assumes the HTML is static

By default, any spider the agent generates will use response.css() or response.xpath() on raw HTML. If your target site renders content with JavaScript, those selectors return nothing. Before you run any generated spider, check whether the target page is JavaScript-rendered by viewing source in your browser. If the data you need is absent from the raw HTML, prompt the agent to use Zyte API's headless browser or a Playwright download handler instead of a plain HTTP request.

Selectors written against one page break on others

The agent writes selectors against whatever HTML you give it. If you paste a single product page, it will produce selectors that work on that product page. Run the spider against 10 or 20 URLs from the same site before treating the selectors as reliable.

Ask the agent to help you validate coverage:

Here are three different product page HTML snippets from the same site (pasted below).
Identify any selectors in BookDetailPage that would fail on snippet 2 or snippet 3,
and suggest more robust alternatives.

Context window exhaustion mid-session

Long sessions that involve large HTML files, multiple spider files, and back-and-forth debugging will eventually exhaust the model's context. When this happens, the agent starts contradicting earlier decisions or forgetting your project conventions.

The fix is to keep sessions short and focused. One session per spider, or one session per refactor task. Use your AGENTS.md to carry conventions across sessions rather than re-explaining them in chat.

Generated settings override your existing configuration

When the agent writes setup instructions, it often suggests adding settings directly to settings.py. If you already have a settings file, this can clobber existing values or introduce conflicts. Review every settings change the agent proposes before accepting it.

The agent does not know about anti-bot measures

opencode has no knowledge of whether a site actively blocks scrapers. It will happily generate a spider that will be blocked immediately in production. Anti-bot handling, rate limiting, and request fingerprinting are your responsibility to layer in. Zyte API handles the blocking and fingerprinting side; you still need to configure the integration yourself rather than expecting the agent to know it is necessary.

Useful custom commands for scraping

opencode custom commands let you encode reusable prompts as Markdown files in ~/.config/opencode/commands/. Here are three worth setting up for any Scrapy workflow.

`user:new-spider`

# New scrapy-poet spider

Create a new Scrapy spider for the URL provided by the user.
- Use scrapy-poet page objects (list page + detail page if applicable)
- Put all selector logic in page objects, nothing in the spider class
- Use item dataclasses from items.py (create new ones if needed)
- Include pagination handling
- Add logging for missing fields (warning level)
- Write page objects first, then the spider

Ask the user for the target URL before starting.

`user:harden-selectors`

# Harden selectors

Review the page objects in the current file. For each CSS or XPath selector:
1. Identify whether it targets a class, ID, tag, or attribute
2. If it targets a class name, suggest an attribute-based or structural alternative
3. Flag any selectors that chain more than three levels deep as fragile

Output a revised version of the file with improved selectors and inline comments
explaining each change.

`user:gen-tests`

# Generate pytest tests

Given a page object file and an HTML fixture provided by the user:
1. Write a pytest test file that covers all extracted fields
2. Test that required fields are non-null and the correct type
3. Test that optional fields handle absence gracefully (None, not exception)
4. Use parametrize if multiple fixture variants are present

Ask for the fixture file path before starting.

Where opencode fits in the workflow

Think of opencode as a fast first-draft tool, not an autonomous spider factory. The right workflow is:

Scaffold the project and write AGENTS.md manually
Use opencode to generate page objects and the spider skeleton
Review every selector by hand before trusting it
Run the spider against a sample of real URLs and inspect the output
Use opencode to patch failures and write tests
Handle anti-bot, rate limiting, and deployment yourself

The agent saves the most time on the repetitive structural work: boilerplate item classes, pagination logic, field extraction scaffolding, and test stubs. The judgment calls around which selectors are robust, whether a site is JavaScript-rendered, and how to handle blocking remain entirely in your hands.

That division of labor is what makes this approach work at production scale rather than just for prototypes.

Try it yourself

Install opencode, initialize it in an existing Scrapy project, and start with the user:new-spider custom command above. Pick a publicly accessible, static site like books.toscrape.com to test the workflow before applying it to a site with more complexity.

For the JavaScript-rendered sites and anything with active anti-bot measures, pair opencode's code generation with Zyte API to handle the access layer. You can sign up for a free trial and have a working integration running in minutes. The Zyte documentation covers the scrapy-zyte-api configuration in detail.

Build Scrapy spiders in 23.54 seconds with this free Claude skill

John Rooney — Mon, 30 Mar 2026 17:50:39 +0000

I built a Claude skill that generates Scrapy spiders in under 30 seconds — ready to run, ready to extract good data. In this post I'll walk through what I built, the design decisions behind it, and where I think it can go next.

What it does

The skill takes a single input: a category or product listing URL. From there, Claude generates a complete, runnable Scrapy spider as a single Python script. No project setup, no configuration files, no boilerplate to write. Just a script you can run immediately.

Here's what that looks like in practice. I opened Claude Code in an empty folder with dependencies installed, activated the skill, and said: "Create a spider for this site" — and pasted a URL.

Within seconds, the script was generated. I ran it, watched the products roll in, piped the output through jq, and had clean structured product data. Start to finish: under a minute.

Why a single-file script, not a full Scrapy project?

Scrapy is usually a full project — multiple files, lots of moving parts, a proper setup process. Running it from a script instead is generally discouraged for production work, but for this use case it's actually the right call.

The goal here is what I'd call pump-and-dump scraping: give Claude a URL, get a spider, run it for a couple of days, move on. It's not designed to scrape millions of products every day for years. For that kind of scale you need proper infrastructure, robust monitoring, and serious logging. This isn't that — and that's intentional.

What you do get, even in the single-file approach, is almost everything Scrapy offers: middleware, automatic retries, and concurrency handling. You'd have to build all of that yourself with a plain requests script. Scrapy gives it to you for free, even when running from a script.

The key design decision: AI extraction

The other major call I made was to lean entirely on Zyte API's AI extraction rather than generating CSS or XPath selectors.

Specifically, the skill uses two extraction types chained together: productNavigation on the category or listing page, which returns product URLs and the next page link, and product on each product URL, which returns structured product data including name, price, availability, brand, SKU (stock keeping unit), description, images, and more.

This means the spider doesn't need to know anything about the structure of the site it's crawling. There are no selectors to generate, no schema to define, no user confirmation step. The AI on Zyte's end handles all of that. It does cost slightly more than a raw HTTP request, but given how little time it takes to go from URL to working spider, the trade-off makes sense.

I've hardcoded httpResponseBody as the extraction source — it's faster and more cost-efficient than browser rendering. If a site is JavaScript-heavy and you're not getting the data you need, you can switch to browserHtml with a one-line change. The spider logs a warning to remind you of this.

The use case is deliberately narrow

This skill is designed for e-commerce sites, and only e-commerce sites. That's not a limitation I stumbled into — it's a feature.

Because the scope is narrow, the spider structure is simple and predictable: category pages with pagination, product links, and detail pages. Zyte API's productNavigation and product extraction types handle this reliably. Widening the scope to arbitrary crawling would require a lot more of Scrapy's machinery and would quickly exceed what makes sense for a lightweight script like this.

What it doesn't do: deep subcategory crawling, link discovery, or full-site crawls. If a page renders all its products without pagination, that works fine — the next page just returns nothing.

Logging and output

I replaced Scrapy's default logging with Rich logging, which gives cleaner terminal output. Scrapy's logs are verbose in ways that aren't useful when you're running a short-lived script — I wanted something concise enough that if something went wrong, it would be obvious at a glance.

Output goes to a .jsonl file named after the spider, alongside a plain .log file. Both are derived from the spider name, which is itself derived from the domain. Run example_com.py, get example_com.jsonl and example_com.log.

Where this goes next

The immediate next step I have in mind is selector-based extraction as an alternative path — useful for sites where the AI extraction isn't quite right, or where you want more control over exactly what gets pulled.

The longer-term vision is running this fully agentically. URLs get submitted somewhere — a queue, a database table, a form — an agent picks them up, builds the spider, and maybe runs a quick validation. The spider then goes into a pool to be run on a schedule, and data lands in a database rather than a flat file. Give Claude access to a virtual private server (VPS) via terminal and most of this is achievable without much extra infrastructure. The skill is already the hard part.

Download the skill

The skill is free to download and use. It's a single .skill file you can install directly into Claude Code. You'll need:

pip install scrapy scrapy-zyte-api rich
export ZYTE_API_KEY=your_key_here

Scrapy 2.13 or above is required for AsyncCrawlerProcess.

The link to the repo and the skill download are in the video description, and here. If you've built something similar, or have thoughts on the design decisions — especially around the extraction approach or the logging setup — I'd love to hear it in the comments. GitHub links to your own scrapers are very welcome too.

If you're interested in more agentic scraping patterns, I also built a Claude skill that helps spiders recover from excessive bans — you can watch that video here.

I Built a Self-Healing Web Scraper to Auto-Solve 403s

John Rooney — Mon, 16 Mar 2026 11:09:31 +0000

Web scraping has a recurring enemy: the 403. Sites add bot detection, anti-scraping tools update their challenges, and scrapers that worked fine last week start silently failing. The usual fix is manual — check the logs, diagnose the cause, update the config, redeploy. I wanted to see if an agent could handle that loop instead.

So I built a self-healing scraper. After each crawl, a Claude-powered agent reads the failure logs, probes the broken domains with escalating fetch strategies, and rewrites the config automatically. By the next run, it's already fixed itself.

How it works

The project has two parts: a scraper and a self-healing agent.

The scraper

main.py is a straightforward Python scraper driven entirely by a config.json file. Each domain entry tells the scraper which URLs to fetch and how to fetch them:

{
  "id": "books",
  "zyte": true,
  "browser_html": false,
  "urls": ["https://www.bookstocsrape.co.uk/products/..."]
}

There are three fetch modes:

Direct — a plain requests.get(). Fast, free, works for sites that don't block bots.
Zyte API (httpResponseBody) — routes the request through Zyte's residential proxy network. Good for sites that block datacenter IPs.
Zyte API (browserHtml) — spins up a real browser via Zyte, executes JavaScript, and returns the fully-rendered DOM. Required for sites using JS-based bot challenges.

Every request is logged to scraper.log in the same format:

2026-03-14 09:12:01 url=https://... domain_id=scan status=200

If a request throws any exception, it's recorded as a 403. That keeps the log clean and gives the agent a consistent signal to act on.

The self-healing agent

agent.py is a Claude-powered agent that runs after each crawl. It uses the Claude Agent SDK and has access to three tools: Read, Bash, and Edit — enough to operate completely autonomously.

The agent works through a staged process:

Read the log — finds every domain that returned a 403
Cross-reference the config — skips domains already configured to use Zyte
Stage 1 probe — uses the zyte-api CLI to fetch one URL per failing domain with httpResponseBody, then inspects the page <title>
Challenge detection — if the title contains phrases like "Just a moment", "Checking your browser", or "Verifying you are human", the page is flagged as a bot challenge
Stage 2 probe — challenge pages are re-probed using browserHtml, which runs a real browser to bypass JS-based detection
Config update — the agent edits config.json directly, setting zyte: true and/or browserHtml: true for domains that now work

The next crawl automatically uses the right fetch strategy. No manual intervention needed.

Design decisions

Config-driven, not code-driven

Everything lives in config.json. Adding a new domain is a one-liner, and the scraper doesn't need to know anything about individual sites — it just reads the config and follows instructions. The agent writes to the same file, so the loop closes itself naturally.

Graduated fetch strategy

Not every site needs an expensive browser render. By escalating from direct to httpResponseBody to [browserHtml](https://www.zyte.com/zyte-api/headless-browser/) only when necessary, I keep costs manageable. Browser renders are slower and consume more API credits — reserving them for sites that actually need them makes a meaningful difference at scale.

Letting the agent handle the heuristics

The challenge detection logic — matching titles against known bot-detection phrases — is exactly the kind of fuzzy heuristic that's tedious to maintain as code but natural for a language model to reason about. Claude also handles edge cases gracefully: if the zyte-api CLI isn't installed, if the log is empty, if a domain is already correctly configured. A rule-based script would need explicit handling for every one of those scenarios.

The limitations

It's worth being honest about where this approach falls short.

It's reactive, not proactive. The agent only runs after a failed crawl. If a site starts blocking mid-run, those URLs fail silently until the next cycle.

Title-based detection is fragile. Most bot-challenge pages say "Just a moment…" — but a legitimate site could theoretically use that phrase. A false positive would cause the scraper to wastefully use browser rendering where it isn't needed.

One URL per domain. The agent probes only the first failing URL for each domain. Different URL patterns on the same domain can have different bot-detection behaviour, which this doesn't account for.

No rollback. Once the config is updated, there's no way to detect if a Zyte setting later stops working and revert it automatically.

Cost opacity. The scraper logs HTTP status codes, not Zyte API credit consumption. There's no visibility into what each domain actually costs to fetch.

Where I'd take it next

Smarter challenge detection. Rather than keyword-matching on the title, the agent could read the full page HTML and make a more nuanced call — is this a product page, a login wall, or a soft block with a CAPTCHA? Each requires a different response.

Proactive monitoring. A lightweight probe running daily against each configured domain, independent of the main crawl, would let the agent update the config before a full scrape run hits a known-bad configuration.

Per-URL config. Right now zyte and browser_html are set at the domain level. Some sites serve static product pages on one path and JS-rendered category pages on another — granular per-URL settings would handle that cleanly.

Structured data extraction. Right now parse_page only pulls the page title. The natural next step is structured product extraction — price, availability, name, images — either via CSS selectors in the config or Zyte's product extraction type, which uses ML models to parse product data from any page.

Multi-agent parallelism. The self-healing loop is currently a single agent. As the config grows, a coordinator could spawn one subagent per failing domain, each running its own probe pipeline concurrently. The Claude Agent SDK supports subagents natively, so this would be a relatively small change.

The core idea is simple: a scraper that observes its own failures and reconfigures itself. What I found interesting about building it wasn't the scraping itself — it was seeing how little scaffolding the agent actually needed. Three tools, a clear task, and it handles the diagnostic work that would otherwise fall to me.

How I get Claude to build HTML parsing code the way I want it

John Rooney — Wed, 11 Mar 2026 07:13:06 +0000

Getting HTML off a page is only the first step. Once you have it, the real work begins: pulling out the data that actually matters — product names, prices, ratings, specifications — in a clean, structured format you can actually do something with.

That's what the parser skill is for. If you haven't read the introduction to skills in our fetcher post, it's worth a quick look first. But the short version is this: a skill is a SKILL.md file that gives Claude precise, reusable instructions for using a specific tool. The parser skill is one of three that together form a complete web scraping pipeline.

zytelabs / claude-webscraping-skills

claude-webscraping-skills

A collection of claude skills and other tools to assist your web-scraping needs.

video explanations:

https://youtu.be/HH0Q9OfKLu0 https://youtu.be/P2HhnFRXm-I

Other claude tools for web scraping

zyte-fetch-page-content-mcp-server

A Model Context Protocol (MCP) server that runs locally using docker desktop mcp-toolkit and help you extracts clean, LLM-friendly content from any webpage using the Zyte API. Perfect for AI assistants that need to read and understand web content. by Ayan Pahwa

Improve Claude Code WebFetch with Zyte API

When Claude encounters a WebFetch failure, it reads the CLAUDE.md instructions and makes a curl request to the Zyte API endpoint. The API returns base64-encoded HTML, which Claude decodes and processes just like it would with a normal WebFetch response. by Joshua Odmark

Claude skills vs MCP vs Web Scraping CoPilot

View on GitHub

What is a skill?

A skill is a small markdown file that tells Claude how to use a specific script or tool — what it does, when to use it, and step-by-step how to run it. Claude reads the file and follows the instructions as part of a broader workflow, with no manual intervention required.

Skills are composable by design. The fetcher skill hands raw HTML to the parser skill, which hands structured JSON to the compare skill. Each one does one job well, and they're built to work together.

The parser skill's front matter sets out its purpose immediately:

---
name: parser
description: "Extracts structured product data from raw HTML. Tries JSON-LD via"
Extruct first, falls back to CSS selectors via Parsel.
---

Two methods, one fallback. That single description line captures the entire logic of the skill.

What the parser skill does

The parser skill takes raw HTML as input and returns a structured JSON object. It uses two extraction methods in sequence, trying the more reliable one first and falling back to the more flexible one if needed.

The primary method uses Extruct to find JSON-LD data embedded in the page. JSON-LD is a structured data format that many modern sites include in their HTML specifically to make their content machine-readable — it's used for search engine optimisation and data portability. When it's present, Extruct can read it cleanly and reliably, with no need to write or maintain selectors.

If no usable JSON-LD is found, the skill falls back to Parsel, which uses CSS selectors to locate data heuristically across the page. This is more flexible but inherently tied to the page's visual structure, which can change.

When to use it

## When to use
Use this skill when you have raw HTML and need to extract structured data from
it — product details, prices, specs, ratings, or any page content.

In practice, that means the parser skill is almost always the second step in a pipeline — running immediately after the fetcher skill has retrieved your HTML. It works with any page type, and handles the most common product fields out of the box.

How it works

## Instructions
1. Save the HTML to a temporary file `page.html`
2. Run `parser.py` against it:
   python parser.py page.html

3. The script outputs a JSON object. Check the `method` field:
   - "extruct" — clean structured data was found, use it directly
   - "parsel" — fell back to CSS selectors, review fields for completeness
4. If key fields are missing from the Parsel output, ask the user which fields
   they need and re-run with --fields:
   python parser.py page.html --fields "price,rating,brand"

5. Return the parsed JSON to the conversation for use in the Compare skill.

The method field in the output is particularly useful. It tells you immediately how the data was extracted and how much trust to place in it. An "extruct" result is clean and stable. A "parsel" result is worth reviewing, especially if you're working with an unusual page layout.

The --fields flag is a practical escape hatch. Rather than requiring you to dig into the script when key data is missing, it lets you specify exactly what you need and re-run — a much more efficient loop.

Why prefer Extruct?

The notes section of the skill file makes this explicit:

## Notes
- Always prefer the Extruct path — it is more stable and requires no maintenance
- Parsel selectors are generated heuristically and may need adjustment for
  unusual page layouts
- Run once per page; pass all outputs together into the Compare skill

Parsel selectors break when sites redesign. JSON-LD, by contrast, is structured data the site publishes independently of its visual layout. A site can completely overhaul its design and its JSON-LD will often remain untouched. That stability is worth prioritising wherever possible.

What comes next

Once you've run the parser skill across all your target pages, you have a set of structured JSON objects ready to compare. That's where the compare skill picks up — generating tables, summaries, and side-by-side analysis from the extracted data.

Do you need a skill?

The parser skill works well when the data you need maps cleanly onto fields that Extruct or Parsel can find — product names, prices, ratings, and similar structured attributes that sites commonly expose through JSON-LD or consistent HTML patterns. For that category of task, the skill is fast to apply and requires no custom code.

See our post about Skills vs MCP vs Web Scraping Copilot (our VS Code Extension):

zyte.com

But not every extraction problem fits that mould. If you're working with pages that don't include JSON-LD and have highly irregular layouts, Parsel's heuristic selectors may return incomplete or inconsistent results, and you'll spend time debugging field by field. In those cases, a purpose-built extraction script using Parsel or BeautifulSoup directly — with selectors you've written and tested against the specific target — will be more reliable.

For larger-scale or more complex extraction work, Zyte API's automatic extraction capabilities go further still. Rather than relying on selectors at all, automatic extraction uses AI to identify and return structured data from a page without requiring you to specify fields or maintain selector logic. If you're extracting data from many different site structures, or you need extraction to keep working through site redesigns without manual intervention, that's a more robust foundation than a skill-based approach. The parser skill is best understood as a practical middle ground: fast to use, good enough for a wide range of common cases, and easy to slot into a pipeline — but not a replacement for extraction tooling built for scale or resilience.

I gave Claude access to a web scraping API

John Rooney — Wed, 11 Mar 2026 07:10:43 +0000

If you've worked with Claude for any length of time, you've probably noticed it can do a lot more than answer questions. With the right setup, it can take actions — running scripts, processing files, working through multi-step workflows autonomously. Skills are what make that possible.

What is a skill?

A skill is a small, self-contained instruction set that tells Claude how to use a specific tool or script to accomplish a well-defined task. Technically, it's a markdown file — a SKILL.md — that describes what a tool does, when to reach for it, and exactly how to run it. Claude reads that file and follows the instructions as part of a larger workflow.

Skills are designed to be composable. Each one does one thing well, and they're built to hand off to each other. The fetcher skill retrieves HTML. The parser skill extracts data from it. The compare skill turns multiple parsed outputs into a structured comparison. Together, they form a complete scraping pipeline — and Claude orchestrates the whole thing.

See our skills here: https://github.com/zytelabs/claude-webscraping-skills

The skill format looks like this:

---
name: fetcher
description: "Fetches raw HTML from a URL using httpx, with automatic fallback to Zyte API if blocked."
---

That front matter is what Claude uses to match the right skill to the right task. The description is deliberately precise: it tells Claude not just what the skill does, but how it does it, so Claude can reason about whether it's the right tool for the job.

What the fetcher skill does

The fetcher skill's job is exactly what it sounds like: given a URL, fetch the raw HTML and return it. It uses httpx as its primary HTTP client — a modern, performant Python library well suited to scraping workloads.

What makes it more than a simple wrapper is the fallback logic. A significant number of sites actively block automated requests. Without a fallback, a blocked request just fails, and you're left manually diagnosing why. The fetcher skill handles this automatically. If a request comes back with a BLOCKED status, it retries via Zyte API, which provides built-in unblocking. Most of the time, you get your HTML without ever needing to intervene.

When to use it

The skill's SKILL.md is explicit about this:

## When to use
Use this skill when the user provides one or more URLs and asks you to fetch,
retrieve, scrape, or get the HTML or page content.

In practice, that means any time you're starting a scraping or data extraction task and you have a URL to work from. It's the entry point for the pipeline.

How it works

The instructions in the skill file are straightforward:

## Instructions
1. Run `fetcher.py` with the URL as an argument:
   python fetcher.py <url>

2. If the script returns a successful HTML response, return the HTML to the
   conversation for use in the next step.
3. If the script returns a `BLOCKED` status, re-run with the `--zyte` flag:
   python fetcher.py <url> --zyte

4. Inform the user if a URL could not be fetched after both attempts.

The two-step process keeps things efficient. httpx is fast and lightweight, so it handles the majority of requests without needing to route through Zyte API. The fallback only kicks in when it's needed. If both attempts fail, Claude surfaces that to you clearly rather than silently moving on.

For multiple URLs, the script runs once per URL — there's no batching — so Claude loops through a list sequentially.

Transparency about failure

One detail worth highlighting is the final instruction: inform the user if a URL could not be fetched after both attempts. This might seem obvious, but it reflects a design principle worth being explicit about. A skill that silently drops failed URLs would produce incomplete data downstream, and you might not notice until you're looking at a comparison table with missing rows. Surfacing failures immediately keeps the pipeline honest.

What comes next

The fetcher skill's output is raw HTML — exactly what the parser skill expects as its input. The two are designed to be used in sequence. Once you have the HTML, the parser skill takes over, extracting structured data through JSON-LD or CSS selectors depending on what the page contains.

That handoff is documented in the skill's notes:

## Notes
- For multiple URLs, run the script once per URL
- Pass the raw HTML output into the Parser skill for extraction

The pipeline continues from there.

Do you need a skill?

Skills are a good fit when you have a well-defined, repeatable task that benefits from consistent behaviour across many runs. Fetching HTML from a URL is a clear example: the inputs and outputs are predictable, the fallback logic is always the same, and packaging that into a skill means Claude applies it reliably without you having to re-explain the process each time.

Read our break down of Skills vs MCP vs Web Scraping Copilot here - our VS Code extension

That said, skills aren't always the right tool. If you only need to fetch a handful of pages once, asking Claude to write a quick httpx script directly may be faster and more flexible. Similarly, if your target sites have unusual behaviour — rate limiting, JavaScript rendering, login walls, or multi-step navigation — a bespoke Scrapy spider built with Zyte API gives you far more control than a general-purpose fetch wrapper. Scrapy's middleware architecture, item pipelines, and scheduling make it better suited to large-scale or complex crawls where you need precise control over every aspect of the request cycle.

The fetcher skill sits in the middle: more structured than an ad hoc script, less complex than a full Scrapy project. It's the right choice when you want Claude to handle straightforward retrieval as part of a larger automated workflow, without the overhead of setting up and maintaining a dedicated spider.

Why your Python request gets 403 Forbidden

John Rooney — Wed, 04 Mar 2026 09:11:29 +0000

If you’ve had your HTTP request blocked despite using correct headers, cookies, and clean IPs, there’s a chance you are running into one of the simplest forms of blocking, and one of the most confusing for beginners.

Chances are, you will recognise the problem. You found the hidden API, and your request works perfectly in Postman... but it fails instantly within your Python code.

It’s called TLS fingerprinting. But the good news is, you can solve it. In fact, when I showed this to some developers at Extract Summit, they couldn’t believe how straightforward it was to fix.

CAPTION: “I copied the request -> matching headers, cookies and IP, but it still failed?”

Your TLS fingerprint

Let’s start with a question. How do the servers and websites know you’ve moved from Postman to making the request in Python? What do they see that you can’t? The key is your TLS fingerprint.

To use an analogy: We’ve effectively written a different name on a sticker and stuck it to our t-shirt, hoping to get past the bouncer at a bar.

Your nametag (headers) says "Chrome."

But your t-shirt logo (TLS handshake) very obviously says "Python."

It’s a dead giveaway. This mismatch is spotted immediately. We need to change our t-shirt to match the nametag.

To understand how they spot the “logo”, we need to look at the initial “Client Hello” packet. There are three key pieces of information exchanged here:

Cipher suites: The encryption methods the client supports.
TLS extensions: Extra features (like specific elliptic curves).
Key exchange algorithms: How they agree on a password.

This is because Python’s requests library uses OpenSSL, while Chrome uses Google's BoringSSL. While they share some underlying logic, their signatures are notably different. And that’s the problem.

OpenSSL vs. BoringSSL

The root cause of this mismatch lies in the underlying libraries.

Python’s requests library relies on OpenSSL, the standard cryptographic library found on almost every Linux server. It is robust, predictable, and remarkably consistent.

Chrome, however, uses BoringSSL, Google’s own fork of OpenSSL. BoringSSL is designed specifically for the chaotic nature of the web and it behaves very differently.

The biggest giveaway between the two is a mechanism called GREASE (Generate Random Extensions And Sustain Extensibility).

[
  "TLS_GREASE (0xFAFA)",
   ....
]

Chrome (BoringSSL) intentionally inserts random, garbage values into the TLS handshake - specifically, in the cipher suites and extensions lists. It does this to "grease the joints" of the internet, ensuring that servers don't crash when they encounter unknown future parameters.

This is one of the key changes

Chrome: Always includes these random GREASE values (e.g., 0x0a0a).
Python (OpenSSL): Never includes them. It only sends valid, known ciphers.

So, when an anti-bot system sees a handshake claiming to be "Chrome 120" but lacking these random GREASE values, it knows instantly that it is dealing with a script. It’s not just that your shirt has the wrong logo; it’s that your shirt is too clean.

JA3 hash

Anti-bot companies take all that handshake data and combine it into a single string called a JA3 fingerprint.

Salesforce invented this years ago to detect malware, but it found its way into our industry as a simple, effective way to fingerprint HTTP requests. Security vendors have built databases of these fingerprints.

It is relatively straightforward to identify and block any request coming from Python’s default library because its JA3 hash is static and well-known.

This code snippet would yield the below JSON response.

def get_ja3_info():
    url = "https://tls.peet.ws/api/clean"
    with requests.Session() as session:
        response = session.get(url)
        response.raise_for_status()
        data = response.json()
        print(json.dumps(data))

Note the lack of akamai_hash

{
  "ja3": 
"771,4866-4867-4865-49196-49200-49195-49199-52393-52392-49188-49192-49187-49191-159-158-107-103-255,0-11-10-16-22-2
3-49-13-43-45-51-21,29-23-30-25-24-256-257-258-259-260,0-1-2",
  "ja3_hash": "a48c0d5f95b1ef98f560f324fd275da1",
  "ja4": "t13d1812h1_85036bcba153_375ca2c5e164",
  "ja4_r": 
"t13d1812h1_0067,006b,009e,009f,00ff,1301,1302,1303,c023,c024,c027,c028,c02b,c02c,c02f,c030,cca8,cca9_000a,000b,000
d,0016,0017,002b,002d,0031,0033_0403,0503,0603,0807,0808,0809,080a,080b,0804,0805,0806,0401,0501,0601,0303,0301,030
2,0402,0502,0602",
  "akamai": "-",
  "akamai_hash": "-",
  "peetprint": 
"772-771|1.1|29-23-30-25-24-256-257-258-259-260|1027-1283-1539-2055-2056-2057-2058-2059-2052-2053-2054-1025-1281-15
37-771-769-770-1026-1282-1538|1||4866-4867-4865-49196-49200-49195-49199-52393-52392-49188-49192-49187-49191-159-158
-107-103-255|0-10-11-13-16-21-22-23-43-45-49-51",
  "peetprint_hash": "76017c4a71b7a055fb2a9a5f70f05112"
}

Putting the above JA3 hash into ja3.zone clearly shows this is a python3 request, using urllib3:

What’s the solution?

As mentioned, simply changing headers and IP addresses won’t make a difference, as these are not part of the TLS handshake. We need to change the ciphers and Extensions to be more like what a browser would send.

The best way to achieve this in Python is to swap requests for a modern, TLS-friendly library like curl_cffi or rnet.

Here is how easy it is to switch to curl_cffi:

from curl_cffi import requests
# note the impersonate argument & import above
def get_ja3_info():
    url = "https://tls.peet.ws/api/clean"
    with requests.Session() as session:
        response = session.get(url, impersonate="chrome")
        response.raise_for_status()
        data = response.json()
        print(json.dumps(data))

"akamai_hash": "52d84b11737d980aef856699f885ca86"

CAPTION: Note - I searched via the akamai_hash here as the fingerprint from the JA3 hash wasn’t in this particular database.

By adding that impersonate parameter, you are effectively putting on the correct t-shirt.

Summary

Make curl_cffi or rnet your default HTTP library in Python. This should be your first port of call before spinning up a full headless browser.

A simple change (which brings benefits like async capabilities) means you don’t fall foul of TLS fingerprinting. curl-cffi even has a requests-like API, meaning it's often a drop-in replacement.

Hybrid scraping: The architecture for the modern web

John Rooney — Wed, 25 Feb 2026 17:05:36 +0000

If you scrape the modern web, you probably know the pain of the JavaScript challenge.

Before you can access any data, the website forces your browser to execute a snippet of JavaScript code. It calculates a result, sends it back to an endpoint for verification, and often captures extensive fingerprinting data in the process.

Once you pass this test, the server assigns you a session cookie. This cookie acts as your "access pass." It tells the website, "This user has passed the challenge," so you don’t have to re-run the JavaScript test on every single page load.

For web scrapers, this mechanism creates a massive inefficiency.

It looks like you are forced to use a headless browser (like Puppeteer or Playwright) for every single request just to handle that initial check. But browsers are heavy, they are slow and they consume massive amounts of RAM and bandwidth.

Running a browser for thousands of requests can quickly become an infrastructure nightmare. You end up paying for CPU cycles just to render a page when all you wanted was the JSON payload.

The solution: Hybrid scraping

The answer to this problem is a technique I’ve started calling hybrid scraping.

This involves using the browser only to open the initial request, grab the cookie, and create a session. Once you have them, you extract that session data and hand it over to a standard, lightweight HTTP client.

This architecture gives you the access of a browser with the speed and efficiency of a script.

Implementing this in Python

To build this in Python, we need two specific packages:

A browser: We will use ZenDriver, a modern wrapper for headless Chrome that handles the "undetected" configuration for us.
HTTP client: We will use rnet, a Rust-based HTTP client for Python.

But why rnet? Well, within the initial TLS handshake where the client/server “hello” is sent, the information traded here can be fingerprinted, taking in things like the TLS version and the ciphers available for encryption. This can be hashed into a fingerprint and profiled.

Python’s requests package, which uses urllib from the standard library, has a very distinctive TLS fingerprint, containing ciphers (amongst other things) that aren’t seen in a browser. This makes it very easy to spot. Both rnet, and other options such as curl-cffi, are able to send a TLS fingerprint similar to that of a browser. This reduces the chances of our request being blocked.

Here is how we assemble the pipeline.

Step 1: Load the page (The handshake)

First, we define our browser logic. Notice that we are not trying to parse HTML here. Our only goal is to visit the site, pass the initial JavaScript challenge, and extract the session cookies.

import zendriver as zd
import asyncio

async def get_cookies():
    """
    Use ZenDriver to launch a browser, navigate to the page, 
    and retrieve the cookies.
    """

    browser = await zd.start()

    # Hit the homepage to trigger the check
    await browser.get("https://auto.hylnd7.com")

    # Wait briefly for the JS challenge to complete
    await asyncio.sleep(1) 

    # Extract the cookies
    requests_style_cookies = await browser.cookies.get_all()
    await browser.stop()

    return requests_style_cookies

What’s happening here:

We launch the browser, visit the site, and wait just one second for the JS challenge to run. Once we have the cookies, we call browser.stop(). This is the most important line: we do not want a browser instance wasting resources when we don’t need it.

Step 2: Use the cookies

Now that we have the "access pass," we can switch to our lightweight HTTP client. We take those cookies and inject them into the rnet client headers.

from rnet import Client, Emulation

async def http_request_rnet(cookies=None):
    """
    Make a fast request using RNet with the borrowed cookies.
    """
    headers = {
        "referer": "https://auto.hylnd7.com/",
    }

    # Format the browser cookies into a simple HTTP header string
    if cookies:
        cookie_list = []
        for cookie in cookies:
            cookie_list.append(f"{cookie.name}={cookie.value}")
        headers["Cookie"] = "; ".join(cookie_list)

    # We use Emulation.Chrome142 to change the TLS Fingerprint.
    # This is site dependent - but worth using
    client = Client(emulation=Emulation.Chrome142, headers=headers)

    response = await client.get("https://auto.hylnd7.com/api/products?page=1&limit=8")
    return response

What’s happening here:

We convert the browser's cookie format into a standard header string. Note the “Emulation.Chrome142” parameter. We are layering two techniques here: hybrid scraping (using real cookies) and TLS fingerprinting (using a modern HTTP client). This double-layer approach covers all our bases.

(Note: Many HTTP clients have a cookie jar that you could also use; for this example, sending the header directly worked perfectly).

Step 3: Run the code

Finally, we tie it together. For this demo, we use a simple argparse flag to show the difference with and without the cookie.

async def main(use_cookies: bool):
    cookies = None

    # The Decision Logic
    if use_cookies:
        cookies = await get_cookies() # Run the heavy browser

    # Always run the fast HTTP client
    resp = await http_request_rnet(cookies)

    status_code = resp.status
    print("Status Code:", status_code)

    if status_code == 200:
        print("Response Body:", await resp.json())
    else:
        print("Request blocked")

Get the complete script

Want to run this yourself? We’ve put the full, copy-pasteable script (including the argument parsers and imports) in the block below.

uv init
uv add zendriver rnet rich
# linux/mac
source .venv/bin/activate\
# windows
.venv\Scripts\activate

import argparse
import asyncio

import zendriver as zd
from rnet import Client, Emulation
from rich import print


async def http_request_rnet(cookies=None):
    """
    Make an HTTP GET request using rnet with the provided cookies. Cookies are sent in the headers. Note for this site we need the referer too.
    Return the Response Object.
    """
    headers = {
        "referer": "https://auto.hylnd7.com/",
    }

    if cookies:
        cookie_list = []
        for cookie in cookies:
            # Adjust based on the actual structure of the cookie object from zendriver
            # If it's a dict: cookie['name'], cookie['value']
            # If it's an object: cookie.name, cookie.value
            cookie_list.append(f"{cookie.name}={cookie.value}")
        headers["Cookie"] = "; ".join(cookie_list)

    client = Client(emulation=Emulation.Chrome142, headers=headers)
    response = await client.get("https://auto.hylnd7.com/api/products?page=1&limit=8")
    return response

async def get_cookies():
    """
    Use zendriver to launch a browser, navigate to a page, and retrieve cookies.
    """
    browser = await zd.start()
    await browser.get("https://auto.hylnd7.com")
    await asyncio.sleep(1)
    requests_style_cookies = await browser.cookies.get_all()
    await browser.stop()
    return requests_style_cookies

async def main(use_cookies: bool):
    cookies = None
    if use_cookies:
        cookies = await get_cookies()

    resp = await http_request_rnet(cookies)
    status_code = resp.status
    print("Status Code:", status_code)

    if status_code == 200:
        print("Response Body:", await resp.json())

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Make HTTP request with optional browser cookies")
    parser.add_argument(
        "--cookies",
        type=lambda x: x.lower() == "true",
        default=False,
        help="Set to 'true' to launch browser and get cookies, 'false' to skip (default: false)"
    )
    args = parser.parse_args()
    asyncio.run(main(args.cookies))

Pros and Cons of Hybrid Scraping

Feature	Pros	Cons
Efficiency	Reduces RAM usage massively compared to pure browser scraping.	Higher complexity: You must manage two libraries (zendriver and rnet) and the glue code.
Speed	HTTP requests complete in milliseconds. Browsers take seconds.	State management: You need logic to handle cookie expiry. If the cookie dies, you must "wake up" the browser.
Access	You get the verification of a real browser without the drag.	Maintenance: You are debugging two points of failure: the browser's ability to solve the challenge, and the client's ability to fetch data.

Final thoughts

For smaller jobs, it might be easier to just use the browser; the benefits won’t necessarily outweigh the extra complexity required.

But for production pipelines, this approach is the standard. It treats the browser as a luxury resource: used only when strictly necessary to unlock the door, so the HTTP client can do the real work. It’s this session and state management that allows you to scrape harder-to-access sites effectively and efficiently.

If building this orchestration layer yourself feels like too much overhead, this is exactly what the Zyte API handles internally. We manage the browser/HTTP switching logic automatically, so you just make a single request and get the data.

Stop Scraping HTML - There's a better way.

John Rooney — Tue, 16 Dec 2025 19:16:43 +0000

The "API-First" Reverse Engineering Method

One of the most common mistakes I see developers make is firing up their code editor too early. They open VS Code, pip install requests beautifulsoup4, and immediately start trying to parse <div> tags.

If you are scraping a modern e-commerce site or Single Page Application (SPA), this is the wrong approach. It’s brittle, it’s slow, and it breaks the moment the site updates its CSS.

The secret to scalable scraping isn't better parsing; it's finding the API that the website uses to populate itself. Here is the exact workflow I use to turn a complex parsing job into a clean, reliable JSON pipeline.

Phase 1: The Discovery (XHR Filtering)

Modern websites are rarely static. They typically use a "Frontend/Backend" architecture where the browser loads a skeleton page and then fetches the actual data via a background API call.

Your goal is to use that call.

Open Developer Tools: Right-click and inspect the page, then navigate to the Network tab.
Filter the Noise: Click the Fetch/XHR filter. We don't care about CSS, images, or fonts. We only care about data.
Trigger the Request: Refresh the page. Watch the waterfall.

If nothing appears of note here, try different pages, try triggering pagination, loading and clicking buttons and watching to see what appears.

You are looking for requests that return JSON. They are often named intuitively, like graphql, search, products, or api. When you click "Preview" on these requests, you won't see HTML; you will see a structured object containing every piece of data you need—prices, descriptions, SKU numbers—already parsed and clean.

Pro Tip: Once you find a candidate URL, test it immediately in the browser console or URL bar. Try changing query parameters like page=1 to page=2. If the JSON response changes to show the next page of products, you have found your "Golden Endpoint."

Phase 2: The "Clean Room" Isolation

Finding the endpoint is only step one. Now you need to determine the minimum viable request required to access it programmatically.

Copy as cURL: Right-click the request in Chrome DevTools and select Copy as cURL.
Import to a Client: Open an API client like Bruno, Postman, or Insomnia. Import the cURL command.
The Baseline Test: Hit "Send." It should work perfectly because you are sending everything—every cookie, every header, and the exact session token your browser just generated.

The "Load-Bearing" Header Game

Efficient scrapers don't send 2KB of headers. You need to strip this down. Start unchecking headers one by one and resending the request:

Remove the Cookie header: Does it break? (Usually, yes).
Remove the Referer: Does it break? (Often, yes—sites check this to ensure the request came from their own frontend).
Remove the User-Agent: Does it break?
Check the Parameters: Can you change limit=10 to limit=100 to get more data in one shot?

Eventually, you will be left with the "skeleton key": the absolute minimum headers required to get a 200 OK. Usually, this consists of a User-Agent, a Referer, and a specific Auth Token or Session Cookie.

Phase 3: The Infrastructure Trap (The "Bonded" Token)

This is where most developers hit a wall. You take your cleaned-up request, put it into a Python script, and... 403 Forbidden.

Why? You have the right URL and the right headers.

In my analysis of modern scraping targets, I found that the API endpoint is increasingly performing a Cryptographic Binding check.

The IP Link: The Auth Token/Cookie you copied from your browser was generated for that specific IP address. When you run your script (likely on a server, VPN, or different proxy), the site sees a mismatch between the token's origin IP and your current request IP.
The Expiry Clock: These tokens are ephemeral. They are designed to expire, you will need to investigate how long.

If you are just looping through a list of URLs with a static token, you will burn out your access almost immediately.

Phase 4: Architecting the Solution

To make this work at scale, you cannot simply write a script. You need to build a Hybrid Architecture that manages state.

You need to engineer a system that takes the above into account and monitor its lifecycle:

The Storage Unit: You need a database (like Redis) to store a "Session Object." This object must contain:
- The Auth Token (Cookie).
- The IP Address used to generate it.
- The Creation Time.
The Browser Worker: You need a headless browser (Nodriver/Camoufox) to visit the site, execute the JavaScript, generate the token, and save it to your Storage Unit.
The HTTP Worker: Your actual scraper. It doesn't browse; it pulls the Token + IP combination from storage and hits the API directly.
The Rotation Logic: You need logic that checks the token age.
- Is the token older than 5 minutes? Stop.
- Spin up the Browser Worker.
- Generate a new Token.
- Update the Storage Unit.
- Resume scraping.

The Hidden Overhead

Suddenly, your simple scraping job requires a Proxy Management System (to ensure the Browser and HTTP worker share the same IP), a Browser Management System (to handle the heavy lifting of token generation), and a State Manager.

This is why "just scraping the API" is harder than it looks. The code to fetch the data is minimal—often just one function. But the infrastructure required to maintain the identity required to access that data is massive.

At Zyte, we abstract this entire architecture. Our API handles the browser fingerprinting, the IP, and the session rotation automatically. You simply send us the URL, and we handle the "Hybrid" complexity in the background, delivering you the clean JSON response without the infrastructure headache.

Want more? Join our community

The Modern Scrapy Developer's Guide (Part 3): Auto-Generating Page Objects with Web Scraping Co-pilot

John Rooney — Tue, 16 Dec 2025 18:55:04 +0000

Welcome to Part 3 of our Modern Scrapy series.

That refactor was a huge improvement, but it was still a lot of manual work. We had to:

Manually create our BookItem and BookListPage schemas.
Manually create the bookstoscrape_com.py Page Object file.
Manually use scrapy shell to find all the CSS selectors.
Manually write all the @field parsers.

What if you could do all of that in about 30 seconds?

In this guide, we'll show you how to use the Web Scraping Co-pilot (our VS Code extension) to automatically write 100% of your Items, Page Objects, and even your unit tests. We'll take our simple spider from Part 1 and upgrade it to the professional scrapy-poet architecture from Part 2, but this time, the AI will do all the heavy lifting.

Prerequisites & Setup

This tutorial assumes you have:

Completed Part 1 (see above)
Visual Studio Code installed.
The Web Scraping Co-pilot extension (which we'll install now).

Step 1: Installing Web Scraping Co-pilot

Inside VS Code, go to the "Extensions" tab and search for Web Scraping Co-pilot (published by Zyte).

Once installed, you'll see a new icon in your sidebar. Open it, and it will automatically detect your Scrapy project. It may ask to install a few dependencies like pytest—allow it to do so. This setup process ensures your environment is ready for AI-powered generation.

Step 2: Auto-Generating our `BookItem`

Let's start with the spider from Part 1. Our goal is to create a Page Object for our BookItem and add even more fields than we did in Part 2.

In the Co-pilot chat window:

Select "Web Scraping."
Write a prompt like this:

"Create a page object for the item BookItem using the sample URL https://books.toscrape.com/catalogue/the-host_979/index.html"

The co-pilot will now:

Check your project: It will confirm you have scrapy-poet and pytest (and will offer to install them if you don't).
Add scrapy-poet settings: It will automatically add the ADDONS and SCRAPY_POET_DISCOVER settings to your settings.py file.
Create your items.py: It will create a new BookItem class, but this time it will intelligently add all the fields it can find on the page.

# tutorial/items.py (Auto-Generated!)
import attrs

@attrs.define
class BookItem:
    """
    The structured data we extract from a book *detail* page.
    """
    name: str
    price: str
    url: str
    availability: str  # <-- New!
    number_of_reviews: int # <-- New!
    upc: str             # <-- New!

Create Fixtures: It creates a fixtures folder with the saved HTML and expected JSON output for testing.
Write the Page Object: It creates the tutorial/pages/bookstoscrape_com.py file and writes the entire Page Object, complete with all parsing logic and selectors, for all the new fields.

# tutorial/pages/bookstoscrape_com.py (Auto-Generated!)

from web_poet import WebPage, handle_urls, field, returns
from tutorial.items import BookItem

@handle_urls("[books.toscrape.com/catalogue](https://books.toscrape.com/catalogue)")
@returns(BookItem)
class BookDetailPage(WebPage):
    """
    This Page Object handles parsing data from book detail pages.
    """

    @field
    def name(self) -> str:
        return self.response.css("h1::text").get()

    @field
    def price(self) -> str:
        return self.response.css("p.price_color::text").get()

    @field
    def url(self) -> str:
        return self.response.url

    # All of this was written for us!
    @field
    def availability(self) -> str:
        return self.response.css("p.availability::text").getall()[1].strip()

    @field
    def number_of_reviews(self) -> int:
        return int(self.response.css("table tr:last-child td::text").get())

    @field
    def upc(self) -> str:
        return self.response.css("table tr:first-child td::text").get()

In 30 seconds, the Co-pilot has done everything we did manually in Part 2, but better—it even added more fields.

Step 3: Running the AI-Generated Tests

The best part? The Co-pilot also wrote unit tests for you. It created a tests folder with test_bookstoscrape_com.py.

You can just click "Run Tests" in the Co-pilot UI (or run pytest in your terminal).

$ pytest
================ test session starts ================
...
tests/test_bookstoscrape_com.py::test_book_detail[book_0] PASSED
tests/test_bookstoscrape_com.py::test_book_detail[book_1] PASSED
...
================ 8 tests passed in 0.10s ================

Your parsing logic is now fully tested, and you didn't write a single line of test code.

Step 4: Refactoring the Spider (The Easy Way)

Now, we just update our tutorial/spiders/books.py to use this new architecture, just like in Part 2.

# tutorial/spiders/books.py

import scrapy
# Import our new, auto-generated Item class
from tutorial.items import BookItem

class BooksSpider(scrapy.Spider):
    name = "books"
    # ... (rest of spider from Part 1) ...

    async def parse_listpage(self, response):
        product_urls = response.css("article.product_pod h3 a::attr(href)").getall()
        for url in product_urls:
            # We just tell Scrapy to call parse_book
            yield response.follow(url, callback=self.parse_book)

        next_page_url = response.css("li.next a::attr(href)").get()
        if next_page_url:
            yield response.follow(next_page_url, callback=self.parse_listpage)

    # We ask for the BookItem, and scrapy-poet does the rest!
    async def parse_book(self, response, book: BookItem):
        yield book

Step 5: Auto-Generating our `BookListPage`

We can repeat the exact same process for our list page to finish the refactor.

Prompt the Co-pilot:

"Create a page object for the list item BookListPage using the sample URL https://books.toscrape.com/catalogue/category/books/fantasy_19/index.html"

Result:

The Co-pilot will create the BookListPage item in items.py.
It will create the BookListPageObject in bookstoscrape_com.py with the parsers for book_urls and next_page_url.
It will write and pass the tests.

Now we can update our spider one last time to be fully architected.

# tutorial/spiders/books.py (FINAL VERSION)

import scrapy
from tutorial.items import BookItem, BookListPage # Import both

class BooksSpider(scrapy.Spider):
    # ... (name, allowed_domains, url) ...

    async def start(self):
        yield scrapy.Request(self.url, callback=self.parse_listpage)

    # We now ask for the BookListPage item!
    async def parse_listpage(self, response, page: BookListPage):

        # All parsing logic is GONE from the spider.
        for url in page.book_urls:
            yield response.follow(url, callback=self.parse_book)

        if page.next_page_url:
            yield response.follow(page.next_page_url, callback=self.parse_listpage)

    async def parse_book(self, response, book: BookItem):
        yield book

Our spider is now just a "crawler." It has zero parsing logic. All the hard work of finding selectors and writing parsers was automated by the Co-pilot.

Conclusion: The "Hybrid Developer"

The Web Scraping Co-pilot doesn't replace you. It accelerates you. It automates the 90% of work that is "grunt work" (finding selectors, writing boilerplate, creating tests) so you can focus on the 10% of work that matters: crawling logic, strategy, and handling complex sites.

This is how we, as the maintainers of Scrapy, build spiders professionally.

What's Next? Join the Community.

What's Next? Join the Community.
💬 TALK: Stuck on this Scrapy code? Ask the maintainers and 5k+ devs in our Discord.
▶️ WATCH: This post was based on our video! Watch the full walkthrough on our YouTube channel.
📩 READ: Want more? In Part 2, we'll cover Scrapy Items and Pipelines. Get the Extract newsletter so you don't miss it.

The Modern Scrapy Developer's Guide (Part 2): Page Objects with scrapy-poet

John Rooney — Tue, 16 Dec 2025 18:49:10 +0000

Welcome to Part 2 of our Modern Scrapy series. In Part 1, we built a working spider that crawls and scrapes an entire category. But if you look at our code, it's already getting messy. Our parse_listpage and parse_book functions are mixing two different jobs:

Crawling Logic: Finding the next page and following links.
Parsing Logic: Finding the data (name, price) with CSS selectors.

What happens when a selector changes? Or when you want to test your parsing logic? You have to run the whole spider. This is slow, hard to maintain, and difficult to test.

In this guide, we'll fix this by refactoring our spider to a professional, modern standard using Scrapy Items and Page Objects (via scrapy-poet). We will completely separate our crawling logic from our parsing logic. This will make our code cleaner, infinitely easier to test, and scalable.

What We'll Build

We will refactor our spider from Part 1. The spider itself will only handle crawling (following links). All the parsing logic will be moved into dedicated "Page Object" classes. scrapy-poet will automatically inject the correct, parsed item into our spider.

Look at how clean our spider's parse_book function becomes:

# The NEW parse_book function
# Where did the parsing logic go?! (Hint: scrapy-poet)

    async def parse_book(self, response, book: BookItem):
        # 'book' is a BookItem, magically injected and parsed
        # by scrapy-poet before this function is even called.
        # We just yield it.
        yield book

Prerequisites

This tutorial builds directly on Part 1: Building Your First Crawling Spider. Please complete that guide first, as we will be modifying the spider we built there.

Step 1: The "Why" (Separation of Concerns)

Our current spider is a monolith. The BooksSpider class knows how to crawl (find next page links, find product links) and how to parse (extract h1 tags, extract p.price_color).

This is bad. If we want to reuse our parsing logic, or test it without re-crawling the web, we can't.

The "Page Object" pattern solves this.

The Spider's Job: Crawling. Its only job is to navigate from page to page and yield Requests or Items.
The Page Object's Job: Parsing. Its only job is to take a response and extract structured data from it.

scrapy-poet is a library that automatically connects our spider to the correct Page Object.

Step 2: Create Our "Schema" (Scrapy Items)

First, let's define the data we're scraping. Instead of messy dictionaries, we'll use Scrapy Items. Scrapy comes with attrs, a fantastic library for this.

Open tutorial/items.py and add two classes: one for our book data and one for our list page data.

# tutorial/items.py

import attrs

@attrs.define
class BookItem:
    """
    The structured data we extract from a book *detail* page.
    """
    name: str
    price: str
    url: str

@attrs.define
class BookListPage:
    """
    The data and links we extract from a *list* page.
    """
    book_urls: list
    next_page_url: str | None

This is our "schema." It makes our code type-safe and easier to read.

Step 3: Install and Configure `scrapy-poet`

scrapy-poet is a separate package we need to install.

# Install scrapy-poet
uv add scrapy-poet
# or: pip install scrapy-poet

Now, we must enable it in tutorial/settings.py.

# tutorial/settings.py

# Add this to enable the scrapy-poet add-on
ADDONS = {
    'scrapy_poet.Addon': 300,
}

# Add this to tell scrapy-poet where to find our Page Objects
# 'tutorial.pages' means a folder named 'pages' in our 'tutorial' module
SCRAPY_POET_DISCOVER = ['tutorial.pages']

Step 4: Create Page Objects for Parsing

Now for the magic. Let's create the tutorial/pages module.

mkdir tutorial/pages
touch tutorial/pages/__init__.py

Inside this new folder, create a file named bookstoscrape_com.py. This file will hold all the parsing logic for bookstoscrape.com.

This is the most complex part, but it's a "set it and forget it" pattern.

# tutorial/pages/bookstoscrape_com.py

from web_poet import WebPage, handle_urls, field, returns

# Import our Item schemas
from tutorial.items import BookItem, BookListPage

# This class handles all book DETAIL pages
@handle_urls("[books.toscrape.com/catalogue](https://books.toscrape.com/catalogue)")
@returns(BookItem)
class BookDetailPage(WebPage):
    """
    This Page Object handles parsing data from book detail pages.
    """

    # The @field decorator tells scrapy-poet: "run this function
    # and put the result into the 'name' field of the BookItem."
    @field
    def name(self) -> str:
        # This is our parsing logic from Part 1
        return self.response.css("h1::text").get()

    @field
    def price(self) -> str:
        # This is our parsing logic from Part 1
        return self.response.css("p.price_color::text").get()

    @field
    def url(self) -> str:
        return self.response.url

# This class handles all book LIST pages (categories)
@handle_urls("[books.toscrape.com/catalogue/category](https://books.toscrape.com/catalogue/category)")
@returns(BookListPage)
class BookListPageObject(WebPage):
    """
    This Page Object handles parsing data from category/list pages.
    """

    @field
    def book_urls(self) -> list:
        # This is our parsing logic from Part 1
        return self.response.css("article.product_pod h3 a::attr(href)").getall()

    @field
    def next_page_url(self) -> str | None:
        # This is our parsing logic from Part 1
        return self.response.css("li.next a::attr(href)").get()

Look at that! All our messy response.css() calls are now neatly organized in their own classes, completely separate from our spider. The @handle_urls decorator tells scrapy-poet which Page Object to use for which URL.

Step 5: Refactor the Spider (The Payoff)

Now, let's go back to tutorial/spiders/books.py and refactor it. It becomes much simpler.

# tutorial/spiders/books.py

import scrapy
# Import our new Item classes
from tutorial.items import BookItem, BookListPage

class BooksSpider(scrapy.Spider):
    name = "books"
    allowed_domains = ["toscrape.com"]
    url: str = "https://books.toscrape.com/catalogue/category/books/fantasy_19/index.html"

    async def start(self):
        # We still start the same way
        yield scrapy.Request(self.url, callback=self.parse_listpage)

    # The 'page: BookListPage' is new.
    # We ask for the BookListPage item, and scrapy-poet injects it.
    async def parse_listpage(self, response, page: BookListPage):

        # 1. Get the parsed book URLs from the Page Object
        for url in page.book_urls:
            # We follow each URL, but our callback no longer
            # needs to do any work!
            yield response.follow(url, callback=self.parse_book)

        # 2. Get the next page URL from the Page Object
        if page.next_page_url:
            yield response.follow(page.next_page_url, callback=self.parse_listpage)

    # The 'book: BookItem' is new.
    # We ask for the BookItem, and scrapy-poet injects it.
    async def parse_book(self, response, book: BookItem):
        # Our parsing logic is GONE.
        # The 'book' variable is already a fully-populated
        # BookItem, parsed by our BookDetailPage Page Object.

        # We just yield it.
        yield book

Our spider is now only responsible for crawling. All parsing is handled by scrapy-poet and our Page Objects. This code is clean, testable, and incredibly easy to read.

When you run scrapy crawl books -o books.json, the output will be identical to Part 1, but your architecture is now 100x better.

The "Hard Part": Why This Still Breaks

We've built a professional, well-architected Scrapy spider. But we've just made a cleaner version of a spider that will still fail on a real-world site.

This architecture is beautiful, but it doesn't solve the "real" problems:

❌ IP Blocks: You're still hitting the site from one IP. You will be blocked.
❌ CAPTCHAs: You have no way to avoid captchas, and your spider will fail.
❌ JavaScript: If the prices were loaded by JS, our response.css() selectors would find nothing.

We've just organized our failing code.

The "Easy Way": Zyte API as a Universal Page Object
scrapy-poet is a great way to organise your scrapy code, making your projects easier to build, collaborate and maintain. However, it doesn't change the fact we are not doing anything to avoid web scraping bans.

So we can add the below settings using our Zyte API account to run our scrapy project through Zyte API.

# add scrapy-zyte-api python library
uv add scrapy-zyte-api
# settings.py
ZYTE_API_KEY = "YOUR_API_KEY"

ADDONS = {
    "scrapy_zyte_api.Addon": 500,
}

This is the power of combining a great architecture (Scrapy) with a powerful service (Zyte API).

Conclusion & Next Steps

Today you elevated your spider from a simple script to a professional-grade crawler. You learned the "Separation of Concerns" principle, defined data with Items, and separated parsing logic with scrapy-poet's Page Objects.

This is the modern way to build robust, testable, and scalable Scrapy spiders.

What's Next? Join the Community.
💬 TALK: Stuck on this Scrapy code? Ask the maintainers and 5k+ devs in our Discord.
▶️ WATCH: This post was based on our video! Watch the full walkthrough on our YouTube channel.
📩 READ: Want more? In Part 2, we'll cover Scrapy Items and Pipelines. Get the Extract newsletter so you don't miss it.

And if you're ready to skip the "Hard Part" entirely, get your free API key and try the "Easy Way."

Start Your Free Zyte Trial

The Modern Scrapy Developer's Guide (Part 1): Building Your First Spider

John Rooney — Tue, 16 Dec 2025 18:41:29 +0000

Scrapy can feel daunting. It's a massive, powerful framework, and the documentation can be overwhelming for a newcomer. Where do you even begin?

In this definitive guide, we will walk you through, step-by-step, how to build a real, multi-page crawling spider. You will go from an empty folder to a clean JSON file of structured data in about 15 minutes. We'll use modern, async/await Python and cover project setup, finding selectors, following links (crawling), and saving your data.

What We'll Build

We will build a Scrapy spider that crawls the "Fantasy" category on books.toscrape.com, follows the "Next" button to crawl every page in that category, follows the link for every book, and scrapes the name, price, and URL from all 48 books, saving the result to a clean books.json file.

Here's a preview of our final spider code:

# The final spider we'll build
import scrapy

class BooksSpider(scrapy.Spider):
    name = "books"
    allowed_domains = ["toscrape.com"]

    url: str = "https://books.toscrape.com/catalogue/category/books/fantasy_19/index.html](https://books.toscrape.com/catalogue/category/books/fantasy_19/index.html"

    async def start(self):
        yield scrapy.Request(self.url, callback=self.parse_listpage)

    async def parse_listpage(self, response):
        product_urls = response.css("article.product_pod h3 a::attr(href)").getall()
        for url in product_urls:
            yield response.follow(url, callback=self.parse_book)

        next_page_url = response.css("li.next a::attr(href)").get()
        if next_page_url:
            yield response.follow(next_page_url, callback=self.parse_listpage)

    async def parse_book(self, response):
        yield {
            "name": response.css("h1::text").get(),
            "price": response.css("p.price_color::text").get(),
            "url": response.url
        }

Prerequisites & Setup

Before we start, you'll need Python 3.x installed. We'll also be using a virtual environment to keep our dependencies clean. You can use standard pip or a modern package manager like uv.

First, let's create a project folder and activate a virtual environment.

# Create a new folder
mkdir scrapy_project
cd scrapy_project

# Option 1: Using standard pip + venv
python -m venv .venv
source .venv/bin/activate  # On Windows, use: .venv\Scripts\activate

# Option 2: Using uv (a fast, modern alternative)
uv init

Now, let's install Scrapy.

# Option 1: Using pip
pip install scrapy

# Option 2: Using uv
uv add scrapy
source .venv/bin/activate

Step 1: Initialize Your Project

With Scrapy installed, we can use its built-in command-line tools to generate our project boilerplate.

First, create the project itself.

# The 'scrapy startproject' command creates the project structure
# The '.' tells it to use the current folder
scrapy startproject tutorial .

You'll see a tutorial folder and a scrapy.cfg file appear. This folder contains all your project's logic.

Next, we'll generate our first spider.

# The 'genspider' command creates a new spider file
# Usage: scrapy genspider <spider_name> <allowed_domain>
scrapy genspider books toscrape.com

If you look in tutorial/spiders/, you'll now see books.py. This is where we'll write our code.

Step 2: Configure Your Settings

Before we write our spider, let's quickly adjust two settings in tutorial/settings.py.

ROBOTSTXT_OBEY

By default, Scrapy respects robots.txt files. This is a good practice, but our test site (toscrape.com) doesn't have one, which can cause a 404 error in our logs. We'll turn it off for this tutorial.

# tutorial/settings.py

# Find this line and change it to False
ROBOTSTXT_OBEY = False

Concurrency

Scrapy is polite by default and runs slowly. Since toscrape.com is a test site built for scraping, we can speed it up.

# tutorial/settings.py

# Uncomment or add these lines
CONCURRENT_REQUESTS = 16
DOWNLOAD_DELAY = 0

Warning: These settings are for this test site only. When scraping in the wild, you must be mindful of your target site and use respectful DOWNLOAD_DELAY and CONCURRENT_REQUESTS values.

Step 3: Finding Our Selectors (with `scrapy shell`)

To scrape a site, we need to tell Scrapy what data to get. We do this with CSS selectors. The scrapy shell is the best tool for this.

Let's launch the shell on our target category page:

scrapy shell https://books.toscrape.com/catalogue/category/books/fantasy_19/index.html

This will download the page and give you an interactive shell with a response object.

You can even type view(response) to open the page in your browser exactly as Scrapy sees it!

Let's find the data we need:

Find all Book Links:

By inspecting the page, we see each book is in an article.product_pod. The link is inside an h3.

# In scrapy shell:
>>> response.css("article.product_pod h3 a::attr(href)").getall()
[
  '../../../../the-host_979/index.html',
  '../../../../the-hunted_978/index.html',
  ...
]

That getall() gives us a clean list of all the URLs.

Find the "Next" Page Link:

At the bottom, we find the "Next" button in an li.next.

# In scrapy shell:
>>> response.css("li.next a::attr(href)").get()
'page-2.html'

This get() gives us the single link we need for pagination.

Find the Book Data (on a product page):

Finally, let's open a shell on a product page to find the selectors for our data.

# Exit the shell and open a new one:
scrapy shell https://books.toscrape.com/catalogue/the-host_979/index.html

# In scrapy shell:
>>> response.css("h1::text").get()
'The Host'

>>> response.css("p.price_color::text").get()
'£25.82'

Perfect. We now have all the selectors we need.

Step 4: Building the Spider (Crawling & Parsing)

Now, let's open tutorial/spiders/books.py and write our spider. We'll use the user's provided code, as it's a clean, final version.

Delete the boilerplate in books.py and replace it with this:

# tutorial/spiders/books.py

import scrapy

class BooksSpider(scrapy.Spider):
    name = "books"
    allowed_domains = ["toscrape.com"]

    # This is our starting URL (the first page of the Fantasy category)
    url: str = "https://books.toscrape.com/catalogue/category/books/fantasy_19/index.html"

    # This is the modern, async version of 'start_requests'
    # It's called once when the spider starts.
    async def start(self):
        # We yield our first request, sending the response to 'parse_listpage'
        yield scrapy.Request(self.url, callback=self.parse_listpage)

    # This function handles the *category page*
    async def parse_listpage(self, response):

        # 1. Get all product URLs using the selector we found
        product_urls = response.css("article.product_pod h3 a::attr(href)").getall()

        # 2. For each product URL, follow it and send the response to 'parse_book'
        for url in product_urls:
            yield response.follow(url, callback=self.parse_book)

        # 3. Find the 'Next' page URL
        next_page_url = response.css("li.next a::attr(href)").get()

        # 4. If a 'Next' page exists, follow it and send the response
        #    back to *this same function*
        if next_page_url:
            yield response.follow(next_page_url, callback=self.parse_listpage)

    # This function handles the *product page*
    async def parse_book(self, response):

        # We yield a dictionary of the data we want
        yield {
            "name": response.css("h1::text").get(),
            "price": response.css("p.price_color::text").get(),
            "url": response.url
        }

This code is clean and efficient. response.follow is smart enough to handle the relative URLs (like page-2.html) for us.

Step 5: Running The Spider & Saving Data

We're ready to run. Go to your terminal (at the project root) and run:

scrapy crawl books

You'll see Scrapy start up, and in the logs, you'll see all 48 items being scraped!

But we want to save this data. Scrapy has a built-in "Feed Exporter" that makes this easy. We just use the -o (output) flag.

scrapy crawl books -o books.json

This will run the spider again, but this time, you'll see a new books.json file in your project root, containing all 48 items, perfectly structured.

Conclusion & Next Steps

Today you built a powerful, modern, async Scrapy crawler. You learned how to set up a project, find selectors, follow links, and handle pagination.

This is just the starting block.

What's Next? Join the Community.

💬 TALK: Stuck on this Scrapy code? Ask the maintainers and 5k+ devs in our Discord.

▶️ WATCH: This post was based on our video! Watch the full walkthrough on our YouTube channel.

📩 READ: Want more? In Part 2, we'll cover Scrapy Items and Pipelines. Get the Extract newsletter so you don't miss it.