Forem: Extract by Zyte

How to write and publish a Python package to PyPI

John Rooney — Mon, 11 May 2026 09:57:41 +0000

I wanted to publish my Scrapy download handler to PyPi - UV made it incredibly easy. Here's how.

At some point, every scraping developer writes the same middleware twice.

The first time is in project A. It works well, so when project B comes along you copy it across. Then project C. Then a colleague asks for it. You email them the file. They make changes. Now there are four versions of the same code living in four different repositories, diverging slowly, none of them getting each other's bug fixes.

The solution is a package. Publish it once to the Python Package Index (PyPI) and every project that needs it can install it with pip install your-package. Updates go to everyone at once. The code has a home.

This guide walks through the full process using uv, a fast, modern Python toolchain that replaces pip, virtualenv, pip-tools, twine, and build with a single tool. We will write a reusable Scrapy download handler, structure it as a proper Python package, test it, and publish it to PyPI.

By the end, the package will be installable with:

uv add scrapy-random-delay

And usable in any Scrapy project with two lines in settings.py:

DOWNLOAD_HANDLERS = {
    "http":  "scrapy_random_delay.RandomDelayHandler",
    "https": "scrapy_random_delay.RandomDelayHandler",
}
RANDOM_DELAY_RANGE = (0.5, 3.0)  # seconds

Installing uv

If you do not have uv installed:

# macOS and Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

# Windows
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

# Or via pip if you prefer
pip install uv

Verify the installation:

uv --version
# uv 0.5.0 (or later)

uv is written in Rust and is dramatically faster than pip for dependency resolution and installation: typically 10 to 100 times faster. More importantly for this guide, it provides a complete project management workflow that makes creating and publishing packages significantly simpler than the traditional pip, build, and twine toolchain.

What we are building

A Scrapy download handler that adds a random delay drawn from a configurable range before each request, with optional per-domain configuration. It is realistic, self-contained, and exactly the kind of code worth packaging: useful across multiple projects but simple enough to understand completely.

How Scrapy download handlers work

A download handler is responsible for a URL scheme. Scrapy's default HTTP handler takes a request, makes the network call, and returns the response. A custom handler wraps this: add headers, impose a delay, swap the underlying HTTP library. The API is simple:

class MyHandler:
    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler.settings)

    async def download_request(self, request, spider):
        # return a Response
        pass

    def close(self):
        pass

Our handler delegates actual HTTP work to Scrapy's built-in HTTPDownloadHandler and adds the delay logic around it.

Creating the project

uv init scaffolds a new package project in one command:

uv init --package scrapy-random-delay
cd scrapy-random-delay

The --package flag tells uv to create a proper importable package rather than a single-script project. The generated structure looks like this:

scrapy-random-delay/
├── src/
│   └── scrapy_random_delay/
│       └── __init__.py
├── pyproject.toml
└── README.md

uv uses the src/ layout by default, placing your package inside src/ so it cannot be accidentally imported from the project root during development, which can mask packaging errors.

Look at what uv generated in pyproject.toml:

[project]
name = "scrapy-random-delay"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.12"
dependencies = []

[project.scripts]
scrapy-random-delay = "scrapy_random_delay:main"

[build-system]
requires = ["uv_build>=0.11.7,<0.12.0"]
build-backend = "uv_build"

Two things to note: the [project.scripts] entry is scaffolding you can delete, since the package exports a handler class rather than a command-line entry point. The build backend is uv_build, uv's own backend introduced in recent versions. We will fill in the rest of the metadata as we go.

Adding dependencies

Add Scrapy as a runtime dependency:

uv add scrapy

Add development dependencies — pytest and pytest-asyncio for testing:

uv add --dev pytest pytest-asyncio

uv writes these into pyproject.toml automatically and creates a uv.lock lockfile that pins exact versions for reproducible installs. You do not need to manually edit pyproject.toml for dependencies.

The resulting pyproject.toml dependencies section:

[project]
dependencies = [
    "scrapy>=2.12",
]

[dependency-groups]
dev = [
    "pytest>=8.0",
    "pytest-asyncio>=0.24",
]

Writing the package

Add the handler module:

touch src/scrapy_random_delay/handler.py

# src/scrapy_random_delay/handler.py
import asyncio
import random
import logging
from typing import Tuple, Optional

from scrapy import Request
from scrapy.crawler import Crawler
from scrapy.http import Response
from scrapy.settings import Settings
from scrapy.core.downloader.handlers.http11 import HTTP11DownloadHandler as HTTPDownloadHandler

logger = logging.getLogger(__name__)


class RandomDelayHandler:
    """
    A Scrapy download handler that adds a random delay before each request.

    Settings
    --------
    RANDOM_DELAY_RANGE : tuple[float, float]
        (min_seconds, max_seconds) delay range. Default: (0.5, 2.0)

    RANDOM_DELAY_PER_DOMAIN : dict[str, tuple[float, float]]
        Per-domain overrides. Keys are domain strings, values are (min, max) tuples.
        Example: {"api.example.com": (1.0, 4.0), "cdn.example.com": (0.0, 0.1)}

    RANDOM_DELAY_VERBOSE : bool
        Log the actual delay chosen for each request. Default: False
    """

    DEFAULT_RANGE: Tuple[float, float] = (0.5, 2.0)

    def __init__(self, settings: Settings, crawler: Crawler):
        self._settings = settings
        self._range = self._parse_range(
            settings.get("RANDOM_DELAY_RANGE", self.DEFAULT_RANGE)
        )
        self._per_domain: dict = settings.getdict("RANDOM_DELAY_PER_DOMAIN", {})
        self._verbose: bool = settings.getbool("RANDOM_DELAY_VERBOSE", False)
        self._delegate = HTTPDownloadHandler(settings, crawler)

    @classmethod
    def from_crawler(cls, crawler: Crawler) -> "RandomDelayHandler":
        return cls(crawler.settings, crawler)

    def _parse_range(self, value) -> Tuple[float, float]:
        try:
            low, high = float(value[0]), float(value[1])
        except (TypeError, ValueError, IndexError) as e:
            raise ValueError(
                f"RANDOM_DELAY_RANGE must be a two-element sequence of numbers, got: {value!r}"
            ) from e

        if low < 0 or high < 0:
            raise ValueError("RANDOM_DELAY_RANGE values must be non-negative")
        if low > high:
            raise ValueError(
                f"RANDOM_DELAY_RANGE minimum ({low}) must not exceed maximum ({high})"
            )
        return low, high

    def _get_delay_range(self, request: Request) -> Tuple[float, float]:
        from urllib.parse import urlparse
        domain = urlparse(request.url).netloc
        if domain in self._per_domain:
            return self._parse_range(self._per_domain[domain])
        return self._range

    async def download_request(self, request: Request, spider) -> Response:
        low, high = self._get_delay_range(request)
        delay = random.uniform(low, high)

        if self._verbose:
            logger.debug(f"Random delay: {delay:.2f}s before {request.url}")

        await asyncio.sleep(delay)
        return await self._delegate.download_request(request, spider)

    def close(self):
        self._delegate.close()

Update __init__.py to export the handler and set the version:

# src/scrapy_random_delay/__init__.py
from scrapy_random_delay.handler import RandomDelayHandler

__version__ = "0.1.0"
__all__ = ["RandomDelayHandler"]

Writing the tests

Create the test directory:

mkdir tests
touch tests/__init__.py tests/conftest.py tests/test_handler.py

# tests/conftest.py
import pytest
from unittest.mock import MagicMock
from scrapy.http import Request, Response
from scrapy.utils.test import get_crawler


@pytest.fixture
def crawler():
    return get_crawler()


@pytest.fixture
def make_request():
    def _make(url="https://example.com/products", **kwargs):
        return Request(url=url, **kwargs)
    return _make


@pytest.fixture
def mock_response():
    return MagicMock(spec=Response)

# tests/test_handler.py
import pytest
from unittest.mock import AsyncMock, MagicMock, patch

from scrapy.utils.test import get_crawler
from scrapy_random_delay import RandomDelayHandler


def make_handler(settings_dict: dict = None) -> RandomDelayHandler:
    crawler = get_crawler(settings_dict=settings_dict or {})
    with patch("scrapy_random_delay.handler.HTTPDownloadHandler"):
        return RandomDelayHandler(crawler.settings, crawler)


class TestRangeValidation:
    def test_default_range_applied_when_not_configured(self):
        handler = make_handler()
        assert handler._range == (0.5, 2.0)

    def test_custom_range_accepted(self):
        handler = make_handler({"RANDOM_DELAY_RANGE": (1.0, 5.0)})
        assert handler._range == (1.0, 5.0)

    def test_equal_values_accepted(self):
        handler = make_handler({"RANDOM_DELAY_RANGE": (2.0, 2.0)})
        assert handler._range == (2.0, 2.0)

    def test_zero_range_accepted(self):
        handler = make_handler({"RANDOM_DELAY_RANGE": (0.0, 0.0)})
        assert handler._range == (0.0, 0.0)

    def test_inverted_range_raises(self):
        with pytest.raises(ValueError, match="minimum.*must not exceed maximum"):
            make_handler({"RANDOM_DELAY_RANGE": (5.0, 1.0)})

    def test_negative_range_raises(self):
        with pytest.raises(ValueError, match="non-negative"):
            make_handler({"RANDOM_DELAY_RANGE": (-1.0, 2.0)})

    def test_invalid_type_raises(self):
        with pytest.raises(ValueError, match="two-element sequence"):
            make_handler({"RANDOM_DELAY_RANGE": "bad-value"})


class TestPerDomainOverrides:
    def test_per_domain_range_returned_for_matching_domain(self, make_request):
        handler = make_handler({
            "RANDOM_DELAY_RANGE": (0.5, 1.0),
            "RANDOM_DELAY_PER_DOMAIN": {"api.example.com": [2.0, 4.0]},
        })
        request = make_request(url="https://api.example.com/products")
        assert handler._get_delay_range(request) == (2.0, 4.0)

    def test_default_range_returned_for_unmatched_domain(self, make_request):
        handler = make_handler({
            "RANDOM_DELAY_RANGE": (0.5, 1.0),
            "RANDOM_DELAY_PER_DOMAIN": {"api.example.com": [2.0, 4.0]},
        })
        request = make_request(url="https://other.example.com/products")
        assert handler._get_delay_range(request) == (0.5, 1.0)


class TestDownloadRequest:
    @pytest.mark.asyncio
    async def test_delegates_to_inner_handler(self, make_request, mock_response):
        handler = make_handler({"RANDOM_DELAY_RANGE": (0.0, 0.0)})
        handler._delegate.download_request = AsyncMock(return_value=mock_response)

        result = await handler.download_request(make_request(), MagicMock())

        assert result is mock_response
        handler._delegate.download_request.assert_awaited_once()

    @pytest.mark.asyncio
    async def test_delay_is_within_range(self, make_request, mock_response):
        import time
        handler = make_handler({"RANDOM_DELAY_RANGE": (0.05, 0.1)})
        handler._delegate.download_request = AsyncMock(return_value=mock_response)

        start = time.monotonic()
        await handler.download_request(make_request(), MagicMock())
        elapsed = time.monotonic() - start

        assert 0.05 <= elapsed <= 0.5  # generous upper bound for slow CI

    @pytest.mark.asyncio
    async def test_verbose_mode_logs_delay(self, make_request, mock_response, caplog):
        import logging
        handler = make_handler({
            "RANDOM_DELAY_RANGE": (0.0, 0.0),
            "RANDOM_DELAY_VERBOSE": True,
        })
        handler._delegate.download_request = AsyncMock(return_value=mock_response)

        with caplog.at_level(logging.DEBUG, logger="scrapy_random_delay.handler"):
            await handler.download_request(make_request(), MagicMock())

        assert "Random delay" in caplog.text


class TestClose:
    def test_close_delegates_to_inner_handler(self):
        handler = make_handler()
        handler._delegate.close = MagicMock()
        handler.close()
        handler._delegate.close.assert_called_once()

Configure pytest in pyproject.toml by adding this section:

[tool.pytest.ini_options]
asyncio_mode = "auto"
testpaths    = ["tests"]

Run the tests with uv:

uv run pytest

uv run executes the command inside the project's virtual environment, which uv manages automatically. No source .venv/bin/activate needed.

Completing pyproject.toml

Fill in the project metadata. uv has already set the basics: add the rest:

[project]
name            = "scrapy-random-delay"
version         = "0.1.0"
description     = "A Scrapy download handler that adds a configurable random delay before each request."
readme          = "README.md"
license         = { file = "LICENSE" }
requires-python = ">=3.9"
authors         = [
    { name = "Your Name", email = "you@example.com" },
]
keywords        = ["scrapy", "web-scraping", "download-handler", "rate-limiting"]
classifiers     = [
    "Development Status :: 4 - Beta",
    "Intended Audience :: Developers",
    "License :: OSI Approved :: MIT License",
    "Programming Language :: Python :: 3",
    "Programming Language :: Python :: 3.9",
    "Programming Language :: Python :: 3.10",
    "Programming Language :: Python :: 3.11",
    "Programming Language :: Python :: 3.12",
    "Framework :: Scrapy",
    "Topic :: Internet :: WWW/HTTP",
    "Topic :: Software Development :: Libraries :: Python Modules",
]
dependencies = [
    "scrapy>=2.8",
]

[dependency-groups]
dev = [
    "pytest>=8.0",
    "pytest-asyncio>=0.24",
]

[project.urls]
Homepage      = "https://github.com/yourname/scrapy-random-delay"
Documentation = "https://github.com/yourname/scrapy-random-delay#readme"
Repository    = "https://github.com/yourname/scrapy-random-delay.git"
Issues        = "https://github.com/yourname/scrapy-random-delay/issues"

[build-system]
requires      = ["uv_build>=0.11.7,<0.12.0"]
build-backend = "uv_build"

[tool.pytest.ini_options]
asyncio_mode = "auto"
testpaths    = ["tests"]

The classifiers list is used by PyPI for filtering and discovery. The Framework :: Scrapy classifier ensures your package appears when someone browses the Scrapy ecosystem. Browse the full list at pypi.org/classifiers.

Writing the README

# scrapy-random-delay

A Scrapy download handler that adds a configurable random delay before each request.

Scrapy's built-in `DOWNLOAD_DELAY` setting adds a fixed delay between requests.
This package draws a delay from a uniform distribution between a minimum and maximum,
which produces more natural request timing and is configurable per domain.

## Installation

    uv add scrapy-random-delay
    # or
    pip install scrapy-random-delay

## Usage

Add to your Scrapy `settings.py`:

    DOWNLOAD_HANDLERS = {
        "http":  "scrapy_random_delay.RandomDelayHandler",
        "https": "scrapy_random_delay.RandomDelayHandler",
    }

    RANDOM_DELAY_RANGE = (0.5, 3.0)   # seconds (min, max). Default: (0.5, 2.0)

### Per-domain configuration

    RANDOM_DELAY_PER_DOMAIN = {
        "api.example.com": (1.0, 4.0),   # slower for sensitive endpoints
        "cdn.example.com": (0.0, 0.1),   # faster for static assets
    }

### Verbose logging

    RANDOM_DELAY_VERBOSE = True   # logs the delay chosen for each request

## Settings reference

| Setting | Type | Default | Description |
|---|---|---|---|
| `RANDOM_DELAY_RANGE` | `(float, float)` | `(0.5, 2.0)` | Min and max delay in seconds |
| `RANDOM_DELAY_PER_DOMAIN` | `dict` | `{}` | Per-domain delay overrides |
| `RANDOM_DELAY_VERBOSE` | `bool` | `False` | Log each delay to DEBUG |

## Compatibility

- Python 3.9+
- Scrapy 2.8+

## License

MIT

Publishing to PyPI

Create accounts

You need two accounts: test.pypi.org for the test registry, and pypi.org for the real registry that pip install and uv add use. Use the test registry first, since it resets periodically and will not pollute the real index with test uploads. Enable two-factor authentication on both, as PyPI requires it for publishing.

Publish to Test PyPI

uv has publishing built in via uv publish. No need to install twine or build separately: uv handles the build step automatically.

uv publish --publish-url https://test.pypi.org/legacy/

You will be prompted for your Test PyPI username (__token__) and your API token. Create a token at test.pypi.org/manage/account/token.

Verify the Test PyPI installation

uv add scrapy-random-delay --index https://test.pypi.org/simple/ --extra-index https://pypi.org/simple/

The --extra-index flag allows uv to find dependencies like scrapy from the real PyPI, since they will not be on Test PyPI.

Verify it works:

import scrapy_random_delay
print(scrapy_random_delay.__version__)   # 0.1.0
from scrapy_random_delay import RandomDelayHandler
print(RandomDelayHandler)

Publish to the real PyPI

Once satisfied:

uv publish

Your package is live. Within a few minutes anyone can install it:

uv add scrapy-random-delay
# or
pip install scrapy-random-delay

Using a token stored in the environment

Rather than typing your token on every publish, store it in an environment variable:

export UV_PUBLISH_TOKEN=pypi-your-token-here
uv publish

Or configure it per-project in a .env file (never commit this):

# .env
UV_PUBLISH_TOKEN=pypi-your-token-here

# .gitignore
.env

Automating releases with GitHub Actions

Manually publishing works fine, but automating it means a new version ships by pushing a version tag, with no manual steps:

# .github/workflows/publish.yml
name: Publish to PyPI

on:
  push:
    tags:
      - "v*"   # triggers on tags like v0.1.0, v1.2.3

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install uv
        uses: astral-sh/setup-uv@v4

      - name: Run tests
        run: uv run pytest

  publish:
    needs: test
    runs-on: ubuntu-latest
    environment: pypi

    permissions:
      id-token: write   # required for trusted publishing

    steps:
      - uses: actions/checkout@v4

      - name: Install uv
        uses: astral-sh/setup-uv@v4

      - name: Publish to PyPI
        run: uv publish
        # Uses trusted publishing via OIDC — no token stored in secrets needed
        # Configure at pypi.org/manage/account/publishing/

The workflow is notably short compared to the traditional pip, build, and twine approach, since uv handles the build and publish in a single command.

Trusted publishing is worth setting up. It authenticates the GitHub Actions workflow directly with PyPI using OpenID Connect, so you never need to store a PyPI token as a GitHub secret. Configure it once at pypi.org under your account's Publishing settings, specifying your GitHub username, repository name, and workflow file name.

To cut a release:

# Update version in pyproject.toml and src/scrapy_random_delay/__init__.py
git commit -am "Release v0.2.0"
git tag v0.2.0
git push && git push --tags

The workflow runs, tests pass, the package publishes.

Bumping versions

uv does not yet have a built-in version bump command, but the process is straightforward. The version lives in two places: pyproject.toml and __init__.py, and both need updating together.

For a small package, doing this by hand is fine. For a larger project, bump2version or python-semantic-release automate the version increment, changelog update, git commit, and tag in one command:

uv add --dev bump2version
uv run bump2version patch   # 0.1.0 → 0.1.1
uv run bump2version minor   # 0.1.0 → 0.2.0
uv run bump2version major   # 0.1.0 → 1.0.0

Configure it in pyproject.toml:

[tool.bumpversion]
current_version = "0.1.0"
commit          = true
tag             = true

[[tool.bumpversion.files]]
filename = "pyproject.toml"

[[tool.bumpversion.files]]
filename = "src/scrapy_random_delay/__init__.py"

Now bumping the version, committing, tagging, and triggering the publish workflow is a single command.

Testing against multiple Python versions

Add a test matrix to the workflow to catch version-specific issues before they reach users:

jobs:
  test:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        python-version: ["3.9", "3.10", "3.11", "3.12"]

    steps:
      - uses: actions/checkout@v4

      - name: Install uv
        uses: astral-sh/setup-uv@v4

      - name: Run tests on Python ${{ matrix.python-version }}
        run: uv run --python ${{ matrix.python-version }} pytest

uv handles Python version management directly: --python 3.9 downloads and uses that Python version without any additional tooling like pyenv.

Keeping your package maintainable

Pin your minimum Scrapy version. scrapy>=2.8 in dependencies sets a floor. Check which Scrapy version introduced the APIs you use and do not allow anything older. uv add scrapy will pick the current latest: tighten the floor to the oldest version you have tested against.

Keep a CHANGELOG. Users upgrading a dependency want to know if anything changed. Keep a CHANGELOG.md with an ## Unreleased section at the top that becomes the next version entry when you tag a release.

Be conservative with dependencies. Every package in dependencies becomes a constraint users must satisfy alongside their own dependencies. A Scrapy extension that depends on pandas and numpy is painful to integrate. Keep the list to what is genuinely required.

Handle deprecation cleanly. When you remove a setting or change an API, keep the old behavior working for at least one minor version and log a deprecation warning. Users who update without reading the changelog will thank you.

Lock files are for applications, not libraries. Commit uv.lock if this is an application. Do not commit it if this is a library, since your users' projects will resolve their own dependency versions and a committed lockfile does not help them.

The complete file tree

After following this guide, your project looks like this:

scrapy-random-delay/
├── src/
│   └── scrapy_random_delay/
│       ├── __init__.py          # exports RandomDelayHandler, sets __version__
│       └── handler.py           # the handler implementation
├── tests/
│   ├── __init__.py
│   ├── conftest.py              # shared fixtures
│   └── test_handler.py          # test suite
├── .github/
│   └── workflows/
│       └── publish.yml          # automated test and publish workflow
├── pyproject.toml               # project metadata, dependencies, tool config
├── uv.lock                      # locked dependency versions (omit for libraries)
├── README.md                    # installation and usage documentation
├── LICENSE                      # MIT or whichever license you choose
└── .gitignore                   # include .env, __pycache__, dist/, *.egg-info/

Next steps

Publishing a Scrapy extension to PyPI is the same process regardless of what the extension does: the download handler is just the example. The same pyproject.toml structure, the same uv publish command, and the same GitHub Actions workflow apply to middlewares, pipelines, item types, and extensions. If you want to go deeper on Scrapy's component system and understand where each type of extension fits, the Modern Scrapy Developer's Guide covers the full architecture from spiders through to deployment on Scrapy Cloud.

How to tell if a page uses JavaScript rendering (and what to do about it)

John Rooney — Mon, 11 May 2026 09:55:57 +0000

You write a scraper, test your selectors in the browser, and everything looks right. Then you run the spider and get back nothing. This is the most common point of confusion for developers new to web scraping: the browser shows you the data, your scraper does not find it, and the two are looking at completely different things.

The browser executes JavaScript before you see anything on screen, but your scraper, unless you specifically configure it to do otherwise, does not. It sees the raw HTML the server sent before any JavaScript ran, and on a modern web application that raw HTML is often just a shell: a few <div> tags, some <script> elements, and no actual content.

Figuring out whether a page uses JavaScript rendering takes about two minutes and a browser you already have open, and once you know what you are dealing with, the path forward is clear.

The two-minute test

Open the page you want to scrape. Right-click anywhere on the page and select View Page Source: not Inspect, not DevTools, but View Page Source. This shows you the raw HTML the server sent before the browser ran any JavaScript.

Now search that source for a piece of text you can see on the rendered page: a product name, a price, a headline, anything specific.

If you find it, the content is in the HTML and your scraper can extract it without JavaScript rendering. If you do not find it, the content was injected by JavaScript after the page loaded, your scraper will not find it either, and you need a different approach.

That is the entire test. Everything else in this guide is detail.

Understanding why this happens

It helps to understand the three different ways a page can deliver its content, because each one requires a different scraping approach.

Server-side rendering. The server builds the complete HTML page and sends it. When the browser receives it, all the content is already in the markup. Wikipedia works this way, many news sites work this way, and older e-commerce platforms work this way. This is the easiest case for scraping: requests and Beautiful Soup are sufficient.

Client-side rendering. The server sends a minimal HTML shell with almost no content, plus a JavaScript bundle. The JavaScript runs in the browser, fetches data from an API, and builds the DOM dynamically, which means the content never exists in the original HTML. React, Vue, and Angular applications built as single-page applications typically work this way, and they require either browser rendering or finding and calling the underlying API directly.

Hybrid rendering. The server sends an HTML page with some content already in it (enough for search engines and initial paint), and JavaScript then enhances the page by adding more data, enabling interactivity, and loading supplementary content. Many modern e-commerce and content sites work this way, and depending on which data you need, you may or may not need JavaScript rendering.

The View Source test tells you which case you are in.

Confirming it with Python

The manual test is fast, but if you want to confirm programmatically or build a check into your scraping toolchain, a comparison of the raw response against the rendered DOM is definitive.

import requests
from bs4 import BeautifulSoup

def check_for_js_rendering(url: str, search_text: str) -> dict:
    """
    Fetch a page with plain requests and check whether expected text is present.
    Returns a diagnostic dict.
    """
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
    }

    response = requests.get(url, headers=headers, timeout=15)

    result = {
        "url": url,
        "status_code": response.status_code,
        "content_length": len(response.text),
        "search_text": search_text,
        "found_in_raw_html": search_text.lower() in response.text.lower(),
        "likely_js_rendered": False,
        "notes": [],
    }

    soup = BeautifulSoup(response.text, "lxml")

    # Check for signals that suggest client-side rendering
    script_tags = soup.find_all("script")
    result["script_count"] = len(script_tags)

    # Common JS framework root elements
    js_roots = soup.select("#root, #app, #__next, #__nuxt, [data-reactroot]")
    if js_roots:
        result["notes"].append(f"Found JS framework root element: {js_roots[0]}")
        result["likely_js_rendered"] = True

    # Very little visible text relative to page size is a strong signal
    visible_text = soup.get_text(strip=True)
    text_ratio = len(visible_text) / max(len(response.text), 1)
    result["text_ratio"] = round(text_ratio, 3)
    if text_ratio < 0.05 and len(response.text) > 5000:
        result["notes"].append(f"Low text ratio ({text_ratio:.1%}) suggests JS-rendered content")
        result["likely_js_rendered"] = True

    # Explicit not-found confirmation
    if not result["found_in_raw_html"]:
        result["likely_js_rendered"] = True
        result["notes"].append(f"'{search_text}' not found in raw HTML — almost certainly JS rendered")

    return result


# Usage
result = check_for_js_rendering(
    url="https://example.com/products/headphones",
    search_text="Wireless Headphones"
)

print(f"Status: {result['status_code']}")
print(f"Content length: {result['content_length']} bytes")
print(f"Text ratio: {result['text_ratio']:.1%}")
print(f"Found in raw HTML: {result['found_in_raw_html']}")
print(f"Likely JS rendered: {result['likely_js_rendered']}")
for note in result["notes"]:
    print(f"  -> {note}")

Reading the signals

Even before searching for specific text, certain patterns in the raw HTML tell you what you are dealing with.

Signal one: the HTML is nearly empty

Open View Source and immediately scroll down. A server-rendered page will have recognizable HTML structure: <header>, <main>, product listings, article text, navigation. A client-side rendered page will often look something like this:

<!DOCTYPE html>
<html>
  <head>
    <meta charset="utf-8">
    <title>My Store</title>
    <link rel="stylesheet" href="/static/css/main.abc123.css">
  </head>
  <body>
    <div id="root"></div>
    <script src="/static/js/main.def456.js"></script>
  </body>
</html>

A <div id="root"> or <div id="app"> with nothing inside it is the unmistakable signature of a React or Vue single-page application. Everything visible in the browser was injected by JavaScript into that empty div.

from bs4 import BeautifulSoup
import requests

def has_empty_app_root(url: str) -> bool:
    response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
    soup = BeautifulSoup(response.text, "lxml")

    # These are the standard mounting points for major JS frameworks
    for selector in ("#root", "#app", "#__next", "#__nuxt", "[data-reactroot]"):
        el = soup.select_one(selector)
        if el and not el.get_text(strip=True):
            print(f"Found empty JS root: {selector}")
            return True

    return False

Signal two: JavaScript framework fingerprints

Even on hybrid-rendered pages, certain markers identify which JavaScript framework is in use:

def identify_js_framework(html: str) -> list[str]:
    """Identify JavaScript frameworks present on the page."""
    frameworks = []

    checks = [
        ("React",         ["data-reactroot", "data-reactid", "__REACT_DEVTOOLS"]),
        ("Next.js",       ["__NEXT_DATA__", "_next/static", "__next"]),
        ("Vue",           ["data-v-", "__vue__", "nuxt__", "__NUXT__"]),
        ("Angular",       ["ng-version", "_nghost", "ng-app"]),
        ("Nuxt",          ["__NUXT__", "__nuxt"]),
        ("Gatsby",        ["___gatsby", "__PATH_PREFIX__"]),
        ("Svelte",        ["__svelte", "svelte-"]),
    ]

    for framework, markers in checks:
        if any(marker in html for marker in markers):
            frameworks.append(framework)

    return frameworks


response = requests.get("https://example.com", headers={"User-Agent": "Mozilla/5.0"})
frameworks = identify_js_framework(response.text)
if frameworks:
    print(f"JS frameworks detected: {', '.join(frameworks)}")
    print("Page likely requires JavaScript rendering for full content")
else:
    print("No major JS framework detected — likely server-rendered")

Signal three: the Network tab in DevTools

For any page where you are unsure, the Network tab in browser DevTools gives you a definitive picture. Open DevTools, click the Network tab, and reload the page.

Look at the first request: the one for the HTML document itself. Click it and look at the Response tab. If the response body contains the data you want, you do not need JavaScript rendering. If it contains an empty shell, you do.

While you are there, also filter to Fetch/XHR and check whether the data you want is arriving via an API call in the background, such as a request to /api/products returning JSON. If it is, you may not need browser rendering at all, because you can call that API directly from Python, which is faster and more reliable than rendering the full page. The guide on intercepting XHR and fetch requests covers this in detail.

Signal four: the page works without CSS but breaks without JavaScript

A quick test that sometimes reveals the answer is to open your browser's developer settings, disable JavaScript, and reload the page. If the content disappears or the page shows a "Please enable JavaScript" message, the content is JavaScript-rendered.

What to do about it

Once you have confirmed the page requires JavaScript rendering, you have three paths forward. They are not mutually exclusive: the right choice depends on what the page is doing, how much data you need, and how much complexity you want to manage.

Path one: find and call the API directly

This is always the first thing to try, because when it works it is the best outcome. Many JavaScript-rendered pages load their data from a JSON API, and if you can find that API, you can call it directly from Python and get clean, structured data without running a browser at all.

Open the Network tab in DevTools, filter to Fetch/XHR, reload the page, and look for requests returning JSON that contains the data you want. Right-click any promising request and select Copy, then Copy as cURL.

import requests

# Reproduced from the cURL command copied from DevTools
response = requests.get(
    "https://example.com/api/v2/products",
    params={"category": "electronics", "page": 1},
    headers={
        "Accept": "application/json",
        "Referer": "https://example.com/products",
        "User-Agent": "Mozilla/5.0",
    },
)

data = response.json()
products = data["products"]
print(f"Found {len(products)} products via API")

If the API requires authentication tokens that are generated by JavaScript on page load, you can extract them first and then call the API directly:

from playwright.sync_api import sync_playwright
import requests

def get_api_token(url: str) -> str | None:
    """Extract an API token from a JS-rendered page."""
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url)
        page.wait_for_load_state("networkidle")

        # Try common locations for auth tokens
        token = (
            page.evaluate("() => window.__INITIAL_STATE__?.auth?.token") or
            page.evaluate("() => localStorage.getItem('token')") or
            page.get_attribute('meta[name="api-token"]', "content")
        )
        browser.close()
        return token

token = get_api_token("https://example.com/products")
if token:
    response = requests.get(
        "https://example.com/api/products",
        headers={"Authorization": f"Bearer {token}"},
    )
    products = response.json()

Path two: use Playwright or Selenium to render the page

When the API approach is not viable (because the data is not available via a clean API, or the authentication is too complex to reproduce), use a browser automation library to render the page and scrape the resulting DOM.

Playwright is the recommended choice for new projects, since it is faster than Selenium, has a cleaner async API, and supports Chromium, Firefox, and WebKit.

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

def scrape_with_playwright(url: str) -> list[dict]:
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()

        # Block images and fonts to speed up page loads
        page.route(
            "**/*.{png,jpg,jpeg,gif,webp,svg,woff,woff2,ttf,eot}",
            lambda route: route.abort()
        )

        page.goto(url)

        # Wait for the content you actually need, not just page load
        page.wait_for_selector("article.product", timeout=10000)

        # Parse with Beautiful Soup — the HTML is now fully rendered
        html = page.content()
        browser.close()

    soup = BeautifulSoup(html, "lxml")
    products = []
    for article in soup.select("article.product"):
        products.append({
            "name": article.select_one(".product-title").get_text(strip=True)
                    if article.select_one(".product-title") else None,
            "price": article.select_one(".price").get_text(strip=True)
                     if article.select_one(".price") else None,
        })

    return products

For async crawls:

import asyncio
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup

async def scrape_page(browser, url: str) -> list[dict]:
    page = await browser.new_page()
    try:
        await page.route(
            "**/*.{png,jpg,jpeg,gif,webp}",
            lambda route: route.abort()
        )
        await page.goto(url)
        await page.wait_for_selector("article.product")
        html = await page.content()
        soup = BeautifulSoup(html, "lxml")
        return [
            {
                "name": a.select_one(".product-title").get_text(strip=True)
                        if a.select_one(".product-title") else None,
            }
            for a in soup.select("article.product")
        ]
    finally:
        await page.close()

async def main():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        urls = [
            "https://example.com/products?page=1",
            "https://example.com/products?page=2",
        ]
        tasks = [scrape_page(browser, url) for url in urls]
        results = await asyncio.gather(*tasks)
        await browser.close()
        return [item for page_items in results for item in page_items]

products = asyncio.run(main())

Choosing what to wait for

The most common mistake with Playwright is waiting for the wrong thing. page.wait_for_load_state("load") fires when the initial HTML and scripts have loaded, not when the JavaScript has finished rendering content. Use one of these instead:

# Wait for a specific element you need to be present
page.wait_for_selector(".product-listing")

# Wait for network activity to stop (risky — some pages have background polling)
page.wait_for_load_state("networkidle")

# Wait for a specific API response to complete
with page.expect_response("**/api/products**") as resp_info:
    page.goto(url)
response = resp_info.value
data = response.json()  # intercept the API response directly

# Wait for an element to contain specific text
page.wait_for_function("document.querySelector('.price')?.textContent?.includes('£')")

Path three: use Zyte API

If you are running scrapers at scale, managing a fleet of browser instances is a significant operational burden, since each browser process uses substantial memory (300–500 MB at minimum), headless browsers require careful configuration to avoid detection, and the infrastructure to run them reliably across many concurrent jobs requires real engineering investment.

Zyte API handles browser rendering and bot detection at the infrastructure level. You send a standard HTTP request and get back rendered HTML, with the browser execution, proxy rotation, and fingerprint management handled by the platform.

import requests

response = requests.post(
    "https://api.zyte.com/v1/extract",
    auth=("YOUR_API_KEY", ""),
    json={
        "url": "https://example.com/products",
        "browserHtml": True,   # request browser-rendered HTML
    },
)

data = response.json()
rendered_html = data["browserHtml"]

# Parse with Beautiful Soup as normal
from bs4 import BeautifulSoup
soup = BeautifulSoup(rendered_html, "lxml")
products = soup.select("article.product")
print(f"Found {len(products)} products")

In Scrapy, Zyte API integrates via the scrapy-zyte-api package:

# settings.py
ZYTE_API_KEY = "YOUR_API_KEY"
DOWNLOAD_HANDLERS = {
    "http":  "scrapy_zyte_api.ScrapyZyteAPIDownloadHandler",
    "https": "scrapy_zyte_api.ScrapyZyteAPIDownloadHandler",
}

# Spider: request browser-rendered HTML for specific requests
def start_requests(self):
    yield scrapy.Request(
        "https://example.com/products",
        meta={
            "zyte_api_automap": True,
            "zyte_api": {"browserHtml": True},
        },
    )

How to decide

Is the data in the raw HTML? Run the View Source test. If yes, use requests and Beautiful Soup or Scrapy and skip browser rendering entirely.

Is the data loaded via a visible API call? Check the Network tab with the Fetch/XHR filter. If yes, call the API directly from Python, since it will be faster and more reliable than any rendering approach.

Is the data accessible via the API but protected by a token that requires JavaScript to generate? Extract the token with a single Playwright page load, then call the API directly for the actual data scraping: one browser load, many API calls.

Do you need to interact with the page (scroll, click, fill forms, navigate tabs) to reveal the data? Use Playwright or Selenium.

Are you running this at scale and do not want to manage browser infrastructure? Use Zyte API.

When rendering is slower than you expect

Browser rendering is ten to twenty times slower than a plain HTTP request and uses significantly more memory, but a few practical adjustments make a meaningful difference.

Block unnecessary resources. Images, fonts, video, and tracking scripts add load time without contributing to the data you need:

page.route(
    "**/*.{png,jpg,jpeg,gif,webp,svg,ico,woff,woff2,ttf,mp4,webm}",
    lambda route: route.abort()
)

# Also block common tracking and analytics domains
def block_trackers(route):
    blocked = ["google-analytics.com", "googletagmanager.com",
               "facebook.net", "doubleclick.net", "hotjar.com"]
    if any(domain in route.request.url for domain in blocked):
        route.abort()
    else:
        route.continue_()

page.route("**/*", block_trackers)

Reuse browser contexts rather than browser instances. Creating a new browser is expensive; creating a new context within an existing browser is cheap:

async with async_playwright() as p:
    browser = await p.chromium.launch(headless=True)

    async def scrape(url):
        # New context per task — isolated cookies and storage, reuses browser process
        context = await browser.new_context()
        page = await context.new_page()
        await page.goto(url)
        html = await page.content()
        await context.close()
        return html

    results = await asyncio.gather(*[scrape(url) for url in urls])
    await browser.close()

Wait for specific elements rather than network idle. networkidle waits until there are no active network connections for 500 ms, and many pages never reach this state because they have background analytics pings. Waiting for the specific element you need is faster and more reliable.

Checking your work

Once you have set up rendering, confirm it is actually returning the content you need:

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

def verify_rendering(url: str, expected_text: str) -> bool:
    """Confirm that rendering produces the expected content."""
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url)
        page.wait_for_load_state("networkidle")
        html = page.content()
        browser.close()

    soup = BeautifulSoup(html, "lxml")
    found = expected_text.lower() in soup.get_text().lower()
    print(f"'{expected_text}': {'found ✓' if found else 'not found ✗'}")
    return found

verify_rendering("https://example.com/products/headphones", "Wireless Headphones")

Next steps

Once you know a page requires JavaScript rendering and have chosen your approach, the next challenge is often what happens to the data after you extract it, since cleaning, deduplication, and storage work the same regardless of how the HTML was obtained. If you went the API interception route, the guide on intercepting XHR and fetch requests in the browser covers the full workflow from DevTools discovery to paginated API calls in Python. If you are running Playwright at scale inside Scrapy, web scraping dynamic websites with Zyte API covers how to handle rendering and unblocking without managing browser infrastructure yourself.

What to do when websites change and your spider doesn't know

John Rooney — Mon, 11 May 2026 09:37:11 +0000

Empty-field-rate monitoring catches selectors that return nothing. It does not catch selectors that return something wrong. The most damaging form of schema drift is the kind where a selector keeps producing values, the values are syntactically reasonable, and they are no longer the values you wanted. A price selector that quietly starts returning the financing instalment instead of the sticker price will pass every non-empty check while corrupting your data for as long as the drift goes unnoticed. That is the failure mode this post is about.

Drift comes in several flavours

People talk about "schema drift" as if it were one thing, but in scraping practice there are several kinds of drift, each of which fails differently and demands a different defence.

Drift type	Meaning	Example
DOM/layout drift	The page structure changes	Product cards move from table rows to grid cards
Data contract drift	The meaning or format of a field changes	Price changes from numeric text to "Contact us"
Navigation drift	Discovery paths change	Pagination links disappear, replaced by infinite scroll
Output schema drift	The spider output changes shape	A field is renamed or removed in the item definition

The first kind is the most familiar. The second is the most dangerous. When extraction returns nothing, you can catch it with a simple non-empty assertion. When extraction returns something plausible but wrong, the validation has to be semantic, and most pipelines have nothing in place to do that work.

Consider the financing-price example. Before the redesign, your selector .product-price matched a <span> containing the value $129.99. After the redesign, the same class name is reused for a marketing element that displays $11/mo with affirm. Your extractor still returns a string. The string still contains a dollar sign and a number. A naive validator looks at it, decides it is a price, and accepts it. The data is wrong, but nothing in the pipeline knows that.

The dangerous failure is not always when extraction returns nothing. It is when extraction returns something plausible but wrong.

The empty-field-rate metric from the previous post in this series will catch DOM drift that produces blanks. It will not catch data contract drift that produces something that just looks like a real value. For that, you need an extra layer of defence.

Structural fingerprints as smoke alarms

One way to catch a site change before the data goes wrong is to monitor the page structure itself, separately from the data you extract. The basic idea is simple: hash a fragment of the page that should remain stable, store the hash as a baseline, and compare future fetches against it. If the hash changes, something about the page changed, and you have an early warning.

The naive implementation, hashing the raw HTML of the page or the product container, is too noisy to be useful. Modern pages contain rotating ads, A/B test variants, randomised CSS class names from build tools, recommendation widgets, inventory banners, and inline analytics scripts, all of which change between requests without anything meaningful changing on the page. A raw hash will fire constantly and you will learn to ignore it.

The better pattern is a normalised structural fingerprint. The goal is to capture the shape of the page, the hierarchy of tags and the semantic attributes, while discarding everything that varies cosmetically.

from hashlib import sha256
from lxml import html
from copy import deepcopy

VOLATILE_TAGS = {"script", "style", "noscript", "iframe"}
VOLATILE_ATTRS_PREFIX = ("data-track", "data-analytics", "data-test-id-")

def normalize_subtree(element):
    """Return a string representation of structure only, not content or noise."""
    el = deepcopy(element)

    # remove volatile tags entirely
    for tag in VOLATILE_TAGS:
        for node in el.iter(tag):
            node.getparent().remove(node) if node.getparent() is not None else None

    parts = []
    for node in el.iter():
        # keep tag and stable semantic attributes only
        attrs = []
        for k, v in sorted(node.attrib.items()):
            if k.startswith("aria-") or k in {"role", "itemprop", "itemtype"}:
                attrs.append(f"{k}={v}")
        parts.append(f"<{node.tag} {' '.join(attrs)}>")
    return "".join(parts)

def fingerprint(html_str, container_xpath):
    tree = html.fromstring(html_str)
    container = tree.xpath(container_xpath)
    if not container:
        return None
    return sha256(normalize_subtree(container[0]).encode()).hexdigest()

The principle behind the normalisation is to keep the things that should be stable across requests (tag hierarchy, ARIA roles, microdata attributes, intentional data-* attributes) and drop the things that are not (text content, generated class names, scripts, ads, tracking IDs). What remains is a structural fingerprint that changes when the developer of the target site changes the page, and is mostly stable otherwise.

A note on A/B testing: even with normalisation, a single hash mismatch is not a reliable signal of a real change. The site might be serving you a different test variant than the one you fingerprinted last week, and the difference is genuine without being a redesign. The right pattern is to sample more than one fetch before concluding that drift has occurred, and to treat a single mismatch as a prompt for review rather than an automatic alert.

Use fingerprints as smoke alarms, not verdicts. When the hash changes, fire a review task. Do not abort the crawl, do not roll back the deployment, and do not page anyone in the middle of the night. The fingerprint is telling you to look at the page; it is not telling you the page is broken.

Live canary checks before production runs

The fingerprint catches changes after they happen. The canary check catches them before they cost you a full crawl of bad data. The pattern is straightforward: pick a small, stable set of representative URLs, fetch them, run your current extraction logic against them, and assert that the critical fields come back with plausible values.

import pytest
import requests
from myproject.extractors import extract_product

CANARY_URLS = [
    "https://example.com/product/12345",
    "https://example.com/product/67890",
]

@pytest.mark.parametrize("url", CANARY_URLS)
def test_extraction_canary(url):
    response = requests.get(url, timeout=30)
    response.raise_for_status()

    item = extract_product(response.text)

    assert item["title"], f"empty title for {url}"
    assert item["price"], f"empty price for {url}"
    assert _looks_like_price(item["price"]), (
        f"price {item['price']!r} for {url} does not look like a price"
    )
    assert item["availability"] in {"in_stock", "out_of_stock", "preorder"}, (
        f"unexpected availability {item['availability']!r} for {url}"
    )

def _looks_like_price(value):
    import re
    # rejects "$11/mo" style strings, accepts "$129.99" and "129,99 €"
    return bool(re.fullmatch(r"[^\d]?\d{1,3}([.,]\d{3})*([.,]\d{2})?[^\d]?", value.strip()))

The semantic checks are what make this useful. Asserting that the title is non-empty is fine, but asserting that the price actually looks like a price is what catches the financing-string failure mode. The check on availability against a known set rejects values that are syntactically valid strings but no longer in the contract.

Wiring this into CI is a question of cadence. Running canary checks on every commit will produce noise from transient network issues and rate limiting. Running them on a schedule (every few hours, or before each production deployment) gives you a useful signal without the false-positive churn. Failed runs should store the fetched HTML, the extracted item, and the assertion that failed, all as artifacts you can inspect later. A canary that fires and discards the evidence is a canary that wastes your time when you go to investigate.

# .github/workflows/canary.yml
name: Extraction canary
on:
  schedule:
    - cron: "0 */4 * * *"
  workflow_dispatch:

jobs:
  canary:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install -r requirements.txt
      - run: pytest tests/canary -v
      - if: failure()
        uses: actions/upload-artifact@v4
        with:
          name: canary-failures
          path: tests/canary/artifacts/

The same A/B caveat applies here. If a canary fails on a single fetch, retry on a fresh request before alerting. If it fails consistently across multiple fetches, the change is real.

Where this connects to the rest of the stack

If your spider is part of a system that pulls a lot of data from a small set of high-value sites, an alternative to maintaining selectors and canaries is to skip the selector-based approach for some content types entirely. Zyte API's pageContent data type, released in late 2025, is one example of a route around the problem: it returns the cleaned main content of a page without you having to maintain selectors at all, which means there is no selector to drift against. That trade-off is not right for every project, especially when you need fine-grained structured fields, but it is worth knowing about when the maintenance cost of a selector-based pipeline starts to dominate.

For pipelines that stay selector-based, the combination of structural fingerprints and canary checks is the strongest defence available. Fingerprints flag that the page changed; canaries verify that your extraction still works against the changed page. Neither is sufficient on its own, and both together still rely on the metrics from the previous post to catch the failure modes they miss.

What to do next

Pick the three or four most valuable URLs in your crawl and write canary checks for them with semantic assertions, not just non-empty checks. Add a normalised structural fingerprint for the same URLs and store the baseline. Run both on a schedule before your next production deployment. That alone will catch most of the silent-failure cases that empty-field-rate monitoring lets through.

In the final post of this series, we will look at the third leg of production-ready scraping: making sure that when something does go wrong mid-run, you can restart the crawl without duplicating data or corrupting state.

Scrapy AutoThrottle: How to tune crawl speed without getting blocked

John Rooney — Thu, 30 Apr 2026 19:58:32 +0000

AutoThrottle is one of Scrapy's most useful production features and one of the most commonly misconfigured. Most guides tell you to add four lines to your settings file and move on. This one explains what the algorithm is actually doing, what its assumptions are, why those assumptions sometimes break down, and how to tune it for real crawls.

If you have enabled AutoThrottle and are not sure whether it is working, or if your crawl speed is not what you expect despite having it turned on, this is the post to read.

What AutoThrottle actually measures

The most common misconception about AutoThrottle: it adjusts based on HTTP response codes. It does not. It adjusts based on response latency, the time between sending a request and receiving a response.

AutoThrottle never looks at whether you are receiving 429s, 503s, or any other status code that might indicate rate limiting or blocking. When responses come back quickly, it assumes the server has capacity and reduces the delay between requests. When responses slow down, it assumes load and increases the delay. The goal is to maintain a target number of in-flight concurrent requests to the server, adjusting the inter-request delay to keep actual concurrency close to that target.

This works well when server latency is an accurate signal of server load. It breaks in several common situations:

Content delivery network (CDN)-cached responses. A CDN edge node returns cached content at sub-millisecond latency. AutoThrottle sees very fast responses and reduces the delay aggressively, potentially hammering the origin server, which AutoThrottle never directly observes.
Silent rate limiting. Some sites return a 200 with a soft-block page, a CAPTCHA, a antiban challenge, or empty results, in fast response time. AutoThrottle interprets this as a healthy server and keeps the rate high. You are being blocked; the algorithm does not know.
Rate limiting via response code. A server that returns 429 in milliseconds looks, to AutoThrottle, like a fast healthy server. Latency-based throttling is irrelevant when the server is enforcing a request cap by policy rather than by load.

Knowing what AutoThrottle measures is what lets you decide when to use it. Before tuning delay settings, it is worth understanding your request requirements thoroughly: the post "The recipe for a request: Scaling data extraction" argues for investigating those requirements carefully before committing to a crawl strategy.

The settings that matter

Five settings control AutoThrottle behavior. Three of them matter for most crawls.

AUTOTHROTTLE_ENABLED = True

# The delay before the algorithm takes over — used for the first few requests
AUTOTHROTTLE_START_DELAY = 1.0

# Ceiling on the computed delay — prevents runaway backoff on very slow servers
AUTOTHROTTLE_MAX_DELAY = 60.0

# Target number of in-flight concurrent requests to the server
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Log per-request throttling decisions — useful for tuning
AUTOTHROTTLE_DEBUG = False

`AUTOTHROTTLE_TARGET_CONCURRENCY`

This is the most commonly misunderstood setting. It is not a hard concurrency cap; it is a target. AutoThrottle computes a delay to try to maintain approximately this many in-flight requests at any given time, given the observed latency.

The default of 1.0 means AutoThrottle aims for one request in flight at a time. At 200ms average response time, that translates to roughly five requests per second. At 50ms, roughly 20 requests per second. Raising TARGET_CONCURRENCY makes the crawl more aggressive; at 4.0 with 200ms average latency, AutoThrottle aims for about 20 requests per second.

Also note the interaction with CONCURRENT_REQUESTS_PER_DOMAIN: if that cap is lower than what AutoThrottle's computed delay would allow, the cap takes precedence. Both settings need to be sized for your throughput target.

`AUTOTHROTTLE_MAX_DELAY`

Without this, a slow or overloaded server can push computed delays into minutes. The default of 60 seconds is reasonable for polite crawling, but it means a blocked or very slow server will stall your crawl for up to a minute between requests. Set it proportional to your acceptable throughput floor: a ceiling of 10 seconds means the slowest you crawl is one request per 10 seconds.

`AUTOTHROTTLE_START_DELAY`

This is the delay used for the first few requests before the algorithm has enough latency samples to make good decisions. Set it to something in the range of the delay you would use if setting DOWNLOAD_DELAY manually. Too low and the first few requests come in a burst; too high and you waste time at the start of each domain.

Reading the debug output

The most direct way to understand what AutoThrottle is doing is to enable debug logging:

AUTOTHROTTLE_DEBUG = True

This logs one line per request showing the throttle decision:

2026-01-15 09:42:11 [scrapy.extensions.throttle] DEBUG: slot: example.com
  prev/curr concurrency: 3/4
  prev/curr latency: 0.24s/0.31s
  target latency: 0.31s
  delay: 0.08s -> 0.10s

slot is the domain this decision applies to.

prev/curr concurrency shows how many requests were in-flight on the previous request versus the current one, indicating whether actual concurrency is matching the target.

prev/curr latency is the signal AutoThrottle is responding to. Rising latency drives the delay up; falling latency drives it down.

target latency is what the algorithm is aiming for, derived from TARGET_CONCURRENCY and current observed latency.

delay is the computed inter-request delay before and after this adjustment. This is the number that feeds into the actual crawl rate.

What to look for during tuning:

Delay stuck at MAX_DELAY consistently. The server is responding slowly or you are being rate limited by latency (rare). Consider raising MAX_DELAY if the server is legitimately slow, or investigate whether you are being blocked.
Delay near zero consistently. AutoThrottle is running with almost no restriction. Either the server is very fast or TARGET_CONCURRENCY is too high relative to CONCURRENT_REQUESTS_PER_DOMAIN.
Concurrency consistently below target. The server is slower than expected and AutoThrottle cannot achieve the target without excessive load. Lower TARGET_CONCURRENCY.

When to use AutoThrottle, a fixed delay, or neither

AutoThrottle makes sense when you are crawling a domain where server response time is a reliable signal of server load, you want adaptive behavior that maximizes throughput without overloading the server, and you are not under a specific documented rate limit. That covers most general-purpose crawls.

Use a fixed DOWNLOAD_DELAY when you have a specific rate limit to respect: an API with documented caps, or a crawl policy agreed with the site operator. Fixed delays give you auditable, predictable behavior. The Scrapy 2026 release improved retry logic and delay handling, making fixed-delay crawls more reliable in high-volume scenarios.

DOWNLOAD_DELAY = 1.0           # one second between requests
RANDOMIZE_DOWNLOAD_DELAY = True  # randomise between 0.5s and 1.5s

RANDOMIZE_DOWNLOAD_DELAY (on by default when DOWNLOAD_DELAY is set) makes the timing pattern look less mechanical, which helps with some simple bot detection approaches.

Use neither when you are crawling many small domains in parallel where per-domain load is negligible, or when you are using a scraping API such as Zyte API that manages politeness for you. Adding a delay in those cases only reduces throughput without benefiting the target server.

Patterns that trick AutoThrottle

CDN caching

Many high-traffic sites serve content from CDN edges that return cached responses in under 10ms. AutoThrottle sees fast responses and reduces the delay, sometimes to near zero. The origin server may be under significant load from your crawl; CDN latency tells you nothing about it. If you are crawling a CDN-backed site and need to be polite to the origin, use a fixed DOWNLOAD_DELAY rather than AutoThrottle.

Silent rate limiting

A site that returns a soft-block page in fast response time looks identical to a successfully served page from AutoThrottle's perspective. You are being blocked; the algorithm sees only fast responses.

The diagnostic: if your item rate drops while response latency stays low and AutoThrottle is not backing off, you are being silently blocked. Slowing down will not fix it. The problem is elsewhere in the request fingerprint.

Variable-latency backends

Some sites have pages that respond in 30ms (static HTML, heavily cached) and pages that take 800ms (product pages with real-time inventory lookups). AutoThrottle makes decisions based on recent latency samples and will oscillate: fast pages drive the delay down, then a batch of slow pages backs it up. If the fast pages are not the ones you need to throttle for, lower TARGET_CONCURRENCY to give more headroom, or separate fast and slow page types into different crawl jobs with different throttle settings.

Settings cheat sheet

Recommended starting points for three common scenarios:

Scenario	`TARGET_CONCURRENCY`	`MAX_DELAY`	`START_DELAY`
Polite crawl of a single domain	1.0	10	2.0
Faster crawl of a cooperative server	4.0	30	1.0
Multi-domain crawl, many parallel slots	2.0	60	1.0

Enable AUTOTHROTTLE_DEBUG = True, run a short crawl, read the output, and adjust from there.

A note on blocking vs rate limiting

AutoThrottle addresses one problem: sending requests too fast for the server to handle comfortably. It does not address fingerprint-based detection, including TLS fingerprints, HTTP header patterns, browser behavior signatures, or IP reputation. Modern anti-bot systems primarily detect scrapers through these signals, not through request rate alone.

If you are following AutoThrottle's guidance on request rate and still getting blocked, the rate is probably not the issue. The problem is in the request fingerprint, and that requires a different solution.

What the Scrapy Maintainer Thinks About AI-Generated Scrapers

Neha Setia — Sat, 11 Apr 2026 15:13:41 +0000

I sat down with Adrian Chaves, one of the lead Scrapy maintainers, who also works at Zyte, to ask him the questions I've been chewing on since Zyte launched Web Scraping Copilot: what happens when an LLM writes your spider(the web scraping code)? What gets easier? What doesn't change?

His answers surprised me. A few highlights:

On vibe coding: Adrian has thoughts about developers treating scraper generation as a black box, and why Scrapy's design philosophy matters more, not less, when an LLM is writing the code.
The bottleneck isn't what you think. He argues the hard part of scraping in 2026 isn't writing code. It's reading pages. And that's the part LLMs still struggle with.
What "good design meeting the future halfway" means. Why frameworks like Scrapy that were built for humans are turning out to be the best frameworks for AI agents too.
Where LLMs actually help. The concrete places where AI makes a scraper developer's life better, and where it just adds complexity.

Full conversation is on the Zyte blog, linked below. If you're building scrapers, thinking about adding AI to your extraction pipeline, or just curious what someone who's been maintaining one of the most widely used scraping frameworks for years thinks about all of this, it's worth a read.

Read the full interview on zyte.com

Happy to discuss in the comments.
What are you using AI for in your scraping workflow right now, and where have you hit walls?

Tags: web scraping • scrapy • ai • opus • anti-bot • Claude AI • sonnet • open source

Stop using Python `requests` for web scraping: there are better & modern libraries instead

Ayan Pahwa — Thu, 09 Apr 2026 11:09:31 +0000

While the 'Requests' library remains the default choice for many Python developers due to its reliability and extensive documentation, the Python HTTP landscape has evolved considerably.

Modern alternatives now offer significant advantages, including built-in asynchronous support, HTTP/2 compatibility, enhanced performance, and up-to-date TLS handling.

This article introduces and compares three such contemporary clients: HTTPX, curl_cffi, and rnet, detailing their unique features and practical applications.

The problem with Requests for web scraping

It's important to clarify Requests' limitations before proceeding; for simple API interactions with well-behaved endpoints, it still remains the de facto standard.

However, a major drawback of the Requests library when it comes to web scraping is its predictable HTTP client fingerprint. This fingerprint, a unique combination of TLS version, cipher suites, HTTP headers, and connection characteristics, is sent with every request, and is well-known and cataloged by anti-bot systems.

Consequently, if you're interacting with any endpoint, including APIs or services protected by anti-ban vendors, your request can be blocked purely based on how the requests library identifies itself. This happens even before your credentials or payload are scrutinized, highlighting a significant limitation when targeting systems that perform client-side validation.

In addition to issues like fingerprinting, a major limitation of the requests library is its lack of native asynchronous support. This absence of async capability is particularly problematic when handling workloads that involve numerous HTTP requests. Without it, the calls execute sequentially, and the program's thread remains blocked for the entire duration of each individual request.

For straightforward scenarios, the standard requests API call remains perfectly functional, as demonstrated in a quick example.

import requests

response = requests.get(
    "https://jsonplaceholder.typicode.com/posts/1",
    timeout=10,
)
response.raise_for_status()
data = response.json()
print(data["title"])

Clean and simple. For a one-off call to a standard REST API, this is fine. The gaps start showing when you need concurrency, HTTP/2, or when the target endpoint does any kind of client validation.

Install the Alternatives

pip install httpx       or  uv add https
pip install curl-cffi       or  uv add curl-cffi
pip install rnet        or  uv add rnet &&
                    uv add asyncio

1. HTTPX

HTTPX is the most direct upgrade from Requests as the API is nearly identical. If you know Requests, you already know most of HTTPX. What it adds is first-class async support, HTTP/2, and a more modern internal architecture.

Where it differs from Requests is the explicit use of a Client context manager (strongly recommended over module-level function calls) and the AsyncClient for async usage. This gives you connection pooling and proper resource cleanup by default.

HTTPX is the right starting point if you're looking for a migration that requires minimal code changes.

Example: Sync

import httpx

with httpx.Client(timeout=10.0) as client:
    response = client.get("https://jsonplaceholder.typicode.com/posts/1")
    response.raise_for_status()
    data = response.json()

print(data["title"])

Example: Async (calling the Zyte API)

Async is where HTTPX really earns its keep. Here it's used to fire multiple requests to the Zyte API concurrently, each request blocks on the server side until extraction is complete, but your event loop stays free to send others in parallel:

import os
import asyncio
import httpx

API_KEY = os.environ["ZYTE_API_KEY"]
ENDPOINT = "https://api.zyte.com/v1/extract"

urls = [
    "https://example.com",
    "https://httpbin.org",
]

async def fetch(client: httpx.AsyncClient, url: str) -> dict:
    response = await client.post(
        ENDPOINT,
        json={"url": url, "browserHtml": True},
        auth=(API_KEY, ""),
    )
    response.raise_for_status()
    return response.json()

async def main():
    async with httpx.AsyncClient(timeout=60.0) as client:
        results = await asyncio.gather(*[fetch(client, url) for url in urls])
    for result in results:
        print(result["url"], "—", len(result["browserHtml"]), "chars")

asyncio.run(main())

Notes:

raise_for_status() raises httpx.HTTPStatusError on 4xx/5xx responses.
HTTP/2 support requires pip install httpx[http2] and passing http2=True to the client.
The 60-second timeout accounts for the Zyte API's server-side blocking behavior — it holds the connection open until extraction completes.

2. curl_cffi

curl_cffi wraps libcurl with Python bindings and adds something HTTPX doesn't have: TLS fingerprint impersonation. It can show the TLS handshake of Chrome, Firefox, Safari, and other browsers. For API calls hitting endpoints protected by anti-ban or similar systems, this can be the difference between getting a response and getting a 403.

The interface closely mirrors Requests, with the addition of the impersonate parameter. It supports both sync and async usage. For most API calls where fingerprinting isn't a concern, curl_cffi behaves just like Requests, the impersonate parameter is opt-in.

Example: Sync

from curl_cffi import requests

response = requests.get(
    "https://jsonplaceholder.typicode.com/posts/1",
    impersonate="chrome",
    timeout=10,
)
response.raise_for_status()
data = response.json()
print(data["title"])

Example: Async (calling the Zyte API)

import os
import asyncio
from curl_cffi.requests import AsyncSession

API_KEY = os.environ["ZYTE_API_KEY"]
ENDPOINT = "https://api.zyte.com/v1/extract"

payload = {
    "url": "https://example.com",
    "browserHtml": True,
}

async def call_zyte_api():
    async with AsyncSession(impersonate="chrome") as session:
        response = await session.post(
            ENDPOINT,
            json=payload,
            auth=(API_KEY, ""),
            timeout=60,
        )
        response.raise_for_status()
        data = response.json()
        print(data["url"], "—", len(data["browserHtml"]), "chars")

asyncio.run(call_zyte_api())

Notes:

impersonate="chrome" sends Chrome's TLS fingerprint on every request made through this session.
Other supported values include "firefox", "safari", "chrome110", and more — check the curl-cffi docs for the full list.
The sync interface (from curl_cffi import requests) is nearly identical to the requests module, making it the easiest drop-in if you only need sync.

3. rnet

rnet is the newest of the three. Like a lot of modern Python, it's built on Rust, making it async-first and performance-oriented. Like curl_cffi, it supports TLS impersonation, but its primary differentiator is throughput. It is designed for high-concurrency workloads where you're firing many requests simultaneously.

The API surface is different from Requests, so it's not a drop-in replacement. But the patterns are clean and modern, and for async-heavy workloads it's worth the minor adjustment.

Example: Sample library code

import asyncio
from rnet import Impersonate, Client


async def main():
    # Build a client
    client = Client(impersonate=Impersonate.Firefox139)

    # Use the API you're already familiar with
    resp = await client.get("https://tls.peet.ws/api/all")

    # Print the response
    print(await resp.text())


if __name__ == "__main__":
    asyncio.run(main())

Notes:

rnet is async-first; sync support is limited.
Response body methods like .json() and .text() are awaitable.
The Rust core makes it particularly well-suited for high-throughput concurrent workloads.

Comparison Table

Feature	Requests	HTTPX	curl_cffi	rnet
Sync Support	✅ Yes	✅ Yes	✅ Yes	⚠️ Limited
Async support	❌ No	✅ Yes	✅ Yes	✅ Yes (primary)
HTTP/2	❌ No	✅ With extra dependencies	✅ Via libcurl	✅ Built-in
Performance	Baseline	Good	Good–High	High
TLS changes	❌ No	❌ No	✅ Yes	✅ Yes

When to use which

Use Requests for simple, one-off scripts, internal tooling, or any situation where you're hitting a cooperative API endpoint and don't need concurrency. Nothing wrong with it in that context.

Use HTTPX when you need async, want the closest migration path from Requests, or need HTTP/2. It's the safest default upgrade for most projects.

Use curl_cffi when TLS fingerprint control matters, whether that's because you're hitting an anti-ban wall or an API with strict client validation, or any service that checks how a client identifies itself at the TLS layer.

Use rnet when raw async performance is the priority. Its Rust foundation makes it the strongest choice for high-concurrency workloads where you're firing many requests simultaneously and need low overhead.

The optimal choice is determined by several factors: your concurrency requirements, the target endpoint's sensitivity to client identification, and the desired similarity between the new code and your existing requests implementation.

Writing production-ready Scrapy spiders with opencode

John Rooney — Wed, 08 Apr 2026 09:19:39 +0000

AI-enabled code editors can now conjure scraping code on command. But anyone who has used a generic coding agent to build a spider knows what comes next: a plausible-looking file that falls apart the moment it hits a real website. The selectors are fragile, the error handling is missing, and the structure ignores everything Scrapy actually expects from production code.

The problem is not the AI. It's the prompts, the context, and knowing where to let the agent drive and where to stay in control. This article walks through using opencode to build Scrapy spiders that are actually deployable, covering setup, the prompts that work, and the pitfalls that will burn you if you are not careful.

Why opencode works well for scraping projects

Most AI coding agents are designed around general-purpose software projects. opencode is different in one important way: it is terminal-native, model-agnostic, and designed to operate inside your actual working directory. It reads your project, understands your file structure, and writes code into the files that already exist rather than pasting snippets into a chat window.

For Scrapy projects specifically, this matters. A spider is not a standalone script. It depends on items, settings, middlewares, pipelines, and page objects. An agent that can see all of those files at once produces far better output than one operating on a blank context.

opencode also supports custom commands stored as Markdown files. That means you can encode your own Scrapy conventions as reusable prompts and call them every time you start a new spider, without retyping the same context.

Getting set up

Install opencode with the one-liner:

curl -fsSL https://opencode.ai/install | bash

On macOS and Linux, the Homebrew tap gives you the fastest updates:

brew install anomalyco/tap/opencode

On Windows, use WSL for the best experience. The choco install opencode path works but the terminal experience is noticeably smoother inside a Linux environment.

Once installed, connect your model provider. The /connect command in the terminal user interface walks you through it. If you want to avoid managing API keys from multiple providers, opencode Zen gives you a curated set of pre-tested models through a single subscription at opencode.ai/auth.

For scraping work, choose a model with a large context window. Spider files, page objects, items, and a sample HTML fixture can easily fill 20,000 tokens before you have written a single prompt. Models with at least 64k context are the practical minimum.

Initialize your Scrapy project first

Before you open opencode, scaffold your Scrapy project as you normally would:

scrapy startproject myproject
cd myproject

Then initialize opencode inside the project root:

opencode init

This creates an AGENTS.md file. Commit it. opencode reads this file on every session to understand how your project is structured. Fill it with the conventions your project follows: which item classes exist, which middlewares are active, whether you are using scrapy-poet page objects, and which version of Zyte API or other HTTP backends you are using. The more context AGENTS.md carries, the less you repeat yourself in prompts.

A minimal AGENTS.md for a Scrapy project looks like this:

# Project conventions

- Python 3.12, Scrapy 2.12
- All spiders use scrapy-poet page objects (never parse in the spider class itself)
- Item classes are defined in items.py using dataclasses
- Zyte API is configured via scrapy-zyte-api; ZYTE_API_KEY is in .env
- Settings live in settings.py; never hardcode values in spider files
- All spiders output to JSON Lines via FEEDS setting
- Test fixtures live in tests/fixtures/ as .html files

The prompts that actually work

Generic prompts produce generic code. The prompts below are tested patterns that produce Scrapy-idiomatic output.

Starting a new spider

The most common mistake is asking opencode to "write a spider for X." That produces a working script, not a Scrapy spider. Be specific about structure:

Create a Scrapy spider for https://books.toscrape.com that:
- Uses a scrapy-poet page object called BookListPage for list pages and BookDetailPage for detail pages
- Extracts: title, price, availability, star rating, and product URL
- Handles pagination by following the "next" link
- Stores results in a BookItem dataclass in items.py
- Does not put any CSS selector logic inside the spider class itself

Start with the page objects in pages.py, then write the spider in spiders/books.py.

The explicit constraint against putting selectors in the spider class is important. Without it, the agent will inline everything, which defeats scrapy-poet's purpose and makes the code harder to test.

Asking for resilient selectors

Generated selectors are often too specific. They target a class that is only present on one layout variant, or chain through five levels of nesting that will break on the next site deploy.

Prompt the agent to justify its selector choices:

Write the CSS selectors for BookDetailPage. For each field, explain why you chose
that selector over alternatives. Prefer attribute-based selectors (like [itemprop] or
[data-*]) over class names where both options exist.

This produces more defensive selectors and, more importantly, gives you enough reasoning to judge whether to accept them.

Adding error handling

The agent will skip error handling unless you ask for it explicitly:

Add error handling to BookDetailPage:
- If price is missing, log a warning and return None (do not raise an exception)
- If star rating cannot be parsed, default to 0
- Add a try/except around the availability field and log the raw text if parsing fails

Never assume the agent will add graceful degradation on its own. It optimizes for the happy path.

Writing tests

opencode is genuinely useful for generating pytest fixtures and test scaffolding. Give it a concrete fixture to work from:

Write a pytest test for BookDetailPage using the HTML fixture at
tests/fixtures/book_detail.html. Test that:
- title is extracted as a non-empty string
- price is a float greater than zero
- availability is one of: "In stock", "Out of stock"
- star_rating is an integer between 0 and 5

Use pytest parametrize if testing multiple fixture variants.

Pitfalls to watch for

The agent assumes the HTML is static

By default, any spider the agent generates will use response.css() or response.xpath() on raw HTML. If your target site renders content with JavaScript, those selectors return nothing. Before you run any generated spider, check whether the target page is JavaScript-rendered by viewing source in your browser. If the data you need is absent from the raw HTML, prompt the agent to use Zyte API's headless browser or a Playwright download handler instead of a plain HTTP request.

Selectors written against one page break on others

The agent writes selectors against whatever HTML you give it. If you paste a single product page, it will produce selectors that work on that product page. Run the spider against 10 or 20 URLs from the same site before treating the selectors as reliable.

Ask the agent to help you validate coverage:

Here are three different product page HTML snippets from the same site (pasted below).
Identify any selectors in BookDetailPage that would fail on snippet 2 or snippet 3,
and suggest more robust alternatives.

Context window exhaustion mid-session

Long sessions that involve large HTML files, multiple spider files, and back-and-forth debugging will eventually exhaust the model's context. When this happens, the agent starts contradicting earlier decisions or forgetting your project conventions.

The fix is to keep sessions short and focused. One session per spider, or one session per refactor task. Use your AGENTS.md to carry conventions across sessions rather than re-explaining them in chat.

Generated settings override your existing configuration

When the agent writes setup instructions, it often suggests adding settings directly to settings.py. If you already have a settings file, this can clobber existing values or introduce conflicts. Review every settings change the agent proposes before accepting it.

The agent does not know about anti-bot measures

opencode has no knowledge of whether a site actively blocks scrapers. It will happily generate a spider that will be blocked immediately in production. Anti-bot handling, rate limiting, and request fingerprinting are your responsibility to layer in. Zyte API handles the blocking and fingerprinting side; you still need to configure the integration yourself rather than expecting the agent to know it is necessary.

Useful custom commands for scraping

opencode custom commands let you encode reusable prompts as Markdown files in ~/.config/opencode/commands/. Here are three worth setting up for any Scrapy workflow.

`user:new-spider`

# New scrapy-poet spider

Create a new Scrapy spider for the URL provided by the user.
- Use scrapy-poet page objects (list page + detail page if applicable)
- Put all selector logic in page objects, nothing in the spider class
- Use item dataclasses from items.py (create new ones if needed)
- Include pagination handling
- Add logging for missing fields (warning level)
- Write page objects first, then the spider

Ask the user for the target URL before starting.

`user:harden-selectors`

# Harden selectors

Review the page objects in the current file. For each CSS or XPath selector:
1. Identify whether it targets a class, ID, tag, or attribute
2. If it targets a class name, suggest an attribute-based or structural alternative
3. Flag any selectors that chain more than three levels deep as fragile

Output a revised version of the file with improved selectors and inline comments
explaining each change.

`user:gen-tests`

# Generate pytest tests

Given a page object file and an HTML fixture provided by the user:
1. Write a pytest test file that covers all extracted fields
2. Test that required fields are non-null and the correct type
3. Test that optional fields handle absence gracefully (None, not exception)
4. Use parametrize if multiple fixture variants are present

Ask for the fixture file path before starting.

Where opencode fits in the workflow

Think of opencode as a fast first-draft tool, not an autonomous spider factory. The right workflow is:

Scaffold the project and write AGENTS.md manually
Use opencode to generate page objects and the spider skeleton
Review every selector by hand before trusting it
Run the spider against a sample of real URLs and inspect the output
Use opencode to patch failures and write tests
Handle anti-bot, rate limiting, and deployment yourself

The agent saves the most time on the repetitive structural work: boilerplate item classes, pagination logic, field extraction scaffolding, and test stubs. The judgment calls around which selectors are robust, whether a site is JavaScript-rendered, and how to handle blocking remain entirely in your hands.

That division of labor is what makes this approach work at production scale rather than just for prototypes.

Try it yourself

Install opencode, initialize it in an existing Scrapy project, and start with the user:new-spider custom command above. Pick a publicly accessible, static site like books.toscrape.com to test the workflow before applying it to a site with more complexity.

For the JavaScript-rendered sites and anything with active anti-bot measures, pair opencode's code generation with Zyte API to handle the access layer. You can sign up for a free trial and have a working integration running in minutes. The Zyte documentation covers the scrapy-zyte-api configuration in detail.

Small models, big ideas: what Google Gemma and MoE mean for developers

Ayan Pahwa — Tue, 07 Apr 2026 12:50:03 +0000

We at zyte-devrel try to stay plugged into what is happening in the AI and developer tooling space, not just because it is interesting, but because a lot of it starts having real implications for how we build and think about web data pipelines. Lately, one development that has had us genuinely curious is Google's new Gemma 4 model family, and specifically the direction it points toward with Mixture of Experts (MoE) architecture.

This is not a deep tutorial. It is more of a "hey, here is what we have been poking at" - the kind of update we would share in a Slack channel or over coffee. If you wanna participate in such discussions, our discord is always a welcoming platform.

What is Gemma 4?

Gemma has been dubbed as stripped down versions of Google Gemini. The new Gemma 4 is Google's latest family of open-weight language models, released last week. The lineup covers four sizes:

2B: ultra-efficient, built for mobile and edge devices
4B: enhanced multimodal capabilities, still edge-deployable
26B: sparse model using Mixture of Experts architecture (more on this below)
31B: dense model for more demanding tasks

All four variants support multimodal input (text and images), over 140 languages, a 128K-256K token context window, and agentic workflows with tool use and JSON output. The 2B and 4B models are specifically designed to run fully offline on modern edge devices like smartphones, with no internet dependency at all.

According to Google's Gemma 4 model page, the family ranks third among open-weighted models on the LM Arena leaderboard and uses 2.5 times fewer tokens than comparable models for equivalent tasks.

The Gemma 4 26B MoE, specially caught my attention because unlike other variants it's based on MoE architecture and it does make a difference :

What is MoE, and why does it matter?

Mixture of Experts (MoE) is one of those ideas that sounds complex but is actually pretty intuitive once you hear the analogy.

In a traditional dense neural network, every parameter in the model activates for every input. It is like calling your entire company into a meeting every time someone has a question. It works, but it is expensive.

MoE works differently. Instead of one large model doing everything, you have a set of smaller "expert" sub-networks, each specialized in different patterns, plus a router that looks at each incoming token and decides which one or two experts to activate. Most of the model sits idle at any given moment.

The result: you get the quality of a much larger model at a fraction of the inference cost.

The Gemma 4 26B model is a great illustration of this. It has 26 billion total parameters, but during inference it only activates around 3.8 billion of them. You get near-26B quality at roughly 3.8B compute cost. That is the MoE advantage in one number.

Other models that take the same approach:

Mixtral 8x7B: eight experts, two active per token; it outperforms Llama 2 70B on most benchmarks at far lower inference cost
Kimi: Moonshot AI's model, also MoE-based, has been making similar waves in the open-model space

For a deep dive on how MoE works under the hood, the Hugging Face guide to mixture of experts is well worth the read.

Since the models are free, if you have the right machine you can host them lcoally using Ollama or call them using API services like OpenRouter.

My prefered way of using a new mode is through Claude, but I believe Gemma4 has a different tool calling structure so it is not compatible yet, but you can use it with LMStudio, or skil all that because you can now

Run Gemma 4 offline on an iPhone

Here is the part worth sharing, because it genuinely surprised us.

Using the Google Edge AI Gallery app from the App Store, you can load a Gemma 4 model and run it with airplane mode on. No API calls, no cloud round-trips, no data leaving the device. Just the model running locally on your phone.

The experience is not going to replace a foundational frontier model for complex reasoning. But that is not the point. For quick classification, summarization, or just experimenting with local inference, the 2B and 4B variants are remarkably capable, and there are zero API costs with no data leaving your device. And since it is multi-modal you can practically point your phone camera to a paper recipt and ask it to save the details in a spreadhseet.

If you have not tried running a local large language model (LLM) yet, this is probably the lowest-friction entry point on hardware you already own.

Why should developers building data pipelines care?

Here is where it connects back to what a lot of us are building.

When LLMs run on-device or at the edge, the calculus around data pipelines shifts in a few useful directions:

Tokens are getting expensive and when a model as good as Gemma 4 or Qwen-3.5 is free and open-weighted it's a welcome development. Everyone's complaining about running out of their claude usage quota last couple of weeks or getting huge bills, thanks to giving Opus API Keys to OpenClaw. These things can be significantly addressed using Open Models.

No API round-trips: on-device inference eliminates latency from cloud API calls. For classification tasks running inside a scraping pipeline, this is a meaningful difference.
Data privacy: running extraction locally means scraped content never leaves your infrastructure. For regulated industries or sensitive datasets, that is a significant advantage.
Cost at scale: if you are doing high-volume classification — is this a product page? is this content in the target language? — running a small local model beats paying per-token at scale.
Edge preprocessing: a small LLM can filter and classify pages before they ever reach a more expensive cloud model for deeper analysis, and I am personally looking forward to run them on SBCs like a Raspberry Pi.
Open Weights: people often confuse open-weights models with open-source models, while the lines may be blurry and even I don't fully understand the difference, one thing I know for sure is that Gemma 4 is available under the Apache 2.0 license, which allows building and selling products on top of it and open-weights allows you to fine-tune it for your use-case or application.

Here's me playing it with it on my iPhone 16, completely offline:

Just checking in

We do not have grand proclamations here. This is a space that is moving fast, and we are learning alongside everyone else.

If you have been experimenting with local LLMs in your scraping or data extraction workflows, we would genuinely love to hear about it. Drop a comment below, or find us on the Zyte discord and read more interesting blogs on Zyte Blog.

If you want to try this yourself, here are three good starting points:

Google Edge Gallery: available on the App Store and Playstore, runs Gemma 4 locally on iOS
Gemma models on Hugging Face: for running on desktop or server
Google's Gemma 4 model page: full family overview, benchmarks, and architecture details

Build Scrapy spiders in 23.54 seconds with this free Claude skill

John Rooney — Mon, 30 Mar 2026 17:50:39 +0000

I built a Claude skill that generates Scrapy spiders in under 30 seconds — ready to run, ready to extract good data. In this post I'll walk through what I built, the design decisions behind it, and where I think it can go next.

What it does

The skill takes a single input: a category or product listing URL. From there, Claude generates a complete, runnable Scrapy spider as a single Python script. No project setup, no configuration files, no boilerplate to write. Just a script you can run immediately.

Here's what that looks like in practice. I opened Claude Code in an empty folder with dependencies installed, activated the skill, and said: "Create a spider for this site" — and pasted a URL.

Within seconds, the script was generated. I ran it, watched the products roll in, piped the output through jq, and had clean structured product data. Start to finish: under a minute.

Why a single-file script, not a full Scrapy project?

Scrapy is usually a full project — multiple files, lots of moving parts, a proper setup process. Running it from a script instead is generally discouraged for production work, but for this use case it's actually the right call.

The goal here is what I'd call pump-and-dump scraping: give Claude a URL, get a spider, run it for a couple of days, move on. It's not designed to scrape millions of products every day for years. For that kind of scale you need proper infrastructure, robust monitoring, and serious logging. This isn't that — and that's intentional.

What you do get, even in the single-file approach, is almost everything Scrapy offers: middleware, automatic retries, and concurrency handling. You'd have to build all of that yourself with a plain requests script. Scrapy gives it to you for free, even when running from a script.

The key design decision: AI extraction

The other major call I made was to lean entirely on Zyte API's AI extraction rather than generating CSS or XPath selectors.

Specifically, the skill uses two extraction types chained together: productNavigation on the category or listing page, which returns product URLs and the next page link, and product on each product URL, which returns structured product data including name, price, availability, brand, SKU (stock keeping unit), description, images, and more.

This means the spider doesn't need to know anything about the structure of the site it's crawling. There are no selectors to generate, no schema to define, no user confirmation step. The AI on Zyte's end handles all of that. It does cost slightly more than a raw HTTP request, but given how little time it takes to go from URL to working spider, the trade-off makes sense.

I've hardcoded httpResponseBody as the extraction source — it's faster and more cost-efficient than browser rendering. If a site is JavaScript-heavy and you're not getting the data you need, you can switch to browserHtml with a one-line change. The spider logs a warning to remind you of this.

The use case is deliberately narrow

This skill is designed for e-commerce sites, and only e-commerce sites. That's not a limitation I stumbled into — it's a feature.

Because the scope is narrow, the spider structure is simple and predictable: category pages with pagination, product links, and detail pages. Zyte API's productNavigation and product extraction types handle this reliably. Widening the scope to arbitrary crawling would require a lot more of Scrapy's machinery and would quickly exceed what makes sense for a lightweight script like this.

What it doesn't do: deep subcategory crawling, link discovery, or full-site crawls. If a page renders all its products without pagination, that works fine — the next page just returns nothing.

Logging and output

I replaced Scrapy's default logging with Rich logging, which gives cleaner terminal output. Scrapy's logs are verbose in ways that aren't useful when you're running a short-lived script — I wanted something concise enough that if something went wrong, it would be obvious at a glance.

Output goes to a .jsonl file named after the spider, alongside a plain .log file. Both are derived from the spider name, which is itself derived from the domain. Run example_com.py, get example_com.jsonl and example_com.log.

Where this goes next

The immediate next step I have in mind is selector-based extraction as an alternative path — useful for sites where the AI extraction isn't quite right, or where you want more control over exactly what gets pulled.

The longer-term vision is running this fully agentically. URLs get submitted somewhere — a queue, a database table, a form — an agent picks them up, builds the spider, and maybe runs a quick validation. The spider then goes into a pool to be run on a schedule, and data lands in a database rather than a flat file. Give Claude access to a virtual private server (VPS) via terminal and most of this is achievable without much extra infrastructure. The skill is already the hard part.

Download the skill

The skill is free to download and use. It's a single .skill file you can install directly into Claude Code. You'll need:

pip install scrapy scrapy-zyte-api rich
export ZYTE_API_KEY=your_key_here

Scrapy 2.13 or above is required for AsyncCrawlerProcess.

The link to the repo and the skill download are in the video description, and here. If you've built something similar, or have thoughts on the design decisions — especially around the extraction approach or the logging setup — I'd love to hear it in the comments. GitHub links to your own scrapers are very welcome too.

If you're interested in more agentic scraping patterns, I also built a Claude skill that helps spiders recover from excessive bans — you can watch that video here.

I Built a Self-Healing Web Scraper to Auto-Solve 403s

John Rooney — Mon, 16 Mar 2026 11:09:31 +0000

Web scraping has a recurring enemy: the 403. Sites add bot detection, anti-scraping tools update their challenges, and scrapers that worked fine last week start silently failing. The usual fix is manual — check the logs, diagnose the cause, update the config, redeploy. I wanted to see if an agent could handle that loop instead.

So I built a self-healing scraper. After each crawl, a Claude-powered agent reads the failure logs, probes the broken domains with escalating fetch strategies, and rewrites the config automatically. By the next run, it's already fixed itself.

How it works

The project has two parts: a scraper and a self-healing agent.

The scraper

main.py is a straightforward Python scraper driven entirely by a config.json file. Each domain entry tells the scraper which URLs to fetch and how to fetch them:

{
  "id": "books",
  "zyte": true,
  "browser_html": false,
  "urls": ["https://www.bookstocsrape.co.uk/products/..."]
}

There are three fetch modes:

Direct — a plain requests.get(). Fast, free, works for sites that don't block bots.
Zyte API (httpResponseBody) — routes the request through Zyte's residential proxy network. Good for sites that block datacenter IPs.
Zyte API (browserHtml) — spins up a real browser via Zyte, executes JavaScript, and returns the fully-rendered DOM. Required for sites using JS-based bot challenges.

Every request is logged to scraper.log in the same format:

2026-03-14 09:12:01 url=https://... domain_id=scan status=200

If a request throws any exception, it's recorded as a 403. That keeps the log clean and gives the agent a consistent signal to act on.

The self-healing agent

agent.py is a Claude-powered agent that runs after each crawl. It uses the Claude Agent SDK and has access to three tools: Read, Bash, and Edit — enough to operate completely autonomously.

The agent works through a staged process:

Read the log — finds every domain that returned a 403
Cross-reference the config — skips domains already configured to use Zyte
Stage 1 probe — uses the zyte-api CLI to fetch one URL per failing domain with httpResponseBody, then inspects the page <title>
Challenge detection — if the title contains phrases like "Just a moment", "Checking your browser", or "Verifying you are human", the page is flagged as a bot challenge
Stage 2 probe — challenge pages are re-probed using browserHtml, which runs a real browser to bypass JS-based detection
Config update — the agent edits config.json directly, setting zyte: true and/or browserHtml: true for domains that now work

The next crawl automatically uses the right fetch strategy. No manual intervention needed.

Design decisions

Config-driven, not code-driven

Everything lives in config.json. Adding a new domain is a one-liner, and the scraper doesn't need to know anything about individual sites — it just reads the config and follows instructions. The agent writes to the same file, so the loop closes itself naturally.

Graduated fetch strategy

Not every site needs an expensive browser render. By escalating from direct to httpResponseBody to [browserHtml](https://www.zyte.com/zyte-api/headless-browser/) only when necessary, I keep costs manageable. Browser renders are slower and consume more API credits — reserving them for sites that actually need them makes a meaningful difference at scale.

Letting the agent handle the heuristics

The challenge detection logic — matching titles against known bot-detection phrases — is exactly the kind of fuzzy heuristic that's tedious to maintain as code but natural for a language model to reason about. Claude also handles edge cases gracefully: if the zyte-api CLI isn't installed, if the log is empty, if a domain is already correctly configured. A rule-based script would need explicit handling for every one of those scenarios.

The limitations

It's worth being honest about where this approach falls short.

It's reactive, not proactive. The agent only runs after a failed crawl. If a site starts blocking mid-run, those URLs fail silently until the next cycle.

Title-based detection is fragile. Most bot-challenge pages say "Just a moment…" — but a legitimate site could theoretically use that phrase. A false positive would cause the scraper to wastefully use browser rendering where it isn't needed.

One URL per domain. The agent probes only the first failing URL for each domain. Different URL patterns on the same domain can have different bot-detection behaviour, which this doesn't account for.

No rollback. Once the config is updated, there's no way to detect if a Zyte setting later stops working and revert it automatically.

Cost opacity. The scraper logs HTTP status codes, not Zyte API credit consumption. There's no visibility into what each domain actually costs to fetch.

Where I'd take it next

Smarter challenge detection. Rather than keyword-matching on the title, the agent could read the full page HTML and make a more nuanced call — is this a product page, a login wall, or a soft block with a CAPTCHA? Each requires a different response.

Proactive monitoring. A lightweight probe running daily against each configured domain, independent of the main crawl, would let the agent update the config before a full scrape run hits a known-bad configuration.

Per-URL config. Right now zyte and browser_html are set at the domain level. Some sites serve static product pages on one path and JS-rendered category pages on another — granular per-URL settings would handle that cleanly.

Structured data extraction. Right now parse_page only pulls the page title. The natural next step is structured product extraction — price, availability, name, images — either via CSS selectors in the config or Zyte's product extraction type, which uses ML models to parse product data from any page.

Multi-agent parallelism. The self-healing loop is currently a single agent. As the config grows, a coordinator could spawn one subagent per failing domain, each running its own probe pipeline concurrently. The Claude Agent SDK supports subagents natively, so this would be a relatively small change.

The core idea is simple: a scraper that observes its own failures and reconfigures itself. What I found interesting about building it wasn't the scraping itself — it was seeing how little scaffolding the agent actually needed. Three tools, a clear task, and it handles the diagnostic work that would otherwise fall to me.

How I get Claude to build HTML parsing code the way I want it

John Rooney — Wed, 11 Mar 2026 07:13:06 +0000

Getting HTML off a page is only the first step. Once you have it, the real work begins: pulling out the data that actually matters — product names, prices, ratings, specifications — in a clean, structured format you can actually do something with.

That's what the parser skill is for. If you haven't read the introduction to skills in our fetcher post, it's worth a quick look first. But the short version is this: a skill is a SKILL.md file that gives Claude precise, reusable instructions for using a specific tool. The parser skill is one of three that together form a complete web scraping pipeline.

zytelabs / claude-webscraping-skills

claude-webscraping-skills

A collection of claude skills and other tools to assist your web-scraping needs.

video explanations:

https://youtu.be/HH0Q9OfKLu0 https://youtu.be/P2HhnFRXm-I

Other claude tools for web scraping

zyte-fetch-page-content-mcp-server

A Model Context Protocol (MCP) server that runs locally using docker desktop mcp-toolkit and help you extracts clean, LLM-friendly content from any webpage using the Zyte API. Perfect for AI assistants that need to read and understand web content. by Ayan Pahwa

Improve Claude Code WebFetch with Zyte API

When Claude encounters a WebFetch failure, it reads the CLAUDE.md instructions and makes a curl request to the Zyte API endpoint. The API returns base64-encoded HTML, which Claude decodes and processes just like it would with a normal WebFetch response. by Joshua Odmark

Claude skills vs MCP vs Web Scraping CoPilot

View on GitHub

What is a skill?

A skill is a small markdown file that tells Claude how to use a specific script or tool — what it does, when to use it, and step-by-step how to run it. Claude reads the file and follows the instructions as part of a broader workflow, with no manual intervention required.

Skills are composable by design. The fetcher skill hands raw HTML to the parser skill, which hands structured JSON to the compare skill. Each one does one job well, and they're built to work together.

The parser skill's front matter sets out its purpose immediately:

---
name: parser
description: "Extracts structured product data from raw HTML. Tries JSON-LD via"
Extruct first, falls back to CSS selectors via Parsel.
---

Two methods, one fallback. That single description line captures the entire logic of the skill.

What the parser skill does

The parser skill takes raw HTML as input and returns a structured JSON object. It uses two extraction methods in sequence, trying the more reliable one first and falling back to the more flexible one if needed.

The primary method uses Extruct to find JSON-LD data embedded in the page. JSON-LD is a structured data format that many modern sites include in their HTML specifically to make their content machine-readable — it's used for search engine optimisation and data portability. When it's present, Extruct can read it cleanly and reliably, with no need to write or maintain selectors.

If no usable JSON-LD is found, the skill falls back to Parsel, which uses CSS selectors to locate data heuristically across the page. This is more flexible but inherently tied to the page's visual structure, which can change.

When to use it

## When to use
Use this skill when you have raw HTML and need to extract structured data from
it — product details, prices, specs, ratings, or any page content.

In practice, that means the parser skill is almost always the second step in a pipeline — running immediately after the fetcher skill has retrieved your HTML. It works with any page type, and handles the most common product fields out of the box.

How it works

## Instructions
1. Save the HTML to a temporary file `page.html`
2. Run `parser.py` against it:
   python parser.py page.html

3. The script outputs a JSON object. Check the `method` field:
   - "extruct" — clean structured data was found, use it directly
   - "parsel" — fell back to CSS selectors, review fields for completeness
4. If key fields are missing from the Parsel output, ask the user which fields
   they need and re-run with --fields:
   python parser.py page.html --fields "price,rating,brand"

5. Return the parsed JSON to the conversation for use in the Compare skill.

The method field in the output is particularly useful. It tells you immediately how the data was extracted and how much trust to place in it. An "extruct" result is clean and stable. A "parsel" result is worth reviewing, especially if you're working with an unusual page layout.

The --fields flag is a practical escape hatch. Rather than requiring you to dig into the script when key data is missing, it lets you specify exactly what you need and re-run — a much more efficient loop.

Why prefer Extruct?

The notes section of the skill file makes this explicit:

## Notes
- Always prefer the Extruct path — it is more stable and requires no maintenance
- Parsel selectors are generated heuristically and may need adjustment for
  unusual page layouts
- Run once per page; pass all outputs together into the Compare skill

Parsel selectors break when sites redesign. JSON-LD, by contrast, is structured data the site publishes independently of its visual layout. A site can completely overhaul its design and its JSON-LD will often remain untouched. That stability is worth prioritising wherever possible.

What comes next

Once you've run the parser skill across all your target pages, you have a set of structured JSON objects ready to compare. That's where the compare skill picks up — generating tables, summaries, and side-by-side analysis from the extracted data.

Do you need a skill?

The parser skill works well when the data you need maps cleanly onto fields that Extruct or Parsel can find — product names, prices, ratings, and similar structured attributes that sites commonly expose through JSON-LD or consistent HTML patterns. For that category of task, the skill is fast to apply and requires no custom code.

See our post about Skills vs MCP vs Web Scraping Copilot (our VS Code Extension):

zyte.com

But not every extraction problem fits that mould. If you're working with pages that don't include JSON-LD and have highly irregular layouts, Parsel's heuristic selectors may return incomplete or inconsistent results, and you'll spend time debugging field by field. In those cases, a purpose-built extraction script using Parsel or BeautifulSoup directly — with selectors you've written and tested against the specific target — will be more reliable.

For larger-scale or more complex extraction work, Zyte API's automatic extraction capabilities go further still. Rather than relying on selectors at all, automatic extraction uses AI to identify and return structured data from a page without requiring you to specify fields or maintain selector logic. If you're extracting data from many different site structures, or you need extraction to keep working through site redesigns without manual intervention, that's a more robust foundation than a skill-based approach. The parser skill is best understood as a practical middle ground: fast to use, good enough for a wide range of common cases, and easy to slot into a pipeline — but not a replacement for extraction tooling built for scale or resilience.

I gave Claude access to a web scraping API

John Rooney — Wed, 11 Mar 2026 07:10:43 +0000

If you've worked with Claude for any length of time, you've probably noticed it can do a lot more than answer questions. With the right setup, it can take actions — running scripts, processing files, working through multi-step workflows autonomously. Skills are what make that possible.

What is a skill?

A skill is a small, self-contained instruction set that tells Claude how to use a specific tool or script to accomplish a well-defined task. Technically, it's a markdown file — a SKILL.md — that describes what a tool does, when to reach for it, and exactly how to run it. Claude reads that file and follows the instructions as part of a larger workflow.

Skills are designed to be composable. Each one does one thing well, and they're built to hand off to each other. The fetcher skill retrieves HTML. The parser skill extracts data from it. The compare skill turns multiple parsed outputs into a structured comparison. Together, they form a complete scraping pipeline — and Claude orchestrates the whole thing.

See our skills here: https://github.com/zytelabs/claude-webscraping-skills

The skill format looks like this:

---
name: fetcher
description: "Fetches raw HTML from a URL using httpx, with automatic fallback to Zyte API if blocked."
---

That front matter is what Claude uses to match the right skill to the right task. The description is deliberately precise: it tells Claude not just what the skill does, but how it does it, so Claude can reason about whether it's the right tool for the job.

What the fetcher skill does

The fetcher skill's job is exactly what it sounds like: given a URL, fetch the raw HTML and return it. It uses httpx as its primary HTTP client — a modern, performant Python library well suited to scraping workloads.

What makes it more than a simple wrapper is the fallback logic. A significant number of sites actively block automated requests. Without a fallback, a blocked request just fails, and you're left manually diagnosing why. The fetcher skill handles this automatically. If a request comes back with a BLOCKED status, it retries via Zyte API, which provides built-in unblocking. Most of the time, you get your HTML without ever needing to intervene.

When to use it

The skill's SKILL.md is explicit about this:

## When to use
Use this skill when the user provides one or more URLs and asks you to fetch,
retrieve, scrape, or get the HTML or page content.

In practice, that means any time you're starting a scraping or data extraction task and you have a URL to work from. It's the entry point for the pipeline.

How it works

The instructions in the skill file are straightforward:

## Instructions
1. Run `fetcher.py` with the URL as an argument:
   python fetcher.py <url>

2. If the script returns a successful HTML response, return the HTML to the
   conversation for use in the next step.
3. If the script returns a `BLOCKED` status, re-run with the `--zyte` flag:
   python fetcher.py <url> --zyte

4. Inform the user if a URL could not be fetched after both attempts.

The two-step process keeps things efficient. httpx is fast and lightweight, so it handles the majority of requests without needing to route through Zyte API. The fallback only kicks in when it's needed. If both attempts fail, Claude surfaces that to you clearly rather than silently moving on.

For multiple URLs, the script runs once per URL — there's no batching — so Claude loops through a list sequentially.

Transparency about failure

One detail worth highlighting is the final instruction: inform the user if a URL could not be fetched after both attempts. This might seem obvious, but it reflects a design principle worth being explicit about. A skill that silently drops failed URLs would produce incomplete data downstream, and you might not notice until you're looking at a comparison table with missing rows. Surfacing failures immediately keeps the pipeline honest.

What comes next

The fetcher skill's output is raw HTML — exactly what the parser skill expects as its input. The two are designed to be used in sequence. Once you have the HTML, the parser skill takes over, extracting structured data through JSON-LD or CSS selectors depending on what the page contains.

That handoff is documented in the skill's notes:

## Notes
- For multiple URLs, run the script once per URL
- Pass the raw HTML output into the Parser skill for extraction

The pipeline continues from there.

Do you need a skill?

Skills are a good fit when you have a well-defined, repeatable task that benefits from consistent behaviour across many runs. Fetching HTML from a URL is a clear example: the inputs and outputs are predictable, the fallback logic is always the same, and packaging that into a skill means Claude applies it reliably without you having to re-explain the process each time.

Read our break down of Skills vs MCP vs Web Scraping Copilot here - our VS Code extension

That said, skills aren't always the right tool. If you only need to fetch a handful of pages once, asking Claude to write a quick httpx script directly may be faster and more flexible. Similarly, if your target sites have unusual behaviour — rate limiting, JavaScript rendering, login walls, or multi-step navigation — a bespoke Scrapy spider built with Zyte API gives you far more control than a general-purpose fetch wrapper. Scrapy's middleware architecture, item pipelines, and scheduling make it better suited to large-scale or complex crawls where you need precise control over every aspect of the request cycle.

The fetcher skill sits in the middle: more structured than an ad hoc script, less complex than a full Scrapy project. It's the right choice when you want Claude to handle straightforward retrieval as part of a larger automated workflow, without the overhead of setting up and maintaining a dedicated spider.