<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: John Rooney</title>
    <description>The latest articles on Forem by John Rooney (@john_rooney).</description>
    <link>https://forem.com/john_rooney</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3129763%2F0afdc059-5402-40b8-9c98-07d6831b0839.jpg</url>
      <title>Forem: John Rooney</title>
      <link>https://forem.com/john_rooney</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/john_rooney"/>
    <language>en</language>
    <item>
      <title>Writing production-ready Scrapy spiders with opencode</title>
      <dc:creator>John Rooney</dc:creator>
      <pubDate>Wed, 08 Apr 2026 09:19:39 +0000</pubDate>
      <link>https://forem.com/extractdata/writing-production-ready-scrapy-spiders-with-opencode-ea2</link>
      <guid>https://forem.com/extractdata/writing-production-ready-scrapy-spiders-with-opencode-ea2</guid>
      <description>&lt;p&gt;AI-enabled code editors can now conjure scraping code on command. But anyone who has used a generic coding agent to build a spider knows what comes next: a plausible-looking file that falls apart the moment it hits a real website. The selectors are fragile, the error handling is missing, and the structure ignores everything Scrapy actually expects from production code.&lt;/p&gt;

&lt;p&gt;The problem is not the AI. It's the prompts, the context, and knowing where to let the agent drive and where to stay in control. This article walks through using &lt;a href="https://opencode.ai" rel="noopener noreferrer"&gt;opencode&lt;/a&gt; to build Scrapy spiders that are actually deployable, covering setup, the prompts that work, and the pitfalls that will burn you if you are not careful.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgx2ycq5iahc2vpvushw9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgx2ycq5iahc2vpvushw9.png" alt="building spiders with opencode"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why opencode works well for scraping projects
&lt;/h2&gt;

&lt;p&gt;Most AI coding agents are designed around general-purpose software projects. opencode is different in one important way: it is terminal-native, model-agnostic, and designed to operate inside your actual working directory. It reads your project, understands your file structure, and writes code into the files that already exist rather than pasting snippets into a chat window.&lt;/p&gt;

&lt;p&gt;For Scrapy projects specifically, this matters. A spider is not a standalone script. It depends on items, settings, middlewares, pipelines, and page objects. An agent that can see all of those files at once produces far better output than one operating on a blank context.&lt;/p&gt;

&lt;p&gt;opencode also supports custom commands stored as Markdown files. That means you can encode your own Scrapy conventions as reusable prompts and call them every time you start a new spider, without retyping the same context.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting set up
&lt;/h2&gt;

&lt;p&gt;Install opencode with the one-liner:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://opencode.ai/install | bash
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On macOS and Linux, the Homebrew tap gives you the fastest updates:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew &lt;span class="nb"&gt;install &lt;/span&gt;anomalyco/tap/opencode
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On Windows, use WSL for the best experience. The &lt;code&gt;choco install opencode&lt;/code&gt; path works but the terminal experience is noticeably smoother inside a Linux environment.&lt;/p&gt;

&lt;p&gt;Once installed, connect your model provider. The &lt;code&gt;/connect&lt;/code&gt; command in the terminal user interface walks you through it. If you want to avoid managing API keys from multiple providers, opencode Zen gives you a curated set of pre-tested models through a single subscription at &lt;a href="https://opencode.ai/auth" rel="noopener noreferrer"&gt;opencode.ai/auth&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For scraping work, choose a model with a large context window. Spider files, page objects, items, and a sample HTML fixture can easily fill 20,000 tokens before you have written a single prompt. Models with at least 64k context are the practical minimum.&lt;/p&gt;

&lt;h3&gt;
  
  
  Initialize your Scrapy project first
&lt;/h3&gt;

&lt;p&gt;Before you open opencode, scaffold your Scrapy project as you normally would:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;scrapy startproject myproject
&lt;span class="nb"&gt;cd &lt;/span&gt;myproject
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then initialize opencode inside the project root:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;opencode init
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This creates an &lt;code&gt;AGENTS.md&lt;/code&gt; file. Commit it. opencode reads this file on every session to understand how your project is structured. Fill it with the conventions your project follows: which item classes exist, which middlewares are active, whether you are using &lt;a href="https://scrapy-poet.readthedocs.io/" rel="noopener noreferrer"&gt;scrapy-poet&lt;/a&gt; page objects, and which version of &lt;a href="https://www.zyte.com/zyte-api/" rel="noopener noreferrer"&gt;Zyte API&lt;/a&gt; or other HTTP backends you are using. The more context &lt;code&gt;AGENTS.md&lt;/code&gt; carries, the less you repeat yourself in prompts.&lt;/p&gt;

&lt;p&gt;A minimal &lt;code&gt;AGENTS.md&lt;/code&gt; for a Scrapy project looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# Project conventions&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; Python 3.12, Scrapy 2.12
&lt;span class="p"&gt;-&lt;/span&gt; All spiders use scrapy-poet page objects (never parse in the spider class itself)
&lt;span class="p"&gt;-&lt;/span&gt; Item classes are defined in items.py using dataclasses
&lt;span class="p"&gt;-&lt;/span&gt; Zyte API is configured via scrapy-zyte-api; ZYTE_API_KEY is in .env
&lt;span class="p"&gt;-&lt;/span&gt; Settings live in settings.py; never hardcode values in spider files
&lt;span class="p"&gt;-&lt;/span&gt; All spiders output to JSON Lines via FEEDS setting
&lt;span class="p"&gt;-&lt;/span&gt; Test fixtures live in tests/fixtures/ as .html files
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The prompts that actually work
&lt;/h2&gt;

&lt;p&gt;Generic prompts produce generic code. The prompts below are tested patterns that produce Scrapy-idiomatic output.&lt;/p&gt;

&lt;h3&gt;
  
  
  Starting a new spider
&lt;/h3&gt;

&lt;p&gt;The most common mistake is asking opencode to "write a spider for X." That produces a working script, not a Scrapy spider. Be specific about structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;Create&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="n"&gt;Scrapy&lt;/span&gt; &lt;span class="n"&gt;spider&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;https&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="n"&gt;books&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;toscrape&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;com&lt;/span&gt; &lt;span class="n"&gt;that&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;Uses&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;poet&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="nb"&gt;object&lt;/span&gt; &lt;span class="n"&gt;called&lt;/span&gt; &lt;span class="n"&gt;BookListPage&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt; &lt;span class="n"&gt;pages&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;BookDetailPage&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;detail&lt;/span&gt; &lt;span class="n"&gt;pages&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;Extracts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;availability&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;star&lt;/span&gt; &lt;span class="n"&gt;rating&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;product&lt;/span&gt; &lt;span class="n"&gt;URL&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;Handles&lt;/span&gt; &lt;span class="n"&gt;pagination&lt;/span&gt; &lt;span class="n"&gt;by&lt;/span&gt; &lt;span class="n"&gt;following&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;next&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="n"&gt;link&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;Stores&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="n"&gt;BookItem&lt;/span&gt; &lt;span class="n"&gt;dataclass&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;Does&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;put&lt;/span&gt; &lt;span class="nb"&gt;any&lt;/span&gt; &lt;span class="n"&gt;CSS&lt;/span&gt; &lt;span class="n"&gt;selector&lt;/span&gt; &lt;span class="n"&gt;logic&lt;/span&gt; &lt;span class="n"&gt;inside&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;spider&lt;/span&gt; &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;itself&lt;/span&gt;

&lt;span class="n"&gt;Start&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="n"&gt;objects&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;then&lt;/span&gt; &lt;span class="n"&gt;write&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;spider&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;spiders&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;books&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The explicit constraint against putting selectors in the spider class is important. Without it, the agent will inline everything, which defeats scrapy-poet's purpose and makes the code harder to test.&lt;/p&gt;

&lt;h3&gt;
  
  
  Asking for resilient selectors
&lt;/h3&gt;

&lt;p&gt;Generated selectors are often too specific. They target a class that is only present on one layout variant, or chain through five levels of nesting that will break on the next site deploy.&lt;/p&gt;

&lt;p&gt;Prompt the agent to justify its selector choices:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight css"&gt;&lt;code&gt;&lt;span class="nt"&gt;Write&lt;/span&gt; &lt;span class="nt"&gt;the&lt;/span&gt; &lt;span class="nt"&gt;CSS&lt;/span&gt; &lt;span class="nt"&gt;selectors&lt;/span&gt; &lt;span class="nt"&gt;for&lt;/span&gt; &lt;span class="nt"&gt;BookDetailPage&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt; &lt;span class="nt"&gt;For&lt;/span&gt; &lt;span class="nt"&gt;each&lt;/span&gt; &lt;span class="nt"&gt;field&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nt"&gt;explain&lt;/span&gt; &lt;span class="nt"&gt;why&lt;/span&gt; &lt;span class="nt"&gt;you&lt;/span&gt; &lt;span class="nt"&gt;chose&lt;/span&gt;
&lt;span class="nt"&gt;that&lt;/span&gt; &lt;span class="nt"&gt;selector&lt;/span&gt; &lt;span class="nt"&gt;over&lt;/span&gt; &lt;span class="nt"&gt;alternatives&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt; &lt;span class="nt"&gt;Prefer&lt;/span&gt; &lt;span class="nt"&gt;attribute-based&lt;/span&gt; &lt;span class="nt"&gt;selectors&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nt"&gt;like&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="nt"&gt;itemprop&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="nt"&gt;or&lt;/span&gt;
&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="nt"&gt;data-&lt;/span&gt;&lt;span class="o"&gt;*])&lt;/span&gt; &lt;span class="nt"&gt;over&lt;/span&gt; &lt;span class="nt"&gt;class&lt;/span&gt; &lt;span class="nt"&gt;names&lt;/span&gt; &lt;span class="nt"&gt;where&lt;/span&gt; &lt;span class="nt"&gt;both&lt;/span&gt; &lt;span class="nt"&gt;options&lt;/span&gt; &lt;span class="nt"&gt;exist&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This produces more defensive selectors and, more importantly, gives you enough reasoning to judge whether to accept them.&lt;/p&gt;

&lt;h3&gt;
  
  
  Adding error handling
&lt;/h3&gt;

&lt;p&gt;The agent will skip error handling unless you ask for it explicitly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;Add&lt;/span&gt; &lt;span class="n"&gt;error&lt;/span&gt; &lt;span class="n"&gt;handling&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;BookDetailPage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;If&lt;/span&gt; &lt;span class="n"&gt;price&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="n"&gt;missing&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;log&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="n"&gt;warning&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;None &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;do&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="n"&gt;an&lt;/span&gt; &lt;span class="n"&gt;exception&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;If&lt;/span&gt; &lt;span class="n"&gt;star&lt;/span&gt; &lt;span class="n"&gt;rating&lt;/span&gt; &lt;span class="n"&gt;cannot&lt;/span&gt; &lt;span class="n"&gt;be&lt;/span&gt; &lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;Add&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;around&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;availability&lt;/span&gt; &lt;span class="n"&gt;field&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;log&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;parsing&lt;/span&gt; &lt;span class="n"&gt;fails&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Never assume the agent will add graceful degradation on its own. It optimizes for the happy path.&lt;/p&gt;

&lt;h3&gt;
  
  
  Writing tests
&lt;/h3&gt;

&lt;p&gt;opencode is genuinely useful for generating pytest fixtures and test scaffolding. Give it a concrete fixture to work from:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;Write&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="n"&gt;pytest&lt;/span&gt; &lt;span class="n"&gt;test&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;BookDetailPage&lt;/span&gt; &lt;span class="n"&gt;using&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;HTML&lt;/span&gt; &lt;span class="n"&gt;fixture&lt;/span&gt; &lt;span class="n"&gt;at&lt;/span&gt;
&lt;span class="n"&gt;tests&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;fixtures&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;book_detail&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt; &lt;span class="n"&gt;Test&lt;/span&gt; &lt;span class="n"&gt;that&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="n"&gt;extracted&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="n"&gt;non&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;empty&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;price&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="n"&gt;greater&lt;/span&gt; &lt;span class="n"&gt;than&lt;/span&gt; &lt;span class="n"&gt;zero&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;availability&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="n"&gt;one&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;In stock&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Out of stock&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;star_rating&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="n"&gt;an&lt;/span&gt; &lt;span class="n"&gt;integer&lt;/span&gt; &lt;span class="n"&gt;between&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;

&lt;span class="n"&gt;Use&lt;/span&gt; &lt;span class="n"&gt;pytest&lt;/span&gt; &lt;span class="n"&gt;parametrize&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;testing&lt;/span&gt; &lt;span class="n"&gt;multiple&lt;/span&gt; &lt;span class="n"&gt;fixture&lt;/span&gt; &lt;span class="n"&gt;variants&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Pitfalls to watch for
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The agent assumes the HTML is static
&lt;/h3&gt;

&lt;p&gt;By default, any spider the agent generates will use &lt;code&gt;response.css()&lt;/code&gt; or &lt;code&gt;response.xpath()&lt;/code&gt; on raw HTML. If your target site renders content with JavaScript, those selectors return nothing. Before you run any generated spider, check whether the target page is JavaScript-rendered by viewing source in your browser. If the data you need is absent from the raw HTML, prompt the agent to use &lt;a href="https://www.zyte.com/zyte-api/headless-browser/" rel="noopener noreferrer"&gt;Zyte API's headless browser&lt;/a&gt; or a Playwright download handler instead of a plain HTTP request.&lt;/p&gt;

&lt;h3&gt;
  
  
  Selectors written against one page break on others
&lt;/h3&gt;

&lt;p&gt;The agent writes selectors against whatever HTML you give it. If you paste a single product page, it will produce selectors that work on that product page. Run the spider against 10 or 20 URLs from the same site before treating the selectors as reliable.&lt;/p&gt;

&lt;p&gt;Ask the agent to help you validate coverage:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight html"&gt;&lt;code&gt;Here are three different product page HTML snippets from the same site (pasted below).
Identify any selectors in BookDetailPage that would fail on snippet 2 or snippet 3,
and suggest more robust alternatives.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Context window exhaustion mid-session
&lt;/h3&gt;

&lt;p&gt;Long sessions that involve large HTML files, multiple spider files, and back-and-forth debugging will eventually exhaust the model's context. When this happens, the agent starts contradicting earlier decisions or forgetting your project conventions.&lt;/p&gt;

&lt;p&gt;The fix is to keep sessions short and focused. One session per spider, or one session per refactor task. Use your &lt;code&gt;AGENTS.md&lt;/code&gt; to carry conventions across sessions rather than re-explaining them in chat.&lt;/p&gt;

&lt;h3&gt;
  
  
  Generated settings override your existing configuration
&lt;/h3&gt;

&lt;p&gt;When the agent writes setup instructions, it often suggests adding settings directly to &lt;code&gt;settings.py&lt;/code&gt;. If you already have a settings file, this can clobber existing values or introduce conflicts. Review every settings change the agent proposes before accepting it.&lt;/p&gt;

&lt;h3&gt;
  
  
  The agent does not know about anti-bot measures
&lt;/h3&gt;

&lt;p&gt;opencode has no knowledge of whether a site actively blocks scrapers. It will happily generate a spider that will be blocked immediately in production. Anti-bot handling, rate limiting, and request fingerprinting are your responsibility to layer in. &lt;a href="https://www.zyte.com/zyte-api/" rel="noopener noreferrer"&gt;Zyte API&lt;/a&gt; handles the blocking and fingerprinting side; you still need to configure the integration yourself rather than expecting the agent to know it is necessary.&lt;/p&gt;

&lt;h2&gt;
  
  
  Useful custom commands for scraping
&lt;/h2&gt;

&lt;p&gt;opencode custom commands let you encode reusable prompts as Markdown files in &lt;code&gt;~/.config/opencode/commands/&lt;/code&gt;. Here are three worth setting up for any Scrapy workflow.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;user:new-spider&lt;/code&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# New scrapy-poet spider&lt;/span&gt;

Create a new Scrapy spider for the URL provided by the user.
&lt;span class="p"&gt;-&lt;/span&gt; Use scrapy-poet page objects (list page + detail page if applicable)
&lt;span class="p"&gt;-&lt;/span&gt; Put all selector logic in page objects, nothing in the spider class
&lt;span class="p"&gt;-&lt;/span&gt; Use item dataclasses from items.py (create new ones if needed)
&lt;span class="p"&gt;-&lt;/span&gt; Include pagination handling
&lt;span class="p"&gt;-&lt;/span&gt; Add logging for missing fields (warning level)
&lt;span class="p"&gt;-&lt;/span&gt; Write page objects first, then the spider

Ask the user for the target URL before starting.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;code&gt;user:harden-selectors&lt;/code&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# Harden selectors&lt;/span&gt;

Review the page objects in the current file. For each CSS or XPath selector:
&lt;span class="p"&gt;1.&lt;/span&gt; Identify whether it targets a class, ID, tag, or attribute
&lt;span class="p"&gt;2.&lt;/span&gt; If it targets a class name, suggest an attribute-based or structural alternative
&lt;span class="p"&gt;3.&lt;/span&gt; Flag any selectors that chain more than three levels deep as fragile

Output a revised version of the file with improved selectors and inline comments
explaining each change.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;code&gt;user:gen-tests&lt;/code&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# Generate pytest tests&lt;/span&gt;

Given a page object file and an HTML fixture provided by the user:
&lt;span class="p"&gt;1.&lt;/span&gt; Write a pytest test file that covers all extracted fields
&lt;span class="p"&gt;2.&lt;/span&gt; Test that required fields are non-null and the correct type
&lt;span class="p"&gt;3.&lt;/span&gt; Test that optional fields handle absence gracefully (None, not exception)
&lt;span class="p"&gt;4.&lt;/span&gt; Use parametrize if multiple fixture variants are present

Ask for the fixture file path before starting.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Where opencode fits in the workflow
&lt;/h2&gt;

&lt;p&gt;Think of opencode as a fast first-draft tool, not an autonomous spider factory. The right workflow is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Scaffold the project and write &lt;code&gt;AGENTS.md&lt;/code&gt; manually&lt;/li&gt;
&lt;li&gt;Use opencode to generate page objects and the spider skeleton&lt;/li&gt;
&lt;li&gt;Review every selector by hand before trusting it&lt;/li&gt;
&lt;li&gt;Run the spider against a sample of real URLs and inspect the output&lt;/li&gt;
&lt;li&gt;Use opencode to patch failures and write tests&lt;/li&gt;
&lt;li&gt;Handle anti-bot, rate limiting, and deployment yourself&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The agent saves the most time on the repetitive structural work: boilerplate item classes, pagination logic, field extraction scaffolding, and test stubs. The judgment calls around which selectors are robust, whether a site is JavaScript-rendered, and how to handle blocking remain entirely in your hands.&lt;/p&gt;

&lt;p&gt;That division of labor is what makes this approach work at production scale rather than just for prototypes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it yourself
&lt;/h2&gt;

&lt;p&gt;Install opencode, initialize it in an existing Scrapy project, and start with the &lt;code&gt;user:new-spider&lt;/code&gt; custom command above. Pick a publicly accessible, static site like &lt;a href="https://books.toscrape.com" rel="noopener noreferrer"&gt;books.toscrape.com&lt;/a&gt; to test the workflow before applying it to a site with more complexity.&lt;/p&gt;

&lt;p&gt;For the JavaScript-rendered sites and anything with active anti-bot measures, pair opencode's code generation with &lt;a href="https://www.zyte.com/zyte-api/" rel="noopener noreferrer"&gt;Zyte API&lt;/a&gt; to handle the access layer. You can &lt;a href="https://app.zyte.com/account/signup/zyteapi" rel="noopener noreferrer"&gt;sign up for a free trial&lt;/a&gt; and have a working integration running in minutes. The &lt;a href="https://docs.zyte.com/" rel="noopener noreferrer"&gt;Zyte documentation&lt;/a&gt; covers the scrapy-zyte-api configuration in detail.&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>ai</category>
      <category>opencode</category>
      <category>scrapy</category>
    </item>
    <item>
      <title>Build Scrapy spiders in 23.54 seconds with this free Claude skill</title>
      <dc:creator>John Rooney</dc:creator>
      <pubDate>Mon, 30 Mar 2026 17:50:39 +0000</pubDate>
      <link>https://forem.com/extractdata/build-scrapy-spiders-in-2354-seconds-with-this-free-claude-skill-50em</link>
      <guid>https://forem.com/extractdata/build-scrapy-spiders-in-2354-seconds-with-this-free-claude-skill-50em</guid>
      <description>&lt;p&gt;I built a Claude skill that generates &lt;a href="https://docs.scrapy.org/en/latest" rel="noopener noreferrer"&gt;Scrapy&lt;/a&gt; spiders in under 30 seconds — ready to run, ready to extract good data. In this post I'll walk through what I built, the design decisions behind it, and where I think it can go next.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it does
&lt;/h2&gt;

&lt;p&gt;The skill takes a single input: a category or product listing URL. From there, Claude generates a complete, runnable Scrapy spider as a single Python script. No project setup, no configuration files, no boilerplate to write. Just a script you can run immediately.&lt;/p&gt;

&lt;p&gt;Here's what that looks like in practice. I opened Claude Code in an empty folder with dependencies installed, activated the skill, and said: "Create a spider for this site" — and pasted a URL.&lt;/p&gt;

&lt;p&gt;Within seconds, the script was generated. I ran it, watched the products roll in, piped the output through &lt;code&gt;jq&lt;/code&gt;, and had clean structured product data. Start to finish: under a minute.&lt;/p&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/2pQD412kJIw"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;h2&gt;
  
  
  Why a single-file script, not a full Scrapy project?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://docs.scrapy.org/en/latest/topics/practices.html" rel="noopener noreferrer"&gt;Scrapy&lt;/a&gt; is usually a full project — multiple files, lots of moving parts, a proper setup process. Running it from a script instead is generally discouraged for production work, but for this use case it's actually the right call.&lt;/p&gt;

&lt;p&gt;The goal here is what I'd call pump-and-dump scraping: give Claude a URL, get a spider, run it for a couple of days, move on. It's not designed to scrape millions of products every day for years. For that kind of scale you need proper infrastructure, robust monitoring, and serious logging. This isn't that — and that's intentional.&lt;/p&gt;

&lt;p&gt;What you do get, even in the single-file approach, is almost everything Scrapy offers: middleware, automatic retries, and concurrency handling. You'd have to build all of that yourself with a plain &lt;code&gt;requests&lt;/code&gt; script. Scrapy gives it to you for free, even when running from a script.&lt;/p&gt;

&lt;h2&gt;
  
  
  The key design decision: AI extraction
&lt;/h2&gt;

&lt;p&gt;The other major call I made was to lean entirely on &lt;a href="https://www.zyte.com/zyte-api/" rel="noopener noreferrer"&gt;Zyte API&lt;/a&gt;'s AI extraction rather than generating CSS or XPath selectors.&lt;/p&gt;

&lt;p&gt;Specifically, the skill uses two extraction types chained together: &lt;code&gt;productNavigation&lt;/code&gt; on the category or listing page, which returns product URLs and the next page link, and &lt;code&gt;product&lt;/code&gt; on each product URL, which returns structured product data including name, price, availability, brand, SKU (stock keeping unit), description, images, and more.&lt;/p&gt;

&lt;p&gt;This means the spider doesn't need to know anything about the structure of the site it's crawling. There are no selectors to generate, no schema to define, no user confirmation step. The AI on &lt;a href="https://www.zyte.com/zyte-api/" rel="noopener noreferrer"&gt;Zyte's&lt;/a&gt; end handles all of that. It does cost slightly more than a raw HTTP request, but given how little time it takes to go from URL to working spider, the trade-off makes sense.&lt;/p&gt;

&lt;p&gt;I've hardcoded &lt;code&gt;httpResponseBody&lt;/code&gt; as the extraction source — it's faster and more cost-efficient than browser rendering. If a site is JavaScript-heavy and you're not getting the data you need, you can switch to &lt;code&gt;browserHtml&lt;/code&gt; with a one-line change. The spider logs a warning to remind you of this.&lt;/p&gt;

&lt;h2&gt;
  
  
  The use case is deliberately narrow
&lt;/h2&gt;

&lt;p&gt;This skill is designed for e-commerce sites, and only e-commerce sites. That's not a limitation I stumbled into — it's a feature.&lt;/p&gt;

&lt;p&gt;Because the scope is narrow, the spider structure is simple and predictable: category pages with pagination, product links, and detail pages. &lt;a href="https://www.zyte.com/zyte-api/" rel="noopener noreferrer"&gt;Zyte API&lt;/a&gt;'s &lt;code&gt;productNavigation&lt;/code&gt; and &lt;code&gt;product&lt;/code&gt; extraction types handle this reliably. Widening the scope to arbitrary crawling would require a lot more of Scrapy's machinery and would quickly exceed what makes sense for a lightweight script like this.&lt;/p&gt;

&lt;p&gt;What it doesn't do: deep subcategory crawling, link discovery, or full-site crawls. If a page renders all its products without pagination, that works fine — the next page just returns nothing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Logging and output
&lt;/h2&gt;

&lt;p&gt;I replaced &lt;a href="https://docs.scrapy.org/en/latest/topics/practices.html" rel="noopener noreferrer"&gt;Scrapy's&lt;/a&gt; default logging with Rich logging, which gives cleaner terminal output. Scrapy's logs are verbose in ways that aren't useful when you're running a short-lived script — I wanted something concise enough that if something went wrong, it would be obvious at a glance.&lt;/p&gt;

&lt;p&gt;Output goes to a &lt;code&gt;.jsonl&lt;/code&gt; file named after the spider, alongside a plain &lt;code&gt;.log&lt;/code&gt; file. Both are derived from the spider name, which is itself derived from the domain. Run &lt;code&gt;example_com.py&lt;/code&gt;, get &lt;code&gt;example_com.jsonl&lt;/code&gt; and &lt;code&gt;example_com.log&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this goes next
&lt;/h2&gt;

&lt;p&gt;The immediate next step I have in mind is selector-based extraction as an alternative path — useful for sites where the AI extraction isn't quite right, or where you want more control over exactly what gets pulled.&lt;/p&gt;

&lt;p&gt;The longer-term vision is running this fully agentically. URLs get submitted somewhere — a queue, a database table, a form — an agent picks them up, builds the spider, and maybe runs a quick validation. The spider then goes into a pool to be run on a schedule, and data lands in a database rather than a flat file. Give Claude access to a virtual private server (VPS) via terminal and most of this is achievable without much extra infrastructure. The skill is already the hard part.&lt;/p&gt;

&lt;h2&gt;
  
  
  Download the skill
&lt;/h2&gt;

&lt;p&gt;The skill is free to download and use. It's a single &lt;code&gt;.skill&lt;/code&gt; file you can install directly into Claude Code. You'll need:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;scrapy scrapy-zyte-api rich
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ZYTE_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your_key_here
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://www.zyte.com/blog/scrapy-in-2026-modern-async-crawling/" rel="noopener noreferrer"&gt;Scrapy 2.13 &lt;/a&gt;or above is required for &lt;code&gt;AsyncCrawlerProcess&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The link to the repo and the skill download are in the video description, and &lt;a href="https://github.com/zytelabs/claude-webscraping-skills" rel="noopener noreferrer"&gt;here&lt;/a&gt;. If you've built something similar, or have thoughts on the design decisions — especially around the extraction approach or the logging setup — I'd love to hear it in the comments. GitHub links to your own scrapers are very welcome too.&lt;/p&gt;

&lt;p&gt;If you're interested in more agentic scraping patterns, I also built a Claude skill that helps spiders recover from excessive bans — you can watch that video &lt;a href="https://youtu.be/bF24BLZWlOk" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>scrapy</category>
      <category>claudeskills</category>
      <category>webscraping</category>
      <category>ai</category>
    </item>
    <item>
      <title>I Built a Self-Healing Web Scraper to Auto-Solve 403s</title>
      <dc:creator>John Rooney</dc:creator>
      <pubDate>Mon, 16 Mar 2026 11:09:31 +0000</pubDate>
      <link>https://forem.com/extractdata/i-built-a-self-healing-web-scraper-to-auto-solve-403s-1jg1</link>
      <guid>https://forem.com/extractdata/i-built-a-self-healing-web-scraper-to-auto-solve-403s-1jg1</guid>
      <description>&lt;p&gt;Web scraping has a recurring enemy: the 403. Sites add bot detection, anti-scraping tools update their challenges, and scrapers that worked fine last week start silently failing. The usual fix is manual — check the logs, diagnose the cause, update the config, redeploy. I wanted to see if an agent could handle that loop instead.&lt;/p&gt;

&lt;p&gt;So I built a self-healing scraper. After each crawl, a Claude-powered agent reads the failure logs, probes the broken domains with escalating fetch strategies, and rewrites the config automatically. By the next run, it's already fixed itself.&lt;/p&gt;

&lt;p&gt;

  &lt;iframe src="https://www.youtube.com/embed/bF24BLZWlOk"&gt;
  &lt;/iframe&gt;


&lt;/p&gt;

&lt;h2&gt;
  
  
  How it works
&lt;/h2&gt;

&lt;p&gt;The project has two parts: a scraper and a self-healing agent.&lt;/p&gt;

&lt;h3&gt;
  
  
  The scraper
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;main.py&lt;/code&gt; is a straightforward Python scraper driven entirely by a &lt;code&gt;config.json&lt;/code&gt; file. Each domain entry tells the scraper which URLs to fetch and how to fetch them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"books"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"zyte"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"browser_html"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"urls"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"https://www.bookstocsrape.co.uk/products/..."&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There are three fetch modes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Direct&lt;/strong&gt; — a plain &lt;code&gt;requests.get()&lt;/code&gt;. Fast, free, works for sites that don't block bots.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.zyte.com/zyte-api/" rel="noopener noreferrer"&gt;Zyte API&lt;/a&gt; (&lt;code&gt;httpResponseBody&lt;/code&gt;)&lt;/strong&gt; — routes the request through Zyte's residential proxy network. Good for sites that block datacenter IPs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.zyte.com/zyte-api/" rel="noopener noreferrer"&gt;Zyte API&lt;/a&gt; (&lt;code&gt;browserHtml&lt;/code&gt;)&lt;/strong&gt; — spins up a real browser via Zyte, executes JavaScript, and returns the fully-rendered DOM. Required for sites using JS-based bot challenges.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every request is logged to &lt;code&gt;scraper.log&lt;/code&gt; in the same format:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2026-03-14 09:12:01 url=https://... domain_id=scan status=200
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If a request throws any exception, it's recorded as a 403. That keeps the log clean and gives the agent a consistent signal to act on.&lt;/p&gt;

&lt;h3&gt;
  
  
  The self-healing agent
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;agent.py&lt;/code&gt; is a Claude-powered agent that runs after each crawl. It uses the &lt;a href="https://github.com/anthropics/claude-agent-sdk-python" rel="noopener noreferrer"&gt;Claude Agent SDK&lt;/a&gt; and has access to three tools: &lt;code&gt;Read&lt;/code&gt;, &lt;code&gt;Bash&lt;/code&gt;, and &lt;code&gt;Edit&lt;/code&gt; — enough to operate completely autonomously.&lt;/p&gt;

&lt;p&gt;The agent works through a staged process:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Read the log&lt;/strong&gt; — finds every domain that returned a 403&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-reference the config&lt;/strong&gt; — skips domains already configured to use Zyte&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stage 1 probe&lt;/strong&gt; — uses the &lt;code&gt;zyte-api&lt;/code&gt; CLI to fetch one URL per failing domain with &lt;code&gt;httpResponseBody&lt;/code&gt;, then inspects the page &lt;code&gt;&amp;lt;title&amp;gt;&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Challenge detection&lt;/strong&gt; — if the title contains phrases like &lt;em&gt;"Just a moment"&lt;/em&gt;, &lt;em&gt;"Checking your browser"&lt;/em&gt;, or &lt;em&gt;"Verifying you are human"&lt;/em&gt;, the page is flagged as a bot challenge&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stage 2 probe&lt;/strong&gt; — challenge pages are re-probed using &lt;code&gt;browserHtml&lt;/code&gt;, which runs a real browser to bypass JS-based detection&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Config update&lt;/strong&gt; — the agent edits &lt;code&gt;config.json&lt;/code&gt; directly, setting &lt;code&gt;zyte: true&lt;/code&gt; and/or &lt;code&gt;browserHtml: true&lt;/code&gt; for domains that now work&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The next crawl automatically uses the right fetch strategy. No manual intervention needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Design decisions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Config-driven, not code-driven
&lt;/h3&gt;

&lt;p&gt;Everything lives in &lt;code&gt;config.json&lt;/code&gt;. Adding a new domain is a one-liner, and the scraper doesn't need to know anything about individual sites — it just reads the config and follows instructions. The agent writes to the same file, so the loop closes itself naturally.&lt;/p&gt;

&lt;h3&gt;
  
  
  Graduated fetch strategy
&lt;/h3&gt;

&lt;p&gt;Not every site needs an expensive browser render. By escalating from direct to &lt;code&gt;httpResponseBody&lt;/code&gt; to &lt;code&gt;[browserHtml](https://www.zyte.com/zyte-api/headless-browser/)&lt;/code&gt; only when necessary, I keep costs manageable. Browser renders are slower and consume more API credits — reserving them for sites that actually need them makes a meaningful difference at scale.&lt;/p&gt;

&lt;h3&gt;
  
  
  Letting the agent handle the heuristics
&lt;/h3&gt;

&lt;p&gt;The challenge detection logic — matching titles against known bot-detection phrases — is exactly the kind of fuzzy heuristic that's tedious to maintain as code but natural for a language model to reason about. Claude also handles edge cases gracefully: if the &lt;code&gt;zyte-api&lt;/code&gt; CLI isn't installed, if the log is empty, if a domain is already correctly configured. A rule-based script would need explicit handling for every one of those scenarios.&lt;/p&gt;

&lt;h2&gt;
  
  
  The limitations
&lt;/h2&gt;

&lt;p&gt;It's worth being honest about where this approach falls short.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It's reactive, not proactive.&lt;/strong&gt; The agent only runs after a failed crawl. If a site starts blocking mid-run, those URLs fail silently until the next cycle.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Title-based detection is fragile.&lt;/strong&gt; Most bot-challenge pages say &lt;em&gt;"Just a moment…"&lt;/em&gt; — but a legitimate site could theoretically use that phrase. A false positive would cause the scraper to wastefully use browser rendering where it isn't needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One URL per domain.&lt;/strong&gt; The agent probes only the first failing URL for each domain. Different URL patterns on the same domain can have different bot-detection behaviour, which this doesn't account for.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No rollback.&lt;/strong&gt; Once the config is updated, there's no way to detect if a Zyte setting later stops working and revert it automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost opacity.&lt;/strong&gt; The scraper logs HTTP status codes, not &lt;a href="https://www.zyte.com/pricing/" rel="noopener noreferrer"&gt;Zyte API&lt;/a&gt; credit consumption. There's no visibility into what each domain actually costs to fetch.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where I'd take it next
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Smarter challenge detection.&lt;/strong&gt; Rather than keyword-matching on the title, the agent could read the full page HTML and make a more nuanced call — is this a product page, a login wall, or a soft block with a CAPTCHA? Each requires a different response.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Proactive monitoring.&lt;/strong&gt; A lightweight probe running daily against each configured domain, independent of the main crawl, would let the agent update the config &lt;em&gt;before&lt;/em&gt; a full scrape run hits a known-bad configuration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Per-URL config.&lt;/strong&gt; Right now &lt;code&gt;zyte&lt;/code&gt; and &lt;code&gt;browser_html&lt;/code&gt; are set at the domain level. Some sites serve static product pages on one path and JS-rendered category pages on another — granular per-URL settings would handle that cleanly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Structured data extraction.&lt;/strong&gt; Right now &lt;code&gt;parse_page&lt;/code&gt; only pulls the page title. The natural next step is structured product extraction — price, availability, name, images — either via CSS selectors in the config or Zyte's &lt;code&gt;product&lt;/code&gt; extraction type, which uses ML models to parse product data from any page.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-agent parallelism.&lt;/strong&gt; The self-healing loop is currently a single agent. As the config grows, a coordinator could spawn one subagent per failing domain, each running its own probe pipeline concurrently. The Claude Agent SDK supports subagents natively, so this would be a relatively small change.&lt;/p&gt;

&lt;p&gt;The core idea is simple: a scraper that observes its own failures and reconfigures itself. What I found interesting about building it wasn't the scraping itself — it was seeing how little scaffolding the agent actually needed. Three tools, a clear task, and it handles the diagnostic work that would otherwise fall to me.&lt;/p&gt;

</description>
      <category>claude</category>
      <category>agents</category>
      <category>ai</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>How I get Claude to build HTML parsing code the way I want it</title>
      <dc:creator>John Rooney</dc:creator>
      <pubDate>Wed, 11 Mar 2026 07:13:06 +0000</pubDate>
      <link>https://forem.com/extractdata/html-parsing-with-claude-skills-extracting-structured-data-from-raw-html-1ai0</link>
      <guid>https://forem.com/extractdata/html-parsing-with-claude-skills-extracting-structured-data-from-raw-html-1ai0</guid>
      <description>&lt;p&gt;Getting HTML off a page is only the first step. Once you have it, the real work begins: pulling out the data that actually matters — product names, prices, ratings, specifications — in a clean, structured format you can actually do something with.&lt;/p&gt;

&lt;p&gt;That's what the parser skill is for. If you haven't read the introduction to skills in our fetcher post, it's worth a quick look first. But the short version is this: a skill is a &lt;code&gt;SKILL.md&lt;/code&gt; file that gives Claude precise, reusable instructions for using a specific tool. The parser skill is one of three that together form a complete web scraping pipeline.&lt;/p&gt;

&lt;p&gt;

&lt;/p&gt;
&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/zytelabs" rel="noopener noreferrer"&gt;
        zytelabs
      &lt;/a&gt; / &lt;a href="https://github.com/zytelabs/claude-webscraping-skills" rel="noopener noreferrer"&gt;
        claude-webscraping-skills
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;claude-webscraping-skills&lt;/h1&gt;

&lt;/div&gt;
&lt;p&gt;A collection of claude skills and other tools to assist your web-scraping needs.&lt;/p&gt;
&lt;p&gt;video explanations:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://youtu.be/HH0Q9OfKLu0" rel="nofollow noopener noreferrer"&gt;https://youtu.be/HH0Q9OfKLu0&lt;/a&gt;
&lt;a href="https://youtu.be/P2HhnFRXm-I" rel="nofollow noopener noreferrer"&gt;https://youtu.be/P2HhnFRXm-I&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Other reading:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://www.zyte.com/blog/claude-skills-vs-mcp-vs-web-scraping-copilot/" rel="nofollow noopener noreferrer"&gt;https://www.zyte.com/blog/claude-skills-vs-mcp-vs-web-scraping-copilot/&lt;/a&gt;
&lt;a href="https://www.zyte.com/blog/supercharging-web-scraping-with-claude-skills/" rel="nofollow noopener noreferrer"&gt;https://www.zyte.com/blog/supercharging-web-scraping-with-claude-skills/&lt;/a&gt;&lt;/p&gt;

&lt;div class="markdown-heading"&gt;
&lt;h4 class="heading-element"&gt;Other claude tools for web scraping&lt;/h4&gt;

&lt;/div&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href="https://github.com/apscrapes/zyte-fetch-page-content-mcp-server" rel="noopener noreferrer"&gt;zyte-fetch-page-content-mcp-server&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;blockquote&gt;
&lt;p&gt;A Model Context Protocol (MCP) server that runs locally using docker desktop mcp-toolkit and help you extracts clean, LLM-friendly content from any webpage using the Zyte API. Perfect for AI assistants that need to read and understand web content. by &lt;a href="https://github.com/apscrapes" rel="noopener noreferrer"&gt;Ayan Pahwa&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;ol start="2"&gt;
&lt;li&gt;&lt;a href="https://joshuaodmark.com/blog/improve-claude-code-webfetch-with-zyte-api" rel="nofollow noopener noreferrer"&gt;Improve Claude Code WebFetch with Zyte API&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;blockquote&gt;
&lt;p&gt;When Claude encounters a WebFetch failure, it reads the CLAUDE.md instructions and makes a curl request to the Zyte API endpoint. The API returns base64-encoded HTML, which Claude decodes and processes just like it would with a normal WebFetch response. by &lt;a href="https://joshuaodmark.com/" rel="nofollow noopener noreferrer"&gt;Joshua Odmark&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;ol start="3"&gt;
&lt;li&gt;&lt;a href="https://www.zyte.com/blog/claude-skills-vs-mcp-vs-web-scraping-copilot/" rel="nofollow noopener noreferrer"&gt;Claude skills vs MCP vs Web Scraping CoPilot&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;



&lt;/div&gt;
&lt;br&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/zytelabs/claude-webscraping-skills" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;br&gt;
&lt;/div&gt;





&lt;h2&gt;
  
  
  What is a skill?
&lt;/h2&gt;

&lt;p&gt;A skill is a small markdown file that tells Claude how to use a specific script or tool — what it does, when to use it, and step-by-step how to run it. Claude reads the file and follows the instructions as part of a broader workflow, with no manual intervention required.&lt;/p&gt;

&lt;p&gt;Skills are composable by design. The fetcher skill hands raw HTML to the parser skill, which hands structured JSON to the compare skill. Each one does one job well, and they're built to work together.&lt;/p&gt;

&lt;p&gt;The parser skill's front matter sets out its purpose immediately:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;parser&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Extracts&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;structured&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;product&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;from&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;raw&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;HTML.&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Tries&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;JSON-LD&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;via"&lt;/span&gt;
&lt;span class="s"&gt;Extruct first, falls back to CSS selectors via Parsel.&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Two methods, one fallback. That single description line captures the entire logic of the skill.&lt;/p&gt;

&lt;p&gt;

  &lt;iframe src="https://www.youtube.com/embed/HH0Q9OfKLu0"&gt;
  &lt;/iframe&gt;


&lt;/p&gt;
&lt;h2&gt;
  
  
  What the parser skill does
&lt;/h2&gt;

&lt;p&gt;The parser skill takes raw HTML as input and returns a structured JSON object. It uses two extraction methods in sequence, trying the more reliable one first and falling back to the more flexible one if needed.&lt;/p&gt;

&lt;p&gt;The primary method uses Extruct to find JSON-LD data embedded in the page. JSON-LD is a structured data format that many modern sites include in their HTML specifically to make their content machine-readable — it's used for search engine optimisation and data portability. When it's present, Extruct can read it cleanly and reliably, with no need to write or maintain selectors.&lt;/p&gt;

&lt;p&gt;If no usable JSON-LD is found, the skill falls back to Parsel, which uses CSS selectors to locate data heuristically across the page. This is more flexible but inherently tied to the page's visual structure, which can change.&lt;/p&gt;
&lt;h2&gt;
  
  
  When to use it
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## When to use&lt;/span&gt;
Use this skill when you have raw HTML and need to extract structured data from
it — product details, prices, specs, ratings, or any page content.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;In practice, that means the parser skill is almost always the second step in a pipeline — running immediately after the fetcher skill has retrieved your HTML. It works with any page type, and handles the most common product fields out of the box.&lt;/p&gt;
&lt;h2&gt;
  
  
  How it works
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Instructions&lt;/span&gt;
&lt;span class="p"&gt;1.&lt;/span&gt; Save the HTML to a temporary file &lt;span class="sb"&gt;`page.html`&lt;/span&gt;
&lt;span class="p"&gt;2.&lt;/span&gt; Run &lt;span class="sb"&gt;`parser.py`&lt;/span&gt; against it:
   python parser.py page.html
&lt;span class="p"&gt;
3.&lt;/span&gt; The script outputs a JSON object. Check the &lt;span class="sb"&gt;`method`&lt;/span&gt; field:
&lt;span class="p"&gt;   -&lt;/span&gt; "extruct" — clean structured data was found, use it directly
&lt;span class="p"&gt;   -&lt;/span&gt; "parsel" — fell back to CSS selectors, review fields for completeness
&lt;span class="p"&gt;4.&lt;/span&gt; If key fields are missing from the Parsel output, ask the user which fields
   they need and re-run with --fields:
   python parser.py page.html --fields "price,rating,brand"
&lt;span class="p"&gt;
5.&lt;/span&gt; Return the parsed JSON to the conversation for use in the Compare skill.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The &lt;code&gt;method&lt;/code&gt; field in the output is particularly useful. It tells you immediately how the data was extracted and how much trust to place in it. An &lt;code&gt;"extruct"&lt;/code&gt; result is clean and stable. A &lt;code&gt;"parsel"&lt;/code&gt; result is worth reviewing, especially if you're working with an unusual page layout.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;--fields&lt;/code&gt; flag is a practical escape hatch. Rather than requiring you to dig into the script when key data is missing, it lets you specify exactly what you need and re-run — a much more efficient loop.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why prefer Extruct?
&lt;/h2&gt;

&lt;p&gt;The notes section of the skill file makes this explicit:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Notes&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Always prefer the Extruct path — it is more stable and requires no maintenance
&lt;span class="p"&gt;-&lt;/span&gt; Parsel selectors are generated heuristically and may need adjustment for
  unusual page layouts
&lt;span class="p"&gt;-&lt;/span&gt; Run once per page; pass all outputs together into the Compare skill
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Parsel selectors break when sites redesign. JSON-LD, by contrast, is structured data the site publishes independently of its visual layout. A site can completely overhaul its design and its JSON-LD will often remain untouched. That stability is worth prioritising wherever possible.&lt;/p&gt;
&lt;h2&gt;
  
  
  What comes next
&lt;/h2&gt;

&lt;p&gt;Once you've run the parser skill across all your target pages, you have a set of structured JSON objects ready to compare. That's where the compare skill picks up — generating tables, summaries, and side-by-side analysis from the extracted data.&lt;/p&gt;
&lt;h2&gt;
  
  
  Do you need a skill?
&lt;/h2&gt;

&lt;p&gt;The parser skill works well when the data you need maps cleanly onto fields that Extruct or Parsel can find — product names, prices, ratings, and similar structured attributes that sites commonly expose through JSON-LD or consistent HTML patterns. For that category of task, the skill is fast to apply and requires no custom code.&lt;/p&gt;

&lt;p&gt;See our post about Skills vs MCP vs Web Scraping Copilot (our VS Code Extension):&lt;/p&gt;

&lt;p&gt;

&lt;/p&gt;
&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
    &lt;div class="c-embed__content"&gt;
      &lt;div class="c-embed__body flex items-center justify-between"&gt;
        &lt;a href="https://www.zyte.com/blog/claude-skills-vs-mcp-vs-web-scraping-copilot/" rel="noopener noreferrer" class="c-link fw-bold flex items-center"&gt;
          &lt;span class="mr-2"&gt;zyte.com&lt;/span&gt;
          

        &lt;/a&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;





&lt;p&gt;But not every extraction problem fits that mould. If you're working with pages that don't include JSON-LD and have highly irregular layouts, Parsel's heuristic selectors may return incomplete or inconsistent results, and you'll spend time debugging field by field. In those cases, a purpose-built extraction script using Parsel or BeautifulSoup directly — with selectors you've written and tested against the specific target — will be more reliable.&lt;/p&gt;

&lt;p&gt;For larger-scale or more complex extraction work, Zyte API's automatic extraction capabilities go further still. Rather than relying on selectors at all, automatic extraction uses AI to identify and return structured data from a page without requiring you to specify fields or maintain selector logic. If you're extracting data from many different site structures, or you need extraction to keep working through site redesigns without manual intervention, that's a more robust foundation than a skill-based approach. The parser skill is best understood as a practical middle ground: fast to use, good enough for a wide range of common cases, and easy to slot into a pipeline — but not a replacement for extraction tooling built for scale or resilience.&lt;/p&gt;

</description>
      <category>claude</category>
      <category>skills</category>
      <category>webscraping</category>
      <category>zyte</category>
    </item>
    <item>
      <title>I gave Claude access to a web scraping API</title>
      <dc:creator>John Rooney</dc:creator>
      <pubDate>Wed, 11 Mar 2026 07:10:43 +0000</pubDate>
      <link>https://forem.com/extractdata/the-fetcher-skill-reliable-html-scraping-with-automatic-fallback-5f02</link>
      <guid>https://forem.com/extractdata/the-fetcher-skill-reliable-html-scraping-with-automatic-fallback-5f02</guid>
      <description>&lt;p&gt;If you've worked with Claude for any length of time, you've probably noticed it can do a lot more than answer questions. With the right setup, it can take actions — running scripts, processing files, working through multi-step workflows autonomously. Skills are what make that possible.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is a skill?
&lt;/h2&gt;

&lt;p&gt;A skill is a small, self-contained instruction set that tells Claude how to use a specific tool or script to accomplish a well-defined task. Technically, it's a markdown file — a &lt;code&gt;SKILL.md&lt;/code&gt; — that describes what a tool does, when to reach for it, and exactly how to run it. Claude reads that file and follows the instructions as part of a larger workflow.&lt;/p&gt;

&lt;p&gt;Skills are designed to be composable. Each one does one thing well, and they're built to hand off to each other. The fetcher skill retrieves HTML. The parser skill extracts data from it. The compare skill turns multiple parsed outputs into a structured comparison. Together, they form a complete scraping pipeline — and Claude orchestrates the whole thing.&lt;/p&gt;

&lt;p&gt;See our skills here: &lt;a href="https://github.com/zytelabs/claude-webscraping-skills" rel="noopener noreferrer"&gt;https://github.com/zytelabs/claude-webscraping-skills&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The skill format looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fetcher&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Fetches&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;raw&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;HTML&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;from&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;URL&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;using&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;httpx,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;with&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;automatic&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;fallback&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Zyte&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;API&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;if&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;blocked."&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That front matter is what Claude uses to match the right skill to the right task. The description is deliberately precise: it tells Claude not just what the skill does, but how it does it, so Claude can reason about whether it's the right tool for the job.&lt;/p&gt;

&lt;p&gt;

  &lt;iframe src="https://www.youtube.com/embed/HH0Q9OfKLu0"&gt;
  &lt;/iframe&gt;


&lt;/p&gt;

&lt;h2&gt;
  
  
  What the fetcher skill does
&lt;/h2&gt;

&lt;p&gt;The fetcher skill's job is exactly what it sounds like: given a URL, fetch the raw HTML and return it. It uses httpx as its primary HTTP client — a modern, performant Python library well suited to scraping workloads.&lt;/p&gt;

&lt;p&gt;What makes it more than a simple wrapper is the fallback logic. A significant number of sites actively block automated requests. Without a fallback, a blocked request just fails, and you're left manually diagnosing why. The fetcher skill handles this automatically. If a request comes back with a &lt;code&gt;BLOCKED&lt;/code&gt; status, it retries via Zyte API, which provides built-in unblocking. Most of the time, you get your HTML without ever needing to intervene.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to use it
&lt;/h2&gt;

&lt;p&gt;The skill's &lt;code&gt;SKILL.md&lt;/code&gt; is explicit about this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## When to use&lt;/span&gt;
Use this skill when the user provides one or more URLs and asks you to fetch,
retrieve, scrape, or get the HTML or page content.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In practice, that means any time you're starting a scraping or data extraction task and you have a URL to work from. It's the entry point for the pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  How it works
&lt;/h2&gt;

&lt;p&gt;The instructions in the skill file are straightforward:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Instructions&lt;/span&gt;
&lt;span class="p"&gt;1.&lt;/span&gt; Run &lt;span class="sb"&gt;`fetcher.py`&lt;/span&gt; with the URL as an argument:
   python fetcher.py &lt;span class="nt"&gt;&amp;lt;url&amp;gt;&lt;/span&gt;
&lt;span class="p"&gt;
2.&lt;/span&gt; If the script returns a successful HTML response, return the HTML to the
   conversation for use in the next step.
&lt;span class="p"&gt;3.&lt;/span&gt; If the script returns a &lt;span class="sb"&gt;`BLOCKED`&lt;/span&gt; status, re-run with the &lt;span class="sb"&gt;`--zyte`&lt;/span&gt; flag:
   python fetcher.py &lt;span class="nt"&gt;&amp;lt;url&amp;gt;&lt;/span&gt; --zyte
&lt;span class="p"&gt;
4.&lt;/span&gt; Inform the user if a URL could not be fetched after both attempts.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The two-step process keeps things efficient. httpx is fast and lightweight, so it handles the majority of requests without needing to route through Zyte API. The fallback only kicks in when it's needed. If both attempts fail, Claude surfaces that to you clearly rather than silently moving on.&lt;/p&gt;

&lt;p&gt;For multiple URLs, the script runs once per URL — there's no batching — so Claude loops through a list sequentially.&lt;/p&gt;

&lt;h2&gt;
  
  
  Transparency about failure
&lt;/h2&gt;

&lt;p&gt;One detail worth highlighting is the final instruction: inform the user if a URL could not be fetched after both attempts. This might seem obvious, but it reflects a design principle worth being explicit about. A skill that silently drops failed URLs would produce incomplete data downstream, and you might not notice until you're looking at a comparison table with missing rows. Surfacing failures immediately keeps the pipeline honest.&lt;/p&gt;

&lt;h2&gt;
  
  
  What comes next
&lt;/h2&gt;

&lt;p&gt;The fetcher skill's output is raw HTML — exactly what the parser skill expects as its input. The two are designed to be used in sequence. Once you have the HTML, the parser skill takes over, extracting structured data through JSON-LD or CSS selectors depending on what the page contains.&lt;/p&gt;

&lt;p&gt;That handoff is documented in the skill's notes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Notes&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; For multiple URLs, run the script once per URL
&lt;span class="p"&gt;-&lt;/span&gt; Pass the raw HTML output into the Parser skill for extraction
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The pipeline continues from there.&lt;/p&gt;

&lt;h2&gt;
  
  
  Do you need a skill?
&lt;/h2&gt;

&lt;p&gt;Skills are a good fit when you have a well-defined, repeatable task that benefits from consistent behaviour across many runs. Fetching HTML from a URL is a clear example: the inputs and outputs are predictable, the fallback logic is always the same, and packaging that into a skill means Claude applies it reliably without you having to re-explain the process each time.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.zyte.com/blog/claude-skills-vs-mcp-vs-web-scraping-copilot/" rel="noopener noreferrer"&gt;Read our break down of Skills vs MCP vs Web Scraping Copilot here - our VS Code extension&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That said, skills aren't always the right tool. If you only need to fetch a handful of pages once, asking Claude to write a quick httpx script directly may be faster and more flexible. Similarly, if your target sites have unusual behaviour — rate limiting, JavaScript rendering, login walls, or multi-step navigation — a bespoke Scrapy spider built with Zyte API gives you far more control than a general-purpose fetch wrapper. Scrapy's middleware architecture, item pipelines, and scheduling make it better suited to large-scale or complex crawls where you need precise control over every aspect of the request cycle.&lt;/p&gt;

&lt;p&gt;The fetcher skill sits in the middle: more structured than an ad hoc script, less complex than a full Scrapy project. It's the right choice when you want Claude to handle straightforward retrieval as part of a larger automated workflow, without the overhead of setting up and maintaining a dedicated spider.&lt;/p&gt;

</description>
      <category>claude</category>
      <category>skills</category>
      <category>webscraping</category>
      <category>zyte</category>
    </item>
    <item>
      <title>Why your Python request gets 403 Forbidden</title>
      <dc:creator>John Rooney</dc:creator>
      <pubDate>Wed, 04 Mar 2026 09:11:29 +0000</pubDate>
      <link>https://forem.com/extractdata/why-your-python-request-gets-403-forbidden-4h3p</link>
      <guid>https://forem.com/extractdata/why-your-python-request-gets-403-forbidden-4h3p</guid>
      <description>&lt;p&gt;If you’ve had your HTTP request blocked despite using correct headers, cookies, and clean IPs, there’s a chance you are running into one of the simplest forms of blocking, and one of the most confusing for beginners.&lt;/p&gt;

&lt;p&gt;Chances are, you will recognise the problem. You found the hidden API, and your request works perfectly in Postman... but it fails instantly within your Python code.&lt;/p&gt;

&lt;p&gt;It’s called TLS fingerprinting. But the good news is, you can solve it. In fact, when I showed this to some developers at &lt;a href="https://www.extractsummit.io/" rel="noopener noreferrer"&gt;Extract Summit&lt;/a&gt;, they couldn’t believe how straightforward it was to fix.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1gl264gftdhxminz531h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1gl264gftdhxminz531h.png" alt="bruno request" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;CAPTION: “I copied the request -&amp;gt; matching headers, cookies and IP, but it still failed?”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Your TLS fingerprint&lt;br&gt;&lt;br&gt;
Let’s start with a question. How do the servers and websites know you’ve moved from Postman to making the request in Python? What do they see that you can’t? The key is your TLS fingerprint.&lt;/p&gt;

&lt;p&gt;To use an analogy: We’ve effectively written a different name on a sticker and stuck it to our t-shirt, hoping to get past the bouncer at a bar.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Your nametag (headers)&lt;/strong&gt; says "Chrome."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;But your t-shirt logo (TLS handshake)&lt;/strong&gt; very obviously says "Python."&lt;/p&gt;

&lt;p&gt;It’s a dead giveaway. This mismatch is spotted immediately. We need to change our t-shirt to match the nametag.&lt;/p&gt;

&lt;p&gt;To understand &lt;em&gt;how&lt;/em&gt; they spot the “logo”, we need to look at the initial &lt;strong&gt;“Client Hello”&lt;/strong&gt; packet. There are three key pieces of information exchanged here:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Cipher suites:&lt;/strong&gt; The encryption methods the client supports.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TLS extensions:&lt;/strong&gt; Extra features (like specific elliptic curves).
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Key exchange algorithms:&lt;/strong&gt; How they agree on a password.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is because Python’s &lt;code&gt;requests&lt;/code&gt; library uses &lt;strong&gt;OpenSSL&lt;/strong&gt;, while Chrome uses Google's &lt;strong&gt;BoringSSL&lt;/strong&gt;. While they share some underlying logic, their signatures are notably different. And that’s the problem.&lt;/p&gt;
&lt;h3&gt;
  
  
  OpenSSL vs. BoringSSL
&lt;/h3&gt;

&lt;p&gt;The root cause of this mismatch lies in the underlying libraries.&lt;/p&gt;

&lt;p&gt;Python’s &lt;code&gt;requests&lt;/code&gt; library relies on &lt;strong&gt;OpenSSL&lt;/strong&gt;, the standard cryptographic library found on almost every Linux server. It is robust, predictable, and remarkably consistent.&lt;/p&gt;

&lt;p&gt;Chrome, however, uses &lt;strong&gt;BoringSSL&lt;/strong&gt;, Google’s own fork of OpenSSL. BoringSSL is designed specifically for the chaotic nature of the web and it behaves very differently.&lt;/p&gt;

&lt;p&gt;The biggest giveaway between the two is a mechanism called &lt;strong&gt;GREASE&lt;/strong&gt; (Generate Random Extensions And Sustain Extensibility).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="s2"&gt;"TLS_GREASE (0xFAFA)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="err"&gt;....&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Chrome (BoringSSL) intentionally inserts random, garbage values into the TLS handshake - specifically, in the cipher suites and extensions lists. It does this to "grease the joints" of the internet, ensuring that servers don't crash when they encounter unknown future parameters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is one of the key changes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Chrome:&lt;/strong&gt; Always includes these random GREASE values (e.g., &lt;code&gt;0x0a0a&lt;/code&gt;).
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Python (OpenSSL):&lt;/strong&gt; &lt;em&gt;Never&lt;/em&gt; includes them. It only sends valid, known ciphers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So, when an anti-bot system sees a handshake claiming to be "Chrome 120" but lacking these random GREASE values, it knows instantly that it is dealing with a script. It’s not just that your shirt has the wrong logo; it’s that your shirt is &lt;em&gt;too clean&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  JA3 hash
&lt;/h2&gt;

&lt;p&gt;Anti-bot companies take all that handshake data and combine it into a single string called a &lt;strong&gt;JA3 fingerprint&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Salesforce invented this years ago to detect malware, but it found its way into our industry as a simple, effective way to fingerprint HTTP requests. Security vendors have built databases of these fingerprints.&lt;/p&gt;

&lt;p&gt;It is relatively straightforward to identify and block &lt;em&gt;any&lt;/em&gt; request coming from Python’s default library because its JA3 hash is static and well-known.&lt;/p&gt;

&lt;p&gt;This code snippet would yield the below JSON response.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_ja3_info&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://tls.peet.ws/api/clean&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Session&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raise_for_status&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Note the lack of akamai_hash&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"ja3"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; 
&lt;/span&gt;&lt;span class="s2"&gt;"771,4866-4867-4865-49196-49200-49195-49199-52393-52392-49188-49192-49187-49191-159-158-107-103-255,0-11-10-16-22-2
3-49-13-43-45-51-21,29-23-30-25-24-256-257-258-259-260,0-1-2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"ja3_hash"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"a48c0d5f95b1ef98f560f324fd275da1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"ja4"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"t13d1812h1_85036bcba153_375ca2c5e164"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"ja4_r"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; 
&lt;/span&gt;&lt;span class="s2"&gt;"t13d1812h1_0067,006b,009e,009f,00ff,1301,1302,1303,c023,c024,c027,c028,c02b,c02c,c02f,c030,cca8,cca9_000a,000b,000
d,0016,0017,002b,002d,0031,0033_0403,0503,0603,0807,0808,0809,080a,080b,0804,0805,0806,0401,0501,0601,0303,0301,030
2,0402,0502,0602"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"akamai"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"-"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"akamai_hash"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"-"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"peetprint"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; 
&lt;/span&gt;&lt;span class="s2"&gt;"772-771|1.1|29-23-30-25-24-256-257-258-259-260|1027-1283-1539-2055-2056-2057-2058-2059-2052-2053-2054-1025-1281-15
37-771-769-770-1026-1282-1538|1||4866-4867-4865-49196-49200-49195-49199-52393-52392-49188-49192-49187-49191-159-158
-107-103-255|0-10-11-13-16-21-22-23-43-45-49-51"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"peetprint_hash"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"76017c4a71b7a055fb2a9a5f70f05112"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Putting the above JA3 hash into ja3.zone clearly shows this is a python3 request, using urllib3:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpmrqcpccqc08rcm5wz34.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpmrqcpccqc08rcm5wz34.png" alt="JA3 Zone" width="800" height="622"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What’s the solution?
&lt;/h2&gt;

&lt;p&gt;As mentioned, simply changing headers and IP addresses won’t make a difference, as these are not part of the TLS handshake. We need to change the ciphers and Extensions to be more like what a browser would send.&lt;/p&gt;

&lt;p&gt;The best way to achieve this in Python is to swap &lt;code&gt;requests&lt;/code&gt; for a modern, TLS-friendly library like &lt;strong&gt;curl_cffi&lt;/strong&gt; or &lt;strong&gt;rnet&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Here is how easy it is to switch to &lt;strong&gt;curl_cffi&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;curl_cffi&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="c1"&gt;# note the impersonate argument &amp;amp; import above
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_ja3_info&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://tls.peet.ws/api/clean&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Session&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;impersonate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chrome&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raise_for_status&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"akamai_hash"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"52d84b11737d980aef856699f885ca86"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fytpcxmljsdlf7nq02rq3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fytpcxmljsdlf7nq02rq3.png" alt="Our new hash" width="800" height="464"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;CAPTION: Note - I searched via the akamai_hash here as the fingerprint from the JA3 hash wasn’t in this particular database.&lt;/em&gt; &lt;/p&gt;

&lt;p&gt;By adding that &lt;code&gt;impersonate&lt;/code&gt; parameter, you are effectively putting on the correct t-shirt.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Make &lt;code&gt;curl_cffi&lt;/code&gt; or &lt;code&gt;rnet&lt;/code&gt; your default HTTP library in Python. This should be your first port of call before spinning up a full headless browser.&lt;/p&gt;

&lt;p&gt;A simple change (which brings benefits like async capabilities) means you don’t fall foul of TLS fingerprinting. &lt;code&gt;curl-cffi&lt;/code&gt; even has a &lt;code&gt;requests&lt;/code&gt;-like API, meaning it's often a drop-in replacement.&lt;/p&gt;

</description>
      <category>api</category>
      <category>networking</category>
      <category>python</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>Hybrid scraping: The architecture for the modern web</title>
      <dc:creator>John Rooney</dc:creator>
      <pubDate>Wed, 25 Feb 2026 17:05:36 +0000</pubDate>
      <link>https://forem.com/extractdata/hybrid-scraping-the-architecture-for-the-modern-web-4p9h</link>
      <guid>https://forem.com/extractdata/hybrid-scraping-the-architecture-for-the-modern-web-4p9h</guid>
      <description>&lt;p&gt;If you scrape the modern web, you probably know the pain of the &lt;strong&gt;JavaScript challenge&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Before you can access any data, the website forces your browser to execute a snippet of JavaScript code. It calculates a result, sends it back to an endpoint for verification, and often captures extensive fingerprinting data in the process.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl32q8pbypgq6l22kner6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl32q8pbypgq6l22kner6.png" alt="browser checks" width="800" height="524"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once you pass this test, the server assigns you a &lt;strong&gt;session cookie&lt;/strong&gt;. This cookie acts as your "access pass." It tells the website, &lt;em&gt;"This user has passed the challenge,"&lt;/em&gt; so you don’t have to re-run the JavaScript test on every single page load.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft9ot3jm7249g98cuwbno.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft9ot3jm7249g98cuwbno.png" alt="devtool shows storage token" width="800" height="422"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For web scrapers, this mechanism creates a massive inefficiency.&lt;/p&gt;

&lt;p&gt;It &lt;em&gt;looks&lt;/em&gt; like you are forced to use a headless browser (like Puppeteer or Playwright) for every single request just to handle that initial check. But browsers are heavy, they are slow and they consume massive amounts of RAM and bandwidth.&lt;/p&gt;

&lt;p&gt;Running a browser for thousands of requests can quickly become an infrastructure nightmare. You end up paying for CPU cycles just to render a page when all you wanted was the JSON payload.&lt;/p&gt;

&lt;h3&gt;
  
  
  The solution: Hybrid scraping
&lt;/h3&gt;

&lt;p&gt;The answer to this problem is a technique I’ve started calling &lt;strong&gt;hybrid scraping&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This involves using the browser &lt;em&gt;only&lt;/em&gt; to open the initial request, grab the cookie, and create a session. Once you have them, you extract that session data and hand it over to a standard, lightweight HTTP client.&lt;/p&gt;

&lt;p&gt;This architecture gives you the &lt;strong&gt;access&lt;/strong&gt; of a browser with the &lt;strong&gt;speed and efficiency&lt;/strong&gt; of a script.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementing this in Python
&lt;/h2&gt;

&lt;p&gt;To build this in Python, we need two specific packages:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;A browser:&lt;/strong&gt; We will use &lt;strong&gt;ZenDriver&lt;/strong&gt;, a modern wrapper for headless Chrome that handles the "undetected" configuration for us.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HTTP client:&lt;/strong&gt; We will use &lt;a href="https://github.com/0x676e67/rnet" rel="noopener noreferrer"&gt;&lt;strong&gt;rnet&lt;/strong&gt;&lt;/a&gt;, a Rust-based HTTP client for Python.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;em&gt;But why &lt;a href="https://github.com/0x676e67/rnet" rel="noopener noreferrer"&gt;rnet&lt;/a&gt;?&lt;/em&gt; Well, within the initial TLS handshake where the client/server “hello” is sent, the information traded here can be fingerprinted, taking in things like the TLS version and the ciphers available for encryption. This can be hashed into a fingerprint and profiled.&lt;/p&gt;

&lt;p&gt;Python’s &lt;a href="https://github.com/psf/requests" rel="noopener noreferrer"&gt;requests&lt;/a&gt; package, which uses &lt;a href="https://docs.python.org/3/library/urllib.html" rel="noopener noreferrer"&gt;urllib&lt;/a&gt; from the standard library, has a very distinctive TLS fingerprint, containing ciphers (amongst other things) that aren’t seen in a browser. This makes it very easy to spot. Both &lt;a href="https://github.com/0x676e67/rnet" rel="noopener noreferrer"&gt;rnet&lt;/a&gt;, and other options such as &lt;a href="https://github.com/lexiforest/curl_cffi" rel="noopener noreferrer"&gt;curl-cffi&lt;/a&gt;, are able to send a TLS fingerprint similar to that of a browser. This reduces the chances of our request being blocked.&lt;/p&gt;

&lt;p&gt;Here is how we assemble the pipeline.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Load the page (The handshake)
&lt;/h3&gt;

&lt;p&gt;First, we define our browser logic. Notice that we are not trying to parse HTML here. Our only goal is to visit the site, pass the initial JavaScript challenge, and extract the session cookies.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;zendriver&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;zd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_cookies&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Use ZenDriver to launch a browser, navigate to the page, 
    and retrieve the cookies.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="n"&gt;browser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;zd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Hit the homepage to trigger the check
&lt;/span&gt;    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://auto.hylnd7.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Wait briefly for the JS challenge to complete
&lt;/span&gt;    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 

    &lt;span class="c1"&gt;# Extract the cookies
&lt;/span&gt;    &lt;span class="n"&gt;requests_style_cookies&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cookies&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_all&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;requests_style_cookies&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What’s happening here:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We launch the browser, visit the site, and wait just one second for the JS challenge to run. Once we have the cookies, we call browser.stop(). This is the most important line: we do not want a browser instance wasting resources when we don’t need it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Use the cookies
&lt;/h3&gt;

&lt;p&gt;Now that we have the "access pass," we can switch to our lightweight HTTP client. We take those cookies and inject them into the rnet client headers.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;rnet&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Emulation&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;http_request_rnet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cookies&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Make a fast request using RNet with the borrowed cookies.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;referer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://auto.hylnd7.com/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;# Format the browser cookies into a simple HTTP header string
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cookies&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;cookie_list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;cookie&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;cookies&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;cookie_list&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cookie&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cookie&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Cookie&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cookie_list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# We use Emulation.Chrome142 to change the TLS Fingerprint.
&lt;/span&gt;    &lt;span class="c1"&gt;# This is site dependent - but worth using
&lt;/span&gt;    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;emulation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Emulation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Chrome142&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://auto.hylnd7.com/api/products?page=1&amp;amp;limit=8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What’s happening here:&lt;/strong&gt; &lt;/p&gt;

&lt;p&gt;We convert the browser's cookie format into a standard header string. Note the “&lt;em&gt;Emulation.Chrome142”&lt;/em&gt; parameter. We are layering two techniques here: hybrid scraping (using real cookies) and TLS fingerprinting (using a modern HTTP client). This double-layer approach covers all our bases.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;(Note: Many HTTP clients have a cookie jar that you could also use; for this example, sending the header directly worked perfectly).&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Run the code
&lt;/h3&gt;

&lt;p&gt;Finally, we tie it together. For this demo, we use a simple argparse flag to show the difference with and without the cookie.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;use_cookies&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;cookies&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

    &lt;span class="c1"&gt;# The Decision Logic
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;use_cookies&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;cookies&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;get_cookies&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="c1"&gt;# Run the heavy browser
&lt;/span&gt;
    &lt;span class="c1"&gt;# Always run the fast HTTP client
&lt;/span&gt;    &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;http_request_rnet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cookies&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Status Code:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Response Body:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Request blocked&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Get the complete script
&lt;/h2&gt;

&lt;p&gt;Want to run this yourself? We’ve put the full, copy-pasteable script (including the argument parsers and imports) in the block below.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv init
uv add zendriver rnet rich
&lt;span class="c"&gt;# linux/mac&lt;/span&gt;
&lt;span class="nb"&gt;source&lt;/span&gt; .venv/bin/activate&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="c"&gt;# windows&lt;/span&gt;
.venv&lt;span class="se"&gt;\S&lt;/span&gt;cripts&lt;span class="se"&gt;\a&lt;/span&gt;ctivate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;zendriver&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;zd&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;rnet&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Emulation&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;rich&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="k"&gt;print&lt;/span&gt;


&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;http_request_rnet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cookies&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Make an HTTP GET request using rnet with the provided cookies. Cookies are sent in the headers. Note for this site we need the referer too.
    Return the Response Object.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;referer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://auto.hylnd7.com/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cookies&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;cookie_list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;cookie&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;cookies&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# Adjust based on the actual structure of the cookie object from zendriver
&lt;/span&gt;            &lt;span class="c1"&gt;# If it's a dict: cookie['name'], cookie['value']
&lt;/span&gt;            &lt;span class="c1"&gt;# If it's an object: cookie.name, cookie.value
&lt;/span&gt;            &lt;span class="n"&gt;cookie_list&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cookie&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cookie&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Cookie&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cookie_list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;emulation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Emulation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Chrome142&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://auto.hylnd7.com/api/products?page=1&amp;amp;limit=8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_cookies&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Use zendriver to launch a browser, navigate to a page, and retrieve cookies.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;browser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;zd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://auto.hylnd7.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;requests_style_cookies&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cookies&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_all&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;requests_style_cookies&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;use_cookies&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;cookies&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;use_cookies&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;cookies&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;get_cookies&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;http_request_rnet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cookies&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Status Code:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Response Body:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;parser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ArgumentParser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Make HTTP request with optional browser cookies&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--cookies&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;help&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Set to &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; to launch browser and get cookies, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;false&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; to skip (default: false)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse_args&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cookies&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Pros and Cons of Hybrid Scraping
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Pros&lt;/th&gt;
&lt;th&gt;Cons&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Efficiency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Reduces RAM usage massively compared to pure browser scraping.&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Higher complexity:&lt;/strong&gt; You must manage two libraries (zendriver and rnet) and the glue code.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Speed&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;HTTP requests complete in milliseconds. Browsers take seconds.&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;State management:&lt;/strong&gt; You need logic to handle cookie expiry. If the cookie dies, you must "wake up" the browser.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Access&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;You get the verification of a real browser without the drag.&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Maintenance:&lt;/strong&gt; You are debugging two points of failure: the browser's ability to solve the challenge, and the client's ability to fetch data.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Final thoughts&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;For smaller jobs, it might be easier to just use the browser; the benefits won’t necessarily outweigh the extra complexity required.&lt;/p&gt;

&lt;p&gt;But for production pipelines, &lt;strong&gt;this approach is the standard.&lt;/strong&gt; It treats the browser as a luxury resource: used only when strictly necessary to unlock the door, so the HTTP client can do the real work. It’s this session and state management that allows you to scrape harder-to-access sites effectively and efficiently.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;If building this orchestration layer yourself feels like too much overhead, this is exactly what the &lt;a href="https://www.zyte.com/zyte-api/" rel="noopener noreferrer"&gt;&lt;strong&gt;Zyte API&lt;/strong&gt;&lt;/a&gt; handles internally. We manage the browser/HTTP switching logic automatically, so you just make a single request and get the data.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>javascript</category>
      <category>webdev</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>Stop Scraping HTML - There's a better way.</title>
      <dc:creator>John Rooney</dc:creator>
      <pubDate>Tue, 16 Dec 2025 19:16:43 +0000</pubDate>
      <link>https://forem.com/zyte/stop-scraping-html-theres-a-better-way-34nl</link>
      <guid>https://forem.com/zyte/stop-scraping-html-theres-a-better-way-34nl</guid>
      <description>&lt;p&gt;&lt;strong&gt;The "API-First" Reverse Engineering Method&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;One of the most common mistakes I see developers make is firing up their code editor too early. They open VS Code, &lt;code&gt;pip install requests beautifulsoup4&lt;/code&gt;, and immediately start trying to parse &lt;code&gt;&amp;lt;div&amp;gt;&lt;/code&gt; tags.&lt;/p&gt;

&lt;p&gt;If you are scraping a modern e-commerce site or Single Page Application (SPA), this is the wrong approach. It’s brittle, it’s slow, and it breaks the moment the site updates its CSS.&lt;/p&gt;

&lt;p&gt;The secret to scalable scraping isn't better parsing; it's finding the API that the website uses to populate itself. Here is the exact workflow I use to turn a complex parsing job into a clean, reliable JSON pipeline.&lt;/p&gt;




&lt;p&gt;

  &lt;iframe src="https://www.youtube.com/embed/7nHqyTbK5K0"&gt;
  &lt;/iframe&gt;


&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 1: The Discovery (XHR Filtering)
&lt;/h2&gt;

&lt;p&gt;Modern websites are rarely static. They typically use a "Frontend/Backend" architecture where the browser loads a skeleton page and then fetches the actual data via a background API call.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Your goal is to use that call.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Open Developer Tools:&lt;/strong&gt; Right-click and inspect the page, then navigate to the &lt;strong&gt;Network&lt;/strong&gt; tab.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Filter the Noise:&lt;/strong&gt; Click the &lt;strong&gt;Fetch/XHR&lt;/strong&gt; filter. We don't care about CSS, images, or fonts. We only care about data.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Trigger the Request:&lt;/strong&gt; Refresh the page. Watch the waterfall.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa36wcl032q8u5iy02pmk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa36wcl032q8u5iy02pmk.png" alt="Find the request"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If nothing appears of note here, try different pages, try triggering pagination, loading and clicking buttons and watching to see what appears.&lt;/p&gt;

&lt;p&gt;You are looking for requests that return JSON. They are often named intuitively, like &lt;code&gt;graphql&lt;/code&gt;, &lt;code&gt;search&lt;/code&gt;, &lt;code&gt;products&lt;/code&gt;, or &lt;code&gt;api&lt;/code&gt;. When you click "Preview" on these requests, you won't see HTML; you will see a structured object containing every piece of data you need—prices, descriptions, SKU numbers—already parsed and clean.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro Tip:&lt;/strong&gt; Once you find a candidate URL, test it immediately in the browser console or URL bar. Try changing query parameters like &lt;code&gt;page=1&lt;/code&gt; to &lt;code&gt;page=2&lt;/code&gt;. If the JSON response changes to show the next page of products, you have found your "Golden Endpoint."&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Phase 2: The "Clean Room" Isolation
&lt;/h2&gt;

&lt;p&gt;Finding the endpoint is only step one. Now you need to determine the minimum viable request required to access it programmatically.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Copy as cURL:&lt;/strong&gt; Right-click the request in Chrome DevTools and select &lt;em&gt;Copy as cURL&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Import to a Client:&lt;/strong&gt; Open an API client like Bruno, Postman, or Insomnia. Import the cURL command.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;The Baseline Test:&lt;/strong&gt; Hit "Send." It should work perfectly because you are sending everything—every cookie, every header, and the exact session token your browser just generated.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8v89ljhkswdp3t3m0qdx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8v89ljhkswdp3t3m0qdx.png" alt="Add the request to Bruno, Postman, or similar"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The "Load-Bearing" Header Game
&lt;/h3&gt;

&lt;p&gt;Efficient scrapers don't send 2KB of headers. You need to strip this down. Start unchecking headers one by one and resending the request:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Remove the Cookie header:&lt;/strong&gt; Does it break? (Usually, yes).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remove the Referer:&lt;/strong&gt; Does it break? (Often, yes—sites check this to ensure the request came from their own frontend).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remove the User-Agent:&lt;/strong&gt; Does it break?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Check the Parameters:&lt;/strong&gt; Can you change &lt;code&gt;limit=10&lt;/code&gt; to &lt;code&gt;limit=100&lt;/code&gt; to get more data in one shot?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Eventually, you will be left with the "skeleton key": the absolute minimum headers required to get a &lt;code&gt;200 OK&lt;/code&gt;. Usually, this consists of a User-Agent, a Referer, and a specific Auth Token or Session Cookie.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxfy16tpr6rwpsgbfvklv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxfy16tpr6rwpsgbfvklv.png" alt="Headers in Bruno"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 3: The Infrastructure Trap (The "Bonded" Token)
&lt;/h2&gt;

&lt;p&gt;This is where most developers hit a wall. You take your cleaned-up request, put it into a Python script, and... &lt;code&gt;403 Forbidden&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Why? You have the right URL and the right headers.&lt;/p&gt;

&lt;p&gt;In my analysis of modern scraping targets, I found that the API endpoint is increasingly performing a &lt;strong&gt;Cryptographic Binding&lt;/strong&gt; check.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The IP Link:&lt;/strong&gt; The Auth Token/Cookie you copied from your browser was generated for that specific IP address. When you run your script (likely on a server, VPN, or different proxy), the site sees a mismatch between the token's origin IP and your current request IP.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Expiry Clock:&lt;/strong&gt; These tokens are ephemeral. They are designed to expire, you will need to investigate how long.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you are just looping through a list of URLs with a static token, you will burn out your access almost immediately.&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 4: Architecting the Solution
&lt;/h2&gt;

&lt;p&gt;To make this work at scale, you cannot simply write a script. You need to build a &lt;strong&gt;Hybrid Architecture&lt;/strong&gt; that manages state.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffjfrdjef30hc4d11gpuk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffjfrdjef30hc4d11gpuk.png" alt="Architecture Image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You need to engineer a system that takes the above into account and monitor its lifecycle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The Storage Unit:&lt;/strong&gt; You need a database (like Redis) to store a "Session Object." This object must contain:

&lt;ul&gt;
&lt;li&gt;The Auth Token (Cookie).&lt;/li&gt;
&lt;li&gt;The IP Address used to generate it.&lt;/li&gt;
&lt;li&gt;The Creation Time.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;The Browser Worker:&lt;/strong&gt; You need a headless browser (&lt;code&gt;Nodriver&lt;/code&gt;/&lt;code&gt;Camoufox&lt;/code&gt;) to visit the site, execute the JavaScript, generate the token, and save it to your Storage Unit.&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;The HTTP Worker:&lt;/strong&gt; Your actual scraper. It doesn't browse; it pulls the Token + IP combination from storage and hits the API directly.&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;The Rotation Logic:&lt;/strong&gt; You need logic that checks the token age.

&lt;ul&gt;
&lt;li&gt;Is the token older than 5 minutes? Stop.&lt;/li&gt;
&lt;li&gt;Spin up the Browser Worker.&lt;/li&gt;
&lt;li&gt;Generate a new Token.&lt;/li&gt;
&lt;li&gt;Update the Storage Unit.&lt;/li&gt;
&lt;li&gt;Resume scraping.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Hidden Overhead
&lt;/h2&gt;

&lt;p&gt;Suddenly, your simple scraping job requires a Proxy Management System (to ensure the Browser and HTTP worker share the same IP), a Browser Management System (to handle the heavy lifting of token generation), and a State Manager.&lt;/p&gt;

&lt;p&gt;This is why "just scraping the API" is harder than it looks. The code to fetch the data is minimal—often just one function. But the infrastructure required to maintain the identity required to access that data is massive.&lt;/p&gt;

&lt;p&gt;At &lt;strong&gt;&lt;a href="https://www.zyte.com/zyte-api/?utm_source=devto&amp;amp;utm_medium=post_cta&amp;amp;utm_campaign=zyte_api" rel="noopener noreferrer"&gt;Zyte&lt;/a&gt;&lt;/strong&gt;, we abstract this entire architecture. Our API handles the browser fingerprinting, the IP, and the session rotation automatically. You simply send us the URL, and we handle the "Hybrid" complexity in the background, delivering you the clean JSON response without the infrastructure headache.&lt;/p&gt;

&lt;p&gt;Want more? &lt;a href="https://www.zyte.com/join-community/?utm_source=devto&amp;amp;utm_medium=post_cta&amp;amp;utm_campaign=joincommunity" rel="noopener noreferrer"&gt;Join our community&lt;/a&gt;&lt;/p&gt;

</description>
      <category>api</category>
      <category>tutorial</category>
      <category>webdev</category>
    </item>
    <item>
      <title>The Modern Scrapy Developer's Guide (Part 3): Auto-Generating Page Objects with Web Scraping Co-pilot</title>
      <dc:creator>John Rooney</dc:creator>
      <pubDate>Tue, 16 Dec 2025 18:55:04 +0000</pubDate>
      <link>https://forem.com/zyte/the-modern-scrapy-developers-guide-part-3-auto-generating-page-objects-with-web-scraping-59fj</link>
      <guid>https://forem.com/zyte/the-modern-scrapy-developers-guide-part-3-auto-generating-page-objects-with-web-scraping-59fj</guid>
      <description>&lt;p&gt;Welcome to Part 3 of our Modern Scrapy series.&lt;/p&gt;

&lt;p&gt;That refactor was a huge improvement, but it was still a lot of &lt;em&gt;manual&lt;/em&gt; work. We had to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Manually create our &lt;code&gt;BookItem&lt;/code&gt; and &lt;code&gt;BookListPage&lt;/code&gt; schemas.&lt;/li&gt;
&lt;li&gt;Manually create the &lt;code&gt;bookstoscrape_com.py&lt;/code&gt; Page Object file.&lt;/li&gt;
&lt;li&gt;Manually use &lt;code&gt;scrapy shell&lt;/code&gt; to find all the CSS selectors.&lt;/li&gt;
&lt;li&gt;Manually write all the &lt;code&gt;@field&lt;/code&gt; parsers.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;What if you could do all of that in about 30 seconds?&lt;/p&gt;

&lt;p&gt;In this guide, we'll show you how to use the &lt;strong&gt;Web Scraping Co-pilot&lt;/strong&gt; (our VS Code extension) to &lt;strong&gt;automatically write 100% of your Items, Page Objects, and even your unit tests.&lt;/strong&gt; We'll take our simple spider from Part 1 and upgrade it to the professional &lt;code&gt;scrapy-poet&lt;/code&gt; architecture from Part 2, but this time, the AI will do all the heavy lifting.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites &amp;amp; Setup
&lt;/h2&gt;

&lt;p&gt;This tutorial assumes you have:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Completed Part 1 (see above)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Visual Studio Code&lt;/strong&gt; installed.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;Web Scraping Co-pilot&lt;/strong&gt; extension (which we'll install now).&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Step 1: Installing Web Scraping Co-pilot
&lt;/h2&gt;

&lt;p&gt;Inside VS Code, go to the "Extensions" tab and search for &lt;code&gt;Web Scraping Co-pilot&lt;/code&gt; (published by Zyte).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftfdxrrer9xtx4by0evfk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftfdxrrer9xtx4by0evfk.png" alt="Web Scraping Copilot" width="800" height="346"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once installed, you'll see a new icon in your sidebar. Open it, and it will automatically detect your Scrapy project. It may ask to install a few dependencies like &lt;code&gt;pytest&lt;/code&gt;—allow it to do so. This setup process ensures your environment is ready for AI-powered generation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Auto-Generating our &lt;code&gt;BookItem&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;Let's start with the spider from Part 1. Our goal is to create a Page Object for our &lt;code&gt;BookItem&lt;/code&gt; and add &lt;em&gt;even more fields&lt;/em&gt; than we did in Part 2.&lt;/p&gt;

&lt;p&gt;In the Co-pilot chat window:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Select "Web Scraping."&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Write a prompt like this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Create a page object for the item BookItem using the sample URL &lt;a href="https://books.toscrape.com/catalogue/the-host_979/index.html" rel="noopener noreferrer"&gt;https://books.toscrape.com/catalogue/the-host_979/index.html&lt;/a&gt;"&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The co-pilot will now:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Check your project:&lt;/strong&gt; It will confirm you have &lt;code&gt;scrapy-poet&lt;/code&gt; and &lt;code&gt;pytest&lt;/code&gt; (and will offer to install them if you don't).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add &lt;code&gt;scrapy-poet&lt;/code&gt; settings:&lt;/strong&gt; It will automatically add the &lt;code&gt;ADDONS&lt;/code&gt; and &lt;code&gt;SCRAPY_POET_DISCOVER&lt;/code&gt; settings to your &lt;code&gt;settings.py&lt;/code&gt; file.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Create your &lt;code&gt;items.py&lt;/code&gt;:&lt;/strong&gt; It will create a new &lt;code&gt;BookItem&lt;/code&gt; class, but this time it will &lt;em&gt;intelligently add all the fields it can find on the page&lt;/em&gt;.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# tutorial/items.py (Auto-Generated!)
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;attrs&lt;/span&gt;

&lt;span class="nd"&gt;@attrs.define&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;BookItem&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    The structured data we extract from a book *detail* page.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;availability&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;  &lt;span class="c1"&gt;# &amp;lt;-- New!
&lt;/span&gt;    &lt;span class="n"&gt;number_of_reviews&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="c1"&gt;# &amp;lt;-- New!
&lt;/span&gt;    &lt;span class="n"&gt;upc&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;             &lt;span class="c1"&gt;# &amp;lt;-- New!
&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Create Fixtures:&lt;/strong&gt; It creates a &lt;code&gt;fixtures&lt;/code&gt; folder with the saved HTML and expected JSON output for testing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Write the Page Object:&lt;/strong&gt; It creates the &lt;code&gt;tutorial/pages/bookstoscrape_com.py&lt;/code&gt; file and writes the &lt;em&gt;entire&lt;/em&gt; Page Object, complete with all parsing logic and selectors, for &lt;em&gt;all&lt;/em&gt; the new fields.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# tutorial/pages/bookstoscrape_com.py (Auto-Generated!)
&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;web_poet&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;WebPage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;handle_urls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;returns&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;tutorial.items&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BookItem&lt;/span&gt;

&lt;span class="nd"&gt;@handle_urls&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[books.toscrape.com/catalogue](https://books.toscrape.com/catalogue)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nd"&gt;@returns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BookItem&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;BookDetailPage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;WebPage&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    This Page Object handles parsing data from book detail pages.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="nd"&gt;@field&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;h1::text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="nd"&gt;@field&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;price&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;p.price_color::text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="nd"&gt;@field&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;url&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;

    &lt;span class="c1"&gt;# All of this was written for us!
&lt;/span&gt;    &lt;span class="nd"&gt;@field&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;availability&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;p.availability::text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;getall&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="nd"&gt;@field&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;number_of_reviews&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;table tr:last-child td::text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

    &lt;span class="nd"&gt;@field&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;upc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;table tr:first-child td::text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In 30 seconds, the Co-pilot has done everything we did manually in Part 2, but &lt;em&gt;better&lt;/em&gt;—it even added more fields.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Running the AI-Generated Tests
&lt;/h2&gt;

&lt;p&gt;The best part? The Co-pilot &lt;em&gt;also&lt;/em&gt; wrote unit tests for you. It created a &lt;code&gt;tests&lt;/code&gt; folder with &lt;code&gt;test_bookstoscrape_com.py&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;You can just click "Run Tests" in the Co-pilot UI (or run &lt;code&gt;pytest&lt;/code&gt; in your terminal).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ pytest
================ test session starts ================
...
tests/test_bookstoscrape_com.py::test_book_detail[book_0] PASSED
tests/test_bookstoscrape_com.py::test_book_detail[book_1] PASSED
...
================ 8 tests passed in 0.10s ================

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Your parsing logic is now fully tested, and you didn't write a single line of test code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: Refactoring the Spider (The Easy Way)
&lt;/h2&gt;

&lt;p&gt;Now, we just update our &lt;code&gt;tutorial/spiders/books.py&lt;/code&gt; to use this new architecture, just like in Part 2.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# tutorial/spiders/books.py

import scrapy
# Import our new, auto-generated Item class
from tutorial.items import BookItem

class BooksSpider(scrapy.Spider):
    name = "books"
    # ... (rest of spider from Part 1) ...

    async def parse_listpage(self, response):
        product_urls = response.css("article.product_pod h3 a::attr(href)").getall()
        for url in product_urls:
            # We just tell Scrapy to call parse_book
            yield response.follow(url, callback=self.parse_book)

        next_page_url = response.css("li.next a::attr(href)").get()
        if next_page_url:
            yield response.follow(next_page_url, callback=self.parse_listpage)

    # We ask for the BookItem, and scrapy-poet does the rest!
    async def parse_book(self, response, book: BookItem):
        yield book

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 5: Auto-Generating our &lt;code&gt;BookListPage&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;We can repeat the exact same process for our list page to finish the refactor.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt the Co-pilot:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Create a page object for the list item BookListPage using the sample URL &lt;a href="https://books.toscrape.com/catalogue/category/books/fantasy_19/index.html" rel="noopener noreferrer"&gt;https://books.toscrape.com/catalogue/category/books/fantasy_19/index.html&lt;/a&gt;"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The Co-pilot will create the &lt;code&gt;BookListPage&lt;/code&gt; item in &lt;code&gt;items.py&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;It will create the &lt;code&gt;BookListPageObject&lt;/code&gt; in &lt;code&gt;bookstoscrape_com.py&lt;/code&gt; with the parsers for &lt;code&gt;book_urls&lt;/code&gt; and &lt;code&gt;next_page_url&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;It will write and pass the tests.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Now we can update our spider one last time to be &lt;em&gt;fully&lt;/em&gt; architected.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# tutorial/spiders/books.py (FINAL VERSION)

import scrapy
from tutorial.items import BookItem, BookListPage # Import both

class BooksSpider(scrapy.Spider):
    # ... (name, allowed_domains, url) ...

    async def start(self):
        yield scrapy.Request(self.url, callback=self.parse_listpage)

    # We now ask for the BookListPage item!
    async def parse_listpage(self, response, page: BookListPage):

        # All parsing logic is GONE from the spider.
        for url in page.book_urls:
            yield response.follow(url, callback=self.parse_book)

        if page.next_page_url:
            yield response.follow(page.next_page_url, callback=self.parse_listpage)

    async def parse_book(self, response, book: BookItem):
        yield book

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Our spider is now just a "crawler." It has zero parsing logic. All the hard work of finding selectors and writing parsers was automated by the Co-pilot.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion: The "Hybrid Developer"
&lt;/h2&gt;

&lt;p&gt;The Web Scraping Co-pilot doesn't replace you. It &lt;em&gt;accelerates&lt;/em&gt; you. It automates the 90% of work that is "grunt work" (finding selectors, writing boilerplate, creating tests) so you can focus on the 10% of work that matters: &lt;strong&gt;crawling logic, strategy, and handling complex sites.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is how we, as the maintainers of Scrapy, build spiders professionally.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What's Next? Join the Community.&lt;/p&gt;

&lt;p&gt;What's Next? Join the Community.&lt;br&gt;
💬 &lt;a href="https://discord.com/invite/extract-data-community-993441606642446397" rel="noopener noreferrer"&gt;TALK&lt;/a&gt;: Stuck on this Scrapy code? Ask the maintainers and 5k+ devs in our Discord.&lt;br&gt;
▶️ &lt;a href="https://www.youtube.com/@zytedata" rel="noopener noreferrer"&gt;WATCH&lt;/a&gt;: This post was based on our video! Watch the full walkthrough on our YouTube channel.&lt;br&gt;
📩 &lt;a href="https://www.zyte.com/join-community/" rel="noopener noreferrer"&gt;READ&lt;/a&gt;: Want more? In Part 2, we'll cover Scrapy Items and Pipelines. Get the Extract newsletter so you don't miss it.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>tutorial</category>
      <category>scrapy</category>
      <category>programming</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>The Modern Scrapy Developer's Guide (Part 2): Page Objects with scrapy-poet</title>
      <dc:creator>John Rooney</dc:creator>
      <pubDate>Tue, 16 Dec 2025 18:49:10 +0000</pubDate>
      <link>https://forem.com/zyte/the-modern-scrapy-developers-guide-part-2-page-objects-with-scrapy-poet-5b6l</link>
      <guid>https://forem.com/zyte/the-modern-scrapy-developers-guide-part-2-page-objects-with-scrapy-poet-5b6l</guid>
      <description>&lt;p&gt;Welcome to Part 2 of our Modern Scrapy series. In Part 1, we built a working spider that crawls and scrapes an entire category. But if you look at our code, it's already getting messy. Our &lt;code&gt;parse_listpage&lt;/code&gt; and &lt;code&gt;parse_book&lt;/code&gt; functions are mixing two different jobs:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Crawling Logic:&lt;/strong&gt; Finding the next page and following links.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parsing Logic:&lt;/strong&gt; Finding the data (name, price) with CSS selectors.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;What happens when a selector changes? Or when you want to test your parsing logic? You have to run the whole spider. This is slow, hard to maintain, and difficult to test.&lt;/p&gt;

&lt;p&gt;In this guide, we'll fix this by refactoring our spider to a professional, modern standard using &lt;strong&gt;Scrapy Items&lt;/strong&gt; and &lt;strong&gt;Page Objects&lt;/strong&gt; (via &lt;code&gt;scrapy-poet&lt;/code&gt;). We will completely separate our crawling logic from our parsing logic. This will make our code cleaner, infinitely easier to test, and scalable.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We'll Build
&lt;/h2&gt;

&lt;p&gt;We will refactor our spider from Part 1. The spider itself will &lt;em&gt;only&lt;/em&gt; handle crawling (following links). All the parsing logic will be moved into dedicated "Page Object" classes. &lt;code&gt;scrapy-poet&lt;/code&gt; will automatically inject the correct, parsed item into our spider.&lt;/p&gt;

&lt;p&gt;Look at how clean our spider's &lt;code&gt;parse_book&lt;/code&gt; function becomes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# The NEW parse_book function
# Where did the parsing logic go?! (Hint: scrapy-poet)

    async def parse_book(self, response, book: BookItem):
        # 'book' is a BookItem, magically injected and parsed
        # by scrapy-poet before this function is even called.
        # We just yield it.
        yield book

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;This tutorial builds directly on &lt;a href="https://www.zyte.com/learn/the-modern-scrapy-developers-guide/" rel="noopener noreferrer"&gt;Part 1: Building Your First Crawling Spider&lt;/a&gt;. Please complete that guide first, as we will be modifying the spider we built there.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: The "Why" (Separation of Concerns)
&lt;/h2&gt;

&lt;p&gt;Our current spider is a monolith. The &lt;code&gt;BooksSpider&lt;/code&gt; class knows &lt;em&gt;how to crawl&lt;/em&gt; (find next page links, find product links) and &lt;em&gt;how to parse&lt;/em&gt; (extract &lt;code&gt;h1&lt;/code&gt; tags, extract &lt;code&gt;p.price_color&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;This is bad. If we want to reuse our parsing logic, or test it without re-crawling the web, we can't.&lt;/p&gt;

&lt;p&gt;The "Page Object" pattern solves this.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The Spider's Job:&lt;/strong&gt; Crawling. Its &lt;em&gt;only&lt;/em&gt; job is to navigate from page to page and yield &lt;code&gt;Requests&lt;/code&gt; or &lt;code&gt;Items&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Page Object's Job:&lt;/strong&gt; Parsing. Its &lt;em&gt;only&lt;/em&gt; job is to take a &lt;code&gt;response&lt;/code&gt; and extract structured data from it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;scrapy-poet&lt;/code&gt; is a library that automatically connects our spider to the correct Page Object.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Create Our "Schema" (Scrapy Items)
&lt;/h2&gt;

&lt;p&gt;First, let's define the data we're scraping. Instead of messy dictionaries, we'll use Scrapy Items. Scrapy comes with &lt;code&gt;attrs&lt;/code&gt;, a fantastic library for this.&lt;/p&gt;

&lt;p&gt;Open &lt;code&gt;tutorial/items.py&lt;/code&gt; and add two classes: one for our book data and one for our list page data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# tutorial/items.py
&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;attrs&lt;/span&gt;

&lt;span class="nd"&gt;@attrs.define&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;BookItem&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    The structured data we extract from a book *detail* page.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;

&lt;span class="nd"&gt;@attrs.define&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;BookListPage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    The data and links we extract from a *list* page.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;book_urls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;
    &lt;span class="n"&gt;next_page_url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is our "schema." It makes our code type-safe and easier to read.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Install and Configure &lt;code&gt;scrapy-poet&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;scrapy-poet&lt;/code&gt; is a separate package we need to install.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Install scrapy-poet
uv add scrapy-poet
# or: pip install scrapy-poet

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, we must enable it in &lt;code&gt;tutorial/settings.py&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# tutorial/settings.py
&lt;/span&gt;
&lt;span class="c1"&gt;# Add this to enable the scrapy-poet add-on
&lt;/span&gt;&lt;span class="n"&gt;ADDONS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;scrapy_poet.Addon&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Add this to tell scrapy-poet where to find our Page Objects
# 'tutorial.pages' means a folder named 'pages' in our 'tutorial' module
&lt;/span&gt;&lt;span class="n"&gt;SCRAPY_POET_DISCOVER&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tutorial.pages&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 4: Create Page Objects for Parsing
&lt;/h2&gt;

&lt;p&gt;Now for the magic. Let's create the &lt;code&gt;tutorial/pages&lt;/code&gt; module.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;mkdir&lt;/span&gt; &lt;span class="n"&gt;tutorial&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;pages&lt;/span&gt;
&lt;span class="n"&gt;touch&lt;/span&gt; &lt;span class="n"&gt;tutorial&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;pages&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Inside this new folder, create a file named &lt;code&gt;bookstoscrape_com.py&lt;/code&gt;. This file will hold all the parsing logic for &lt;code&gt;bookstoscrape.com&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This is the most complex part, but it's a "set it and forget it" pattern.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# tutorial/pages/bookstoscrape_com.py
&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;web_poet&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;WebPage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;handle_urls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;returns&lt;/span&gt;

&lt;span class="c1"&gt;# Import our Item schemas
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;tutorial.items&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BookItem&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BookListPage&lt;/span&gt;

&lt;span class="c1"&gt;# This class handles all book DETAIL pages
&lt;/span&gt;&lt;span class="nd"&gt;@handle_urls&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[books.toscrape.com/catalogue](https://books.toscrape.com/catalogue)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nd"&gt;@returns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BookItem&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;BookDetailPage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;WebPage&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    This Page Object handles parsing data from book detail pages.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="c1"&gt;# The @field decorator tells scrapy-poet: "run this function
&lt;/span&gt;    &lt;span class="c1"&gt;# and put the result into the 'name' field of the BookItem."
&lt;/span&gt;    &lt;span class="nd"&gt;@field&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# This is our parsing logic from Part 1
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;h1::text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="nd"&gt;@field&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;price&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# This is our parsing logic from Part 1
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;p.price_color::text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="nd"&gt;@field&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;url&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;

&lt;span class="c1"&gt;# This class handles all book LIST pages (categories)
&lt;/span&gt;&lt;span class="nd"&gt;@handle_urls&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[books.toscrape.com/catalogue/category](https://books.toscrape.com/catalogue/category)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nd"&gt;@returns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BookListPage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;BookListPageObject&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;WebPage&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    This Page Object handles parsing data from category/list pages.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="nd"&gt;@field&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;book_urls&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# This is our parsing logic from Part 1
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;article.product_pod h3 a::attr(href)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;getall&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="nd"&gt;@field&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;next_page_url&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# This is our parsing logic from Part 1
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;li.next a::attr(href)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Look at that! All our messy &lt;code&gt;response.css()&lt;/code&gt; calls are now neatly organized in their own classes, completely separate from our spider. The &lt;code&gt;@handle_urls&lt;/code&gt; decorator tells &lt;code&gt;scrapy-poet&lt;/code&gt; which Page Object to use for which URL.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5: Refactor the Spider (The Payoff)
&lt;/h2&gt;

&lt;p&gt;Now, let's go back to &lt;code&gt;tutorial/spiders/books.py&lt;/code&gt; and refactor it. It becomes &lt;em&gt;much&lt;/em&gt; simpler.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# tutorial/spiders/books.py
&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;
&lt;span class="c1"&gt;# Import our new Item classes
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;tutorial.items&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BookItem&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BookListPage&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;BooksSpider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Spider&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;books&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;allowed_domains&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;toscrape.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://books.toscrape.com/catalogue/category/books/fantasy_19/index.html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# We still start the same way
&lt;/span&gt;        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parse_listpage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# The 'page: BookListPage' is new.
&lt;/span&gt;    &lt;span class="c1"&gt;# We ask for the BookListPage item, and scrapy-poet injects it.
&lt;/span&gt;    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_listpage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;BookListPage&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;

        &lt;span class="c1"&gt;# 1. Get the parsed book URLs from the Page Object
&lt;/span&gt;        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;book_urls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# We follow each URL, but our callback no longer
&lt;/span&gt;            &lt;span class="c1"&gt;# needs to do any work!
&lt;/span&gt;            &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;follow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parse_book&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# 2. Get the next page URL from the Page Object
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;next_page_url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;follow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;next_page_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parse_listpage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# The 'book: BookItem' is new.
&lt;/span&gt;    &lt;span class="c1"&gt;# We ask for the BookItem, and scrapy-poet injects it.
&lt;/span&gt;    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_book&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;book&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;BookItem&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Our parsing logic is GONE.
&lt;/span&gt;        &lt;span class="c1"&gt;# The 'book' variable is already a fully-populated
&lt;/span&gt;        &lt;span class="c1"&gt;# BookItem, parsed by our BookDetailPage Page Object.
&lt;/span&gt;
        &lt;span class="c1"&gt;# We just yield it.
&lt;/span&gt;        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;book&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Our spider is now &lt;em&gt;only&lt;/em&gt; responsible for crawling. All parsing is handled by &lt;code&gt;scrapy-poet&lt;/code&gt; and our Page Objects. This code is clean, testable, and incredibly easy to read.&lt;/p&gt;

&lt;p&gt;When you run &lt;code&gt;scrapy crawl books -o books.json&lt;/code&gt;, the output will be identical to Part 1, but your architecture is now 100x better.&lt;/p&gt;

&lt;h3&gt;
  
  
  The "Hard Part": Why This Still Breaks
&lt;/h3&gt;

&lt;p&gt;We've built a professional, well-architected Scrapy spider. But we've just made a cleaner version of a spider that will still fail on a real-world site.&lt;/p&gt;

&lt;p&gt;This architecture is beautiful, but it doesn't solve the "real" problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;❌ IP Blocks:&lt;/strong&gt; You're still hitting the site from one IP. You will be blocked.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;❌ CAPTCHAs:&lt;/strong&gt; You have no way to avoid captchas, and your spider will fail.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;❌ JavaScript:&lt;/strong&gt; If the prices were loaded by JS, our &lt;code&gt;response.css()&lt;/code&gt; selectors would find &lt;em&gt;nothing&lt;/em&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We've just organized our failing code.&lt;/p&gt;

&lt;p&gt;The "Easy Way": Zyte API as a Universal Page Object&lt;br&gt;
scrapy-poet is a great way to organise your scrapy code, making your projects easier to build, collaborate and maintain. However, it doesn't change the fact we are not doing anything to avoid web scraping bans.&lt;/p&gt;

&lt;p&gt;So we can add the below settings using our Zyte API account to run our scrapy project through Zyte API.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# add scrapy-zyte-api python library
uv add scrapy-zyte-api
# settings.py
ZYTE_API_KEY = "YOUR_API_KEY"

ADDONS = {
    "scrapy_zyte_api.Addon": 500,
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the power of combining a great architecture (Scrapy) with a powerful service (Zyte API).&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion &amp;amp; Next Steps
&lt;/h2&gt;

&lt;p&gt;Today you elevated your spider from a simple script to a professional-grade crawler. You learned the "Separation of Concerns" principle, defined data with &lt;code&gt;Items&lt;/code&gt;, and separated parsing logic with &lt;code&gt;scrapy-poet&lt;/code&gt;'s Page Objects.&lt;/p&gt;

&lt;p&gt;This is the modern way to build robust, testable, and scalable Scrapy spiders.&lt;/p&gt;

&lt;p&gt;What's Next? Join the Community.&lt;br&gt;
💬 TALK: Stuck on this Scrapy code? Ask the maintainers and 5k+ devs in our Discord.&lt;br&gt;
▶️ WATCH: This post was based on our video! Watch the full walkthrough on our YouTube channel.&lt;br&gt;
📩 READ: Want more? In Part 2, we'll cover Scrapy Items and Pipelines. Get the Extract newsletter so you don't miss it.&lt;/p&gt;

&lt;p&gt;And if you're ready to skip the "Hard Part" entirely, get your free API key and try the "Easy Way."&lt;/p&gt;

&lt;p&gt;&lt;a href="https://hubs.li/Q03YmnDF0" rel="noopener noreferrer"&gt;&lt;strong&gt;Start Your Free Zyte Trial&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>programming</category>
      <category>tutorial</category>
      <category>scrapy</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>The Modern Scrapy Developer's Guide (Part 1): Building Your First Spider</title>
      <dc:creator>John Rooney</dc:creator>
      <pubDate>Tue, 16 Dec 2025 18:41:29 +0000</pubDate>
      <link>https://forem.com/zyte/the-modern-scrapy-developers-guide-part-1-building-your-first-spider-4gc2</link>
      <guid>https://forem.com/zyte/the-modern-scrapy-developers-guide-part-1-building-your-first-spider-4gc2</guid>
      <description>&lt;p&gt;Scrapy can feel daunting. It's a massive, powerful framework, and the documentation can be overwhelming for a newcomer. Where do you even begin?&lt;/p&gt;

&lt;p&gt;In this definitive guide, we will walk you through, step-by-step, how to build a real, multi-page crawling spider. You will go from an empty folder to a clean JSON file of structured data in about 15 minutes. We'll use modern, &lt;code&gt;async&lt;/code&gt;/&lt;code&gt;await&lt;/code&gt; Python and cover project setup, finding selectors, following links (crawling), and saving your data.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We'll Build
&lt;/h2&gt;

&lt;p&gt;We will build a Scrapy spider that crawls the "Fantasy" category on &lt;a href="https://books.toscrape.com/" rel="noopener noreferrer"&gt;books.toscrape.com&lt;/a&gt;, follows the "Next" button to crawl every page in that category, follows the link for every book, and scrapes the name, price, and URL from all 48 books, saving the result to a clean &lt;code&gt;books.json&lt;/code&gt; file.&lt;/p&gt;

&lt;p&gt;Here's a preview of our final spider code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# The final spider we'll build
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;BooksSpider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Spider&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;books&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;allowed_domains&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;toscrape.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://books.toscrape.com/catalogue/category/books/fantasy_19/index.html](https://books.toscrape.com/catalogue/category/books/fantasy_19/index.html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parse_listpage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_listpage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;product_urls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;article.product_pod h3 a::attr(href)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;getall&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;product_urls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;follow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parse_book&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;next_page_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;li.next a::attr(href)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;next_page_url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;follow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;next_page_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parse_listpage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_book&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;h1::text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;p.price_color::text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Prerequisites &amp;amp; Setup
&lt;/h2&gt;

&lt;p&gt;Before we start, you'll need Python 3.x installed. We'll also be using a virtual environment to keep our dependencies clean. You can use standard &lt;code&gt;pip&lt;/code&gt; or a modern package manager like &lt;code&gt;uv&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;First, let's create a project folder and activate a virtual environment.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create a new folder&lt;/span&gt;
&lt;span class="nb"&gt;mkdir &lt;/span&gt;scrapy_project
&lt;span class="nb"&gt;cd &lt;/span&gt;scrapy_project

&lt;span class="c"&gt;# Option 1: Using standard pip + venv&lt;/span&gt;
python &lt;span class="nt"&gt;-m&lt;/span&gt; venv .venv
&lt;span class="nb"&gt;source&lt;/span&gt; .venv/bin/activate  &lt;span class="c"&gt;# On Windows, use: .venv\Scripts\activate&lt;/span&gt;

&lt;span class="c"&gt;# Option 2: Using uv (a fast, modern alternative)&lt;/span&gt;
uv init
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, let's install Scrapy.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Option 1: Using pip&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;scrapy

&lt;span class="c"&gt;# Option 2: Using uv&lt;/span&gt;
uv add scrapy
&lt;span class="nb"&gt;source&lt;/span&gt; .venv/bin/activate

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 1: Initialize Your Project
&lt;/h2&gt;

&lt;p&gt;With Scrapy installed, we can use its built-in command-line tools to generate our project boilerplate.&lt;/p&gt;

&lt;p&gt;First, create the project itself.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# The 'scrapy startproject' command creates the project structure&lt;/span&gt;
&lt;span class="c"&gt;# The '.' tells it to use the current folder&lt;/span&gt;
scrapy startproject tutorial &lt;span class="nb"&gt;.&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You'll see a &lt;code&gt;tutorial&lt;/code&gt; folder and a &lt;code&gt;scrapy.cfg&lt;/code&gt; file appear. This folder contains all your project's logic.&lt;/p&gt;

&lt;p&gt;Next, we'll generate our first spider.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# The 'genspider' command creates a new spider file&lt;/span&gt;
&lt;span class="c"&gt;# Usage: scrapy genspider &amp;lt;spider_name&amp;gt; &amp;lt;allowed_domain&amp;gt;&lt;/span&gt;
scrapy genspider books toscrape.com

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you look in &lt;code&gt;tutorial/spiders/&lt;/code&gt;, you'll now see &lt;code&gt;books.py&lt;/code&gt;. This is where we'll write our code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Configure Your Settings
&lt;/h2&gt;

&lt;p&gt;Before we write our spider, let's quickly adjust two settings in &lt;code&gt;tutorial/settings.py&lt;/code&gt;.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;ROBOTSTXT_OBEY&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;By default, Scrapy respects robots.txt files. This is a good practice, but our test site (toscrape.com) doesn't have one, which can cause a 404 error in our logs. We'll turn it off for this tutorial.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# tutorial/settings.py
&lt;/span&gt;
&lt;span class="c1"&gt;# Find this line and change it to False
&lt;/span&gt;&lt;span class="n"&gt;ROBOTSTXT_OBEY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Concurrency&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Scrapy is polite by default and runs slowly. Since toscrape.com is a test site built for scraping, we can speed it up.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# tutorial/settings.py
&lt;/span&gt;
&lt;span class="c1"&gt;# Uncomment or add these lines
&lt;/span&gt;&lt;span class="n"&gt;CONCURRENT_REQUESTS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;
&lt;span class="n"&gt;DOWNLOAD_DELAY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Warning:&lt;/strong&gt; These settings are for this test site only. When scraping in the wild, you must be mindful of your target site and use respectful &lt;code&gt;DOWNLOAD_DELAY&lt;/code&gt; and &lt;code&gt;CONCURRENT_REQUESTS&lt;/code&gt; values.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Finding Our Selectors (with &lt;code&gt;scrapy shell&lt;/code&gt;)
&lt;/h2&gt;

&lt;p&gt;To scrape a site, we need to tell Scrapy &lt;em&gt;what&lt;/em&gt; data to get. We do this with CSS selectors. The &lt;code&gt;scrapy shell&lt;/code&gt; is the best tool for this.&lt;/p&gt;

&lt;p&gt;Let's launch the shell on our target category page:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;scrapy shell https://books.toscrape.com/catalogue/category/books/fantasy_19/index.html
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will download the page and give you an interactive shell with a &lt;code&gt;response&lt;/code&gt; object.&lt;/p&gt;

&lt;p&gt;You can even type &lt;code&gt;view(response)&lt;/code&gt; to open the page in your browser exactly as Scrapy sees it!&lt;/p&gt;

&lt;p&gt;Let's find the data we need:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Find all Book Links:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;By inspecting the page, we see each book is in an article.product_pod. The link is inside an h3.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# In scrapy shell:
&amp;gt;&amp;gt;&amp;gt; response.css("article.product_pod h3 a::attr(href)").getall()
[
  '../../../../the-host_979/index.html',
  '../../../../the-hunted_978/index.html',
  ...
]

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That &lt;code&gt;getall()&lt;/code&gt; gives us a clean list of all the URLs.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Find the "Next" Page Link:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;At the bottom, we find the "Next" button in an li.next.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# In scrapy shell:
&amp;gt;&amp;gt;&amp;gt; response.css("li.next a::attr(href)").get()
'page-2.html'

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This &lt;code&gt;get()&lt;/code&gt; gives us the single link we need for pagination.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Find the Book Data (on a product page):&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Finally, let's open a shell on a product page to find the selectors for our data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Exit the shell and open a new one:
scrapy shell https://books.toscrape.com/catalogue/the-host_979/index.html

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# In scrapy shell:
&amp;gt;&amp;gt;&amp;gt; response.css("h1::text").get()
'The Host'

&amp;gt;&amp;gt;&amp;gt; response.css("p.price_color::text").get()
'£25.82'

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Perfect. We now have all the selectors we need.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: Building the Spider (Crawling &amp;amp; Parsing)
&lt;/h2&gt;

&lt;p&gt;Now, let's open &lt;code&gt;tutorial/spiders/books.py&lt;/code&gt; and write our spider. We'll use the user's provided code, as it's a clean, final version.&lt;/p&gt;

&lt;p&gt;Delete the boilerplate in &lt;code&gt;books.py&lt;/code&gt; and replace it with this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# tutorial/spiders/books.py
&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;BooksSpider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Spider&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;books&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;allowed_domains&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;toscrape.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="c1"&gt;# This is our starting URL (the first page of the Fantasy category)
&lt;/span&gt;    &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://books.toscrape.com/catalogue/category/books/fantasy_19/index.html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="c1"&gt;# This is the modern, async version of 'start_requests'
&lt;/span&gt;    &lt;span class="c1"&gt;# It's called once when the spider starts.
&lt;/span&gt;    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# We yield our first request, sending the response to 'parse_listpage'
&lt;/span&gt;        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parse_listpage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# This function handles the *category page*
&lt;/span&gt;    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_listpage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;

        &lt;span class="c1"&gt;# 1. Get all product URLs using the selector we found
&lt;/span&gt;        &lt;span class="n"&gt;product_urls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;article.product_pod h3 a::attr(href)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;getall&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="c1"&gt;# 2. For each product URL, follow it and send the response to 'parse_book'
&lt;/span&gt;        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;product_urls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;follow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parse_book&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# 3. Find the 'Next' page URL
&lt;/span&gt;        &lt;span class="n"&gt;next_page_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;li.next a::attr(href)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="c1"&gt;# 4. If a 'Next' page exists, follow it and send the response
&lt;/span&gt;        &lt;span class="c1"&gt;#    back to *this same function*
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;next_page_url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;follow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;next_page_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parse_listpage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# This function handles the *product page*
&lt;/span&gt;    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_book&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;

        &lt;span class="c1"&gt;# We yield a dictionary of the data we want
&lt;/span&gt;        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;h1::text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;p.price_color::text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This code is clean and efficient. &lt;code&gt;response.follow&lt;/code&gt; is smart enough to handle the relative URLs (like &lt;code&gt;page-2.html&lt;/code&gt;) for us.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5: Running The Spider &amp;amp; Saving Data
&lt;/h2&gt;

&lt;p&gt;We're ready to run. Go to your terminal (at the project root) and run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;scrapy crawl books
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You'll see Scrapy start up, and in the logs, you'll see all 48 items being scraped!&lt;/p&gt;

&lt;p&gt;But we want to &lt;em&gt;save&lt;/em&gt; this data. Scrapy has a built-in "Feed Exporter" that makes this easy. We just use the &lt;code&gt;-o&lt;/code&gt; (output) flag.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;scrapy crawl books -o books.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will run the spider again, but this time, you'll see a new &lt;code&gt;books.json&lt;/code&gt; file in your project root, containing all 48 items, perfectly structured.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion &amp;amp; Next Steps
&lt;/h2&gt;

&lt;p&gt;Today you built a powerful, modern, async Scrapy crawler. You learned how to set up a project, find selectors, follow links, and handle pagination.&lt;/p&gt;

&lt;p&gt;This is just the starting block.&lt;/p&gt;

&lt;blockquote&gt;
&lt;h3&gt;
  
  
  What's Next? Join the Community.
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;💬 TALK:&lt;/strong&gt; Stuck on this Scrapy code? &lt;a href="https://discord.com/invite/extract-data-community-993441606642446397" rel="noopener noreferrer"&gt;Ask the maintainers and 5k+ devs in our Discord.&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;▶️ WATCH:&lt;/strong&gt; This post was based on our video! &lt;a href="https://www.youtube.com/@zytedata" rel="noopener noreferrer"&gt;&lt;strong&gt;Watch the full walkthrough on our YouTube channel.&lt;/strong&gt;&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;📩 READ:&lt;/strong&gt; Want more? In Part 2, we'll cover Scrapy Items and Pipelines. &lt;a href="https://www.zyte.com/join-community/" rel="noopener noreferrer"&gt;Get the Extract newsletter so you don't miss it.&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

</description>
      <category>programming</category>
      <category>webscraping</category>
      <category>scrapy</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Stop Writing Messy Spiders. The Professional Way with Scrapy-Poet</title>
      <dc:creator>John Rooney</dc:creator>
      <pubDate>Fri, 03 Oct 2025 14:12:08 +0000</pubDate>
      <link>https://forem.com/extractdata/stop-writing-messy-spiders-the-professional-way-with-scrapy-poet-6og</link>
      <guid>https://forem.com/extractdata/stop-writing-messy-spiders-the-professional-way-with-scrapy-poet-6og</guid>
      <description>&lt;p&gt;If you were building a web app, you wouldn't cram your database queries and business logic into your API routes. That would be a maintenance nightmare. So why do we accept this in our Scrapy projects? We build massive, unwieldy spiders where crawling logic and parsing logic are tangled together in huge &lt;code&gt;parse&lt;/code&gt; methods.&lt;/p&gt;

&lt;p&gt;There’s a better way. It’s time to introduce a clean separation of concerns to your spiders.&lt;/p&gt;

&lt;p&gt;In this guide, I’ll introduce you to &lt;strong&gt;Scrapy-Poet&lt;/strong&gt;, the official integration of the &lt;code&gt;web-poet&lt;/code&gt; library. It allows you to use a powerful architectural pattern called &lt;strong&gt;Page Objects&lt;/strong&gt;, which separates your parsing logic from your spider's crawling duties. The result? Cleaner, more maintainable, and highly testable code.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Glimpse of the Future: The Page Object Pattern
&lt;/h2&gt;

&lt;p&gt;Let's look at the difference. A traditional spider is often a long file with complex &lt;code&gt;parse_item&lt;/code&gt; methods full of CSS and XPath selectors.&lt;/p&gt;

&lt;p&gt;The Scrapy-Poet way is different. Your spider becomes a clean, concise crawling manager. Its only jobs are to manage requests, follow links, and hand off the response to the correct Page Object.&lt;/p&gt;

&lt;p&gt;Look how clean this spider is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# products.py (The Spider)
&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ProductsSpider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Spider&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;products&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;start_requests&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parse_list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ItemListPage&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;The spider yields the Page Object, which handles parsing.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;follow_all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;urls&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;product_urls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parse_product&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_product&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ProductPage&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;The product page object extracts the final item.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_item&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice what's missing? There are &lt;strong&gt;no selectors&lt;/strong&gt; in the spider! The &lt;code&gt;parse&lt;/code&gt; methods simply declare which Page Object (&lt;code&gt;ItemListPage&lt;/code&gt; or &lt;code&gt;ProductPage&lt;/code&gt;) they expect, and Scrapy-Poet injects it, fully parsed. The spider's job is now pure navigation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Meet Your New Best Friend: The Page Object
&lt;/h2&gt;

&lt;p&gt;So where did all the parsing logic go? It moved into &lt;strong&gt;Page Objects&lt;/strong&gt;. A Page Object is a simple Python class dedicated to understanding and extracting data from a single type of web page.&lt;/p&gt;

&lt;h3&gt;
  
  
  The List Page Object
&lt;/h3&gt;

&lt;p&gt;Here’s the Page Object for our product listing page. Its only job is to find all the product URLs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# pages/list.py (The List Page Object)
&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ItemListPage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;WebPage&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;A page object for parsing product listing pages.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="nd"&gt;@field&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;product_urls&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Extracts all product URLs from the page.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.product a::attr(href)&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;getall&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It’s a simple class that inherits from &lt;code&gt;WebPage&lt;/code&gt; and has one method, &lt;code&gt;product_urls&lt;/code&gt;, decorated with &lt;code&gt;@field&lt;/code&gt;. This method contains the selector and logic to get the links. That's it.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Product Page Object
&lt;/h3&gt;

&lt;p&gt;The detail page is more complex, but the principle is the same. Each piece of data we want gets its own method, cleanly mapping item fields to selectors.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# pages/product.py (The Product Page Object)
&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ProductPage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;WebPage&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;A page object for parsing product detail pages.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="c1"&gt;# This method defines the final, structured item to be returned
&lt;/span&gt;    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;to_item&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Product&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Product&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="c1"&gt;# ... other fields
&lt;/span&gt;        &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nd"&gt;@field&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;h1.product-title::text&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="nd"&gt;@field&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;price&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;price_str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.price::text&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_clean_price&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;price_str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# You can call helper methods
&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_clean_price&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;price_str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# ... (data cleaning logic here)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All the logic for finding, extracting, and cleaning product data is now neatly organized in one place. If a selector breaks, you know exactly which file to open.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Incredible Benefits (Why You'll Never Go Back)
&lt;/h2&gt;

&lt;p&gt;Adopting this pattern isn't just about tidiness; it unlocks professional-grade benefits.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Radically Improved Maintainability
&lt;/h3&gt;

&lt;p&gt;When a website changes its layout (and it will), you no longer have to hunt through a giant spider file.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Price selector changed?&lt;/strong&gt; Open &lt;code&gt;pages/product.py&lt;/code&gt; and fix the &lt;code&gt;price&lt;/code&gt; method.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Need to add a new field?&lt;/strong&gt; Add it to your Item and then add a new &lt;code&gt;@field&lt;/code&gt; method in the corresponding Page Object.
Your spider remains completely untouched. This isolation makes maintenance fast and painless.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Finally, Real Testability 🧪
&lt;/h3&gt;

&lt;p&gt;This is the game-changer. You can now write unit tests for your parsing logic &lt;strong&gt;without ever running the spider&lt;/strong&gt;. Using a framework like &lt;code&gt;pytest&lt;/code&gt;, you can feed saved HTML files directly to your Page Objects and assert that they extract the correct data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# tests/test_product_page.py
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_product_page_parsing&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="c1"&gt;# Load a saved HTML file as a fixture
&lt;/span&gt;    &lt;span class="n"&gt;html_content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;fixtures/product.html&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ProductPage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;html_content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Assert that your selectors work as expected
&lt;/span&gt;    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Awesome Product&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;price&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mf"&gt;99.99&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means you can validate your selectors in milliseconds, making your spiders incredibly robust and reliable.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Supercharged Team Collaboration 🤝
&lt;/h3&gt;

&lt;p&gt;This pattern establishes a clear, repeatable structure for your projects. When a new developer joins the team, the architecture is self-explanatory:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;items.py&lt;/code&gt; defines the data shape.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;pages/&lt;/code&gt; directory contains all parsing logic.&lt;/li&gt;
&lt;li&gt;Spiders in the &lt;code&gt;spiders/&lt;/code&gt; directory handle only crawling.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This consistency makes it easy for anyone to contribute effectively right away.&lt;/p&gt;




&lt;h2&gt;
  
  
  How to Get Started (It's Easier Than You Think)
&lt;/h2&gt;

&lt;p&gt;Integrating Scrapy-Poet into your project is straightforward.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;*&lt;em&gt;Install scrapy-poet: *&lt;/em&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;scrapy-poet
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Activate it in &lt;code&gt;settings.py&lt;/code&gt;:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# settings.py
&lt;/span&gt;
&lt;span class="n"&gt;Addons&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;scrapy_poet.Addon&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Tell it where to find your Page Objects:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# settings.py
&lt;/span&gt;
&lt;span class="c1"&gt;# This points to the directory where you'll store your page objects.
&lt;/span&gt;&lt;span class="n"&gt;SCRAPY_POET_DISCOVER&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your_project.pages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Create a &lt;code&gt;pages&lt;/code&gt; directory in your project, make it a Python module by adding an &lt;code&gt;__init__.py&lt;/code&gt; file, and start building your Page Objects. That’s all it takes.&lt;/p&gt;

&lt;p&gt;This is the exact pattern we use internally at Zyte to build and maintain our spiders at scale, and it’s highly recommended by the Scrapy maintainers themselves. It makes your code more structured, more testable, and ultimately, more professional.&lt;/p&gt;

&lt;p&gt;Say goodbye to monolithic spiders and hello to a cleaner, more powerful way of scraping.&lt;/p&gt;

&lt;h2&gt;
  
  
  Full Code Project
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;git clone https://github.com/johnatzyte/scrapy-poet-demo
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>scrapy</category>
      <category>webscraping</category>
      <category>programming</category>
      <category>zyte</category>
    </item>
  </channel>
</rss>
