<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Extract by Zyte</title>
    <description>The latest articles on Forem by Extract by Zyte (@extractdata).</description>
    <link>https://forem.com/extractdata</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F11159%2F9b0ab14b-3550-4e5e-b996-02b33c0912fa.jpg</url>
      <title>Forem: Extract by Zyte</title>
      <link>https://forem.com/extractdata</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/extractdata"/>
    <language>en</language>
    <item>
      <title>Build Scrapy spiders in 23.54 seconds with this free Claude skill</title>
      <dc:creator>John Rooney</dc:creator>
      <pubDate>Mon, 30 Mar 2026 17:50:39 +0000</pubDate>
      <link>https://forem.com/extractdata/build-scrapy-spiders-in-2354-seconds-with-this-free-claude-skill-50em</link>
      <guid>https://forem.com/extractdata/build-scrapy-spiders-in-2354-seconds-with-this-free-claude-skill-50em</guid>
      <description>&lt;p&gt;I built a Claude skill that generates &lt;a href="https://docs.scrapy.org/en/latest" rel="noopener noreferrer"&gt;Scrapy&lt;/a&gt; spiders in under 30 seconds — ready to run, ready to extract good data. In this post I'll walk through what I built, the design decisions behind it, and where I think it can go next.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it does
&lt;/h2&gt;

&lt;p&gt;The skill takes a single input: a category or product listing URL. From there, Claude generates a complete, runnable Scrapy spider as a single Python script. No project setup, no configuration files, no boilerplate to write. Just a script you can run immediately.&lt;/p&gt;

&lt;p&gt;Here's what that looks like in practice. I opened Claude Code in an empty folder with dependencies installed, activated the skill, and said: "Create a spider for this site" — and pasted a URL.&lt;/p&gt;

&lt;p&gt;Within seconds, the script was generated. I ran it, watched the products roll in, piped the output through &lt;code&gt;jq&lt;/code&gt;, and had clean structured product data. Start to finish: under a minute.&lt;/p&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/2pQD412kJIw"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;h2&gt;
  
  
  Why a single-file script, not a full Scrapy project?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://docs.scrapy.org/en/latest/topics/practices.html" rel="noopener noreferrer"&gt;Scrapy&lt;/a&gt; is usually a full project — multiple files, lots of moving parts, a proper setup process. Running it from a script instead is generally discouraged for production work, but for this use case it's actually the right call.&lt;/p&gt;

&lt;p&gt;The goal here is what I'd call pump-and-dump scraping: give Claude a URL, get a spider, run it for a couple of days, move on. It's not designed to scrape millions of products every day for years. For that kind of scale you need proper infrastructure, robust monitoring, and serious logging. This isn't that — and that's intentional.&lt;/p&gt;

&lt;p&gt;What you do get, even in the single-file approach, is almost everything Scrapy offers: middleware, automatic retries, and concurrency handling. You'd have to build all of that yourself with a plain &lt;code&gt;requests&lt;/code&gt; script. Scrapy gives it to you for free, even when running from a script.&lt;/p&gt;

&lt;h2&gt;
  
  
  The key design decision: AI extraction
&lt;/h2&gt;

&lt;p&gt;The other major call I made was to lean entirely on &lt;a href="https://www.zyte.com/zyte-api/" rel="noopener noreferrer"&gt;Zyte API&lt;/a&gt;'s AI extraction rather than generating CSS or XPath selectors.&lt;/p&gt;

&lt;p&gt;Specifically, the skill uses two extraction types chained together: &lt;code&gt;productNavigation&lt;/code&gt; on the category or listing page, which returns product URLs and the next page link, and &lt;code&gt;product&lt;/code&gt; on each product URL, which returns structured product data including name, price, availability, brand, SKU (stock keeping unit), description, images, and more.&lt;/p&gt;

&lt;p&gt;This means the spider doesn't need to know anything about the structure of the site it's crawling. There are no selectors to generate, no schema to define, no user confirmation step. The AI on &lt;a href="https://www.zyte.com/zyte-api/" rel="noopener noreferrer"&gt;Zyte's&lt;/a&gt; end handles all of that. It does cost slightly more than a raw HTTP request, but given how little time it takes to go from URL to working spider, the trade-off makes sense.&lt;/p&gt;

&lt;p&gt;I've hardcoded &lt;code&gt;httpResponseBody&lt;/code&gt; as the extraction source — it's faster and more cost-efficient than browser rendering. If a site is JavaScript-heavy and you're not getting the data you need, you can switch to &lt;code&gt;browserHtml&lt;/code&gt; with a one-line change. The spider logs a warning to remind you of this.&lt;/p&gt;

&lt;h2&gt;
  
  
  The use case is deliberately narrow
&lt;/h2&gt;

&lt;p&gt;This skill is designed for e-commerce sites, and only e-commerce sites. That's not a limitation I stumbled into — it's a feature.&lt;/p&gt;

&lt;p&gt;Because the scope is narrow, the spider structure is simple and predictable: category pages with pagination, product links, and detail pages. &lt;a href="https://www.zyte.com/zyte-api/" rel="noopener noreferrer"&gt;Zyte API&lt;/a&gt;'s &lt;code&gt;productNavigation&lt;/code&gt; and &lt;code&gt;product&lt;/code&gt; extraction types handle this reliably. Widening the scope to arbitrary crawling would require a lot more of Scrapy's machinery and would quickly exceed what makes sense for a lightweight script like this.&lt;/p&gt;

&lt;p&gt;What it doesn't do: deep subcategory crawling, link discovery, or full-site crawls. If a page renders all its products without pagination, that works fine — the next page just returns nothing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Logging and output
&lt;/h2&gt;

&lt;p&gt;I replaced &lt;a href="https://docs.scrapy.org/en/latest/topics/practices.html" rel="noopener noreferrer"&gt;Scrapy's&lt;/a&gt; default logging with Rich logging, which gives cleaner terminal output. Scrapy's logs are verbose in ways that aren't useful when you're running a short-lived script — I wanted something concise enough that if something went wrong, it would be obvious at a glance.&lt;/p&gt;

&lt;p&gt;Output goes to a &lt;code&gt;.jsonl&lt;/code&gt; file named after the spider, alongside a plain &lt;code&gt;.log&lt;/code&gt; file. Both are derived from the spider name, which is itself derived from the domain. Run &lt;code&gt;example_com.py&lt;/code&gt;, get &lt;code&gt;example_com.jsonl&lt;/code&gt; and &lt;code&gt;example_com.log&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this goes next
&lt;/h2&gt;

&lt;p&gt;The immediate next step I have in mind is selector-based extraction as an alternative path — useful for sites where the AI extraction isn't quite right, or where you want more control over exactly what gets pulled.&lt;/p&gt;

&lt;p&gt;The longer-term vision is running this fully agentically. URLs get submitted somewhere — a queue, a database table, a form — an agent picks them up, builds the spider, and maybe runs a quick validation. The spider then goes into a pool to be run on a schedule, and data lands in a database rather than a flat file. Give Claude access to a virtual private server (VPS) via terminal and most of this is achievable without much extra infrastructure. The skill is already the hard part.&lt;/p&gt;

&lt;h2&gt;
  
  
  Download the skill
&lt;/h2&gt;

&lt;p&gt;The skill is free to download and use. It's a single &lt;code&gt;.skill&lt;/code&gt; file you can install directly into Claude Code. You'll need:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;scrapy scrapy-zyte-api rich
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ZYTE_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your_key_here
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://www.zyte.com/blog/scrapy-in-2026-modern-async-crawling/" rel="noopener noreferrer"&gt;Scrapy 2.13 &lt;/a&gt;or above is required for &lt;code&gt;AsyncCrawlerProcess&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The link to the repo and the skill download are in the video description, and &lt;a href="https://github.com/zytelabs/claude-webscraping-skills" rel="noopener noreferrer"&gt;here&lt;/a&gt;. If you've built something similar, or have thoughts on the design decisions — especially around the extraction approach or the logging setup — I'd love to hear it in the comments. GitHub links to your own scrapers are very welcome too.&lt;/p&gt;

&lt;p&gt;If you're interested in more agentic scraping patterns, I also built a Claude skill that helps spiders recover from excessive bans — you can watch that video &lt;a href="https://youtu.be/bF24BLZWlOk" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>scrapy</category>
      <category>claudeskills</category>
      <category>webscraping</category>
      <category>ai</category>
    </item>
    <item>
      <title>I Built a Self-Healing Web Scraper to Auto-Solve 403s</title>
      <dc:creator>John Rooney</dc:creator>
      <pubDate>Mon, 16 Mar 2026 11:09:31 +0000</pubDate>
      <link>https://forem.com/extractdata/i-built-a-self-healing-web-scraper-to-auto-solve-403s-1jg1</link>
      <guid>https://forem.com/extractdata/i-built-a-self-healing-web-scraper-to-auto-solve-403s-1jg1</guid>
      <description>&lt;p&gt;Web scraping has a recurring enemy: the 403. Sites add bot detection, anti-scraping tools update their challenges, and scrapers that worked fine last week start silently failing. The usual fix is manual — check the logs, diagnose the cause, update the config, redeploy. I wanted to see if an agent could handle that loop instead.&lt;/p&gt;

&lt;p&gt;So I built a self-healing scraper. After each crawl, a Claude-powered agent reads the failure logs, probes the broken domains with escalating fetch strategies, and rewrites the config automatically. By the next run, it's already fixed itself.&lt;/p&gt;

&lt;p&gt;

  &lt;iframe src="https://www.youtube.com/embed/bF24BLZWlOk"&gt;
  &lt;/iframe&gt;


&lt;/p&gt;

&lt;h2&gt;
  
  
  How it works
&lt;/h2&gt;

&lt;p&gt;The project has two parts: a scraper and a self-healing agent.&lt;/p&gt;

&lt;h3&gt;
  
  
  The scraper
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;main.py&lt;/code&gt; is a straightforward Python scraper driven entirely by a &lt;code&gt;config.json&lt;/code&gt; file. Each domain entry tells the scraper which URLs to fetch and how to fetch them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"books"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"zyte"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"browser_html"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"urls"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"https://www.bookstocsrape.co.uk/products/..."&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There are three fetch modes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Direct&lt;/strong&gt; — a plain &lt;code&gt;requests.get()&lt;/code&gt;. Fast, free, works for sites that don't block bots.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.zyte.com/zyte-api/" rel="noopener noreferrer"&gt;Zyte API&lt;/a&gt; (&lt;code&gt;httpResponseBody&lt;/code&gt;)&lt;/strong&gt; — routes the request through Zyte's residential proxy network. Good for sites that block datacenter IPs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://www.zyte.com/zyte-api/" rel="noopener noreferrer"&gt;Zyte API&lt;/a&gt; (&lt;code&gt;browserHtml&lt;/code&gt;)&lt;/strong&gt; — spins up a real browser via Zyte, executes JavaScript, and returns the fully-rendered DOM. Required for sites using JS-based bot challenges.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every request is logged to &lt;code&gt;scraper.log&lt;/code&gt; in the same format:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2026-03-14 09:12:01 url=https://... domain_id=scan status=200
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If a request throws any exception, it's recorded as a 403. That keeps the log clean and gives the agent a consistent signal to act on.&lt;/p&gt;

&lt;h3&gt;
  
  
  The self-healing agent
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;agent.py&lt;/code&gt; is a Claude-powered agent that runs after each crawl. It uses the &lt;a href="https://github.com/anthropics/claude-agent-sdk-python" rel="noopener noreferrer"&gt;Claude Agent SDK&lt;/a&gt; and has access to three tools: &lt;code&gt;Read&lt;/code&gt;, &lt;code&gt;Bash&lt;/code&gt;, and &lt;code&gt;Edit&lt;/code&gt; — enough to operate completely autonomously.&lt;/p&gt;

&lt;p&gt;The agent works through a staged process:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Read the log&lt;/strong&gt; — finds every domain that returned a 403&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-reference the config&lt;/strong&gt; — skips domains already configured to use Zyte&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stage 1 probe&lt;/strong&gt; — uses the &lt;code&gt;zyte-api&lt;/code&gt; CLI to fetch one URL per failing domain with &lt;code&gt;httpResponseBody&lt;/code&gt;, then inspects the page &lt;code&gt;&amp;lt;title&amp;gt;&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Challenge detection&lt;/strong&gt; — if the title contains phrases like &lt;em&gt;"Just a moment"&lt;/em&gt;, &lt;em&gt;"Checking your browser"&lt;/em&gt;, or &lt;em&gt;"Verifying you are human"&lt;/em&gt;, the page is flagged as a bot challenge&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stage 2 probe&lt;/strong&gt; — challenge pages are re-probed using &lt;code&gt;browserHtml&lt;/code&gt;, which runs a real browser to bypass JS-based detection&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Config update&lt;/strong&gt; — the agent edits &lt;code&gt;config.json&lt;/code&gt; directly, setting &lt;code&gt;zyte: true&lt;/code&gt; and/or &lt;code&gt;browserHtml: true&lt;/code&gt; for domains that now work&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The next crawl automatically uses the right fetch strategy. No manual intervention needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Design decisions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Config-driven, not code-driven
&lt;/h3&gt;

&lt;p&gt;Everything lives in &lt;code&gt;config.json&lt;/code&gt;. Adding a new domain is a one-liner, and the scraper doesn't need to know anything about individual sites — it just reads the config and follows instructions. The agent writes to the same file, so the loop closes itself naturally.&lt;/p&gt;

&lt;h3&gt;
  
  
  Graduated fetch strategy
&lt;/h3&gt;

&lt;p&gt;Not every site needs an expensive browser render. By escalating from direct to &lt;code&gt;httpResponseBody&lt;/code&gt; to &lt;code&gt;[browserHtml](https://www.zyte.com/zyte-api/headless-browser/)&lt;/code&gt; only when necessary, I keep costs manageable. Browser renders are slower and consume more API credits — reserving them for sites that actually need them makes a meaningful difference at scale.&lt;/p&gt;

&lt;h3&gt;
  
  
  Letting the agent handle the heuristics
&lt;/h3&gt;

&lt;p&gt;The challenge detection logic — matching titles against known bot-detection phrases — is exactly the kind of fuzzy heuristic that's tedious to maintain as code but natural for a language model to reason about. Claude also handles edge cases gracefully: if the &lt;code&gt;zyte-api&lt;/code&gt; CLI isn't installed, if the log is empty, if a domain is already correctly configured. A rule-based script would need explicit handling for every one of those scenarios.&lt;/p&gt;

&lt;h2&gt;
  
  
  The limitations
&lt;/h2&gt;

&lt;p&gt;It's worth being honest about where this approach falls short.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It's reactive, not proactive.&lt;/strong&gt; The agent only runs after a failed crawl. If a site starts blocking mid-run, those URLs fail silently until the next cycle.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Title-based detection is fragile.&lt;/strong&gt; Most bot-challenge pages say &lt;em&gt;"Just a moment…"&lt;/em&gt; — but a legitimate site could theoretically use that phrase. A false positive would cause the scraper to wastefully use browser rendering where it isn't needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One URL per domain.&lt;/strong&gt; The agent probes only the first failing URL for each domain. Different URL patterns on the same domain can have different bot-detection behaviour, which this doesn't account for.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No rollback.&lt;/strong&gt; Once the config is updated, there's no way to detect if a Zyte setting later stops working and revert it automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost opacity.&lt;/strong&gt; The scraper logs HTTP status codes, not &lt;a href="https://www.zyte.com/pricing/" rel="noopener noreferrer"&gt;Zyte API&lt;/a&gt; credit consumption. There's no visibility into what each domain actually costs to fetch.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where I'd take it next
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Smarter challenge detection.&lt;/strong&gt; Rather than keyword-matching on the title, the agent could read the full page HTML and make a more nuanced call — is this a product page, a login wall, or a soft block with a CAPTCHA? Each requires a different response.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Proactive monitoring.&lt;/strong&gt; A lightweight probe running daily against each configured domain, independent of the main crawl, would let the agent update the config &lt;em&gt;before&lt;/em&gt; a full scrape run hits a known-bad configuration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Per-URL config.&lt;/strong&gt; Right now &lt;code&gt;zyte&lt;/code&gt; and &lt;code&gt;browser_html&lt;/code&gt; are set at the domain level. Some sites serve static product pages on one path and JS-rendered category pages on another — granular per-URL settings would handle that cleanly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Structured data extraction.&lt;/strong&gt; Right now &lt;code&gt;parse_page&lt;/code&gt; only pulls the page title. The natural next step is structured product extraction — price, availability, name, images — either via CSS selectors in the config or Zyte's &lt;code&gt;product&lt;/code&gt; extraction type, which uses ML models to parse product data from any page.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-agent parallelism.&lt;/strong&gt; The self-healing loop is currently a single agent. As the config grows, a coordinator could spawn one subagent per failing domain, each running its own probe pipeline concurrently. The Claude Agent SDK supports subagents natively, so this would be a relatively small change.&lt;/p&gt;

&lt;p&gt;The core idea is simple: a scraper that observes its own failures and reconfigures itself. What I found interesting about building it wasn't the scraping itself — it was seeing how little scaffolding the agent actually needed. Three tools, a clear task, and it handles the diagnostic work that would otherwise fall to me.&lt;/p&gt;

</description>
      <category>claude</category>
      <category>agents</category>
      <category>ai</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>How I get Claude to build HTML parsing code the way I want it</title>
      <dc:creator>John Rooney</dc:creator>
      <pubDate>Wed, 11 Mar 2026 07:13:06 +0000</pubDate>
      <link>https://forem.com/extractdata/html-parsing-with-claude-skills-extracting-structured-data-from-raw-html-1ai0</link>
      <guid>https://forem.com/extractdata/html-parsing-with-claude-skills-extracting-structured-data-from-raw-html-1ai0</guid>
      <description>&lt;p&gt;Getting HTML off a page is only the first step. Once you have it, the real work begins: pulling out the data that actually matters — product names, prices, ratings, specifications — in a clean, structured format you can actually do something with.&lt;/p&gt;

&lt;p&gt;That's what the parser skill is for. If you haven't read the introduction to skills in our fetcher post, it's worth a quick look first. But the short version is this: a skill is a &lt;code&gt;SKILL.md&lt;/code&gt; file that gives Claude precise, reusable instructions for using a specific tool. The parser skill is one of three that together form a complete web scraping pipeline.&lt;/p&gt;

&lt;p&gt;

&lt;/p&gt;
&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/zytelabs" rel="noopener noreferrer"&gt;
        zytelabs
      &lt;/a&gt; / &lt;a href="https://github.com/zytelabs/claude-webscraping-skills" rel="noopener noreferrer"&gt;
        claude-webscraping-skills
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;claude-webscraping-skills&lt;/h1&gt;

&lt;/div&gt;
&lt;p&gt;A collection of claude skills and other tools to assist your web-scraping needs.&lt;/p&gt;
&lt;p&gt;video explanations:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://youtu.be/HH0Q9OfKLu0" rel="nofollow noopener noreferrer"&gt;https://youtu.be/HH0Q9OfKLu0&lt;/a&gt;
&lt;a href="https://youtu.be/P2HhnFRXm-I" rel="nofollow noopener noreferrer"&gt;https://youtu.be/P2HhnFRXm-I&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Other reading:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://www.zyte.com/blog/claude-skills-vs-mcp-vs-web-scraping-copilot/" rel="nofollow noopener noreferrer"&gt;https://www.zyte.com/blog/claude-skills-vs-mcp-vs-web-scraping-copilot/&lt;/a&gt;
&lt;a href="https://www.zyte.com/blog/supercharging-web-scraping-with-claude-skills/" rel="nofollow noopener noreferrer"&gt;https://www.zyte.com/blog/supercharging-web-scraping-with-claude-skills/&lt;/a&gt;&lt;/p&gt;

&lt;div class="markdown-heading"&gt;
&lt;h4 class="heading-element"&gt;Other claude tools for web scraping&lt;/h4&gt;

&lt;/div&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href="https://github.com/apscrapes/zyte-fetch-page-content-mcp-server" rel="noopener noreferrer"&gt;zyte-fetch-page-content-mcp-server&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;blockquote&gt;
&lt;p&gt;A Model Context Protocol (MCP) server that runs locally using docker desktop mcp-toolkit and help you extracts clean, LLM-friendly content from any webpage using the Zyte API. Perfect for AI assistants that need to read and understand web content. by &lt;a href="https://github.com/apscrapes" rel="noopener noreferrer"&gt;Ayan Pahwa&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;ol start="2"&gt;
&lt;li&gt;&lt;a href="https://joshuaodmark.com/blog/improve-claude-code-webfetch-with-zyte-api" rel="nofollow noopener noreferrer"&gt;Improve Claude Code WebFetch with Zyte API&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;blockquote&gt;
&lt;p&gt;When Claude encounters a WebFetch failure, it reads the CLAUDE.md instructions and makes a curl request to the Zyte API endpoint. The API returns base64-encoded HTML, which Claude decodes and processes just like it would with a normal WebFetch response. by &lt;a href="https://joshuaodmark.com/" rel="nofollow noopener noreferrer"&gt;Joshua Odmark&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;ol start="3"&gt;
&lt;li&gt;&lt;a href="https://www.zyte.com/blog/claude-skills-vs-mcp-vs-web-scraping-copilot/" rel="nofollow noopener noreferrer"&gt;Claude skills vs MCP vs Web Scraping CoPilot&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;



&lt;/div&gt;
&lt;br&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/zytelabs/claude-webscraping-skills" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;br&gt;
&lt;/div&gt;





&lt;h2&gt;
  
  
  What is a skill?
&lt;/h2&gt;

&lt;p&gt;A skill is a small markdown file that tells Claude how to use a specific script or tool — what it does, when to use it, and step-by-step how to run it. Claude reads the file and follows the instructions as part of a broader workflow, with no manual intervention required.&lt;/p&gt;

&lt;p&gt;Skills are composable by design. The fetcher skill hands raw HTML to the parser skill, which hands structured JSON to the compare skill. Each one does one job well, and they're built to work together.&lt;/p&gt;

&lt;p&gt;The parser skill's front matter sets out its purpose immediately:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;parser&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Extracts&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;structured&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;product&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;from&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;raw&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;HTML.&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Tries&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;JSON-LD&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;via"&lt;/span&gt;
&lt;span class="s"&gt;Extruct first, falls back to CSS selectors via Parsel.&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Two methods, one fallback. That single description line captures the entire logic of the skill.&lt;/p&gt;

&lt;p&gt;

  &lt;iframe src="https://www.youtube.com/embed/HH0Q9OfKLu0"&gt;
  &lt;/iframe&gt;


&lt;/p&gt;
&lt;h2&gt;
  
  
  What the parser skill does
&lt;/h2&gt;

&lt;p&gt;The parser skill takes raw HTML as input and returns a structured JSON object. It uses two extraction methods in sequence, trying the more reliable one first and falling back to the more flexible one if needed.&lt;/p&gt;

&lt;p&gt;The primary method uses Extruct to find JSON-LD data embedded in the page. JSON-LD is a structured data format that many modern sites include in their HTML specifically to make their content machine-readable — it's used for search engine optimisation and data portability. When it's present, Extruct can read it cleanly and reliably, with no need to write or maintain selectors.&lt;/p&gt;

&lt;p&gt;If no usable JSON-LD is found, the skill falls back to Parsel, which uses CSS selectors to locate data heuristically across the page. This is more flexible but inherently tied to the page's visual structure, which can change.&lt;/p&gt;
&lt;h2&gt;
  
  
  When to use it
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## When to use&lt;/span&gt;
Use this skill when you have raw HTML and need to extract structured data from
it — product details, prices, specs, ratings, or any page content.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;In practice, that means the parser skill is almost always the second step in a pipeline — running immediately after the fetcher skill has retrieved your HTML. It works with any page type, and handles the most common product fields out of the box.&lt;/p&gt;
&lt;h2&gt;
  
  
  How it works
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Instructions&lt;/span&gt;
&lt;span class="p"&gt;1.&lt;/span&gt; Save the HTML to a temporary file &lt;span class="sb"&gt;`page.html`&lt;/span&gt;
&lt;span class="p"&gt;2.&lt;/span&gt; Run &lt;span class="sb"&gt;`parser.py`&lt;/span&gt; against it:
   python parser.py page.html
&lt;span class="p"&gt;
3.&lt;/span&gt; The script outputs a JSON object. Check the &lt;span class="sb"&gt;`method`&lt;/span&gt; field:
&lt;span class="p"&gt;   -&lt;/span&gt; "extruct" — clean structured data was found, use it directly
&lt;span class="p"&gt;   -&lt;/span&gt; "parsel" — fell back to CSS selectors, review fields for completeness
&lt;span class="p"&gt;4.&lt;/span&gt; If key fields are missing from the Parsel output, ask the user which fields
   they need and re-run with --fields:
   python parser.py page.html --fields "price,rating,brand"
&lt;span class="p"&gt;
5.&lt;/span&gt; Return the parsed JSON to the conversation for use in the Compare skill.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The &lt;code&gt;method&lt;/code&gt; field in the output is particularly useful. It tells you immediately how the data was extracted and how much trust to place in it. An &lt;code&gt;"extruct"&lt;/code&gt; result is clean and stable. A &lt;code&gt;"parsel"&lt;/code&gt; result is worth reviewing, especially if you're working with an unusual page layout.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;--fields&lt;/code&gt; flag is a practical escape hatch. Rather than requiring you to dig into the script when key data is missing, it lets you specify exactly what you need and re-run — a much more efficient loop.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why prefer Extruct?
&lt;/h2&gt;

&lt;p&gt;The notes section of the skill file makes this explicit:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Notes&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Always prefer the Extruct path — it is more stable and requires no maintenance
&lt;span class="p"&gt;-&lt;/span&gt; Parsel selectors are generated heuristically and may need adjustment for
  unusual page layouts
&lt;span class="p"&gt;-&lt;/span&gt; Run once per page; pass all outputs together into the Compare skill
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Parsel selectors break when sites redesign. JSON-LD, by contrast, is structured data the site publishes independently of its visual layout. A site can completely overhaul its design and its JSON-LD will often remain untouched. That stability is worth prioritising wherever possible.&lt;/p&gt;
&lt;h2&gt;
  
  
  What comes next
&lt;/h2&gt;

&lt;p&gt;Once you've run the parser skill across all your target pages, you have a set of structured JSON objects ready to compare. That's where the compare skill picks up — generating tables, summaries, and side-by-side analysis from the extracted data.&lt;/p&gt;
&lt;h2&gt;
  
  
  Do you need a skill?
&lt;/h2&gt;

&lt;p&gt;The parser skill works well when the data you need maps cleanly onto fields that Extruct or Parsel can find — product names, prices, ratings, and similar structured attributes that sites commonly expose through JSON-LD or consistent HTML patterns. For that category of task, the skill is fast to apply and requires no custom code.&lt;/p&gt;

&lt;p&gt;See our post about Skills vs MCP vs Web Scraping Copilot (our VS Code Extension):&lt;/p&gt;

&lt;p&gt;

&lt;/p&gt;
&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
    &lt;div class="c-embed__content"&gt;
      &lt;div class="c-embed__body flex items-center justify-between"&gt;
        &lt;a href="https://www.zyte.com/blog/claude-skills-vs-mcp-vs-web-scraping-copilot/" rel="noopener noreferrer" class="c-link fw-bold flex items-center"&gt;
          &lt;span class="mr-2"&gt;zyte.com&lt;/span&gt;
          

        &lt;/a&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;





&lt;p&gt;But not every extraction problem fits that mould. If you're working with pages that don't include JSON-LD and have highly irregular layouts, Parsel's heuristic selectors may return incomplete or inconsistent results, and you'll spend time debugging field by field. In those cases, a purpose-built extraction script using Parsel or BeautifulSoup directly — with selectors you've written and tested against the specific target — will be more reliable.&lt;/p&gt;

&lt;p&gt;For larger-scale or more complex extraction work, Zyte API's automatic extraction capabilities go further still. Rather than relying on selectors at all, automatic extraction uses AI to identify and return structured data from a page without requiring you to specify fields or maintain selector logic. If you're extracting data from many different site structures, or you need extraction to keep working through site redesigns without manual intervention, that's a more robust foundation than a skill-based approach. The parser skill is best understood as a practical middle ground: fast to use, good enough for a wide range of common cases, and easy to slot into a pipeline — but not a replacement for extraction tooling built for scale or resilience.&lt;/p&gt;

</description>
      <category>claude</category>
      <category>skills</category>
      <category>webscraping</category>
      <category>zyte</category>
    </item>
    <item>
      <title>I gave Claude access to a web scraping API</title>
      <dc:creator>John Rooney</dc:creator>
      <pubDate>Wed, 11 Mar 2026 07:10:43 +0000</pubDate>
      <link>https://forem.com/extractdata/the-fetcher-skill-reliable-html-scraping-with-automatic-fallback-5f02</link>
      <guid>https://forem.com/extractdata/the-fetcher-skill-reliable-html-scraping-with-automatic-fallback-5f02</guid>
      <description>&lt;p&gt;If you've worked with Claude for any length of time, you've probably noticed it can do a lot more than answer questions. With the right setup, it can take actions — running scripts, processing files, working through multi-step workflows autonomously. Skills are what make that possible.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is a skill?
&lt;/h2&gt;

&lt;p&gt;A skill is a small, self-contained instruction set that tells Claude how to use a specific tool or script to accomplish a well-defined task. Technically, it's a markdown file — a &lt;code&gt;SKILL.md&lt;/code&gt; — that describes what a tool does, when to reach for it, and exactly how to run it. Claude reads that file and follows the instructions as part of a larger workflow.&lt;/p&gt;

&lt;p&gt;Skills are designed to be composable. Each one does one thing well, and they're built to hand off to each other. The fetcher skill retrieves HTML. The parser skill extracts data from it. The compare skill turns multiple parsed outputs into a structured comparison. Together, they form a complete scraping pipeline — and Claude orchestrates the whole thing.&lt;/p&gt;

&lt;p&gt;See our skills here: &lt;a href="https://github.com/zytelabs/claude-webscraping-skills" rel="noopener noreferrer"&gt;https://github.com/zytelabs/claude-webscraping-skills&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The skill format looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fetcher&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Fetches&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;raw&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;HTML&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;from&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;URL&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;using&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;httpx,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;with&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;automatic&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;fallback&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Zyte&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;API&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;if&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;blocked."&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That front matter is what Claude uses to match the right skill to the right task. The description is deliberately precise: it tells Claude not just what the skill does, but how it does it, so Claude can reason about whether it's the right tool for the job.&lt;/p&gt;

&lt;p&gt;

  &lt;iframe src="https://www.youtube.com/embed/HH0Q9OfKLu0"&gt;
  &lt;/iframe&gt;


&lt;/p&gt;

&lt;h2&gt;
  
  
  What the fetcher skill does
&lt;/h2&gt;

&lt;p&gt;The fetcher skill's job is exactly what it sounds like: given a URL, fetch the raw HTML and return it. It uses httpx as its primary HTTP client — a modern, performant Python library well suited to scraping workloads.&lt;/p&gt;

&lt;p&gt;What makes it more than a simple wrapper is the fallback logic. A significant number of sites actively block automated requests. Without a fallback, a blocked request just fails, and you're left manually diagnosing why. The fetcher skill handles this automatically. If a request comes back with a &lt;code&gt;BLOCKED&lt;/code&gt; status, it retries via Zyte API, which provides built-in unblocking. Most of the time, you get your HTML without ever needing to intervene.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to use it
&lt;/h2&gt;

&lt;p&gt;The skill's &lt;code&gt;SKILL.md&lt;/code&gt; is explicit about this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## When to use&lt;/span&gt;
Use this skill when the user provides one or more URLs and asks you to fetch,
retrieve, scrape, or get the HTML or page content.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In practice, that means any time you're starting a scraping or data extraction task and you have a URL to work from. It's the entry point for the pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  How it works
&lt;/h2&gt;

&lt;p&gt;The instructions in the skill file are straightforward:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Instructions&lt;/span&gt;
&lt;span class="p"&gt;1.&lt;/span&gt; Run &lt;span class="sb"&gt;`fetcher.py`&lt;/span&gt; with the URL as an argument:
   python fetcher.py &lt;span class="nt"&gt;&amp;lt;url&amp;gt;&lt;/span&gt;
&lt;span class="p"&gt;
2.&lt;/span&gt; If the script returns a successful HTML response, return the HTML to the
   conversation for use in the next step.
&lt;span class="p"&gt;3.&lt;/span&gt; If the script returns a &lt;span class="sb"&gt;`BLOCKED`&lt;/span&gt; status, re-run with the &lt;span class="sb"&gt;`--zyte`&lt;/span&gt; flag:
   python fetcher.py &lt;span class="nt"&gt;&amp;lt;url&amp;gt;&lt;/span&gt; --zyte
&lt;span class="p"&gt;
4.&lt;/span&gt; Inform the user if a URL could not be fetched after both attempts.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The two-step process keeps things efficient. httpx is fast and lightweight, so it handles the majority of requests without needing to route through Zyte API. The fallback only kicks in when it's needed. If both attempts fail, Claude surfaces that to you clearly rather than silently moving on.&lt;/p&gt;

&lt;p&gt;For multiple URLs, the script runs once per URL — there's no batching — so Claude loops through a list sequentially.&lt;/p&gt;

&lt;h2&gt;
  
  
  Transparency about failure
&lt;/h2&gt;

&lt;p&gt;One detail worth highlighting is the final instruction: inform the user if a URL could not be fetched after both attempts. This might seem obvious, but it reflects a design principle worth being explicit about. A skill that silently drops failed URLs would produce incomplete data downstream, and you might not notice until you're looking at a comparison table with missing rows. Surfacing failures immediately keeps the pipeline honest.&lt;/p&gt;

&lt;h2&gt;
  
  
  What comes next
&lt;/h2&gt;

&lt;p&gt;The fetcher skill's output is raw HTML — exactly what the parser skill expects as its input. The two are designed to be used in sequence. Once you have the HTML, the parser skill takes over, extracting structured data through JSON-LD or CSS selectors depending on what the page contains.&lt;/p&gt;

&lt;p&gt;That handoff is documented in the skill's notes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Notes&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; For multiple URLs, run the script once per URL
&lt;span class="p"&gt;-&lt;/span&gt; Pass the raw HTML output into the Parser skill for extraction
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The pipeline continues from there.&lt;/p&gt;

&lt;h2&gt;
  
  
  Do you need a skill?
&lt;/h2&gt;

&lt;p&gt;Skills are a good fit when you have a well-defined, repeatable task that benefits from consistent behaviour across many runs. Fetching HTML from a URL is a clear example: the inputs and outputs are predictable, the fallback logic is always the same, and packaging that into a skill means Claude applies it reliably without you having to re-explain the process each time.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.zyte.com/blog/claude-skills-vs-mcp-vs-web-scraping-copilot/" rel="noopener noreferrer"&gt;Read our break down of Skills vs MCP vs Web Scraping Copilot here - our VS Code extension&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That said, skills aren't always the right tool. If you only need to fetch a handful of pages once, asking Claude to write a quick httpx script directly may be faster and more flexible. Similarly, if your target sites have unusual behaviour — rate limiting, JavaScript rendering, login walls, or multi-step navigation — a bespoke Scrapy spider built with Zyte API gives you far more control than a general-purpose fetch wrapper. Scrapy's middleware architecture, item pipelines, and scheduling make it better suited to large-scale or complex crawls where you need precise control over every aspect of the request cycle.&lt;/p&gt;

&lt;p&gt;The fetcher skill sits in the middle: more structured than an ad hoc script, less complex than a full Scrapy project. It's the right choice when you want Claude to handle straightforward retrieval as part of a larger automated workflow, without the overhead of setting up and maintaining a dedicated spider.&lt;/p&gt;

</description>
      <category>claude</category>
      <category>skills</category>
      <category>webscraping</category>
      <category>zyte</category>
    </item>
    <item>
      <title>I built a Claude Code skill that screenshots any website (and it handles anti-bot sites too)</title>
      <dc:creator>Ayan Pahwa</dc:creator>
      <pubDate>Fri, 06 Mar 2026 14:40:32 +0000</pubDate>
      <link>https://forem.com/extractdata/i-built-a-claude-code-skill-that-screenshots-any-website-and-it-handles-anti-bot-sites-too-2m4b</link>
      <guid>https://forem.com/extractdata/i-built-a-claude-code-skill-that-screenshots-any-website-and-it-handles-anti-bot-sites-too-2m4b</guid>
      <description>&lt;p&gt;TLDR;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Automate screenshot capture for any URL with JavaScript rendering and anti-ban protection — straight from your AI assistant.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;

  &lt;iframe src="https://www.youtube.com/embed/P2HhnFRXm-I"&gt;
  &lt;/iframe&gt;


&lt;/p&gt;

&lt;p&gt;Taking a screenshot of a webpage sounds trivial, until you need to do it at scale. Modern websites throw every obstacle imaginable in your way: JavaScript-rendered content that only appears after a React bundle loads, bot-detection systems that serve blank pages to automated headless browsers, geo-blocked content, and CAPTCHAs that appear the moment traffic patterns look non-human. For a handful of URLs you can get away with Puppeteer or Playwright. For hundreds or thousands? You need infrastructure built for the job.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnca4lwviajb7vavqwio5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnca4lwviajb7vavqwio5.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Zyte API was designed specifically for this problem. It handles JavaScript rendering, anti-bot fingerprinting, rotating proxies, and headless browser management so you don't have to and what better way to do it straight from your LLM supplying the URLs? Hence I created this zyte-screenshots Claude Skill, which you can use to trigger the entire workflow- API call, base64 decode, PNG save on your filesystem, all just by chatting with Claude.&lt;/p&gt;

&lt;p&gt;In this tutorial, we'll walk through exactly how the skill works, how to set it up, and how to use it to capture production-quality screenshots of any URL.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Use the Zyte API for Screenshots?
&lt;/h2&gt;

&lt;p&gt;Before diving into the skill itself, it's worth understanding what makes the Zyte API uniquely suited to screenshot capture at scale.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Full JavaScript Rendering
&lt;/h3&gt;

&lt;p&gt;Single-page applications built with React, Vue, Angular, or Next.js don't serve their content in the raw HTML response, they render it client-side after the page loads. Tools that capture the raw HTTP response will get a blank shell. Zyte's screenshot endpoint fires a real headless browser, waits for the DOM to fully settle, then captures the final rendered state.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Anti-Bot and Anti-Ban Protection
&lt;/h3&gt;

&lt;p&gt;Enterprise-grade sites use fingerprinting libraries to detect automation. They check TLS fingerprints, browser headers, canvas rendering patterns, mouse movement entropy, and dozens of other signals. Zyte's infrastructure is battle-tested to pass these checks so your screenshots won't return a "Access Denied" page.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Scale Without Infrastructure
&lt;/h3&gt;

&lt;p&gt;Managing a fleet of headless browser instances, proxy rotation, retries, and residential IP pools is a serious engineering investment. Zyte abstracts all of this into a single API call.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. One API, Any URL
&lt;/h3&gt;

&lt;p&gt;Whether the target is a static HTML page, a JS-heavy SPA, a behind-login dashboard (with session cookies), or a geo-restricted site, the same API call structure works. The skill you're about to install uses this endpoint.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://app.zyte.com/account/signup/zyteapi" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh9a7qsf6b03h1rrfzw5g.png" alt="Zyte API signup"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What Is the zyte-screenshots Claude Skill?
&lt;/h2&gt;

&lt;p&gt;Claude Skills are reusable instruction packages that extend Claude's capabilities with domain-specific workflows. The &lt;strong&gt;zyte-screenshots&lt;/strong&gt; skill teaches Claude how to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Accept a URL from the user in natural language&lt;/li&gt;
&lt;li&gt;Read the ZYTE_API_KEY environment variable&lt;/li&gt;
&lt;li&gt;Construct and execute the correct curl command against &lt;a href="https://api.zyte.com/v1/extract" rel="noopener noreferrer"&gt;https://api.zyte.com/v1/extract&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Pipe the JSON response through jq and base64 --decode to produce a PNG file&lt;/li&gt;
&lt;li&gt;Derive a clean filename from the URL (e.g. &lt;a href="https://quotes.toscrape.com" rel="noopener noreferrer"&gt;https://quotes.toscrape.com&lt;/a&gt; becomes quotes.toscrape.png)&lt;/li&gt;
&lt;li&gt;Report the exact file path and describe what's visible in the screenshot in one sentence&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In practice, this means you can open Claude, say &lt;strong&gt;"screenshot &lt;a href="https://example.com" rel="noopener noreferrer"&gt;https://example.com&lt;/a&gt;"&lt;/strong&gt;, and have a pixel-perfect PNG on your filesystem in seconds, no browser, no script, no Puppeteer config.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;Before installing the skill, make sure you have the following:&lt;/p&gt;

&lt;h3&gt;
  
  
  Tools
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;curl&lt;/strong&gt;: Pre-installed on macOS and most Linux distributions. On Windows, use WSL or Git Bash.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;jq&lt;/strong&gt;: A lightweight JSON processor. Install via &lt;code&gt;brew install jq&lt;/code&gt; (macOS) or &lt;code&gt;sudo apt install jq&lt;/code&gt; (Ubuntu/Debian).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;base64&lt;/strong&gt;: Standard on all Unix-like systems.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Claude desktop app&lt;/strong&gt; with Skills support enabled.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  A Zyte API Key
&lt;/h3&gt;

&lt;p&gt;Sign up at &lt;a href="https://www.zyte.com/" rel="noopener noreferrer"&gt;zyte.com&lt;/a&gt; and navigate to your API credentials. The free tier includes enough credits to get started with testing. Copy your API key, you'll set it as an environment variable.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;💡 &lt;strong&gt;Pro tip:&lt;/strong&gt; Set your ZYTE_API_KEY in your shell profile (~/.zshrc or ~/.bashrc) so it's always available: &lt;code&gt;export ZYTE_API_KEY="your_key_here"&lt;/code&gt; or pass it along your prompt&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Installing the Skill
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: Download the Skill from GitHub
&lt;/h3&gt;

&lt;p&gt;The skill is open source and available at &lt;a href="https://github.com/apscrapes/claude-zyte-screenshots" rel="noopener noreferrer"&gt;github.com/apscrapes/claude-zyte-screenshots&lt;/a&gt;. Download the latest release ZIP from the repository's Releases page, or clone it directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/apscrapes/claude-zyte-screenshots.git
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Import into Claude
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Open the Claude desktop app or go to Claude.ai&lt;/li&gt;
&lt;li&gt;Navigate to Settings → Skills&lt;/li&gt;
&lt;li&gt;Click Import Skill&lt;/li&gt;
&lt;li&gt;Select the SKILL.md file from the downloaded ZIP&lt;/li&gt;
&lt;li&gt;Claude will confirm the skill has been installed&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Step 3: Set Your API Key
&lt;/h3&gt;

&lt;p&gt;In your terminal, export your Zyte API key or pass it along your prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ZYTE_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your_zyte_api_key_here"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To make this permanent across terminal sessions, add the export line to your &lt;code&gt;~/.zshrc&lt;/code&gt; or &lt;code&gt;~/.bashrc&lt;/code&gt; file.&lt;/p&gt;

&lt;h2&gt;
  
  
  Using the Skill: Step-by-Step
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Basic Screenshot
&lt;/h3&gt;

&lt;p&gt;Once the skill is installed and your API key is set, using it is as simple as asking Claude:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;Screenshot https://quotes.toscrape.com using zyte-screenshots skill
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Claude will:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Verify ZYTE_API_KEY is available in the environment&lt;/li&gt;
&lt;li&gt;Derive the output filename: quotes.toscrape.png&lt;/li&gt;
&lt;li&gt;Execute the following command:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; https://api.zyte.com/v1/extract &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-u&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$ZYTE_API_KEY&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;: &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "url": "https://quotes.toscrape.com",
    "screenshot": true
  }'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
| jq &lt;span class="nt"&gt;-r&lt;/span&gt; &lt;span class="s1"&gt;'.screenshot'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
| &lt;span class="nb"&gt;base64&lt;/span&gt; &lt;span class="nt"&gt;--decode&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; quotes.toscrape.png
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Verify the file is non-empty&lt;/li&gt;
&lt;li&gt;Open and inspect the image&lt;/li&gt;
&lt;li&gt;Return the file location and a one-line visual description&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Example output from Claude:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;✅ Screenshot captured!

📁 Location: /home/you/quotes.toscrape.png

🖼️  What's in it: A clean light-themed page listing inspirational quotes
   with author attributions and tag filters in the sidebar.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Under the Hood: How the API Call Works
&lt;/h2&gt;

&lt;p&gt;Let's break down the exact curl command the skill executes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; https://api.zyte.com/v1/extract &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-u&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$ZYTE_API_KEY&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;: &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "url": "https://target-site.com",
    "screenshot": true
  }'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
| jq &lt;span class="nt"&gt;-r&lt;/span&gt; &lt;span class="s1"&gt;'.screenshot'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
| &lt;span class="nb"&gt;base64&lt;/span&gt; &lt;span class="nt"&gt;--decode&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; output.png
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;code&gt;curl -s&lt;/code&gt;&lt;/strong&gt; — Silent mode; suppresses progress output.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;-u "$ZYTE_API_KEY":&lt;/code&gt;&lt;/strong&gt; — HTTP Basic Auth. Zyte uses the API key as the username with an empty password.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;-H "Content-Type: application/json"&lt;/code&gt;&lt;/strong&gt; — Tells the API to expect a JSON body.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;-d '{...}'&lt;/code&gt;&lt;/strong&gt; — The JSON request body. Setting &lt;code&gt;screenshot: true&lt;/code&gt; instructs Zyte to return a base64-encoded PNG of the fully rendered page.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;| jq -r '.screenshot'&lt;/code&gt;&lt;/strong&gt; — Extracts the raw base64 string from the JSON response.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;| base64 --decode&lt;/code&gt;&lt;/strong&gt; — Decodes the base64 string into binary PNG data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;&amp;gt; output.png&lt;/code&gt;&lt;/strong&gt; — Writes the binary data to a PNG file.&lt;/p&gt;

&lt;p&gt;The Zyte API handles everything in between — spinning up a headless Chromium instance, loading the page with real browser fingerprints, waiting for JavaScript execution to complete, and rendering the final DOM to a pixel buffer.&lt;/p&gt;

&lt;p&gt;This was a fun weekend project I put together, let me know your thoughts on our Discord and feel free to play around with it. I'd also love to know if you create any useful claude skills or mcp server, so say hi on our discord.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Tags: web scraping • Zyte API • screenshots at scale • JavaScript rendering • anti-bot • Claude AI • Claude Skills • automation • headless browser • site APIs&lt;/em&gt;&lt;/p&gt;

</description>
      <category>claude</category>
      <category>webscraping</category>
      <category>python</category>
      <category>ai</category>
    </item>
    <item>
      <title>Why your Python request gets 403 Forbidden</title>
      <dc:creator>John Rooney</dc:creator>
      <pubDate>Wed, 04 Mar 2026 09:11:29 +0000</pubDate>
      <link>https://forem.com/extractdata/why-your-python-request-gets-403-forbidden-4h3p</link>
      <guid>https://forem.com/extractdata/why-your-python-request-gets-403-forbidden-4h3p</guid>
      <description>&lt;p&gt;If you’ve had your HTTP request blocked despite using correct headers, cookies, and clean IPs, there’s a chance you are running into one of the simplest forms of blocking, and one of the most confusing for beginners.&lt;/p&gt;

&lt;p&gt;Chances are, you will recognise the problem. You found the hidden API, and your request works perfectly in Postman... but it fails instantly within your Python code.&lt;/p&gt;

&lt;p&gt;It’s called TLS fingerprinting. But the good news is, you can solve it. In fact, when I showed this to some developers at &lt;a href="https://www.extractsummit.io/" rel="noopener noreferrer"&gt;Extract Summit&lt;/a&gt;, they couldn’t believe how straightforward it was to fix.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1gl264gftdhxminz531h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1gl264gftdhxminz531h.png" alt="bruno request" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;CAPTION: “I copied the request -&amp;gt; matching headers, cookies and IP, but it still failed?”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Your TLS fingerprint&lt;br&gt;&lt;br&gt;
Let’s start with a question. How do the servers and websites know you’ve moved from Postman to making the request in Python? What do they see that you can’t? The key is your TLS fingerprint.&lt;/p&gt;

&lt;p&gt;To use an analogy: We’ve effectively written a different name on a sticker and stuck it to our t-shirt, hoping to get past the bouncer at a bar.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Your nametag (headers)&lt;/strong&gt; says "Chrome."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;But your t-shirt logo (TLS handshake)&lt;/strong&gt; very obviously says "Python."&lt;/p&gt;

&lt;p&gt;It’s a dead giveaway. This mismatch is spotted immediately. We need to change our t-shirt to match the nametag.&lt;/p&gt;

&lt;p&gt;To understand &lt;em&gt;how&lt;/em&gt; they spot the “logo”, we need to look at the initial &lt;strong&gt;“Client Hello”&lt;/strong&gt; packet. There are three key pieces of information exchanged here:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Cipher suites:&lt;/strong&gt; The encryption methods the client supports.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TLS extensions:&lt;/strong&gt; Extra features (like specific elliptic curves).
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Key exchange algorithms:&lt;/strong&gt; How they agree on a password.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is because Python’s &lt;code&gt;requests&lt;/code&gt; library uses &lt;strong&gt;OpenSSL&lt;/strong&gt;, while Chrome uses Google's &lt;strong&gt;BoringSSL&lt;/strong&gt;. While they share some underlying logic, their signatures are notably different. And that’s the problem.&lt;/p&gt;
&lt;h3&gt;
  
  
  OpenSSL vs. BoringSSL
&lt;/h3&gt;

&lt;p&gt;The root cause of this mismatch lies in the underlying libraries.&lt;/p&gt;

&lt;p&gt;Python’s &lt;code&gt;requests&lt;/code&gt; library relies on &lt;strong&gt;OpenSSL&lt;/strong&gt;, the standard cryptographic library found on almost every Linux server. It is robust, predictable, and remarkably consistent.&lt;/p&gt;

&lt;p&gt;Chrome, however, uses &lt;strong&gt;BoringSSL&lt;/strong&gt;, Google’s own fork of OpenSSL. BoringSSL is designed specifically for the chaotic nature of the web and it behaves very differently.&lt;/p&gt;

&lt;p&gt;The biggest giveaway between the two is a mechanism called &lt;strong&gt;GREASE&lt;/strong&gt; (Generate Random Extensions And Sustain Extensibility).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="s2"&gt;"TLS_GREASE (0xFAFA)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="err"&gt;....&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Chrome (BoringSSL) intentionally inserts random, garbage values into the TLS handshake - specifically, in the cipher suites and extensions lists. It does this to "grease the joints" of the internet, ensuring that servers don't crash when they encounter unknown future parameters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is one of the key changes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Chrome:&lt;/strong&gt; Always includes these random GREASE values (e.g., &lt;code&gt;0x0a0a&lt;/code&gt;).
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Python (OpenSSL):&lt;/strong&gt; &lt;em&gt;Never&lt;/em&gt; includes them. It only sends valid, known ciphers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So, when an anti-bot system sees a handshake claiming to be "Chrome 120" but lacking these random GREASE values, it knows instantly that it is dealing with a script. It’s not just that your shirt has the wrong logo; it’s that your shirt is &lt;em&gt;too clean&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  JA3 hash
&lt;/h2&gt;

&lt;p&gt;Anti-bot companies take all that handshake data and combine it into a single string called a &lt;strong&gt;JA3 fingerprint&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Salesforce invented this years ago to detect malware, but it found its way into our industry as a simple, effective way to fingerprint HTTP requests. Security vendors have built databases of these fingerprints.&lt;/p&gt;

&lt;p&gt;It is relatively straightforward to identify and block &lt;em&gt;any&lt;/em&gt; request coming from Python’s default library because its JA3 hash is static and well-known.&lt;/p&gt;

&lt;p&gt;This code snippet would yield the below JSON response.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_ja3_info&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://tls.peet.ws/api/clean&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Session&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raise_for_status&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Note the lack of akamai_hash&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"ja3"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; 
&lt;/span&gt;&lt;span class="s2"&gt;"771,4866-4867-4865-49196-49200-49195-49199-52393-52392-49188-49192-49187-49191-159-158-107-103-255,0-11-10-16-22-2
3-49-13-43-45-51-21,29-23-30-25-24-256-257-258-259-260,0-1-2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"ja3_hash"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"a48c0d5f95b1ef98f560f324fd275da1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"ja4"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"t13d1812h1_85036bcba153_375ca2c5e164"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"ja4_r"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; 
&lt;/span&gt;&lt;span class="s2"&gt;"t13d1812h1_0067,006b,009e,009f,00ff,1301,1302,1303,c023,c024,c027,c028,c02b,c02c,c02f,c030,cca8,cca9_000a,000b,000
d,0016,0017,002b,002d,0031,0033_0403,0503,0603,0807,0808,0809,080a,080b,0804,0805,0806,0401,0501,0601,0303,0301,030
2,0402,0502,0602"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"akamai"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"-"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"akamai_hash"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"-"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"peetprint"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; 
&lt;/span&gt;&lt;span class="s2"&gt;"772-771|1.1|29-23-30-25-24-256-257-258-259-260|1027-1283-1539-2055-2056-2057-2058-2059-2052-2053-2054-1025-1281-15
37-771-769-770-1026-1282-1538|1||4866-4867-4865-49196-49200-49195-49199-52393-52392-49188-49192-49187-49191-159-158
-107-103-255|0-10-11-13-16-21-22-23-43-45-49-51"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"peetprint_hash"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"76017c4a71b7a055fb2a9a5f70f05112"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Putting the above JA3 hash into ja3.zone clearly shows this is a python3 request, using urllib3:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpmrqcpccqc08rcm5wz34.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpmrqcpccqc08rcm5wz34.png" alt="JA3 Zone" width="800" height="622"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What’s the solution?
&lt;/h2&gt;

&lt;p&gt;As mentioned, simply changing headers and IP addresses won’t make a difference, as these are not part of the TLS handshake. We need to change the ciphers and Extensions to be more like what a browser would send.&lt;/p&gt;

&lt;p&gt;The best way to achieve this in Python is to swap &lt;code&gt;requests&lt;/code&gt; for a modern, TLS-friendly library like &lt;strong&gt;curl_cffi&lt;/strong&gt; or &lt;strong&gt;rnet&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Here is how easy it is to switch to &lt;strong&gt;curl_cffi&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;curl_cffi&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="c1"&gt;# note the impersonate argument &amp;amp; import above
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_ja3_info&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://tls.peet.ws/api/clean&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Session&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;impersonate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chrome&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raise_for_status&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"akamai_hash"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"52d84b11737d980aef856699f885ca86"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fytpcxmljsdlf7nq02rq3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fytpcxmljsdlf7nq02rq3.png" alt="Our new hash" width="800" height="464"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;CAPTION: Note - I searched via the akamai_hash here as the fingerprint from the JA3 hash wasn’t in this particular database.&lt;/em&gt; &lt;/p&gt;

&lt;p&gt;By adding that &lt;code&gt;impersonate&lt;/code&gt; parameter, you are effectively putting on the correct t-shirt.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Make &lt;code&gt;curl_cffi&lt;/code&gt; or &lt;code&gt;rnet&lt;/code&gt; your default HTTP library in Python. This should be your first port of call before spinning up a full headless browser.&lt;/p&gt;

&lt;p&gt;A simple change (which brings benefits like async capabilities) means you don’t fall foul of TLS fingerprinting. &lt;code&gt;curl-cffi&lt;/code&gt; even has a &lt;code&gt;requests&lt;/code&gt;-like API, meaning it's often a drop-in replacement.&lt;/p&gt;

</description>
      <category>api</category>
      <category>networking</category>
      <category>python</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>Hybrid scraping: The architecture for the modern web</title>
      <dc:creator>John Rooney</dc:creator>
      <pubDate>Wed, 25 Feb 2026 17:05:36 +0000</pubDate>
      <link>https://forem.com/extractdata/hybrid-scraping-the-architecture-for-the-modern-web-4p9h</link>
      <guid>https://forem.com/extractdata/hybrid-scraping-the-architecture-for-the-modern-web-4p9h</guid>
      <description>&lt;p&gt;If you scrape the modern web, you probably know the pain of the &lt;strong&gt;JavaScript challenge&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Before you can access any data, the website forces your browser to execute a snippet of JavaScript code. It calculates a result, sends it back to an endpoint for verification, and often captures extensive fingerprinting data in the process.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl32q8pbypgq6l22kner6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl32q8pbypgq6l22kner6.png" alt="browser checks" width="800" height="524"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once you pass this test, the server assigns you a &lt;strong&gt;session cookie&lt;/strong&gt;. This cookie acts as your "access pass." It tells the website, &lt;em&gt;"This user has passed the challenge,"&lt;/em&gt; so you don’t have to re-run the JavaScript test on every single page load.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft9ot3jm7249g98cuwbno.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft9ot3jm7249g98cuwbno.png" alt="devtool shows storage token" width="800" height="422"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For web scrapers, this mechanism creates a massive inefficiency.&lt;/p&gt;

&lt;p&gt;It &lt;em&gt;looks&lt;/em&gt; like you are forced to use a headless browser (like Puppeteer or Playwright) for every single request just to handle that initial check. But browsers are heavy, they are slow and they consume massive amounts of RAM and bandwidth.&lt;/p&gt;

&lt;p&gt;Running a browser for thousands of requests can quickly become an infrastructure nightmare. You end up paying for CPU cycles just to render a page when all you wanted was the JSON payload.&lt;/p&gt;

&lt;h3&gt;
  
  
  The solution: Hybrid scraping
&lt;/h3&gt;

&lt;p&gt;The answer to this problem is a technique I’ve started calling &lt;strong&gt;hybrid scraping&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This involves using the browser &lt;em&gt;only&lt;/em&gt; to open the initial request, grab the cookie, and create a session. Once you have them, you extract that session data and hand it over to a standard, lightweight HTTP client.&lt;/p&gt;

&lt;p&gt;This architecture gives you the &lt;strong&gt;access&lt;/strong&gt; of a browser with the &lt;strong&gt;speed and efficiency&lt;/strong&gt; of a script.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementing this in Python
&lt;/h2&gt;

&lt;p&gt;To build this in Python, we need two specific packages:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;A browser:&lt;/strong&gt; We will use &lt;strong&gt;ZenDriver&lt;/strong&gt;, a modern wrapper for headless Chrome that handles the "undetected" configuration for us.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HTTP client:&lt;/strong&gt; We will use &lt;a href="https://github.com/0x676e67/rnet" rel="noopener noreferrer"&gt;&lt;strong&gt;rnet&lt;/strong&gt;&lt;/a&gt;, a Rust-based HTTP client for Python.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;em&gt;But why &lt;a href="https://github.com/0x676e67/rnet" rel="noopener noreferrer"&gt;rnet&lt;/a&gt;?&lt;/em&gt; Well, within the initial TLS handshake where the client/server “hello” is sent, the information traded here can be fingerprinted, taking in things like the TLS version and the ciphers available for encryption. This can be hashed into a fingerprint and profiled.&lt;/p&gt;

&lt;p&gt;Python’s &lt;a href="https://github.com/psf/requests" rel="noopener noreferrer"&gt;requests&lt;/a&gt; package, which uses &lt;a href="https://docs.python.org/3/library/urllib.html" rel="noopener noreferrer"&gt;urllib&lt;/a&gt; from the standard library, has a very distinctive TLS fingerprint, containing ciphers (amongst other things) that aren’t seen in a browser. This makes it very easy to spot. Both &lt;a href="https://github.com/0x676e67/rnet" rel="noopener noreferrer"&gt;rnet&lt;/a&gt;, and other options such as &lt;a href="https://github.com/lexiforest/curl_cffi" rel="noopener noreferrer"&gt;curl-cffi&lt;/a&gt;, are able to send a TLS fingerprint similar to that of a browser. This reduces the chances of our request being blocked.&lt;/p&gt;

&lt;p&gt;Here is how we assemble the pipeline.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Load the page (The handshake)
&lt;/h3&gt;

&lt;p&gt;First, we define our browser logic. Notice that we are not trying to parse HTML here. Our only goal is to visit the site, pass the initial JavaScript challenge, and extract the session cookies.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;zendriver&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;zd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_cookies&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Use ZenDriver to launch a browser, navigate to the page, 
    and retrieve the cookies.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="n"&gt;browser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;zd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Hit the homepage to trigger the check
&lt;/span&gt;    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://auto.hylnd7.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Wait briefly for the JS challenge to complete
&lt;/span&gt;    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 

    &lt;span class="c1"&gt;# Extract the cookies
&lt;/span&gt;    &lt;span class="n"&gt;requests_style_cookies&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cookies&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_all&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;requests_style_cookies&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What’s happening here:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We launch the browser, visit the site, and wait just one second for the JS challenge to run. Once we have the cookies, we call browser.stop(). This is the most important line: we do not want a browser instance wasting resources when we don’t need it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Use the cookies
&lt;/h3&gt;

&lt;p&gt;Now that we have the "access pass," we can switch to our lightweight HTTP client. We take those cookies and inject them into the rnet client headers.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;rnet&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Emulation&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;http_request_rnet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cookies&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Make a fast request using RNet with the borrowed cookies.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;referer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://auto.hylnd7.com/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;# Format the browser cookies into a simple HTTP header string
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cookies&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;cookie_list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;cookie&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;cookies&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;cookie_list&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cookie&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cookie&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Cookie&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cookie_list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# We use Emulation.Chrome142 to change the TLS Fingerprint.
&lt;/span&gt;    &lt;span class="c1"&gt;# This is site dependent - but worth using
&lt;/span&gt;    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;emulation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Emulation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Chrome142&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://auto.hylnd7.com/api/products?page=1&amp;amp;limit=8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What’s happening here:&lt;/strong&gt; &lt;/p&gt;

&lt;p&gt;We convert the browser's cookie format into a standard header string. Note the “&lt;em&gt;Emulation.Chrome142”&lt;/em&gt; parameter. We are layering two techniques here: hybrid scraping (using real cookies) and TLS fingerprinting (using a modern HTTP client). This double-layer approach covers all our bases.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;(Note: Many HTTP clients have a cookie jar that you could also use; for this example, sending the header directly worked perfectly).&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Run the code
&lt;/h3&gt;

&lt;p&gt;Finally, we tie it together. For this demo, we use a simple argparse flag to show the difference with and without the cookie.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;use_cookies&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;cookies&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

    &lt;span class="c1"&gt;# The Decision Logic
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;use_cookies&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;cookies&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;get_cookies&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="c1"&gt;# Run the heavy browser
&lt;/span&gt;
    &lt;span class="c1"&gt;# Always run the fast HTTP client
&lt;/span&gt;    &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;http_request_rnet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cookies&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Status Code:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Response Body:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Request blocked&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Get the complete script
&lt;/h2&gt;

&lt;p&gt;Want to run this yourself? We’ve put the full, copy-pasteable script (including the argument parsers and imports) in the block below.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv init
uv add zendriver rnet rich
&lt;span class="c"&gt;# linux/mac&lt;/span&gt;
&lt;span class="nb"&gt;source&lt;/span&gt; .venv/bin/activate&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="c"&gt;# windows&lt;/span&gt;
.venv&lt;span class="se"&gt;\S&lt;/span&gt;cripts&lt;span class="se"&gt;\a&lt;/span&gt;ctivate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;zendriver&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;zd&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;rnet&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Emulation&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;rich&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="k"&gt;print&lt;/span&gt;


&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;http_request_rnet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cookies&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Make an HTTP GET request using rnet with the provided cookies. Cookies are sent in the headers. Note for this site we need the referer too.
    Return the Response Object.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;referer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://auto.hylnd7.com/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cookies&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;cookie_list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;cookie&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;cookies&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# Adjust based on the actual structure of the cookie object from zendriver
&lt;/span&gt;            &lt;span class="c1"&gt;# If it's a dict: cookie['name'], cookie['value']
&lt;/span&gt;            &lt;span class="c1"&gt;# If it's an object: cookie.name, cookie.value
&lt;/span&gt;            &lt;span class="n"&gt;cookie_list&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cookie&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cookie&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Cookie&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cookie_list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;emulation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Emulation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Chrome142&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://auto.hylnd7.com/api/products?page=1&amp;amp;limit=8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_cookies&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Use zendriver to launch a browser, navigate to a page, and retrieve cookies.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;browser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;zd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://auto.hylnd7.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;requests_style_cookies&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cookies&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_all&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;requests_style_cookies&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;use_cookies&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;cookies&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;use_cookies&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;cookies&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;get_cookies&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;http_request_rnet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cookies&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Status Code:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Response Body:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;parser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ArgumentParser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Make HTTP request with optional browser cookies&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--cookies&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;help&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Set to &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; to launch browser and get cookies, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;false&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; to skip (default: false)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse_args&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cookies&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Pros and Cons of Hybrid Scraping
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Pros&lt;/th&gt;
&lt;th&gt;Cons&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Efficiency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Reduces RAM usage massively compared to pure browser scraping.&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Higher complexity:&lt;/strong&gt; You must manage two libraries (zendriver and rnet) and the glue code.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Speed&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;HTTP requests complete in milliseconds. Browsers take seconds.&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;State management:&lt;/strong&gt; You need logic to handle cookie expiry. If the cookie dies, you must "wake up" the browser.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Access&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;You get the verification of a real browser without the drag.&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Maintenance:&lt;/strong&gt; You are debugging two points of failure: the browser's ability to solve the challenge, and the client's ability to fetch data.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Final thoughts&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;For smaller jobs, it might be easier to just use the browser; the benefits won’t necessarily outweigh the extra complexity required.&lt;/p&gt;

&lt;p&gt;But for production pipelines, &lt;strong&gt;this approach is the standard.&lt;/strong&gt; It treats the browser as a luxury resource: used only when strictly necessary to unlock the door, so the HTTP client can do the real work. It’s this session and state management that allows you to scrape harder-to-access sites effectively and efficiently.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;If building this orchestration layer yourself feels like too much overhead, this is exactly what the &lt;a href="https://www.zyte.com/zyte-api/" rel="noopener noreferrer"&gt;&lt;strong&gt;Zyte API&lt;/strong&gt;&lt;/a&gt; handles internally. We manage the browser/HTTP switching logic automatically, so you just make a single request and get the data.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>javascript</category>
      <category>webdev</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>Raspberry Pi &amp; E-ink scrapes &amp; displays the price of Gold today</title>
      <dc:creator>Ayan Pahwa</dc:creator>
      <pubDate>Tue, 24 Feb 2026 08:39:31 +0000</pubDate>
      <link>https://forem.com/extractdata/how-i-trade-gold-using-e-ink-live-data-and-an-old-raspberry-pi-42ag</link>
      <guid>https://forem.com/extractdata/how-i-trade-gold-using-e-ink-live-data-and-an-old-raspberry-pi-42ag</guid>
      <description>&lt;p&gt;They say that “data is the new oil”, but there’s another hot commodity that’s setting markets alight - precious metals.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frtbqu72m01w8auk29wfd.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frtbqu72m01w8auk29wfd.jpg" alt=" " width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the last 12 months, the &lt;a href="https://uk.finance.yahoo.com/quote/GC%3DF/" rel="noopener noreferrer"&gt;value of gold&lt;/a&gt; has surged about 75%, while &lt;a href="https://www.gold.co.uk/silver-price/" rel="noopener noreferrer"&gt;silver has boomed&lt;/a&gt; more than 200%. That’s why I, like a growing number of others, now trade in the metal markets.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdanyckktjsbtvn1e2b3p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdanyckktjsbtvn1e2b3p.png" alt=" " width="800" height="414"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;These days, it is possible to buy &lt;em&gt;digital&lt;/em&gt; versions of precious metals. But I think of myself as a &lt;em&gt;collector&lt;/em&gt; - I like to buy &lt;em&gt;real&lt;/em&gt;, solid coins or bullions whenever I get a chance.&lt;/p&gt;

&lt;p&gt;In the last two years, I have acquired a small collection of gold bullions and silver coins, which have appreciated healthily. But I am not planning to sell and book a profit just yet. In fact, I want to buy more, especially when there’s a dip in the price.&lt;/p&gt;

&lt;p&gt;There’s just one problem that hits this hobby - prices of actual physical gold and silver bullions are very different in the retail market from stock exchanges’ spot prices and keeping track of them manually is cumbersome specially with a full time job.&lt;/p&gt;

&lt;p&gt;To take advantage of the dips and price arbitrage, I need to &lt;em&gt;automate&lt;/em&gt; my decisions. To buy gold old-style, I need a key resource from the modern trading toolset - data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Turn data into gold
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="http://gjc.org.in/" rel="noopener noreferrer"&gt;All-India Gem And Jewellery Domestic Council&lt;/a&gt; (GJC), a national trade federation for the promotion and growth of trade in gems and jewellery across, is the go-to site listing latest retail rates for gold and silver.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsgjjdai9hw0oyihva84n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsgjjdai9hw0oyihva84n.png" alt=" " width="800" height="474"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Alas, it doesn’t offer an API to access that data. But fear not - with web scraping skills and &lt;a href="https://www.zyte.com/zyte-api/" rel="noopener noreferrer"&gt;Zyte API&lt;/a&gt;, I can extract these prices quickly and regularly.&lt;/p&gt;

&lt;p&gt;And I can do it using some of the tech I love to tinker with.&lt;/p&gt;

&lt;p&gt;I call it &lt;a href="https://github.com/apscrapes/ExtractToInk" rel="noopener noreferrer"&gt;ExtractToInk&lt;/a&gt; - a custom project that pulls the latest prices on a two-inch, 250x122 e-ink display powered by a retired Raspberry Pi (total cost under US$50).&lt;/p&gt;

&lt;p&gt;This is the story of how I power my quest for rapid riches using cheap old hardware and the &lt;a href="https://www.zyte.com/blog/zyte-leads-proxyway-2025-web-scraping-api-report/" rel="noopener noreferrer"&gt;world’s best web scraping engine&lt;/a&gt; - and how you can, too.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mining for data
&lt;/h2&gt;

&lt;p&gt;Like many modern sites, GJC’s includes both JavaScript rendering for HTML and protection mechanisms- technologies that can break brittle traditional scraping solutions.&lt;/p&gt;

&lt;p&gt;This project connects all the dots:&lt;/p&gt;

&lt;p&gt;Web → Extract → Parse → Render → Physical display&lt;/p&gt;

&lt;h3&gt;
  
  
  Tech stack
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Hardware&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Raspberry Pi (tested on Pi Zero 2 W), it should run on any Raspberry Pi Board
&lt;/li&gt;
&lt;li&gt;Pimoroni Inky pHAT (Black, SSD1608)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Software&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Python 3
&lt;/li&gt;
&lt;li&gt;Zyte API: to get rendered HTML
&lt;/li&gt;
&lt;li&gt;BeautifulSoup: to parse HTML
&lt;/li&gt;
&lt;li&gt;Pillow and Inky Python libraries: for e-ink display stuff&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now let’s get building.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Prepare hardware
&lt;/h2&gt;

&lt;p&gt;Setup your Raspberry Pi. In my case, I am using Raspberry Pi OS &lt;a href="https://www.raspberrypi.com/documentation/computers/getting-started.html" rel="noopener noreferrer"&gt;booted from the SD card&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Furl2uiive3xweo4opha6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Furl2uiive3xweo4opha6.png" alt=" " width="800" height="532"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Depending on which display you use, it most probably will be connected to the Pi over i2c bus or SPI bus protocol - so, enable your display type by entering:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;sudo raspi-config&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Now attach your e-ink display and do a quick reboot &lt;/p&gt;

&lt;p&gt;You might need to install libraries to use your e-ink display.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkd77cowp2x5kj1hmvupf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkd77cowp2x5kj1hmvupf.png" alt=" " width="800" height="373"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Fetching rendered HTML with Zyte API
&lt;/h2&gt;

&lt;p&gt;The source site, GJC, renders prices dynamically, using JavaScript - something which can make plain HTTP requests unreliable.&lt;/p&gt;

&lt;p&gt;No problem. By accessing the page through Zyte API, we can set &lt;code&gt;browserHTML&lt;/code&gt; mode to return the page content as though rendered in an actual browser.&lt;/p&gt;

&lt;p&gt;Instead of fighting JavaScript, we let Zyte handle it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;html = requests.post(`  
    `"https://api.zyte.com/v1/extract",`  
    `auth=(ZYTE_API_KEY, ""),`  
    `json={"url": URL, "browserHtml": True},`  
`).json()["browserHtml"]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note: there is no Selenium here, and no headless browsers. This is much more reliable for production-style scraping&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Parsing with CSS selectors
&lt;/h2&gt;

&lt;p&gt;Once we have clean HTML, parsing becomes straightforward.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzkwbp7ffd758ci8uu01r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzkwbp7ffd758ci8uu01r.png" alt=" " width="800" height="474"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Gold prices
&lt;/h3&gt;

&lt;p&gt;Let’s locate the actual prices in the page mark-up.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;for row in soup.select(".gold_rate table tr"):`  
    `label = row.select_one("td strong")`  
    `values = row.select("td strong")`

    `if not label or len(values) &amp;lt; 2:`  
        `continue`

    `text = label.get_text(strip=True)`  
    `priceText = values[1].get_text()`

    `if "Standard Rate Buying" in text:`  
        `goldBuying = re.search(r"\d[\d,]*", priceText).group(0)`

    `if "Standard Rate Selling" in text:`  
        `goldSelling = re.search(r"\d[\d,]*", priceText).group(0)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We’re deliberately using:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CSS selectors (easy to find from your browser’s DevTools).
&lt;/li&gt;
&lt;li&gt;Minimal regular expressions (only for numeric extraction).
&lt;/li&gt;
&lt;li&gt;Defensive checks to avoid brittle parsing.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Silver prices
&lt;/h3&gt;

&lt;p&gt;Silver appears outside the main table, so we filter it carefully:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;for strong in soup.select("p &amp;gt; strong"):`  
    `text = strong.get_text(" ", strip=True)`

    `if "Standard Rate Selling" in text and not strong.find_parent("table"):`  
        `silver = re.search(r"\d[\d,]*", text).group(0)`  
        `break
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 4: Rendering for e-ink
&lt;/h2&gt;

&lt;p&gt;For this project, I did not want to pipe data into a web dashboard on a computer monitor.&lt;/p&gt;

&lt;p&gt;E-ink is always-on, low power, distraction-free and perfect for “ambient information” like this.&lt;/p&gt;

&lt;p&gt;So, it’s a great fit for data like prices, weather, status indicators and system health.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwp8mf7r0lr5iibk3h8b5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwp8mf7r0lr5iibk3h8b5.png" alt=" " width="800" height="452"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;But e-ink displays are not normal screens.&lt;/p&gt;

&lt;p&gt;They are typically black-and-white, have high contrast and are slow to refresh.&lt;/p&gt;

&lt;p&gt;What’s more, no two e-ink displays are made the same way. Every vendor has different support packages so, whichever you end up using, make sure to read the documentation and change the code accordingly.&lt;/p&gt;

&lt;p&gt;In my case, I am using &lt;a href="https://learn.pimoroni.com/article/getting-started-with-inky-phat" rel="noopener noreferrer"&gt;Pimoroni inky PHat&lt;/a&gt;. The supplied Python library has great built-in examples to get you quickly up and running. I used the helper function to render texts on the display, ex, the build in draw.text() function comes handy:&lt;/p&gt;

&lt;h3&gt;
  
  
  Draw silver selling price
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    draw.text((x, y), f"Silver : {silverPrice}", fill=(0, 0, 0), font=fontBig)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Taking it furtherSection about the finished product
&lt;/h2&gt;

&lt;p&gt;I built this project to use web data thoughtfully, connecting it to the physical world, and building pipelines that feel calm, reliable, and purposeful. When I am at my work-desk the project actively tells me the current prices so I can buy new coins if I see a price drop.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdf1n493o7cimspvfjrjw.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdf1n493o7cimspvfjrjw.jpg" alt=" " width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I can further extend this to place automatic orders on the website and secure me a coin at my desired strike price. &lt;/p&gt;

&lt;p&gt;If you want to take this further, you could also:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Run it via &lt;code&gt;cron&lt;/code&gt; every 10 minutes : The website I am targeting only refreshes prices twice a day, so my cron job runs every 12 hours, but, if you need faster data, you can scrape a site with more real-time updates.
&lt;/li&gt;
&lt;li&gt;Add more commodities or currencies.
&lt;/li&gt;
&lt;li&gt;Turn it into a &lt;code&gt;systemd&lt;/code&gt; service to run at start time.
&lt;/li&gt;
&lt;li&gt;Swap e-ink for another output (PDF, LED, dashboard).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you’re exploring Zyte API, or looking for real-world scraping examples beyond CSVs and JSON files, this project is a great place to start.&lt;/p&gt;

&lt;p&gt;You can get my code in the &lt;a href="https://github.com/apscrapes/ExtractToInk" rel="noopener noreferrer"&gt;ExtractToInk GitHub repository&lt;/a&gt; now.&lt;/p&gt;

</description>
      <category>python</category>
      <category>raspberrypi</category>
      <category>webdev</category>
      <category>sideprojects</category>
    </item>
    <item>
      <title>Holiday Gift Guide 2025: For Developers, Web Scrapers &amp; Everyone In Between</title>
      <dc:creator>Ayan Pahwa</dc:creator>
      <pubDate>Fri, 19 Dec 2025 07:55:12 +0000</pubDate>
      <link>https://forem.com/extractdata/holiday-gift-guide-2025-for-developers-web-scrapers-everyone-in-between-2hbp</link>
      <guid>https://forem.com/extractdata/holiday-gift-guide-2025-for-developers-web-scrapers-everyone-in-between-2hbp</guid>
      <description>&lt;p&gt;It’s that time of the year when the coffee gets stronger, commits get messier, and everyone agrees to finally refactor that script in January. And let’s be honest, most of us won’t. But while it lasts let’s celebrate the season of enjoying laid back family time and exchanging gifts. &lt;br&gt;
And to make gifting a little easier, we asked the Zyte team and community to share what they would love to receive. So if you’ve got a developer, a web scraper, or someone who just really enjoys arguing with APIs in your life… Here's your cheat sheet.&lt;br&gt;
Grab a hot drink, settle into your favourite debugging position, and let’s dive in. &lt;/p&gt;

&lt;p&gt;&lt;em&gt;Disclaimer : It's not sponsored. All the recommendations are from community and author's personal experience using these products, hence no URL is provided but you should be able find these easily.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;1. Book: Soul of a New Machine&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhsk3748t6ioitfnv4qms.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhsk3748t6ioitfnv4qms.png" alt=" " width="572" height="892"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For the dev who loves origin stories. It’s a classic engineering tale that reminds us why we fell in love with building things in the first place. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Logitech MX Master mouse&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbkn5l6tujar1d04fvqyf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbkn5l6tujar1d04fvqyf.png" alt=" " width="570" height="812"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Smooth scrolling. Perfect ergonomics. Side buttons that feel like cheat codes for productivity. This is the mouse equivalent of finally finding that one undocumented API endpoint you needed all year.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. AeroPress + Coffee Subscription&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Facfz2yudv63bxnq41916.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Facfz2yudv63bxnq41916.png" alt=" " width="734" height="866"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If coffee is the real runtime powering your favourite developer, this combo is basically a performance upgrade. AeroPress means fast, clean brews; fresh beans mean they might actually fix that bug before lunch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Spider Plush Toy&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fltyo0r4rwnfq2ybtb482.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fltyo0r4rwnfq2ybtb482.png" alt=" " width="636" height="526"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For every web scraper who proudly identifies as “part human, part spider.” it sits silently on your desk reminding you to obey robots.txt (…most of the time).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Git Merch Based on Their Commit History&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdlexoulpxhdj1354vou5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdlexoulpxhdj1354vou5.png" alt=" " width="720" height="918"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A mug that says “I survived the merge conflict of 2025.”&lt;br&gt;
A T-shirt celebrating their 3,000-day streak.&lt;br&gt;
Or maybe a gentle reminder that “WIP” is not a personality.&lt;br&gt;
Funny, personal, and guaranteed to make them smile during standup.&lt;br&gt;
Link : &lt;a href="https://gitmerch.com/" rel="noopener noreferrer"&gt;https://gitmerch.com/&lt;/a&gt; (not-sponsored),  just enter their github username&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. Nothing Headphones&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxulp8u0vgwuj70i6dhli.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxulp8u0vgwuj70i6dhli.png" alt=" " width="548" height="578"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Sleek, transparent, and great for tuning out noisy offices or noisy families. Perfect for deep work, debugging, or pretending you can’t hear someone asking, “Can you take a quick look at this?”&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;7. Mechanical Keyboard (NuPhy / Keychron)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdwvejewr2qxjcsgilo9p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdwvejewr2qxjcsgilo9p.png" alt=" " width="800" height="537"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Developers don’t just type on these; they perform.&lt;br&gt;
Clicky keys, gorgeous layouts, RGB that could guide aircraft, what’s not to love? Warning: once they switch, they’ll never stop talking about actuation force.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;8. Bambu Lab A1 Mini 3D Printer&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyve77kf7h4f7palr2hdt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyve77kf7h4f7palr2hdt.png" alt=" " width="540" height="730"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For the dev who already has too many hobbies.&lt;br&gt;
They’ll print brackets, cable holders, figurines, replacement parts… and things no one can identify but everyone politely admires. A creator’s playground in miniature.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;9. BenQ Monitor Light Bar&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fesxgg5au2821ji4v9h01.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fesxgg5au2821ji4v9h01.png" alt=" " width="676" height="574"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Immediate upgrade to any desk setup. Reduces eye strain, looks clean, and helps developers see their code clearly even during late-night “just one more function” sessions without taking extra space on their desk.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;10. 100W GaN Charger&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feznqyferd6siq0nsnwa1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feznqyferd6siq0nsnwa1.png" alt=" " width="738" height="690"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Tiny but absurdly powerful, just like that one script they wrote at 3 AM that still runs in production. A GaN charger keeps everything powered: laptop, phone, tablet, e-reader, existential dread, everything.&lt;/p&gt;




&lt;p&gt;If you love someone who spends their days coaxing data out of websites, obsessing over keyboard switches, or whispering sweet nothings to their terminal, this list has something that will make them light up.&lt;br&gt;
Here’s to a warm, restful holiday season… and to fewer bugs in 2026. Happy gifting, happy coding, and as always, happy scraping! 🕷️✨&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>holiday</category>
      <category>python</category>
      <category>developers</category>
    </item>
    <item>
      <title>The AI Web Scraper: One Workflow to Scrape Anything (n8n Part 3)</title>
      <dc:creator>Lakshay Nasa</dc:creator>
      <pubDate>Tue, 16 Dec 2025 14:00:59 +0000</pubDate>
      <link>https://forem.com/extractdata/web-scraping-with-n8n-part-3-the-ai-web-scraper-one-workflow-scrape-anything-3e4n</link>
      <guid>https://forem.com/extractdata/web-scraping-with-n8n-part-3-the-ai-web-scraper-one-workflow-scrape-anything-3e4n</guid>
      <description>&lt;p&gt;Welcome to the finale of our n8n web scraping series!&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;In &lt;a href="https://dev.to/extractdata/web-scraping-with-n8n-part-1-build-your-first-web-scraper-37cf"&gt;Part 1&lt;/a&gt;, we covered the basics: fetching a single page and parsing it with CSS selectors.&lt;/li&gt;
&lt;li&gt;In &lt;a href="https://dev.to/extractdata/n8n-web-scraping-part-2-pagination-infinite-scroll-network-capture-more-365"&gt;Part 2&lt;/a&gt;, we tackled the tricky mechanics: pagination loops, infinite scroll, and network capture.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;But if you’ve been following along, you know there is still one massive headache in web scraping: &lt;strong&gt;"New Site = New Workflow."&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Every time you want to scrape a different website, you have to open the browser inspector, hunt for new &lt;code&gt;&amp;lt;div&amp;gt;&lt;/code&gt; classes, debug why &lt;code&gt;price_color&lt;/code&gt; isn't working, and rewrite your entire flow. It’s exhausting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Today, we changed that.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In this final part, we are going to build an &lt;strong&gt;Automated AI Scraper&lt;/strong&gt; - a single n8n workflow that can scrape almost anything (Online Stores, Article/News Sites, Job Boards, and more) without you changing a single node.&lt;/p&gt;

&lt;p&gt;Whether you are a developer looking to save hours of coding, or a non-technical user who just needs the data without the headache, this tool is designed for you.&lt;/p&gt;

&lt;center&gt;
  &lt;h4&gt;Watch the Walkthrough 🎬&lt;/h4&gt;
&lt;/center&gt;

&lt;p&gt;

  &lt;iframe src="https://www.youtube.com/embed/QLuvyOCwYT4"&gt;
  &lt;/iframe&gt;


&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;💡 TL;DR:&lt;/strong&gt; Want to start scraping immediately? We have packaged this entire workflow into a ready-to-use template. 👉 &lt;a href="https://n8n.io/workflows/11637-automate-data-extraction-with-zyte-ai-products-jobs-articles-and-more/" rel="noopener noreferrer"&gt;Automate Data Extraction with Zyte AI (Products, Jobs, Articles &amp;amp; More)&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  The Concept: "AI-Driven Architecture”
&lt;/h1&gt;

&lt;p&gt;So, we will make n8n stop looking for specific elements (CSS Selectors) and start caring about what we actually want (Data).&lt;/p&gt;

&lt;p&gt;We are leveraging &lt;a href="https://docs.zyte.com/zyte-api/usage/extract/index.html?utm_campaign=Discord_n8n_blog_p3_z_docs_auto_extract&amp;amp;utm_activity=Community&amp;amp;utm_medium=social&amp;amp;utm_source=Discord" rel="noopener noreferrer"&gt;Zyte’s AI-powered extraction&lt;/a&gt;. Instead of saying "Find the text inside &lt;code&gt;.product_pod h3 a&lt;/code&gt;, we simply send the URL and say product: true. The AI analyzes the visual layout of the page and figures it out, even if the website changes its code tomorrow.&lt;/p&gt;

&lt;p&gt;To handle every scenario, we designed &lt;strong&gt;three distinct pipelines:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;The "AI Extraction" Pipeline (Automatic)&lt;/strong&gt;&lt;br&gt;
This is the core of the workflow. You simply select a Category (e.g., &lt;code&gt;E-commerce&lt;/code&gt;, &lt;code&gt;Article&lt;/code&gt; etc) and a &lt;code&gt;Goal&lt;/code&gt;, and the workflow automatically routes you to one of two paths:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Direct Extraction (Fast):&lt;/strong&gt; If you just need data from the current page (like a "Single Product" or a "Simple List"), the workflow sends a single smart request. No loops, no waiting.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The "Two-Phase" Architecture (For AI crawling):&lt;/strong&gt; If your goal involves "All Pages" or "Visiting Items," the workflow activates a robust recursive loop:

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Phase 1 (The Crawler):&lt;/strong&gt; It maps out the URLs you need (looping through pagination or grabbing item links from a list).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Phase 2 (The Scraper):&lt;/strong&gt; It visits every mapped URL one by one to extract the rich details you asked for.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The "SERP" Pipeline (For SEO Data)&lt;/strong&gt;&lt;br&gt;
Need search rankings? We included a dedicated path for &lt;strong&gt;Search Engine Results Pages (SERP)&lt;/strong&gt;. It uses the specific serp schema to automatically extract organic results, ads, and knowledge panels without you needing to parse complex HTML.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The "Manual" Mode (For Raw Control)&lt;/strong&gt;&lt;br&gt;
Sometimes you don't need AI. We added a "General" path that gives you raw &lt;code&gt;browserHtml&lt;/code&gt;, HTTP responses, or Screenshots so you can parse specific data yourself.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Let’s get building.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 1: The Control Center
&lt;/h2&gt;

&lt;p&gt;In previous parts, we hardcoded URLs into our nodes. For this tool, that won’t work. We need a flexible User Interface.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The Main Interface (Form Trigger)
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh417c7sm4xjtwu6t74sp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh417c7sm4xjtwu6t74sp.png" alt="N8N Form AI Web Scraper Submission"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We use an &lt;strong&gt;n8n Form Trigger&lt;/strong&gt; node as the entry point. This turns your workflow into a clean web app that anyone on your team can use.&lt;/p&gt;

&lt;p&gt;The Main Form collects three key inputs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Target URL&lt;/strong&gt; (Text): The website you want to scrape.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Site Category&lt;/strong&gt; (Dropdown): Options like &lt;code&gt;Online Store&lt;/code&gt;, &lt;code&gt;Article/News&lt;/code&gt;, &lt;code&gt;Job Post&lt;/code&gt; , &lt;code&gt;General &amp;amp; More&lt;/code&gt; .
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zyte API Key&lt;/strong&gt; (Password): Securely input the key so it isn't hardcoded in the workflow.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Smart Routing (The Switch Node)
&lt;/h3&gt;

&lt;p&gt;This is where the magic happens. Immediately after the form, we use a &lt;strong&gt;Switch Node&lt;/strong&gt; ("Route by Category") that directs the traffic into distinct lanes.&lt;/p&gt;

&lt;p&gt;This logic is crucial because different categories require different inputs:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The AI Lane (Store, News, Jobs):&lt;/strong&gt; If you select a structured category, the workflow routes you to a &lt;strong&gt;Secondary Form&lt;/strong&gt; asking for your "Extraction Goal" (e.g., &lt;em&gt;Scrape this page&lt;/em&gt; vs. &lt;em&gt;Crawl ALL pages&lt;/em&gt;).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The SEO Lane:&lt;/strong&gt; If you select "SERP (Search Engine Results)," it bypasses extra forms and goes straight to the specialized SERP scraper.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The Manual Lane (General):&lt;/strong&gt; If you select "General," it routes you to a different &lt;strong&gt;Manual Options Form&lt;/strong&gt; where you can choose specific technical actions (e.g., &lt;em&gt;Take Screenshot&lt;/em&gt;, &lt;em&gt;Get Browser HTML&lt;/em&gt;, &lt;em&gt;Network Capture&lt;/em&gt;).&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This architecture ensures you only see options relevant to your goal.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 2: Pipeline 1 – AI Extraction
&lt;/h2&gt;

&lt;p&gt;If the user selects E-commerce, News/Blog/Article, or Jobs, they enter the AI pipeline.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The "AI Extraction Goal Form" (Refining Scope)
&lt;/h3&gt;

&lt;p&gt;Since "scraping" can mean anything from checking one price to archiving an entire blog, we present a secondary form here to define the scope. You simply tell the workflow what you need: a quick &lt;strong&gt;Single Item&lt;/strong&gt; lookup, a &lt;strong&gt;List&lt;/strong&gt; from the current page, or a full &lt;strong&gt;Multi-Page Crawl&lt;/strong&gt;.&lt;br&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7j4jy725csctiadkvw6y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7j4jy725csctiadkvw6y.png" alt="AI Extraction Goal"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  2. The Brain (Config Generator)
&lt;/h3&gt;

&lt;p&gt;We place a &lt;strong&gt;Code Node&lt;/strong&gt; (the "Zyte Config Generator") to translate your form choices into technical instructions.  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fejpt2m8v567dzbovrnou.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fejpt2m8v567dzbovrnou.png" alt="N8N Code node: Zyte Config Generator"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For instance,&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If you selected &lt;strong&gt;"Online Store"&lt;/strong&gt; → It maps to the Zyte schema &lt;code&gt;product&lt;/code&gt;.
&lt;/li&gt;
&lt;li&gt;If you select &lt;strong&gt;"Article Site"&lt;/strong&gt; → It maps to &lt;code&gt;articles&lt;/code&gt;.
&lt;/li&gt;
&lt;li&gt;If you choose &lt;strong&gt;"Get List"&lt;/strong&gt; → It targets &lt;code&gt;productList&lt;/code&gt; (or &lt;code&gt;articleList&lt;/code&gt;) to extract an array of items.
&lt;/li&gt;
&lt;li&gt;If you choose &lt;strong&gt;"Crawl All Pages"&lt;/strong&gt; → It switches the target to &lt;code&gt;productNavigation&lt;/code&gt; (or &lt;code&gt;articleNavigation&lt;/code&gt;) to activate the crawler loop.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. The 5 Strategies
&lt;/h3&gt;

&lt;p&gt;Based on your "Extraction Goal," the workflow automatically routes to one of 5 specific branches:&lt;/p&gt;

&lt;h4&gt;
  
  
  A. &lt;strong&gt;Single Item:&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Fast execution. Scrapes details of one URL.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We send a single request to the Zyte API with our specific target schema (e.g., &lt;code&gt;product: true&lt;/code&gt;). The AI analyzes the page layout and returns a structured JSON object with the price, name, and details instantly.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  B. &lt;strong&gt;List (Current Page):&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Returns a clean JSON array of items found on the provided URL.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx8gf3goep2mm652d2x25.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx8gf3goep2mm652d2x25.png" alt="Scrape Details AI This Page - N8N"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Similar to the single item strategy, but instead of asking for one object, we request a List schema (like &lt;code&gt;productList&lt;/code&gt;). The AI identifies the repeating elements on the page and returns them as a clean array.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;💡 Design Note:&lt;/strong&gt; You might notice that the nodes for Strategy 1 and 2 look identical. That is because the heavy lifting (choosing between &lt;code&gt;product&lt;/code&gt; vs &lt;code&gt;productList&lt;/code&gt;) is actually handled upstream by the &lt;strong&gt;Config Generator&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🧑‍💻 Best Practice:&lt;/strong&gt; In your own production automations, you should usually &lt;strong&gt;combine these into a single node&lt;/strong&gt; to keep your canvas clean. However, for this template, we kept them separate. This makes the logic visually intuitive and allows you to add specific post-processing (like a unique filter) to the &lt;em&gt;List&lt;/em&gt; path without accidentally breaking the &lt;em&gt;Single Item&lt;/em&gt; path.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  C. &lt;strong&gt;Details (Current Page):&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;A hybrid approach. It scans the current list, finds item links, and visits them one by one.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fav0mtxeob8ibbw711rzx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fav0mtxeob8ibbw711rzx.png" alt="Scrape List AI - N8N"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We use a two-step logic: first, we request a &lt;code&gt;navigation&lt;/code&gt; schema to identify all item links on the current page. Then, we split that list and use a loop to visit each URL individually to extract the full details.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  D. Crawl List (All Pages):
&lt;/h4&gt;

&lt;p&gt;Activates the &lt;strong&gt;Crawler (Phase 1)&lt;/strong&gt; to loop through pagination and build a massive master list.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F71sejledynmnmg1ly4oh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F71sejledynmnmg1ly4oh.png" alt="Scrape List - n8n"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;This enables the pagination loop. The workflow fetches the current page's list, saves the items to a global "Backpack" (memory), detects the "Next Page" link automatically, and loops back to repeat the process until it reaches the end.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  E. Crawl Details (All Pages):
&lt;/h4&gt;

&lt;p&gt;The ultimate mode. It crawls all pages (Phase 1) AND visits every single item found (Phase 2).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx24quf6lcfjw47lwev69.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx24quf6lcfjw47lwev69.png" alt="scrape details AI - n8n"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;This uses our robust &lt;strong&gt;"Two-Phase" architecture&lt;/strong&gt;. Phase 1 loops through pagination specifically to map out every item URL. Once the map is complete, Phase 2 takes over to visit every single URL one by one and extract the deep data.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Step 3: Pipeline 2 – SERP (Search Engine Results)
&lt;/h2&gt;

&lt;p&gt;If you select "Search Engine Results" in the main form, the workflow takes a direct path to the &lt;strong&gt;SERP Node&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This is a single HTTP Request node configured with the &lt;code&gt;serp&lt;/code&gt; schema.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Input:&lt;/strong&gt; Your target Search URL (e.g., a query on a search engine).
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Output:&lt;/strong&gt; Structured JSON containing organic results, ad positions, and knowledge panels.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvmrzrt20779qz6v3phuj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvmrzrt20779qz6v3phuj.png" alt="SERP Extraction: Scrape with n8n"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It is the fastest way to get reliable &lt;strong&gt;SERP&lt;/strong&gt; &lt;strong&gt;data&lt;/strong&gt; for rank tracking or brand monitoring, handling complex layouts automatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 4: Pipeline 3 – Manual / General Mode
&lt;/h2&gt;

&lt;p&gt;What if sometimes you need to scrape a unique dashboard, a niche directory, or just want to debug the raw HTML yourself. That’s why we included the &lt;strong&gt;"Manual"&lt;/strong&gt; path.&lt;/p&gt;

&lt;p&gt;If you select "General / Other" in the form, you are presented with a secondary form offering 5 raw tools:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Browser HTML:&lt;/strong&gt; Returns the full rendered DOM (great for the &lt;a href="https://dev.to/extractdata/web-scraping-with-n8n-part-1-build-your-first-web-scraper-37cf#:~:text=Step%203%3A%20Extract%20the%20HTML%20content"&gt;custom parsing logic we built in Part 1&lt;/a&gt;).
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HTTP Response Body:&lt;/strong&gt; Useful for API endpoints.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Network Capture:&lt;/strong&gt; Intercepts background XHR/Fetch requests (&lt;a href="https://dev.to/extractdata/n8n-web-scraping-part-2-pagination-infinite-scroll-network-capture-more-365#network-capture"&gt;as we learned in Part 2&lt;/a&gt;).
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Infinite Scroll:&lt;/strong&gt; Automatically scrolls to the bottom before capturing HTML. (&lt;a href="https://dev.to/extractdata/n8n-web-scraping-part-2-pagination-infinite-scroll-network-capture-more-365#infinite-scroll"&gt;see the infinite scroll guide in Part 2&lt;/a&gt;)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Screenshot:&lt;/strong&gt; Returns a PNG snapshot of the page. (&lt;a href="https://dev.to/extractdata/n8n-web-scraping-part-2-pagination-infinite-scroll-network-capture-more-365#screenshots"&gt;view the setup steps here&lt;/a&gt;).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This ensures your scraping tool is never useless, even on the most obscure websites.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;The Result &amp;amp; Output&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Regardless of which pipeline you chose (AI, SERP, or Manual), all data converges at a final &lt;strong&gt;Data Collector&lt;/strong&gt; node.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5v2klg4a7j20laopexq3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5v2klg4a7j20laopexq3.png" alt="AI Scrape, General, Image - n8n output"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We use a &lt;strong&gt;Convert to File&lt;/strong&gt; node to transform that JSON into a clean &lt;strong&gt;CSV file&lt;/strong&gt; or &lt;strong&gt;Image file&lt;/strong&gt; (for screenshots), ready for download directly in the browser&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr6cabe6x1notavtkejep.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr6cabe6x1notavtkejep.png" alt="n8n scrape results "&gt;&lt;/a&gt;&lt;/p&gt;



&lt;h3&gt;
  
  
  Get the Workflow
&lt;/h3&gt;

&lt;p&gt;We have packaged this entire logic – the forms, the smart routing, the crawler loops, and the safety checks, into a single template you can import right now from the n8n community.&lt;/p&gt;

&lt;p&gt;👉 &lt;a href="https://n8n.io/workflows/11637-automate-data-extraction-with-zyte-ai-products-jobs-articles-and-more/" rel="noopener noreferrer"&gt;&lt;strong&gt;Automate Data Extraction with Zyte AI (Products, Jobs, Articles &amp;amp; More)&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping Up
&lt;/h2&gt;

&lt;p&gt;And that’s a wrap on our n8n web scraping series! 🎬&lt;/p&gt;

&lt;p&gt;From building your first simple scraper in &lt;strong&gt;Part 1&lt;/strong&gt;, to mastering pagination in &lt;strong&gt;Part 2&lt;/strong&gt;, we have now arrived at the ultimate goal: an &lt;strong&gt;Intelligent Scraper&lt;/strong&gt; that adapts to the web so you don't have to.&lt;/p&gt;

&lt;p&gt;You now have a tool that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Gets You Data With Ease:&lt;/strong&gt; Automatically extracts structured fields (like prices, images, and articles) without you needing to hunt for CSS selectors or manage CAPTCHAs.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reduces Maintenance:&lt;/strong&gt; Adapts to layout changes automatically.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gives You Control:&lt;/strong&gt; Lets you switch between AI automation and manual debugging instantly.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This template is ready for you to fork, modify, and deploy.&lt;/p&gt;

&lt;p&gt;Thanks for joining us on this journey! If you build something cool, or if you run into a challenge that stumps you, come share it in the &lt;a href="https://discord.com/invite/eN83rMWqAt" rel="noopener noreferrer"&gt;&lt;strong&gt;Extract Data Community&lt;/strong&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Happy scraping! 🚀🕷️&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webscraping</category>
      <category>data</category>
      <category>python</category>
    </item>
    <item>
      <title>Build Your Own Holiday Deal Tracker with Python, Zyte API &amp; IFTTT 🎁</title>
      <dc:creator>Ayan Pahwa</dc:creator>
      <pubDate>Fri, 21 Nov 2025 11:06:18 +0000</pubDate>
      <link>https://forem.com/extractdata/build-your-own-holiday-deal-tracker-with-python-zyte-api-ifttt-3p78</link>
      <guid>https://forem.com/extractdata/build-your-own-holiday-deal-tracker-with-python-zyte-api-ifttt-3p78</guid>
      <description>&lt;p&gt;&lt;em&gt;(A simple, developer-friendly project to help you catch Black Friday, Cyber Monday, and Christmas deals before anyone else)&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Watch the video tutorial : &lt;/p&gt;

&lt;h2&gt;
  
  
  

  &lt;iframe src="https://www.youtube.com/embed/7o-S-FY5-k8"&gt;
  &lt;/iframe&gt;



&lt;/h2&gt;

&lt;p&gt;It’s that time of the year again, everyone’s hunting for discounts, limited-time deals, and that one item you’ve been keeping an eye on all year. You know the drill: tabs open everywhere, price tracker sites breaking under traffic, browser extensions that promise magic but fail right when The Deal drops.&lt;/p&gt;

&lt;p&gt;So this year, I decided to build something different, something that actually works when I need it the most.&lt;/p&gt;

&lt;p&gt;A personal price-alert system, powered by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Python 🐍&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.zyte.com/" rel="noopener noreferrer"&gt;Zyte’s&lt;/a&gt; Automatic Extraction API 🕷️&lt;/li&gt;
&lt;li&gt;An &lt;a href="https://ifttt.com/" rel="noopener noreferrer"&gt;IFTTT&lt;/a&gt; mobile notification trigger 📱&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And honestly? It ended up being one of the simplest, useful and most reliable holiday project I’ve made in a long time. &lt;/p&gt;

&lt;p&gt;❌ No HTML parsing.&lt;br&gt;
❌ No complex CSS selectors.&lt;br&gt;
❌ No brittle scrapers that break during peak traffic.&lt;br&gt;
✅Just a clean API call, structured product data, and a push notification when the price hits your set target. &lt;/p&gt;

&lt;p&gt;Let me walk you through the whole thing :&lt;/p&gt;


&lt;h2&gt;
  
  
  🎁 Why Build Your Own Deal Tracker?
&lt;/h2&gt;

&lt;p&gt;Because holiday deals don’t wait for anyone. Every year I get messages from friends asking,&lt;/p&gt;

&lt;p&gt;&lt;em&gt;“Bro, what’s the best price tracker? Nothing seems to work today.”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;And they’re right. Most free services throttle, break, or go down completely during Black Friday and Cyber Monday because everyone hits them at once.&lt;/p&gt;

&lt;p&gt;Meanwhile, with a few lines of Python and Zyte’s AI-powered extraction, you can build something that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Works only for you&lt;/li&gt;
&lt;li&gt;Deals with anti-ban, anti-bot and captchas &lt;/li&gt;
&lt;li&gt;Checks as often as you want (make it run on your pc, github-actions, server or a raspberry pi)&lt;/li&gt;
&lt;li&gt;Doesn’t rely on brittle selectors&lt;/li&gt;
&lt;li&gt;Doesn’t choke on JavaScript&lt;/li&gt;
&lt;li&gt;Doesn’t get rate-limited&lt;/li&gt;
&lt;li&gt;Sends you a mobile ping instantly when price drops&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It’s the perfect mix of practical dev fun + a tool you’ll actually use.&lt;/p&gt;


&lt;h2&gt;
  
  
  🎄 Why Zyte?
&lt;/h2&gt;

&lt;p&gt;Scraping e-commerce sites for price data is usually a pain: blocking, JavaScript rendering, cookie rules, bot detection, dynamic DOMs, infinite variations in product page layouts... you get it.&lt;/p&gt;

&lt;p&gt;The magic here is &lt;strong&gt;&lt;a href="https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/response/200/product" rel="noopener noreferrer"&gt;Zyte’s Automatic Extraction&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;You literally tell AI powered Zyte API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{ 
"url": "&amp;lt;product-url&amp;gt;", 
"product": true 
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and it returns structured fields like: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;price &lt;/li&gt;
&lt;li&gt;currency &lt;/li&gt;
&lt;li&gt;product name &lt;/li&gt;
&lt;li&gt;sku &lt;/li&gt;
&lt;li&gt;images &lt;/li&gt;
&lt;li&gt;stock availability &lt;/li&gt;
&lt;li&gt;description &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You don’t write selectors. You don’t parse HTML. You don’t chase CSS changes. You just get clean data. And that makes this holiday project absurdly simple.&lt;/p&gt;




&lt;h2&gt;
  
  
  🎅 What We’re Building
&lt;/h2&gt;

&lt;p&gt;A script that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Takes:&lt;/li&gt;
&lt;li&gt;a product URL&lt;/li&gt;
&lt;li&gt;a target price&lt;/li&gt;
&lt;li&gt;your Zyte API key&lt;/li&gt;
&lt;li&gt;&lt;p&gt;your IFTTT Webhook key&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Fetches structured product data using Zyte API&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Checks if the product price is ≤ your target&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;If yes → triggers an IFTTT event&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Your phone instantly notifies you with product URL:&lt;br&gt;
“Your product dropped to your target price. Go get it!”&lt;/p&gt;
&lt;h2&gt;
  
  
  You can run this:
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;manually (when you want)&lt;/li&gt;
&lt;li&gt;on a cron job (in the background)&lt;/li&gt;
&lt;li&gt;in GitHub Actions&lt;/li&gt;
&lt;li&gt;on a Raspberry Pi&lt;/li&gt;
&lt;li&gt;on a cloud function
Your call.&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  🛠 Setup
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Clone your project
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;git clone https://github.com/apscrapes/zyte-sale-alert.git
cd zyte-sale-alert
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;ol&gt;
&lt;li&gt;Create your virtual environment
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;python3 -m venv venv
source venv/bin/activate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;ol&gt;
&lt;li&gt;Install dependencies
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install -r requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;ol&gt;
&lt;li&gt;Create and add secrets in .env file
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ZYTE_API_KEY=your_zyte_key_here (obtained from zyte)
IFTTT_KEY=your_webhooks_key (obtained from IFTTT, see next step)
IFTTT_EVENT=price_drop (name of your ifttt applet, see next step)
TARGET_PRICE=149.99 (example)
PRODUCT_URL=https://example.com/product/123
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;ol&gt;
&lt;li&gt;Create an IFTTT Webhook Applet&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;a. Download &lt;a href="https://ifttt.com/" rel="noopener noreferrer"&gt;IFTTT&lt;/a&gt; mobile app, it’s paid but you get a 7-day trial and it has tons of automation you can build using no-code, so i think it’s worth it. &lt;/p&gt;

&lt;p&gt;b. Click create new automation / applet&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7uje1n43d1odgh872t1g.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7uje1n43d1odgh872t1g.jpg" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;c. In “IF” field, select “Webhooks” &amp;gt; Add Event Name &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3j8wuf71g071i1wo949c.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3j8wuf71g071i1wo949c.jpg" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flbtj4o3uiwrv3hq1yzee.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flbtj4o3uiwrv3hq1yzee.jpg" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;d. In “THEN” field select Notification &amp;gt; App Notification &amp;gt; JSON Payload &amp;gt; High Priority&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqkbr84te5l7gsd86oilz.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqkbr84te5l7gsd86oilz.jpg" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Note : You can instead of notification can also set other automation like getting an email for example&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flcao83gt5mwlzh1fy0sd.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flcao83gt5mwlzh1fy0sd.jpg" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyabr1st8f5aia47sggtc.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyabr1st8f5aia47sggtc.jpg" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;e. Here's how it should look like when it's successfully created:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4bdxpa8pyosqkr39c431.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4bdxpa8pyosqkr39c431.jpg" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;



&lt;p&gt;Once the automation is created add your IFTTT_KEY and IFTTT_EVENT to your .env file &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Set product parameters 
The main script is at src/pricedrop.py, edit the following variables to add your:
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PRODUCT_URL = “ ”
DESIRED_PRICE = 250 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;PRODUCT_URL “ ” is the URL of the product you want to track and DESIRED_PRICE is the price at which you want to be notified, that is it. &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Run the project
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;python src/pricedrop.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;You can run it manually or set up a cronjob to run at regular intervals. Whenever the price drops equal or below your target price you’ll get a notification from the IFTTT app on your phone with product URL so you can order it right away. &lt;/p&gt;

&lt;p&gt;Let’s understand how it works. &lt;/p&gt;

&lt;p&gt;🧩 Core Logic (Short Version)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;resp = client.get({"url": url, "product": True})
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;“product : True” tells zyte API that the webpage we’re scraping has a product so the Machine Learning powered scrapper gets you all the relevant parameters like price, quantity, description, currency etc. &lt;/p&gt;

&lt;p&gt;And that’s literally all you need.&lt;/p&gt;

&lt;p&gt;The reason this works so beautifully is Zyte is handling:&lt;br&gt;
JS rendering&lt;br&gt;
blocking&lt;br&gt;
retries&lt;br&gt;
browser simulation&lt;br&gt;
extraction logic&lt;br&gt;
AI-powered field detection&lt;/p&gt;




&lt;p&gt;In-detail : &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Importing all the necessary libraries
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import os
import sys
import requests
from zyte_api import ZyteAPI
from dotenv import load_dotenv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Setting required variables
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PRODUCT_URL = "https://outdoor.hylnd7.com/product/a1b2c3d4-e5f6-4a7b-8c9d-000000000293"
DESIRED_PRICE = 250
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here we've setup a sample product link whose price we want to track and a price at which if it goes below we want to be notified. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Loading the API keys from .env file
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;load_dotenv()

# from Zyte API
ZYTE_API_KEY = os.getenv("ZYTE_API_KEY")

# from IFTTT Service applets
EVENT_NAME= os.getenv("EVENT_NAME")
IFTTT_KEY= os.getenv("IFTTT_KEY")

if not ZYTE_API_KEY:
    print("ERROR: ZYTE_API_KEY not found in environment.")
    sys.exit(1)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the project root directory, create a .env file and add ZYTE API Key you'll get after logging into zyte.com and IFTTT webhook API key you get after creating the automation applet&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Function to trigger the mobile notification :
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def trigger_ifttt(event_name, key, value1):
    url = f"https://maker.ifttt.com/trigger/{event_name}/json/with/key/{key}"

    payload = {
        "value1": value1,
    }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What this funciton is doing is basically making an API call to IFTTT and IFTTT applet is set so whenever the API calls comes with payload it sends mobile notification with that payload, which in this case is product URL so you can directly click and open the product page and buy it before it goes out of stock, SMART right? 😉&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Scraping init
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;client = ZyteAPI(api_key=ZYTE_API_KEY)

    payload = {
        "url": PRODUCT_URL,
        "product": True,          
    }

    resp = client.get(payload)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Making a GET request on Zyte API with product : True, we're asking zyte to treat the URL as product page and thus it's uses it's ML capabilities to fetch product relevant details, price in this case. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Compare price to SETPOINT
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;if price_float &amp;lt;= DESIRED_PRICE:
            trigger_ifttt(EVENT_NAME, IFTTT_KEY, value1 = PRODUCT_URL)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If price of the product reaches to or below our target price it will call the IFTTT function, thus triggering the notification.&lt;/p&gt;




&lt;h2&gt;
  
  
  🌟 Make It Even Better
&lt;/h2&gt;

&lt;p&gt;You can extend this to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Track multiple URLs&lt;/li&gt;
&lt;li&gt;Log daily prices to CSV&lt;/li&gt;
&lt;li&gt;Plot graphs&lt;/li&gt;
&lt;li&gt;Send WhatsApp alerts&lt;/li&gt;
&lt;li&gt;Push to a Telegram bot&lt;/li&gt;
&lt;li&gt;Use GitHub Actions to check every hour&lt;/li&gt;
&lt;li&gt;Deploy as a Streamlit dashboard&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Zyte handles the extraction. You build the magic on top.&lt;/p&gt;




&lt;h2&gt;
  
  
  🧘 Final Thoughts
&lt;/h2&gt;

&lt;p&gt;I love projects like this because they hit the sweet spot between:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;seasonal usefulness &lt;/li&gt;
&lt;li&gt;real-world scraping challenges&lt;/li&gt;
&lt;li&gt;a clean developer experience&lt;/li&gt;
&lt;li&gt;a fun weekend build&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're new to the world of web scraping like me, this shows how powerful the right tools can be. If you're experienced, it’s refreshing to skip the boilerplate and let Zyte handle the messy parts.&lt;/p&gt;

&lt;p&gt;And honestly, there’s something fun about getting a custom alert on your phone saying:&lt;br&gt;
“Hey, that gadget you wanted all year just dropped to your target price.”&lt;/p&gt;

&lt;p&gt;Happy building, happy holidays, and happy deal-hunting! 🎄🎁&lt;br&gt;
Let me know what you end up tracking.&lt;/p&gt;

&lt;p&gt;Join Zyte Discord to share what you're building or get any support : &lt;br&gt;
&lt;a href="https://discord.com/invite/DwTnbrm83s" rel="noopener noreferrer"&gt;https://discord.com/invite/DwTnbrm83s&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a class="mentioned-user" href="https://dev.to/iayanpahwa"&gt;@iayanpahwa&lt;/a&gt; &lt;/p&gt;

</description>
      <category>webdev</category>
      <category>ai</category>
      <category>python</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>n8n Web Scraping || Part 2: Pagination, Infinite Scroll, Network Capture &amp; More</title>
      <dc:creator>Lakshay Nasa</dc:creator>
      <pubDate>Mon, 17 Nov 2025 18:15:13 +0000</pubDate>
      <link>https://forem.com/extractdata/n8n-web-scraping-part-2-pagination-infinite-scroll-network-capture-more-365</link>
      <guid>https://forem.com/extractdata/n8n-web-scraping-part-2-pagination-infinite-scroll-network-capture-more-365</guid>
      <description>&lt;p&gt;This is Part 2 of our n8n web scraping series with &lt;a href="https://www.zyte.com/zyte-api/?utm_campaign=Discord_dev_to&amp;amp;utm_activity=Community&amp;amp;utm_medium=social&amp;amp;utm_source=devto" rel="noopener noreferrer"&gt;Zyte API&lt;/a&gt;. If you’re new here, check out &lt;a href="https://dev.to/extractdata/web-scraping-with-n8n-part-1-build-your-first-web-scraper-37cf"&gt;Part 1&lt;/a&gt; first, it covers the basics: &lt;code&gt;fetching pages&lt;/code&gt;, &lt;code&gt;extracting HTML&lt;/code&gt; with the &lt;em&gt;HTML node&lt;/em&gt;, &lt;code&gt;cleaning&lt;/code&gt; + &lt;code&gt;normalizing results&lt;/code&gt;, &amp;amp; exporting &lt;code&gt;CSV/JSON&lt;/code&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  Table of Contents
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Pagination&lt;/li&gt;
&lt;li&gt;Infinite Scroll&lt;/li&gt;
&lt;li&gt;Geolocation support&lt;/li&gt;
&lt;li&gt;Screenshots from browser rendering&lt;/li&gt;
&lt;li&gt;Capturing network requests&lt;/li&gt;
&lt;li&gt;Handling cookies, sessions, headers &amp;amp; IP type&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Let’s Begin!
&lt;/h2&gt;

&lt;p&gt;In this part, we’ll explore some important scraping practices and nodes, along with a few hands on tricks that make your web scraping journey smoother.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Everything you learn here will also lay the foundation for our 3rd &amp;amp; final part, where we will build a universal scraper capable of scraping any website with minimal configuration.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Let’s start by taking the same workflow we built in Part 1, &amp;amp; extend it. Beginning with Pagination and  Infinite Scroll.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9o8874utbkb31wxjubie.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9o8874utbkb31wxjubie.png" alt="N8N Scraping Workflow" width="800" height="453"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Pagination across pages
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;A website can navigate in multiple ways &amp;amp; our scraper needs to adapt accordingly.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;N8N gives us a default &lt;a href="https://docs.n8n.io/code/cookbook/http-node/pagination/" rel="noopener noreferrer"&gt;Pagination Mode&lt;/a&gt; inside the HTTP Request node under Options, and while it sounds convenient, it didn’t behave reliably in my experience for typical web scraping use cases. &lt;/p&gt;

&lt;p&gt;After testing several patterns, &lt;em&gt;the approach below is the one that has worked most consistently in my workflows.&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;💬 If you’re stuck or want to share your own approach, lets discuss it in the &lt;a href="https://discord.com/invite/eN83rMWqAt" rel="noopener noreferrer"&gt;&lt;strong&gt;Extract Data Discord&lt;/strong&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Step 1: Page Manager Node
&lt;/h4&gt;

&lt;p&gt;Before calling the HTTP Request node, we introduce a small function called &lt;strong&gt;Page Manager&lt;/strong&gt;, exactly what the name suggests: a node that controls the page number.&lt;/p&gt;

&lt;p&gt;Add a Code node (JavaScript) and paste:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Page Manager Function ( We use this node as both starter and incrementer)&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;MAX_PAGE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// n8n provides `items` array. If no items =&amp;gt; first run&lt;/span&gt;
&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;Array&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isArray&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;items&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;items&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;

  &lt;span class="c1"&gt;// Check for .json.page (from this node's first run)&lt;/span&gt;
  &lt;span class="c1"&gt;// OR .json.Page (from the Normalizer node's output)&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;p&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;items&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;json&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nx"&gt;items&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;json&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nx"&gt;Page&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="nx"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;typeof&lt;/span&gt; &lt;span class="nx"&gt;p&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;undefined&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;p&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;p&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="nc"&gt;Number&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;p&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;next&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;current&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;next&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// first run (still 1)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;next&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;current&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;next&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;MAX_PAGE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt; &lt;span class="c1"&gt;// safety stop&lt;/span&gt;

&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt; &lt;span class="na"&gt;json&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;page&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;next&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;}];&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What this does:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;On the first run, it starts with &lt;code&gt;page = 1&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Every time the loop returns here, it increments to the next page.&lt;/li&gt;
&lt;li&gt;There’s a built in safety &lt;code&gt;MAX_PAGE&lt;/code&gt; so you don’t accidentally infinite loop. ( Adjust accordingly )&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F719ovxjuzauj6u0o1d1x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F719ovxjuzauj6u0o1d1x.png" alt="Scraping Function N8N" width="800" height="454"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now update your URL in old &lt;a href="https://dev.to/extractdata/web-scraping-with-n8n-part-1-build-your-first-web-scraper-37cf#:~:text=Step%202%3A%20Add%20an%20HTTP%20Request%20Node"&gt;HTTP Request node&lt;/a&gt; to use the page variable:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;URL:&lt;/strong&gt; &lt;code&gt;https://books.toscrape.com/catalogue/page-{{ $json.page }}.html&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx9j84u81ohzikcb7iiag.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx9j84u81ohzikcb7iiag.png" alt="Pagination Scraping URL" width="800" height="853"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This makes the node fetch the correct page each time.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The rest of the workflow remains the same, till the second HTML Extract node (where we parsed the book name, URL, price, rating etc. in Part 1).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Step 2: Modify the Normalizer Function Node to Save Results Across Pages
&lt;/h4&gt;

&lt;p&gt;In Part 1, our &lt;a href="https://dev.to/extractdata/web-scraping-with-n8n-part-1-build-your-first-web-scraper-37cf#:~:text=Step%207%3A%20Clean%20and%20normalize%20the%20data"&gt;Step 7&lt;/a&gt; code simply cleaned and normalized items for one page.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Now we need it to do two things:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Normalize the results (same as before)&lt;/li&gt;
&lt;li&gt;Store the results from every page inside n8n’s global static data bucket. Think of it like temporary workflow memory.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Update the node, code with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// --- Normalizer (Code node) ---&lt;/span&gt;
&lt;span class="c1"&gt;// Get the global workflow static data bucket&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;workflowStaticData&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;$getWorkflowStaticData&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;global&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// initialize storage if needed&lt;/span&gt;
&lt;span class="nx"&gt;workflowStaticData&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;workBooks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;workflowStaticData&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;workBooks&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;

&lt;span class="c1"&gt;// normalization logic (kept minimal version)&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;base&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https://books.toscrape.com/&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;normalized&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;items&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;item&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;urlRel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;imgRel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;image&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ratingClass&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;rating&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ratingParts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;ratingClass&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt; &lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;rating&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;ratingParts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="nx"&gt;ratingParts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;ratingParts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;base&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;urlRel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/^&lt;/span&gt;&lt;span class="se"&gt;(\.\.\/)&lt;/span&gt;&lt;span class="sr"&gt;+/&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;base&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;imgRel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/^&lt;/span&gt;&lt;span class="se"&gt;(\.\.\/)&lt;/span&gt;&lt;span class="sr"&gt;+/&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="na"&gt;price&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;price&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;availability&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;availability&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toString&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="nx"&gt;rating&lt;/span&gt;
  &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// append to global storage&lt;/span&gt;
&lt;span class="nx"&gt;workflowStaticData&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;workBooks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(...&lt;/span&gt;&lt;span class="nx"&gt;normalized&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// return control info for IF node (not the items)&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;currentPage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Page Manager&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;first&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nx"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
  &lt;span class="na"&gt;json&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;itemsFound&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;normalized&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;nextHref&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;$json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;nextHref&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;Page&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;currentPage&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}];&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn7fs6bgzf794rja4wy0t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn7fs6bgzf794rja4wy0t.png" alt="Save Data N8N Function" width="800" height="458"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We normalize the data exactly like Part 1.&lt;/li&gt;
&lt;li&gt;Then we push all normalized items into &lt;code&gt;workflowStaticData.workBooks&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Instead of returning the items themselves, we return only a small control object.&lt;/li&gt;
&lt;li&gt;This object is used by the &lt;code&gt;IF node&lt;/code&gt; to decide whether we continue scraping or stop.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Step 3: IF Node (Stop Scraping or Continue)
&lt;/h4&gt;

&lt;p&gt;Add an &lt;a href="https://docs.n8n.io/integrations/builtin/core-nodes/n8n-nodes-base.if/" rel="noopener noreferrer"&gt;&lt;code&gt;IF node&lt;/code&gt;&lt;/a&gt; with two conditions and &lt;code&gt;OR&lt;/code&gt; Type:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Condition 1:&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;{{ $json.itemsFound }}&lt;/code&gt; is equal to &lt;code&gt;0&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Meaning → The current page returned no items → we’ve reached the end.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Condition 2:&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;{{ $json.Page }}&lt;/code&gt; is greater than or equal to &lt;code&gt;YOUR_MAX_PAGE&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Meaning → Stop when you reach the max page number you set.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7y5c3w4xg3660vc40djo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7y5c3w4xg3660vc40djo.png" alt="If loop node n8n" width="800" height="456"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Together these conditions help the workflow decide:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;IF → True&lt;/strong&gt;&lt;br&gt;
Stop scraping and move to the export step.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;IF → False&lt;/strong&gt;&lt;br&gt;
Go back to the &lt;code&gt;Page Manager&lt;/code&gt;, increment the page number, and keep scraping.&lt;/p&gt;

&lt;p&gt;This creates a complete and safe pagination loop.&lt;/p&gt;
&lt;h4&gt;
  
  
  Step 4: Collect All Results and Export
&lt;/h4&gt;

&lt;p&gt;When the IF node returns True, add one more small Code node before the Convert To File node:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Get the global data&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;workflowStaticData&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;$getWorkflowStaticData&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;global&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Get the array of books, or an empty array if it doesn't exist&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;allBooks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;workflowStaticData&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;workBooks&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;

&lt;span class="c1"&gt;// Return all the books as standard n8n items&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;allBooks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;book&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;json&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;book&lt;/span&gt; &lt;span class="p"&gt;}));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fynjomyhxecffj4klmpab.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fynjomyhxecffj4klmpab.png" alt="Data Scraping N8N" width="800" height="456"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What this one does:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pulls everything we stored in the temporary memory.&lt;/li&gt;
&lt;li&gt;Returns it as normal n8n items.&lt;/li&gt;
&lt;li&gt;These go straight into Convert To File → CSV.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;And that’s the entire pagination workflow.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxjdz86w1ynumt4xj1a3x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxjdz86w1ynumt4xj1a3x.png" alt="Pagination in N8N" width="800" height="456"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Infinite Scroll
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;This one is much simpler.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Some websites load content as you scroll, there's no traditional page numbers.&lt;br&gt;
The &lt;a href="https://www.zyte.com/zyte-api/?utm_campaign=Discord_dev_to&amp;amp;utm_activity=Community&amp;amp;utm_medium=social&amp;amp;utm_source=devto" rel="noopener noreferrer"&gt;Zyte API&lt;/a&gt; supports browser actions, which makes this easy.&lt;/p&gt;

&lt;p&gt;Just add one line to our original &lt;strong&gt;cURL command&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;curl \
   --user YOUR_ZYTE_API_KEY_GOES_HERE: \
   --header 'Content-Type: application/json' \
   --data '{"url": "https://quotes.toscrape.com/scroll", "browserHtml": true, "actions": [{ "action": "scrollBottom" }]}' \
   https://api.zyte.com/v1/extract
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6028qtzxo3b3klmo3llu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6028qtzxo3b3klmo3llu.png" alt="Infinite Scroll in N8N" width="800" height="495"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Zyte API loads the page in a headful browser session.&lt;/li&gt;
&lt;li&gt;It scrolls to the bottom, triggering all JavaScript that loads additional items.&lt;/li&gt;
&lt;li&gt;Then it returns the final, fully loaded &lt;code&gt;browserHtml&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;You can parse this HTML normally using the same nodes from Part 1.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Geolocation
&lt;/h2&gt;

&lt;p&gt;Some websites return different data depending on your region.&lt;br&gt;
Zyte API makes this super simple by allowing you to specify a geolocation.&lt;/p&gt;

&lt;p&gt;Use this inside an HTTP Request node:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;curl \
   --user YOUR_ZYTE_API_KEY_GOES_HERE: \
   --header 'Content-Type: application/json' \
   --data '{"url": "http://ip-api.com/json", "browserHtml": true, "geolocation": "AU" }' \
   https://api.zyte.com/v1/extract
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F638etacjkchl2641595w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F638etacjkchl2641595w.png" alt="Geolocation Scraping" width="800" height="494"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Setting &lt;code&gt;"geolocation": "AU"&lt;/code&gt; makes Zyte perform the browser request from that region, check the list of all available &lt;a href="https://docs.zyte.com/zyte-api/usage/reference.html#operation/extract/request/geolocation/?utm_campaign=Discord_geo&amp;amp;utm_activity=Community&amp;amp;utm_medium=social&amp;amp;utm_source=devto" rel="noopener noreferrer"&gt;CountryCodes&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Many websites use region based content (pricing, currencies, language, product availability), so this is extremely helpful.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Screenshots
&lt;/h2&gt;

&lt;p&gt;If you’d like to grab a screenshot of what the browser rendered, you can do that too.&lt;/p&gt;

&lt;p&gt;cURL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;curl \
   --user YOUR_ZYTE_API_KEY_GOES_HERE: \
   --header 'Content-Type: application/json' \
   --data '{"url": "https://toscrape.com", "screenshot": true }' \
   https://api.zyte.com/v1/extract

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;It will return the screenshot as &lt;code&gt;Base64&lt;/code&gt; data.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft288ynokh6slmc89w6g8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft288ynokh6slmc89w6g8.png" alt="Base64 Scraping" width="800" height="262"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To convert it into a proper image (PNG, JPEG, etc.) → Use &lt;strong&gt;Convert To File&lt;/strong&gt; node in n8n.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu1ffw7a2ljtl3dd7e4yu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu1ffw7a2ljtl3dd7e4yu.png" alt="Scraping Screenshot" width="800" height="456"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Important:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;n8n often converts boolean values like &lt;code&gt;true&lt;/code&gt; into &lt;code&gt;"true"&lt;/code&gt; when importing via cURL.&lt;br&gt;
Fix it by clicking the gear icon → &lt;strong&gt;Add Expression&lt;/strong&gt; → &lt;code&gt;{{true}}&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe3b4gelimpj8u34ydx2g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe3b4gelimpj8u34ydx2g.png" alt="Field Scraping" width="800" height="1273"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Or switch body mode to &lt;strong&gt;Using JSON&lt;/strong&gt; and paste:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://toscrape.com"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"screenshot"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Futluvudf5o2ja7fhfojm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Futluvudf5o2ja7fhfojm.png" alt="JSON Scraping" width="800" height="1261"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Network Capture
&lt;/h2&gt;

&lt;p&gt;Many modern websites load content through background API calls rather than raw HTML.&lt;br&gt;
And you can just capture those network activity during rendering.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;curl \
   --user YOUR_ZYTE_API_KEY_GOES_HERE: \
   --header 'Content-Type: application/json' \
   --data '{"url": "https://quotes.toscrape.com/scroll", "browserHtml": true,  "networkCapture": [
        {
            "filterType": "url",
            "httpResponseBody": true,
            "value": "/api/",
            "matchType": "contains"
        }]}' \
   https://api.zyte.com/v1/extract
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This returns a &lt;code&gt;networkCapture&lt;/code&gt; array with all responses whose URL contains &lt;code&gt;/api/&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx5r4gbnbcfgrteaqz9r5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx5r4gbnbcfgrteaqz9r5.png" alt="Network Capture Scraping" width="800" height="607"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Understanding above Parameters&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;filterType: "url"&lt;/code&gt; ⟶ filter network requests by URL&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;value: "/api/"&lt;/code&gt; ⟶ look for URLs containing &lt;code&gt;/api/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;matchType: "contains"&lt;/code&gt; ⟶ pattern match style&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;httpResponseBody: true&lt;/code&gt; ⟶ include the response body (Base64)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Extracting data from the captured network response&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You can decode the Base64 response in two easy ways:&lt;/p&gt;

&lt;h5&gt;
  
  
  &lt;strong&gt;1. Using a Function node (Python)&lt;/strong&gt;
&lt;/h5&gt;

&lt;p&gt;&lt;em&gt;(You can also use JS if you prefer)&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Get the network capture data
&lt;/span&gt;&lt;span class="n"&gt;capture&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;first&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;networkCapture&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Decode base64 and parse JSON
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="n"&gt;decoded_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;b64decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;capture&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;httpResponseBody&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;decoded_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Return the result
&lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quotes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quotes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;firstAuthor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quotes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;author&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}]&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg36qwxllt7j0rh3r82ez.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg36qwxllt7j0rh3r82ez.png" alt="Decode Base64" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;→ This method decodes the &lt;code&gt;Base64&lt;/code&gt; encoded HTTP response, parses it as JSON, and gives you structured data directly, very reliable and readable.&lt;/p&gt;

&lt;h5&gt;
  
  
  &lt;strong&gt;2. Using Edit Field Node (No code)&lt;/strong&gt;
&lt;/h5&gt;

&lt;blockquote&gt;
&lt;p&gt;In this method, you still need to parse your data&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt;Add an &lt;strong&gt;Edit Fields&lt;/strong&gt; node&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mode:&lt;/strong&gt; Add Field&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Name:&lt;/strong&gt; decodedData&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Type:&lt;/strong&gt; String&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Value:&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{{ $json.networkCapture[0].httpResponseBody.base64Decode().parseJson() }}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F54z28rdgbmagj6ex253v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F54z28rdgbmagj6ex253v.png" alt="Decode Base64 in N8N" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;→ This takes the Base64 content, decodes it, parses JSON, and puts the result under &lt;code&gt;decodedData&lt;/code&gt; automatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  Cookies, sessions, headers &amp;amp; IP type (quick guide)
&lt;/h2&gt;

&lt;p&gt;When you move from toy sites to real sites, a few extra controls matter a lot: which IP type you use, whether you keep a session, and what cookies or headers you send.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;a href="https://www.zyte.com/zyte-api/?utm_campaign=Discord_dev_to&amp;amp;utm_activity=Community&amp;amp;utm_medium=social&amp;amp;utm_source=devto" rel="noopener noreferrer"&gt;Zyte API&lt;/a&gt; exposes all these as request fields and you can use them the same way we used &lt;code&gt;browserHtml&lt;/code&gt;, &lt;code&gt;networkCapture&lt;/code&gt; or &lt;code&gt;actions&lt;/code&gt; above (via &lt;code&gt;curl&lt;/code&gt; → &lt;code&gt;Import in n8n HTTP Request node&lt;/code&gt; → Adjust Fields as needed → Extract).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;To keep this guide focused, we won’t dive into all code examples here, but here’s one small one for _ Setting a cookie and getting it back_ ( &lt;code&gt;requestCookies&lt;/code&gt; ) just to show how it integrates.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cookies&lt;/strong&gt; (&lt;em&gt;via&lt;/em&gt;&lt;code&gt;requestCookies&lt;/code&gt;/ &lt;code&gt;responseCookies&lt;/code&gt;)
➜ Useful when a website relies on cookies for preferences, language, or maintaining continuity between requests.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;curl \
   --user YOUR_ZYTE_API_KEY_GOES_HERE: \
   --header 'Content-Type: application/json' \
   --data '{ "url": "https://httpbin.org/cookies", "browserHtml": true,
    "requestCookies": [{ "name": "foo",
            "value": "bar",
            "domain": "httpbin.org"
        }]
}' \
   https://api.zyte.com/v1/extract
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw7x0xeg6y7gvd4n01r6f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw7x0xeg6y7gvd4n01r6f.png" alt="Manage Scraping Cookies" width="800" height="638"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;⟶ This example uses &lt;code&gt;requestCookies&lt;/code&gt;, but &lt;code&gt;responseCookies&lt;/code&gt; works the same way, you simply read cookies from one request and pass them into the next.&lt;/p&gt;

&lt;p&gt;Learn more on &lt;a href="https://docs.zyte.com/zyte-api/usage/features.html#cookies/?utm_campaign=Discord_cookies&amp;amp;utm_activity=Community&amp;amp;utm_medium=social&amp;amp;utm_source=devto" rel="noopener noreferrer"&gt;cookies&lt;/a&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Everything else below (sessions, ipType, custom headers) plugs in the same way.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Sessions&lt;/strong&gt;&lt;br&gt;
➜ Sessions bundle IP address, cookie jar, and network settings so multiple requests look consistently related. &lt;em&gt;Helpful for multi step interactions, region based content, or sites that hate stateless scraping..&lt;/em&gt;&lt;br&gt;
&lt;strong&gt;Docs:&lt;/strong&gt; &lt;a href="https://docs.zyte.com/zyte-api/usage/features.html#sessions/?utm_campaign=Discord_sessions&amp;amp;utm_activity=Community&amp;amp;utm_medium=social&amp;amp;utm_source=devto" rel="noopener noreferrer"&gt;Sessions&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Custom Headers&lt;/strong&gt;&lt;br&gt;
➜ Add a User Agent, Referer, or any custom metadata the target site expects: simply define them inside the &lt;code&gt;HTTP Request node&lt;/code&gt; headers.&lt;br&gt;
&lt;strong&gt;Docs:&lt;/strong&gt;&lt;a href="https://docs.zyte.com/zyte-api/usage/features.html#response-headers/?utm_campaign=Discord_headers&amp;amp;utm_activity=Community&amp;amp;utm_medium=social&amp;amp;utm_source=devto" rel="noopener noreferrer"&gt; Headers&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;IP Type&lt;/strong&gt; (&lt;code&gt;datacenter&lt;/code&gt; vs &lt;code&gt;residential&lt;/code&gt;)&lt;br&gt;
➜ Some sites vary content based on IP type. Zyte API automatically selects the best option, but you can override it with &lt;code&gt;ipType&lt;/code&gt;.&lt;br&gt;
&lt;strong&gt;Docs:&lt;/strong&gt; &lt;a href="https://docs.zyte.com/zyte-api/usage/features.html#ip-type/?utm_campaign=Discord_Iptype&amp;amp;utm_activity=Community&amp;amp;utm_medium=social&amp;amp;utm_source=devto" rel="noopener noreferrer"&gt;IP Types&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;All of these follow the same pattern we’ve already used above.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Where This Takes Us Next
&lt;/h2&gt;

&lt;p&gt;And that’s it for Part 2! 🎉&lt;/p&gt;

&lt;p&gt;We covered a lot more than just pagination, from infinite scroll &amp;amp; geolocation to screenshots, network capture, and the key request fields you’ll use while scraping sites. &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What we learned isn’t a complete workflow on its own, but it builds the foundation you’ll use again and again in your scraping workflows.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In Part 3, we’ll take everything one step further and combine these patterns into a universal scraper: a reusable, configurable template that can adapt to almost any site with minimal changes.&lt;/p&gt;

&lt;p&gt;Thanks for following along, and feel free to share your workflow, questions, or improvements in the &lt;a href="https://discord.com/invite/eN83rMWqAt" rel="noopener noreferrer"&gt;Extract Data Community&lt;/a&gt;. &lt;br&gt;
Happy scraping! 🕸️✨&lt;/p&gt;

</description>
      <category>programming</category>
      <category>webscraping</category>
      <category>ai</category>
      <category>python</category>
    </item>
  </channel>
</rss>
