<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Saurav Jain</title>
    <description>The latest articles on Forem by Saurav Jain (@sauain).</description>
    <link>https://forem.com/sauain</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F269955%2Fe9c19f8a-fbff-4e31-a547-8a4c178e03f9.jpg</url>
      <title>Forem: Saurav Jain</title>
      <link>https://forem.com/sauain</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/sauain"/>
    <language>en</language>
    <item>
      <title>Firecrawl vs. Apify: 2025 guide for AI and data teams</title>
      <dc:creator>Saurav Jain</dc:creator>
      <pubDate>Mon, 11 Aug 2025 08:42:57 +0000</pubDate>
      <link>https://forem.com/apify/firecrawl-vs-apify-2025-guide-for-ai-and-data-teams-42e3</link>
      <guid>https://forem.com/apify/firecrawl-vs-apify-2025-guide-for-ai-and-data-teams-42e3</guid>
      <description>&lt;p&gt;&lt;em&gt;A detailed comparison of Firecrawl's unified AI-driven scraping and Apify's comprehensive, flexible ecosystem. We explain what each does best&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Fresh, structured &lt;a href="https://apify.com/use-cases/data-for-ai-agents" rel="noopener noreferrer"&gt;web data is the fuel for AI&lt;/a&gt;, including agents, RAG pipelines, competitive‑intelligence dashboards, and change‑monitoring services. Two platforms dominate that space in 2025:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Firecrawl&lt;/strong&gt; – an API‑first crawler that turns any URL into LLM‑ready Markdown/JSON in seconds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apify&lt;/strong&gt; – a full‑stack scraping platform with thousands of reusable data collection tools.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We'll go deep into the differences and what they do best, so you can choose (or combine) wisely.&lt;/p&gt;

&lt;p&gt;Let's kick off with a table of each platform’s features, benefits, and trade‑offs so you can see where each one excels.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Firecrawl&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Apify&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Core value prop&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Fast, semantic scraping API geared for AI&lt;/td&gt;
&lt;td&gt;End-to-end scraping platform (6,000+ data collection tools, proxies, compliance)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Stand‑out features&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pre-warmed browsersNL extractionAGPL OSS &amp;amp; self-hostStealth proxies&lt;/td&gt;
&lt;td&gt;Scraper marketplaceJS/TS &amp;amp; Py SDKsGlobal proxy pool + CAPTCHACron-scheduler, retries, webhooksSOC 2 Type II and GDPR compliance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Primary benefits&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Sub-second latency on cached pagesPrompt-based (no selectors)Predictable credit pricing&lt;/td&gt;
&lt;td&gt;Fine‑grained session &amp;amp; proxy controlNo‑code operation for analysts; devs extend via code&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Pros&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✔ Fast single‑page fetches✔ Self‑host to avoid vendor lock‑in✔ Low entry cost (free + $16 Hobby tier)&lt;/td&gt;
&lt;td&gt;✔ Breadth: 6,000 off-the-shelf scrapers✔ Effective anti‑blocking technology✔ Monetize your own scrapers; earn rev‑share&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cons&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✖ Credits can disappear fast on large crawls✖ Limited built-in scheduling✖ AGPL copyleft for forks&lt;/td&gt;
&lt;td&gt;✖ Actors / CU concepts add a learning curve✖ Consumption costs can spike with inefficient code✖ Cold-start ≈1.5s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://console.apify.com/sign-up" rel="noopener noreferrer"&gt;Get data for AI with Apify&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Pricing
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Firecrawl’s pricing model&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Firecrawl uses a simple credit-based model: 1 page = 1 credit (under standard conditions). That makes it very easy to predict costs. The free tier lets you scrape up to 500 pages before committing financially (no credit card required). Paid plans range from $16 to $333 for standard pricing.&lt;/p&gt;

&lt;p&gt;Extraction tasks that go beyond simple scraping consume additional credits or tokens, and Firecrawl offers dedicated “Extract” plans ranging from $89 to $719/month, depending on token volume.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff614pdy4bwhfc10bhzcm.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff614pdy4bwhfc10bhzcm.jpeg" alt="Firecrawl pricing" width="800" height="365"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Apify’s pricing model&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Apify combines a subscription and consumption model. You get a base amount of platform credit with each pricing plan, but the consumption rate depends on the resources used. 1 compute unit = 1 gigabyte-hour of RAM.&lt;/p&gt;

&lt;p&gt;It’s harder to predict costs this way because scraping a JavaScript-rendered site that requires browser automation consumes a lot more than a simple HTML scraper.&lt;/p&gt;

&lt;p&gt;That’s why Apify has introduced pay-per-event pricing. Scrapers with this pricing model charge for specific actions rather than just results. For example, &lt;strong&gt;a scraper that charges $5 per run start and $2 per 1,000 results would cost $15 for 5,000 results&lt;/strong&gt;.  This can make large-scale scraping jobs cheaper in the long run.&lt;/p&gt;

&lt;p&gt;Apify's forever-free tier gives you $5 of credit that renews automatically every month, so you can test any scraper on Apify Store without financial commitment (no credit card required). Paid plans range from $39 to $999 per month.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft3a2v1j16kjq8a0dindk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft3a2v1j16kjq8a0dindk.png" alt="Apify pricing" width="800" height="353"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Firecrawl and Apify pricing compared&lt;/strong&gt;
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Firecrawl(flat credits)&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Apify(pre-paid credit+CU)&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;3,000 pages&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Hobby: $16 ✅&lt;/td&gt;
&lt;td&gt;Starter: $39 ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;100,000 pages&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Standard: $83 ✅&lt;/td&gt;
&lt;td&gt;Scale: $199 ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;500,000 pages&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Growth: $333 or Enterprise&lt;/td&gt;
&lt;td&gt;Business: $999 or Enterprise&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Summary&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;500k pages/month? &lt;strong&gt;Firecrawl&lt;/strong&gt; is usually cheaper.Millions of lightweight pages or heavy anti‑bot workflows? A well‑optimized &lt;strong&gt;Apify&lt;/strong&gt; scraper can win on total cost.&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Performance
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Firecrawl: Unified AI-driven scraping&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffxj0swzvdr3vkuzfzotx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffxj0swzvdr3vkuzfzotx.png" alt="Firecrawl - Unified AI-driven scraping" width="800" height="489"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Firecrawl offers a single, consistent API that handles scraping, crawling, and AI‑driven site navigation, so developers never have to juggle multiple endpoints or bespoke parameters.&lt;/p&gt;

&lt;p&gt;When you request a page, the service decides on‑the‑fly whether it needs a headless browser, waits for all dynamic elements to render, and then applies extraction models that automatically ignore ads, menus, and other noise.&lt;/p&gt;

&lt;p&gt;Instead of writing brittle CSS or XPath rules, you ask for the data in plain language — “product prices and availability,” for example — and Firecrawl returns a clean JSON block that stays stable even when the site’s markup changes.&lt;/p&gt;

&lt;p&gt;The same intelligence powers the crawler: it goes through internal links without sitemaps, skips duplicate content, and can infer which pages matter most from their position in the site hierarchy and linking patterns, all while respecting any boundaries you set.&lt;/p&gt;

&lt;p&gt;For sites that hide information behind clicks, forms, or paginated views, you can enable the FIRE‑1 agent. It mimics human behavior by clicking “Load More,” filling search fields, and even solving simple CAPTCHA, eliminating special‑case code.&lt;/p&gt;

&lt;p&gt;Performance optimizations run throughout the platform. Recently scraped pages come from cache in milliseconds, hundreds of URLs can be batched in a single call for parallel processing, and converting HTML to lightweight Markdown cuts the token count for downstream LLMs by roughly two‑thirds. The net effect is faster, lower‑maintenance data collection that remains reliable as websites evolve.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Apify: A comprehensive, flexible ecosystem&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2w9320o95aph7gfz8g8u.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2w9320o95aph7gfz8g8u.jpeg" alt="Apify: A comprehensive, flexible ecosystem, not just a web scraping API" width="800" height="364"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Apify approaches web scraping as an ecosystem problem rather than a single‑tool exercise. Its foundation is the Actor system — self‑contained programs that run in Apify’s cloud with uniform input/output, shared storage, and common scheduling and monitoring. Because every Actor behaves the same way, you can link them into multistep workflows just by passing data from one to the next.&lt;/p&gt;

&lt;p&gt;This standardization fuels &lt;a href="https://apify.com/store" rel="noopener noreferrer"&gt;Apify Store&lt;/a&gt;, a marketplace of more than 6,000 pre‑built scrapers and automation tools maintained by domain specialists. If you need Amazon product data, Instagram follower stats, or a one‑off government registry crawl, chances are an Actor already exists and is kept up to date as site layouts change, so you rarely start from a blank page.&lt;/p&gt;

&lt;p&gt;When an off‑the‑shelf Actor doesn’t fit, you can build your own with the Apify SDK (also released as the open‑source &lt;a href="https://crawlee.dev/?__hstc=160404322.3b3599158ad751038b19c751c187f023.1696600257709.1753948779266.1753948924434.739&amp;amp;__hssc=160404322.1.1753948924434&amp;amp;__hsfp=3246212229" rel="noopener noreferrer"&gt;Crawlee library&lt;/a&gt;). The SDK offers high‑level helpers — request queues, automatic retries, error handling, parallel processing — while still letting you drop down to raw Puppeteer, Playwright, or HTTP calls when necessary. It supports both JavaScript/TypeScript and Python, making it &lt;strong&gt;easy to slot into existing codebases&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Infrastructure management is largely hands‑off. Apify autoscales compute instances, rotates datacenter or residential proxies by geography, and applies anti‑detection tactics such as browser‑fingerprint randomization, human‑like delays, and outsourced CAPTCHA solving. Enterprise‑grade features cover the operational side: detailed run statistics, alerting, conditional scheduling (e.g., “scrape 100 sites at 06:00 only if yesterday’s run succeeded”), real‑time webhooks for downstream systems, and configurable result retention to satisfy compliance audits or historical analyses.&lt;/p&gt;

&lt;p&gt;In practice, the platform lets teams mix and match ready‑made Actors, custom code, and reliable infrastructure to &lt;strong&gt;solve diverse scraping tasks without rebuilding each time&lt;/strong&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Technical feature&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Firecrawl&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Apify&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;API surface&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Single base URL with multiple HTTP/JSON endpoints under one uniform API&lt;/td&gt;
&lt;td&gt;Each Actor exposes a standard REST interface: you POST to the “run” endpoint with a JSON input and then GET results from key‑value stores or datasets&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Dynamic‑content handling&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Automatically spins up a headless browser when JS rendering is detected, with no extra configuration&lt;/td&gt;
&lt;td&gt;You choose per Actor: Puppeteer, Playwright, Cheerio, raw HTTP, etc.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Extraction approach&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;“Zero‑selector” extraction: ML/NLP models parse pages into JSON or Markdown out‑of‑the‑box&lt;/td&gt;
&lt;td&gt;Code‑based extraction inside each Actor using Crawlee's page handlers or raw selectors&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Workflow composition&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;In a single request you can switch between scrape, crawl, or the FIRE‑1 “agent” mode using flags&lt;/td&gt;
&lt;td&gt;Chain Actors/tasks asynchronously using shared storage, datasets, or webhooks to build multistep pipelines&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Browser automation / navigation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;FIRE‑1 agent clicks buttons, paginates, fills forms, solves simple CAPTCHAs&lt;/td&gt;
&lt;td&gt;Automation coded per Actor; CAPTCHA solving baked in&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Anti‑detection &amp;amp; proxy support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Fully managed proxies, rotating sessions, fingerprint spoofing and geotargeting are all built‑in&lt;/td&gt;
&lt;td&gt;Apify Proxy (datacenter &amp;amp; residential) with session rotation, geo‑targeting and spoofing; you enable it per Actor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Caching &amp;amp; batching&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Global intelligent cache for recently fetched pages; supports batching hundreds of URLs in one API call&lt;/td&gt;
&lt;td&gt;Request queues with autoscaled parallelism; per‑Actor caching logic is up to you (e.g. dataset dedupe, custom cache)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scalability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Firecrawl auto‑scales browser instances for concurrent requests&lt;/td&gt;
&lt;td&gt;Platform scales Actors and browser instances elastically according to queue depth and configured concurrency limits&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Monitoring &amp;amp; scheduling&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Metrics and scheduling built into the single API dashboard&lt;/td&gt;
&lt;td&gt;Per‑Actor run metrics, alerts, conditional scheduling, and webhook triggers via Apify Console&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data format optimization&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Converts HTML → Markdown to cut LLM tokens by ≈ 67%&lt;/td&gt;
&lt;td&gt;Returns whatever the Actor emits (HTML, JSON, CSV, etc.); no built‑in token reduction&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Language / SDK support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Any language that can call HTTP/JSON (official clients in Python, Node, Go, Rust, C#, etc.)&lt;/td&gt;
&lt;td&gt;Official SDKs in JavaScript/TypeScript and Python (Apify SDK / Crawlee); other languages via raw REST calls&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  How each platform scales
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Firecrawl&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Firecrawl scales like a classic SaaS API. Your subscription tier defines a fixed pool of headless‑browser workers, and the service queues requests behind those workers. Because the queueing and resource allocation are handled centrally, you get stable latency and never touch infrastructure. Even the mid‑tier plans can move thousands of pages per minute thanks to caching and smart routing, so the raw worker counts rarely become a choke point.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Apify&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Apify takes a looser, cloud‑native approach. You launch as many Actors as your budget allows, and the platform spins up containers to run them in parallel. That “elastic infinity” is perfect for bursty jobs — say, crawling an entire retail catalog overnight or tracking tens of thousands of social‑media accounts — because you can flood the platform with work and pay only for the compute you burn. Automatic retries and detailed run logs keep large Actor swarms reliable.&lt;/p&gt;

&lt;p&gt;The marketplace magnifies &lt;strong&gt;Apify’s scale advantage&lt;/strong&gt;: if your project hits 50 different sites, chances are someone has already published an Actor for most of them. You trade weeks of scraper authoring for coordination logic — managing many Actors’ versions, schedules, and output formats — but you &lt;strong&gt;gain speed and capacity on day one&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Integrations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Integrating Firecrawl into AI pipelines&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Firecrawl treats integration the same way it treats scraping: make the routine path effortless and keep the escape hatches open.&lt;/p&gt;

&lt;p&gt;Official SDKs for Python, JavaScript, Rust, and Go expose idiomatic methods, but the real strength lies in direct hooks to AI tooling. A native LangChain loader, for example, can be wired up in just a few lines; it fetches pages, paginates automatically, preserves attribution metadata, and delivers chunks that are ready for embeddings or other RAG workflows. LlamaIndex receives similar first‑class support with retrievers that fetch, de‑duplicate, and format content for chatbots or summarization agents while keeping token counts under control.&lt;/p&gt;

&lt;p&gt;Outside pure code, Firecrawl blocks appear in Make.com and Zapier, so non‑developers can drag‑and‑drop flows — say, watch a competitor’s site, extract new product data in plain English, and update a shared spreadsheet — without touching a script. The result is a scraper that plugs into AI stacks and no‑code tools with equal ease.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Apify integrations for production workflows&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Apify’s connectors reflect a more mature, enterprise‑centric agenda: it's designed to sit inside CI/CD pipelines and data‑engineering stacks, not just trigger one‑off jobs.&lt;/p&gt;

&lt;p&gt;The Zapier app goes beyond “run an Actor” by adding triggers for job completion, built‑in data filters, and error‑handling branches, so a sales team can scrape LinkedIn, enrich leads with a second Actor, and post qualified prospects to a CRM in a single Zap.&lt;/p&gt;

&lt;p&gt;The GitHub integration lets teams treat scraper code like any other service — commit, test, review, and deploy via GitHub Actions — bringing familiar DevOps discipline to crawling logic.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foubbz5i3hiyhctiz1706.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foubbz5i3hiyhctiz1706.jpeg" alt="LangChain and LlamaIndex integration - Apify" width="800" height="252"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For AI tooling, Apify has loaders for LangChain and LlamaIndex, and integrations for HuggingFace, Pinecone, Haystack, Qdrant, and other vector databases.&lt;/p&gt;

&lt;p&gt;For data pipelines, Apify ships connectors that push results straight into S3, GCS, or Azure Blob, or stream them through webhooks to Kafka or Pub/Sub, &lt;strong&gt;turning the platform into a managed data‑ingestion layer&lt;/strong&gt; that feeds downstream analytics with no extra glue code.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Integration feature&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Firecrawl&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Apify&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Official SDKs&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Python, JavaScript/TypeScript (official), Rust, Go (community)&lt;/td&gt;
&lt;td&gt;JavaScript/TypeScript &amp;amp; Python (via Apify SDK / Crawlee)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AI‑framework hooks&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Native loaders/retrievers for LangChain and LlamaIndex (handles pagination, chunking, metadata, token‑cost control)&lt;/td&gt;
&lt;td&gt;Official loaders for LangChain and LlamaIndex; individual Actors may embed extra AI logic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;No‑code / automation tools&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Make and Zapier blocks for drag‑and‑drop flows&lt;/td&gt;
&lt;td&gt;Zapier app with branching, filters, and error handling for complex automations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DevOps / CI · CD&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;GitHub/Bitbucket integration for version control, tests, and CI‑based deployments&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data‑pipeline connectors&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;– (pull via API or SDK)&lt;/td&gt;
&lt;td&gt;Export Actors to S3, GCS, Azure; real‑time streaming to Kafka/Pub/Sub/webhooks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Webhook support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Crawl/scrape lifecycle webhooks for async callbacks&lt;/td&gt;
&lt;td&gt;Webhooks on Actor events; plus metamorph &amp;amp; transform steps for streaming&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Token‑usage optimization&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;HTML→Markdown plus loader‑level chunk sizing &amp;amp; token‑budget helpers&lt;/td&gt;
&lt;td&gt;Up to individual Actor / downstream loader; no platform‑wide token controls&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Picking the right platform for your workload
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;When Firecrawl is the better fit&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Choose Firecrawl when &lt;strong&gt;low‑latency&lt;/strong&gt; access to web data and &lt;strong&gt;tight coupling with AI pipelines&lt;/strong&gt; are top priorities. Its uniform API, natural‑language extraction, and sub‑second response times let you build chatbots, RAG systems, or research agents without wrestling with scraping logic. Pricing is credit‑based and predictable. If you know you’ll process about  50k pages a month, you can budget the spend to the dollar and count on the same performance every day. The open‑source core offers a self‑host option for teams with data‑residency mandates or those who simply want an exit ramp from the hosted service.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;When Apify delivers more value&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Apify is the best option when &lt;strong&gt;breadth&lt;/strong&gt;, &lt;strong&gt;flexibility&lt;/strong&gt;, and &lt;strong&gt;enterprise&lt;/strong&gt; guarantees matter. Its marketplace of 6,000‑plus maintained Actors means you can &lt;strong&gt;cover dozens of sites in hours&lt;/strong&gt; instead of building each scraper yourself. Non‑developers can launch and schedule those Actors through a web UI, while engineering teams still have full SDK control when needed. SOC 2 compliance, GDPR alignment, and a history of &lt;strong&gt;large‑scale deployments&lt;/strong&gt; satisfy procurement checklists, and the platform’s elastic Actor model handles everything from tiny HTML scrapes to overnight catalog crawls without manual capacity planning.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://console.apify.com/sign-up" rel="noopener noreferrer"&gt;Get started with Apify&lt;/a&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>5 best JavaScript web scraping libraries in 2025</title>
      <dc:creator>Saurav Jain</dc:creator>
      <pubDate>Thu, 24 Jul 2025 18:30:00 +0000</pubDate>
      <link>https://forem.com/apify/5-best-javascript-web-scraping-libraries-in-2025-4mf2</link>
      <guid>https://forem.com/apify/5-best-javascript-web-scraping-libraries-in-2025-4mf2</guid>
      <description>&lt;h3&gt;
  
  
  Even if you're not a JavaScript developer, it's worth knowing these libraries for parsing HTML, interacting with pages, and dealing with dynamic content.
&lt;/h3&gt;

&lt;p&gt;Websites are becoming increasingly complex and dynamic. The modern web is full of JavaScript-rendered apps that load content asynchronously, use auth systems involving multiple steps and JavaScript-based token handling, and block scraping bots.&lt;/p&gt;

&lt;p&gt;That's why JavaScript is still a great choice for collecting web data in 2025. But if you're a developer new to web scraping or unfamiliar with the JavaScript language, you're probably wondering which libraries and frameworks you should try.&lt;/p&gt;

&lt;p&gt;At Apify, we've been scraping the web with JavaScript and Node.js for a decade. This selection of 5 libraries is informed by our experience of using them for data extraction, from parsing HTML to navigating web pages and scraping dynamic content.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Crawlee
&lt;/h2&gt;

&lt;p&gt;Tackling complexity, stealth, and scalability in one package&lt;/p&gt;

&lt;p&gt;Juggling multiple libraries for requests, parsing, browser automation, and crawling logic quickly becomes a maintenance headache. You end up writing glue code to handle queues, rotate proxies, and merge results, only to find you still get blocked or stranded when scale increases.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://crawlee.dev/js?__hstc=160404322.210f49410a426b754a1e9d8c7a300289.1753096943609.1753284250052.1753339598956.12&amp;amp;__hssc=160404322.11.1753339598956&amp;amp;__hsfp=2557066438" rel="noopener noreferrer"&gt;Crawlee&lt;/a&gt;, developed by the Apify team, unifies everything under a single interface. Out of the box, it mimics real browsers (headers, TLS fingerprints, and even stealth plugins) so you avoid common anti-bot defenses without manual header or fingerprint tweaking. Instead of wiring together Cheerio + Playwright/Puppeteer + queue managers, Crawlee provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Switchable crawler classes&lt;/strong&gt;: &lt;code&gt;CheerioCrawler&lt;/code&gt; for static HTML, &lt;code&gt;PlaywrightCrawler&lt;/code&gt; or &lt;code&gt;PuppeteerCrawler&lt;/code&gt; for dynamic pages, all sharing a common configuration style.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Built-in queue management&lt;/strong&gt;: Breadth-first or depth-first crawling with concurrency settings, retry logic, and automatic backoff. You define start URLs; Crawlee handles enqueuing, prioritization, and scheduling.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automatic proxy rotation and session handling&lt;/strong&gt;: Effectively rotate proxies or manage cookies and browser contexts, so you stay under rate limits and maintain logins across multiple pages.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pluggable data storages&lt;/strong&gt;: Datasets (JSON, CSV, or key-value stores) appear in a local “datasets” directory, making it trivial to persist results or resume failed crawls.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lifecycle hooks and customizability&lt;/strong&gt;: Logging, error handling, and custom request handlers via routers, so you can insert your own logic at enqueue, request success, or failure without rewriting core code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Native integration with the Apify platform&lt;/strong&gt;: Once your crawler is ready, running &lt;code&gt;apify push&lt;/code&gt; deploys it, and Apify handles autoscaling, proxy billing, and data exports. No extra configuration needed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Starter templates and file structure&lt;/strong&gt;: When you run &lt;code&gt;npx crawlee create my-crawler&lt;/code&gt;, you get a &lt;code&gt;main.js&lt;/code&gt; and &lt;code&gt;routes.js&lt;/code&gt; setup. Boilerplate code means you can focus on selectors rather than instantiating browser instances, setting headers, or wiring queues. The default file structure looks like this:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;/my-crawler
├── main.js      &lt;span class="c"&gt;# entry point: initializes crawler class and starts run()&lt;/span&gt;
├── routes.js    &lt;span class="c"&gt;# defines request handlers via createCheerioRouter/createPlaywrightRouter&lt;/span&gt;
├── storages /
       datasets/    &lt;span class="c"&gt;# where results are stored as JSON files per page&lt;/span&gt;
├──    key-value-stores/  &lt;span class="c"&gt;# storage for arbitrary binary data (images, videos, JSON files…)&lt;/span&gt;
└── package.json

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Code snapshot (Cheerio crawler)&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight jsx"&gt;&lt;code&gt;&lt;span class="c1"&gt;// routes.js&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Dataset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;createCheerioRouter&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;crawlee&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;router&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;createCheerioRouter&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="nx"&gt;router&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;addDefaultHandler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;enqueueLinks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;$&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`enqueueing new URLs`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="c1"&gt;// Finds “next” pages and enqueues them&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;enqueueLinks&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;globs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https://news.ycombinator.com/?p=*&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="c1"&gt;// Extract post URL, title, rank&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;.athing&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;post&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="na"&gt;postUrl&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;post&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;.title a&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;attr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;href&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
      &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;post&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;.title a&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
      &lt;span class="na"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;post&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;.rank&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="p"&gt;}))&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toArray&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

  &lt;span class="c1"&gt;// Push to dataset for automatic file output&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;Dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pushData&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Why this matters for your workflow&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Simplicity:&lt;/strong&gt; No more separate proxy rotation libraries, queue managers, or manual header generators. Crawlee handles it all.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scaling&lt;/strong&gt;: You can start locally and then deploy to the Apify platform, where it auto‐scales, monitors memory/CPU, and logs failures.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Maintenance&lt;/strong&gt;: Switching from CheerioCrawler to PlaywrightCrawler only requires changing one import and maybe tweaking selectors. The core logic stays the same.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  2. Impit
&lt;/h2&gt;

&lt;p&gt;Making browser impersonation simple&lt;/p&gt;

&lt;p&gt;Sending vanilla HTTP requests often gets you blocked by modern anti-scraping systems. You might spend hours rotating user-agents, randomizing delays, or implementing captchas manually, only to still find your IP banned.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/apify/impit" rel="noopener noreferrer"&gt;Impit&lt;/a&gt; is an HTTP client for Node.js and Python, based on Rust’s &lt;code&gt;reqwest&lt;/code&gt;, specifically tailored for scraping. Instead of wrestling with header spoofing or TLS fingerprinting yourself, you get:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Automatic fingerprint spoofing:&lt;/strong&gt; Pick from a library of existing browser fingerprints, and impit builds a full set of realistic HTTP headers and matching TLS settings. This makes your requests indistinguishable from browser requests and reduces detection risk.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Integrated &lt;code&gt;tough-cookie&lt;/code&gt; support&lt;/strong&gt;: Handle session cookies out of the box, so you can maintain login sessions or track redirects using the most popular JS cookie library.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;fetch&lt;/code&gt; API:&lt;/strong&gt; Impit implements a subset of the well-known &lt;code&gt;fetch&lt;/code&gt; API (&lt;a href="https://developer.mozilla.org/en-US/docs/Web/API/Fetch_API" rel="noopener noreferrer"&gt;MDN&lt;/a&gt;), so you can write your scrapers without having to read lengthy docs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Proxy integration&lt;/strong&gt;: Support for HTTP and HTTPS proxies via a single option, so you can rotate IPs with minimal code.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Code snapshot (impersonating Firefox)&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight jsx"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Impit&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;impit&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;fetchHtml&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;impit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Impit&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;firefox&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;http3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;impit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https://news.ycombinator.com/&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt; &lt;span class="c1"&gt;// raw HTML&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nf"&gt;fetchHtml&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Why this matters for your workflow&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Stealth:&lt;/strong&gt; You no longer manually assemble user-agent strings or randomize headers; impit covers 95% of common anti-bot checks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Error handling:&lt;/strong&gt; Configurable retries and timeouts mean fewer surprises when a request fails.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  3. Cheerio
&lt;/h2&gt;

&lt;p&gt;Converting unruly HTML into structured data&lt;/p&gt;

&lt;p&gt;Plain HTML is cluttered: nested tags, inconsistent class names, and no programmatic way to navigate the DOM on the server. If you’ve written custom regex or string-based parsers, you know how brittle that can be.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cheerio.js.org/" rel="noopener noreferrer"&gt;Cheerio&lt;/a&gt; loads raw HTML into a fast, jQuery-like API on the server. You can query for elements, attributes, and text using familiar CSS selectors, then extract exactly what you need without worrying about manual string manipulation.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Code snapshot (parsing Hacker News)&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight jsx"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;gotScraping&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;got-scraping&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nx"&gt;cheerio&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;cheerio&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;fetchTitles&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;gotScraping&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https://news.ycombinator.com/&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;$&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;cheerio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;body&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="nf"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;.athing&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;each&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;post&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;title&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;post&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;.title a&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;rank&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;post&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;.rank&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;title&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="nf"&gt;fetchTitles&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Why this matters for your workflow&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Robustness:&lt;/strong&gt; No more fragile regex. With Cheerio, you use &lt;code&gt;.find()&lt;/code&gt;, &lt;code&gt;.text()&lt;/code&gt;, and &lt;code&gt;.attr()&lt;/code&gt; just like jQuery.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance:&lt;/strong&gt; Cheerio is lightweight and blazingly fast, so parsing large HTML documents doesn’t become your scraper’s bottleneck - especially when compared to full-headless browsers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Familiar syntax:&lt;/strong&gt; If you’ve used jQuery on the front end, there’s almost zero onboarding time.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  4. Playwright
&lt;/h2&gt;

&lt;p&gt;Handling JavaScript‐driven, dynamic content reliably&lt;/p&gt;

&lt;p&gt;Many modern websites rely on client-side JavaScript to populate the DOM, for lazy loading, infinite scrolling, or data fetched via XHR/AJAX. Cheerio has no power here.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://playwright.dev/" rel="noopener noreferrer"&gt;Playwright&lt;/a&gt;, on the other hand, spins up a real browser (Chromium, Firefox, or WebKit), navigates pages as a human would, waits for selectors or network to idle, and then gives you a fully rendered DOM snapshot. You can even intercept requests to block ads or unwanted resources.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Code snapshot (Amazon product page)&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight jsx"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;firefox&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;playwright&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;scrapeAmazon&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;browser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;firefox&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;launch&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;headless&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;newPage&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https://www.amazon.com/Hitchhikers-Guide-Galaxy-Douglas-Adams-ebook/dp/B000XUBC2C/&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;book&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;locator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;#productTitle&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;innerText&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="na"&gt;author&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;locator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;span.author a&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;innerText&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="na"&gt;kindlePrice&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;
      &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;locator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;#formats span.ebook-price-value&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;innerText&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="na"&gt;paperbackPrice&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;
      &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;locator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;#tmm-grid-swatch-PAPERBACK .slot-price span&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;innerText&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="na"&gt;hardcoverPrice&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;
      &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;locator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;#tmm-grid-swatch-HARDCOVER .slot-price span&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;innerText&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
  &lt;span class="p"&gt;};&lt;/span&gt;

  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;book&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="nf"&gt;scrapeAmazon&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Why this matters for your workflow&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Reliability:&lt;/strong&gt; If the data isn’t in the initial HTML, you need a browser to run the page’s JS. Playwright ensures you get exactly what a real user sees.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flexible waits:&lt;/strong&gt; You can &lt;code&gt;await page.waitForSelector()&lt;/code&gt; or specify &lt;code&gt;.waitUntil("networkidle")&lt;/code&gt; so you only scrape once all resources load, reducing flaky results.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Intercepting resources:&lt;/strong&gt; Block images, CSS, or analytics endpoints to speed up scrapes and reduce noise in your logs.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  5. Puppeteer
&lt;/h2&gt;

&lt;p&gt;A Chrome-centric approach to browser automation&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pptr.dev/" rel="noopener noreferrer"&gt;Puppeteer&lt;/a&gt; and Playwright are pretty much the same thing, except for some minor differences in API, and unlike Playwright, it's limited to JavaScript and Node.js. Puppeteer is older, but only recently did Firefox &lt;a href="https://hacks.mozilla.org/2024/08/puppeteer-support-for-firefox/" rel="noopener noreferrer"&gt;add official support&lt;/a&gt; for Puppeteer.&lt;/p&gt;

&lt;p&gt;Switching to Playwright isn't difficult, but if you prefer Chromium’s engine, this is a good option.&lt;/p&gt;

&lt;p&gt;Puppeteer gives you a headless (or headed) Chrome instance with an easy API for navigation, selection, and evaluation. It supports intercepting requests, generating PDFs, and capturing screenshots. While it doesn’t include the same cross-browser support as Playwright, it’s been around for longer and integrates well with Chrome DevTools Protocol.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Code snapshot (basic Puppeteer scraper)&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight jsx"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;puppeteer&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;puppeteer&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;scrapeSite&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;browser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;puppeteer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;launch&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;headless&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;newPage&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https://news.ycombinator.com/&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;waitUntil&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;networkidle2&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;articles&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;$&lt;/span&gt;&lt;span class="nf"&gt;$eval&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;.athing&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;posts&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt;
    &lt;span class="nx"&gt;posts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;post&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;post&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;querySelector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;.title a&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)?.&lt;/span&gt;&lt;span class="nx"&gt;innerText&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
      &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;post&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;querySelector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;.title a&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)?.&lt;/span&gt;&lt;span class="nx"&gt;href&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;post&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;querySelector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;.rank&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)?.&lt;/span&gt;&lt;span class="nx"&gt;innerText&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="p"&gt;}))&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;articles&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="nf"&gt;scrapeSite&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Why this matters for your workflow&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Existing Puppeteer code:&lt;/strong&gt; Migrate incrementally or reuse libraries that depend on Puppeteer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chrome-only features:&lt;/strong&gt; Use DevTools Protocol to capture screenshots, trace performance, or emulate network conditions without additional dependencies.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lightweight automation needs:&lt;/strong&gt; if you only need a headless Chrome for a few pages and already have an IP rotation or session management solution, Puppeteer might be the simplest choice.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  In summary
&lt;/h2&gt;

&lt;p&gt;JavaScript remains a top choice for web scraping in 2025, thanks to its solid ecosystem of open-source libraries, which make it easier to parse HTML, interact with web pages, and deal with dynamic content. In our opinion, these five are the best:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Crawlee&lt;/strong&gt; - A comprehensive, all-in-one scraping framework that handles browser automation, proxy rotation, session management, queuing, and data storage. It simplifies scaling and maintenance by unifying multiple tools under one interface.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Impit&lt;/strong&gt; - A stealthy HTTP client tailored for scraping, with automatic realistic header generation, cookie jar support, and proxy integration - ideal for scraping without a full browser.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cheerio&lt;/strong&gt; - A fast and lightweight HTML parser that mimics jQuery, perfect for extracting structured data from static HTML without using a browser.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Playwright&lt;/strong&gt; - A full browser automation library for scraping JavaScript-rendered sites. It supports multiple browsers, waits for content to load, and intercepts resources, making it highly reliable for dynamic pages.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Puppeteer&lt;/strong&gt; - A headless Chrome automation tool with strong DevTools support. It’s suitable for existing Puppeteer codebases or lightweight scraping needs focused on Chromium.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Whether you’re parsing static HTML or navigating complex, JavaScript-rendered pages, this toolkit helps you choose and combine the best options for performance, stealth, and scalability.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>programming</category>
      <category>javascript</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>Build and deploy MCP servers in minutes with a TypeScript template</title>
      <dc:creator>Saurav Jain</dc:creator>
      <pubDate>Thu, 24 Jul 2025 07:33:07 +0000</pubDate>
      <link>https://forem.com/apify/build-and-deploy-mcp-servers-in-minutes-with-a-typescript-template-4mi5</link>
      <guid>https://forem.com/apify/build-and-deploy-mcp-servers-in-minutes-with-a-typescript-template-4mi5</guid>
      <description>&lt;h3&gt;
  
  
  Transform any stdio MCP server into a scalable, cloud-hosted service.
&lt;/h3&gt;

&lt;p&gt;Model Context Protocol (&lt;a href="https://blog.apify.com/what-is-model-context-protocol/" rel="noopener noreferrer"&gt;MCP&lt;/a&gt;) is transforming and simplifying how AI applications connect with external tools. While &lt;a href="https://blog.apify.com/how-to-use-mcp/" rel="noopener noreferrer"&gt;we’ve covered how to use MCP with tools that give agents context from the web&lt;/a&gt;, this guide digs deeper into the developer side: how to build and deploy your own MCP servers on the Apify platform.&lt;/p&gt;

&lt;p&gt;With Apify's MCP templates, you can transform any stdio or remote MCP server into a scalable, cloud-hosted service in minutes. There are currently two templates available: &lt;a href="https://apify.com/templates/python-mcp-server" rel="noopener noreferrer"&gt;for Python&lt;/a&gt; and &lt;a href="https://apify.com/templates/ts-mcp-server" rel="noopener noreferrer"&gt;for TypeScript&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In this tutorial, we'll show you how to build an MCP server on Apify with TypeScript.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why deploy MCP servers on Apify?
&lt;/h2&gt;

&lt;p&gt;Before we get into implementation, here’s why the Apify platform is ideal for hosting MCP servers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;1. Instant scalability&lt;/strong&gt;: Apify's infrastructure automatically scales based on demand, from a single request to thousands of concurrent connections.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2. Built-in monetization&lt;/strong&gt;: With the &lt;a href="https://help.apify.com/en/articles/10700066-what-is-pay-per-event" rel="noopener noreferrer"&gt;pay-per-event&lt;/a&gt; (PPE) model, you can charge users for each tool request, API call, or custom event, turning your MCP server into a revenue stream.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;3. Persistent URLs&lt;/strong&gt;: &lt;a href="https://docs.apify.com/platform/actors/development/programming-interface/standby" rel="noopener noreferrer"&gt;Standby mode&lt;/a&gt; provides stable endpoints like &lt;code&gt;https://your-username--your-mcp-server.apify.actor/sse&lt;/code&gt;, perfect for MCP client configurations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;4. Zero infrastructure management&lt;/strong&gt;: No servers to maintain, no Docker orchestration, no SSL certificates. Just deploy and run.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Understanding the architecture
&lt;/h2&gt;

&lt;p&gt;When you deploy an MCP server on Apify, you can work with two types of MCP servers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;stdio MCP servers&lt;/strong&gt;: Local servers that communicate via standard input/output, which Apify converts to SSE (Server-Sent Events) for remote access&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SSE MCP servers&lt;/strong&gt;: Remote servers that already communicate via HTTP/SSE, which Apify can proxy and enhance with monetization features&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This flexible architecture allows you to turn any type of MCP server into an &lt;a href="https://apify.com/actors" rel="noopener noreferrer"&gt;Apify Actor&lt;/a&gt;, whether it's a local stdio-based tool or a remote SSE endpoint, and expose it through a unified SSE interface with built-in scaling and monetization.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Actors are lightweight, containerized programs that take JSON inputs, execute tasks, and return structured outputs.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Step-by-step implementation
&lt;/h2&gt;

&lt;p&gt;Let’s walk through building an MCP server on Apify using the TypeScript template.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Step 1: Choose and create your Actor from the template&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Create the TypeScript MCP server from template
apify create my-mcp-server --template ts-mcp-server
cd my-mcp-server
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Step 2: Configure your MCP server&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Open &lt;code&gt;src/main.ts&lt;/code&gt; and set the &lt;code&gt;MCP_COMMAND&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight tsx"&gt;&lt;code&gt;&lt;span class="c1"&gt;// For stdio servers:&lt;/span&gt;
&lt;span class="c1"&gt;// Example: Everything MCP server&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;MCP_COMMAND&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;npx @modelcontextprotocol/server-everything&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// For SSE servers (requires mcp-remote package):&lt;/span&gt;
&lt;span class="c1"&gt;// Custom SSE endpoint&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;MCP_COMMAND&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;npx mcp-remote https://your-domain.com/mcp-endpoint&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Step 3: Install your MCP server dependencies&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Update &lt;code&gt;package.json&lt;/code&gt; with the MCP server dependencies:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"dependencies"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"@modelcontextprotocol/server-everything"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"^2025.5.12"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"mcp-remote"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"^0.1.16"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; If you’re developing the Actor locally, you can use &lt;code&gt;npm install&lt;/code&gt; instead of editing &lt;code&gt;package.json&lt;/code&gt; directly.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Step 4: Set up monetization (optional)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;In &lt;code&gt;.actor/pay_per_event.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"tool-request"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"eventTitle"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Tool Request"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"eventDescription"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Charge for each tool execution"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"eventPriceUsd"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.05&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Trigger charges in your code:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For TypeScript:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight tsx"&gt;&lt;code&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;Actor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;charge&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;eventName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;tool-request&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Step 5: Configure Actor settings&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;In &lt;code&gt;.actor/actor.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"actorSpecification"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"my-mcp-server"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1.0.0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"usesStandbyMode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"minMemoryMbytes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"maxMemoryMbytes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"webServerMcpPath"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/sse"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;So&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;the&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Actor&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;is&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;recognized&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;an&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;MCP&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;server&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Step 6: Deploy to Apify&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apify login
apify push

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Step 7: Configure standby mode&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;In Apify Console:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Go to your Actor’s settings&lt;/li&gt;
&lt;li&gt;Enable “Standby mode”&lt;/li&gt;
&lt;li&gt;Set idle timeout (e.g., 300 seconds)&lt;/li&gt;
&lt;li&gt;Adjust memory allocation as needed&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Step 8: Connect your MCP client&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Use this URL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;https://your-username--my-mcp-server.apify.actor/sse

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Configure your client by pointing your MCP client to the URL, and be sure to set the Authorization headers with the bearer auth token &lt;code&gt;Authorization: Bearer your-token&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Advanced configuration
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Environment variables&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;To set non-sensitive environment variables, use &lt;code&gt;actor.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"environmentVariables"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"RATE_LIMIT"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"100"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Important:&lt;/strong&gt; Do not put API tokens or other sensitive values in the &lt;code&gt;actor.json&lt;/code&gt; file. Instead, set sensitive environment variables in the Apify Console UI under your Actor's settings when doing the build. For more information on setting custom environment variables, see the &lt;a href="https://docs.apify.com/platform/actors/development/programming-interface/environment-variables#custom-environment-variables" rel="noopener noreferrer"&gt;documentation&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;TypeScript template capabilities&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The TypeScript template supports both stdio and SSE server types:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;stdio servers&lt;/strong&gt;: Simple command string configuration for local MCP servers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SSE servers&lt;/strong&gt;: Remote server proxying using mcp-remote for connecting to external SSE endpoints&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Debugging and monitoring
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Local development&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;To run the MCP server locally, use the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;APIFY_META_ORIGIN="STANDBY" ACTOR_WEB_SERVER_PORT=8080 apify run
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can use the &lt;a href="https://github.com/modelcontextprotocol/inspector" rel="noopener noreferrer"&gt;MCP inspector&lt;/a&gt; on GitHub for debugging and testing locally.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Production monitoring&lt;/strong&gt;
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;View Actor logs in Apify Console&lt;/li&gt;
&lt;li&gt;Set up error/usage alerts&lt;/li&gt;
&lt;li&gt;Track monetization in Analytics&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Troubleshooting common issues&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Memory errors&lt;/strong&gt;: Increase memory or optimize code&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auth failures&lt;/strong&gt;: Check tokens&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What next?
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Add more tools&lt;/li&gt;
&lt;li&gt;Integrate with Apify Actors&lt;/li&gt;
&lt;li&gt;Build custom clients&lt;/li&gt;
&lt;li&gt;Share your server on &lt;a href="https://apify.com/store" rel="noopener noreferrer"&gt;Apify Store&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;a href="https://console.apify.com/sign-up" rel="noopener noreferrer"&gt;Start building&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Deploying MCP servers on the Apify platform turns local tools into scalable, monetizable cloud services. With our TypeScript template and standby mode, you can get a production-ready server running in minutes, whether it's a local stdio server or an existing SSE server.&lt;/p&gt;

&lt;p&gt;Explore what’s possible:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://apify.com/templates/python-mcp-server" rel="noopener noreferrer"&gt;Python MCP server template&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://apify.com/templates/ts-mcp-server" rel="noopener noreferrer"&gt;TypeScript MCP server template&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.apify.com/platform/integrations/mcp" rel="noopener noreferrer"&gt;Apify MCP documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://modelcontextprotocol.io/" rel="noopener noreferrer"&gt;Model Context Protocol specification&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://discord.com/invite/jyEM2PRvMU" rel="noopener noreferrer"&gt;Join the Apify Discord&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>mcp</category>
      <category>webdev</category>
      <category>javascript</category>
    </item>
    <item>
      <title>[Boost]</title>
      <dc:creator>Saurav Jain</dc:creator>
      <pubDate>Fri, 21 Mar 2025 13:51:28 +0000</pubDate>
      <link>https://forem.com/sauain/-4gno</link>
      <guid>https://forem.com/sauain/-4gno</guid>
      <description>&lt;div class="ltag__link"&gt;
  &lt;a href="/crawlee" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__org__pic"&gt;
      &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F8417%2F0d74ff82-4b50-4d84-ba0b-8546be19aa84.png" alt="Crawlee" width="192" height="192"&gt;
      &lt;div class="ltag__link__user__pic"&gt;
        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1964149%2F9ec7b4f2-3e35-4b8f-976c-35b4cb0b5af2.jpg" alt="" width="620" height="620"&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/a&gt;
  &lt;a href="https://dev.to/crawlee/how-to-scrape-bluesky-with-python-2ooo" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__content"&gt;
      &lt;h2&gt;How to scrape Bluesky with Python&lt;/h2&gt;
      &lt;h3&gt;Max Bohomolov for Crawlee ・ Mar 21&lt;/h3&gt;
      &lt;div class="ltag__link__taglist"&gt;
        &lt;span class="ltag__link__tag"&gt;#webscraping&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#webdev&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#python&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#developers&lt;/span&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/a&gt;
&lt;/div&gt;


</description>
      <category>webscraping</category>
      <category>webdev</category>
      <category>python</category>
      <category>developers</category>
    </item>
    <item>
      <title>Inside implementing SuperScraper with Crawlee.</title>
      <dc:creator>Saurav Jain</dc:creator>
      <pubDate>Thu, 06 Mar 2025 07:23:07 +0000</pubDate>
      <link>https://forem.com/crawlee/inside-implementing-superscraper-with-crawlee-1e6o</link>
      <guid>https://forem.com/crawlee/inside-implementing-superscraper-with-crawlee-1e6o</guid>
      <description>&lt;p&gt;&lt;a href="https://github.com/apify/super-scraper" rel="noopener noreferrer"&gt;SuperScraper&lt;/a&gt; is an open-source &lt;a href="https://docs.apify.com/platform/actors" rel="noopener noreferrer"&gt;Actor&lt;/a&gt; that combines features from various web scraping services, including &lt;a href="https://www.scrapingbee.com/" rel="noopener noreferrer"&gt;ScrapingBee&lt;/a&gt;, &lt;a href="https://scrapingant.com/" rel="noopener noreferrer"&gt;ScrapingAnt&lt;/a&gt;, and &lt;a href="https://www.scraperapi.com/" rel="noopener noreferrer"&gt;ScraperAPI&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;A key capability is its standby mode, which runs the Actor as a persistent API server. This removes the usual start-up times - a common pain point in many systems - and lets users make direct API calls to interact with the system immediately.&lt;/p&gt;

&lt;p&gt;This blog explains how SuperScraper works, highlights its implementation details, and provides code snippets to demonstrate its core functionality.&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fassets.dev.to%2Fassets%2Fgithub-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/apify" rel="noopener noreferrer"&gt;
        apify
      &lt;/a&gt; / &lt;a href="https://github.com/apify/crawlee" rel="noopener noreferrer"&gt;
        crawlee
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;
    &lt;a href="https://crawlee.dev" rel="nofollow noopener noreferrer"&gt;
        
          
          &lt;img alt="Crawlee" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fapify%2Fcrawlee%2Fmaster%2Fwebsite%2Fstatic%2Fimg%2Fcrawlee-light.svg%3Fsanitize%3Dtrue" width="500"&gt;
        
    &lt;/a&gt;
    &lt;br&gt;
    A web scraping and browser automation library
&lt;/h1&gt;
&lt;/div&gt;
&lt;p&gt;
    &lt;a href="https://www.npmjs.com/package/@crawlee/core" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/0254a8a134a98b27e3c3a22da5bfb90fae31709d0cd76d0370b06426dcaeb116/68747470733a2f2f696d672e736869656c64732e696f2f6e706d2f762f40637261776c65652f636f72652e737667" alt="NPM latest version"&gt;&lt;/a&gt;
    &lt;a href="https://www.npmjs.com/package/@crawlee/core" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/a9a876e2cefe6ad81ded3d089f0a0506366658d75c28fe98e944327943b39521/68747470733a2f2f696d672e736869656c64732e696f2f6e706d2f646d2f40637261776c65652f636f72652e737667" alt="Downloads"&gt;&lt;/a&gt;
    &lt;a href="https://discord.gg/jyEM2PRvMU" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/afa7e1b87d07c3e1531206ff182c487373079a12f504d7a95b371871ca09d3b0/68747470733a2f2f696d672e736869656c64732e696f2f646973636f72642f3830313136333731373931353537343332333f6c6162656c3d646973636f7264" alt="Chat on discord"&gt;&lt;/a&gt;
    &lt;a href="https://github.com/apify/crawlee/actions/workflows/test-ci.yml" rel="noopener noreferrer"&gt;&lt;img src="https://github.com/apify/crawlee/actions/workflows/test-ci.yml/badge.svg?branch=master" alt="Build Status"&gt;&lt;/a&gt;
&lt;/p&gt;

&lt;p&gt;Crawlee covers your crawling and scraping end-to-end and &lt;strong&gt;helps you build reliable scrapers. Fast.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Your crawlers will appear human-like and fly under the radar of modern bot protections even with the default configuration. Crawlee gives you the tools to crawl the web for links, scrape data, and store it to disk or cloud while staying configurable to suit your project's needs.&lt;/p&gt;

&lt;p&gt;Crawlee is available as the &lt;a href="https://www.npmjs.com/package/crawlee" rel="nofollow noopener noreferrer"&gt;&lt;code&gt;crawlee&lt;/code&gt;&lt;/a&gt; NPM package.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;👉 &lt;strong&gt;View full documentation, guides and examples on the &lt;a href="https://crawlee.dev" rel="nofollow noopener noreferrer"&gt;Crawlee project website&lt;/a&gt;&lt;/strong&gt; 👈&lt;/p&gt;
&lt;/blockquote&gt;

&lt;blockquote&gt;
&lt;p&gt;Crawlee for Python is open for early adopters. 🐍  &lt;a href="https://github.com/apify/crawlee-python" rel="noopener noreferrer"&gt;👉 Checkout the source code 👈&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Installation&lt;/h2&gt;
&lt;/div&gt;

&lt;p&gt;We recommend visiting the &lt;a href="https://crawlee.dev/docs/introduction" rel="nofollow noopener noreferrer"&gt;Introduction tutorial&lt;/a&gt; in Crawlee documentation for more information.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Crawlee requires &lt;strong&gt;Node.js 16 or higher&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="markdown-heading"&gt;
&lt;h3 class="heading-element"&gt;With Crawlee CLI&lt;/h3&gt;
&lt;/div&gt;

&lt;p&gt;The fastest way to try Crawlee out is to use the &lt;strong&gt;Crawlee CLI&lt;/strong&gt; and choose the &lt;strong&gt;Getting started example&lt;/strong&gt;. The CLI…&lt;/p&gt;
&lt;/div&gt;


&lt;/div&gt;
&lt;br&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/apify/crawlee" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;br&gt;
&lt;/div&gt;
&lt;br&gt;


&lt;h3&gt;
  
  
  What is SuperScraper?
&lt;/h3&gt;

&lt;p&gt;SuperScraper transforms a traditional scraper into an API server. Instead of running with static inputs and waiting for completion, it starts only once, stays active, and listens for incoming requests. &lt;/p&gt;

&lt;h3&gt;
  
  
  How to enable standby mode
&lt;/h3&gt;

&lt;p&gt;To activate standby mode, you must configure the settings so it listens for incoming requests.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fegtktjnnah2nz6z8j5eb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fegtktjnnah2nz6z8j5eb.png" alt="Activating Actor standby mode" width="800" height="314"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Server setup
&lt;/h3&gt;

&lt;p&gt;The project uses Node.js &lt;code&gt;http&lt;/code&gt; module to create a server that listens on the desired port. After the server starts, a check ensures users are interacting with it correctly by sending requests instead of running it traditionally. This keeps SuperScraper operating as a persistent server.&lt;/p&gt;

&lt;h3&gt;
  
  
  Handling multiple crawlers
&lt;/h3&gt;

&lt;p&gt;SuperScraper processes user requests using multiple instances of Crawlee’s &lt;a href="https://crawlee.dev/api/playwright-crawler/class/PlaywrightCrawler" rel="noopener noreferrer"&gt;&lt;code&gt;PlaywrightCrawler&lt;/code&gt;&lt;/a&gt;. Since each &lt;code&gt;PlaywrightCrawler&lt;/code&gt; instance can only handle one proxy configuration, a separate crawler is created for each unique proxy setting. &lt;/p&gt;

&lt;p&gt;For example, if the user sends one request for “normal” proxies and one request with residential US proxies, a separate crawler needs to be created for each proxy configuration. Hence, to solve this, we store the crawlers in a key-value map, where the key is a stringified proxy configuration.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;crawlers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nb"&gt;Map&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;PlaywrightCrawler&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here’s a part of the code that gets executed when a new request from the user arrives; if the crawler for this proxy configuration exists in the map, it will be used. Otherwise, a new crawler gets created. Then, we add the request to the crawler’s queue so it can be processed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;crawlerOptions&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; 
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;crawler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;crawlers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;has&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="nx"&gt;crawlers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;createAndStartCrawler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;crawlerOptions&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;crawler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;addRequests&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The function below initializes new crawlers with predefined settings and behaviors. Each crawler utilizes its own in-memory queue created with the &lt;code&gt;MemoryStorage&lt;/code&gt; client. This approach is used for two key reasons:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Performance&lt;/strong&gt;: In-memory queues are faster, and there's no need to persist them when SuperScraper migrates.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Isolation&lt;/strong&gt;: Using a separate queue prevents interference with the shared default queue of the SuperScraper Actor, avoiding potential bugs when multiple crawlers use it simultaneously.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;createAndStartCrawler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;crawlerOptions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;CrawlerOptions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;DEFAULT_CRAWLER_OPTIONS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;MemoryStorage&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;persistStorage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;queue&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;RequestQueue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;undefined&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;storageClient&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;proxyConfig&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;Actor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createProxyConfiguration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;crawlerOptions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;proxyConfigurationOptions&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;crawler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;PlaywrightCrawler&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="na"&gt;keepAlive&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;proxyConfiguration&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;proxyConfig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;maxRequestRetries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;requestQueue&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At the end of the function, we start the crawler and log a message if it terminates for any reason. Next, we add the newly created crawler to the key-value map containing all crawlers, and finally, we return the crawler.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;crawler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;then&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Crawler ended`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;crawlerOptions&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="nx"&gt;crawlers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;crawlerOptions&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nx"&gt;crawler&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Crawler ready 🚀&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;crawlerOptions&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;crawler&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Mapping standby HTTP requests to Crawlee requests
&lt;/h3&gt;

&lt;p&gt;When creating the server, it accepts a request listener function that takes two arguments: the user’s request and a response object. The response object is used to send scraped data back to the user. These response objects are stored in a key-value map to so they can be accessed later in the code. The key is a randomly generated string shared between the request and its corresponding response object, it is used as &lt;code&gt;request.uniqueKey&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;responses&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nb"&gt;Map&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;ServerResponse&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Saving response objects&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The following function stores a response object in the key-value map:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;addResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;responseId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ServerResponse&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;responses&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;responseId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Updating crawler logic to store responses&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here’s the updated logic for fetching/creating the corresponding crawler for a given proxy configuration, with a call to store the response object:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;crawlerOptions&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; 
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;crawler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;crawlers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;has&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="nx"&gt;crawlers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;createAndStartCrawler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;crawlerOptions&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="nf"&gt;addResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;uniqueKey&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;crawler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;requestQueue&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;addRequest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Sending scraped data back&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Once a crawler finishes processing a request, it retrieves the corresponding response object using the key and sends the scraped data back to the user:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;sendSuccResponseById&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;responseId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;unknown&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;contentType&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;responses&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;responseId&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Response for request &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;responseId&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; not found`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;writeHead&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Content-Type&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;contentType&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
    &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;end&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nx"&gt;responses&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;delete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;responseId&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Error handling&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;There is similar logic to send a response back if an error occurs during scraping:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;sendErrorResponseById&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;responseId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;statusCode&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;responses&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;responseId&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Response for request &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;responseId&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; not found`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;writeHead&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;statusCode&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Content-Type&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;application/json&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
    &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;end&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nx"&gt;responses&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;delete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;responseId&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Adding timeouts during migrations&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;During migration, SuperScraper adds timeouts to pending responses to handle termination cleanly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;addTimeoutToAllResponses&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;timeoutInSeconds&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;migrationErrorMessage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="na"&gt;errorMessage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Actor had to migrate to another server. Please, retry your request.&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;};&lt;/span&gt;

    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;responseKeys&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Object&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;responses&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;key&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;responseKeys&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nf"&gt;setTimeout&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="nf"&gt;sendErrorResponseById&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;migrationErrorMessage&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="nx"&gt;timeoutInSeconds&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Managing migrations
&lt;/h3&gt;

&lt;p&gt;SuperScraper handles migrations by timing out active responses to prevent lingering requests during server transitions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;Actor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;migrating&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nf"&gt;addTimeoutToAllResponses&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Users receive clear feedback during server migrations, maintaining stable operation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Build your own
&lt;/h3&gt;

&lt;p&gt;This guide showed how to build and manage a standby web scraper using Apify’s platform and Crawlee. The implementation handles multiple proxy configurations through &lt;code&gt;PlaywrightCrawler&lt;/code&gt; instances while managing request-response cycles efficiently to support diverse scraping needs.&lt;/p&gt;

&lt;p&gt;Standby mode transforms SuperScraper into a persistent API server, eliminating start-up delays. The migration handling system keeps operations stable during server transitions. You can build on this foundation to create web scraping tools tailored to your requirements.&lt;/p&gt;

&lt;p&gt;To get started, explore the project on &lt;a href="https://github.com/apify/super-scraper" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; or learn more about &lt;a href="https://crawlee.dev/" rel="noopener noreferrer"&gt;Crawlee&lt;/a&gt; to build your own scalable web scraping tools.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>programming</category>
      <category>javascript</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>[Boost]</title>
      <dc:creator>Saurav Jain</dc:creator>
      <pubDate>Thu, 13 Feb 2025 15:54:12 +0000</pubDate>
      <link>https://forem.com/sauain/-3n8</link>
      <guid>https://forem.com/sauain/-3n8</guid>
      <description>&lt;div class="ltag__link"&gt;
  &lt;a href="/arindam_1729" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__pic"&gt;
      &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F965723%2Fe0982512-4de1-4154-b3c3-1869d19e9ecc.png" alt="arindam_1729"&gt;
    &lt;/div&gt;
  &lt;/a&gt;
  &lt;a href="https://dev.to/arindam_1729/11-exciting-github-repositories-you-should-check-right-now-37cg" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__content"&gt;
      &lt;h2&gt;🤯 11 Exciting GitHub Repositories You Should Check Right Now⚡️&lt;/h2&gt;
      &lt;h3&gt;Arindam Majumder  ・ Feb 13&lt;/h3&gt;
      &lt;div class="ltag__link__taglist"&gt;
        &lt;span class="ltag__link__tag"&gt;#opensource&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#javascript&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#webdev&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#python&lt;/span&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/a&gt;
&lt;/div&gt;


</description>
      <category>opensource</category>
      <category>javascript</category>
      <category>webdev</category>
      <category>python</category>
    </item>
    <item>
      <title>Crawlee GiveAway for our dev.to community! :)</title>
      <dc:creator>Saurav Jain</dc:creator>
      <pubDate>Fri, 13 Dec 2024 05:02:55 +0000</pubDate>
      <link>https://forem.com/sauain/crawlee-giveaway-for-our-devto-community--2nkl</link>
      <guid>https://forem.com/sauain/crawlee-giveaway-for-our-devto-community--2nkl</guid>
      <description>&lt;div class="ltag__link"&gt;
  &lt;a href="/arindam_1729" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__pic"&gt;
      &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F965723%2Fe0982512-4de1-4154-b3c3-1869d19e9ecc.png" alt="arindam_1729"&gt;
    &lt;/div&gt;
  &lt;/a&gt;
  &lt;a href="/arindam_1729/7-must-try-open-source-tools-for-python-and-javascript-developers-4c56" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__content"&gt;
      &lt;h2&gt;7 Must-Try Open-Source Tools for Python and JavaScript Developers 🚀&lt;/h2&gt;
      &lt;h3&gt;Arindam Majumder  ・ Dec 12&lt;/h3&gt;
      &lt;div class="ltag__link__taglist"&gt;
        &lt;span class="ltag__link__tag"&gt;#javascript&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#webdev&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#python&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#beginners&lt;/span&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/a&gt;
&lt;/div&gt;


</description>
    </item>
    <item>
      <title>Worth a read! :)</title>
      <dc:creator>Saurav Jain</dc:creator>
      <pubDate>Mon, 02 Dec 2024 05:32:14 +0000</pubDate>
      <link>https://forem.com/sauain/worth-a-read--3naf</link>
      <guid>https://forem.com/sauain/worth-a-read--3naf</guid>
      <description>&lt;div class="ltag__link"&gt;
  &lt;a href="/crawlee" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__org__pic"&gt;
      &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F8417%2F0d74ff82-4b50-4d84-ba0b-8546be19aa84.png" alt="Crawlee" width="192" height="192"&gt;
      &lt;div class="ltag__link__user__pic"&gt;
        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1964149%2F9ec7b4f2-3e35-4b8f-976c-35b4cb0b5af2.jpg" alt="" width="620" height="620"&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/a&gt;
  &lt;a href="/crawlee/how-to-scrape-google-search-results-with-python-4ce2" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__content"&gt;
      &lt;h2&gt;How to scrape Google search results with Python&lt;/h2&gt;
      &lt;h3&gt;Max Bohomolov for Crawlee ・ Dec 2&lt;/h3&gt;
      &lt;div class="ltag__link__taglist"&gt;
        &lt;span class="ltag__link__tag"&gt;#webscraping&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#webdev&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#python&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#developers&lt;/span&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/a&gt;
&lt;/div&gt;


</description>
    </item>
    <item>
      <title>Three years of working remotely! 🥂 ( A summary)</title>
      <dc:creator>Saurav Jain</dc:creator>
      <pubDate>Fri, 22 Nov 2024 09:21:19 +0000</pubDate>
      <link>https://forem.com/sauain/three-years-of-working-remotely-a-summary-ng2</link>
      <guid>https://forem.com/sauain/three-years-of-working-remotely-a-summary-ng2</guid>
      <description>&lt;h2&gt;
  
  
  Three years of working remotely! 🥂  ( A summary)
&lt;/h2&gt;

&lt;p&gt;Today marks the completion of my three years of working remotely, and I must say it's not been a bed of roses. 😅 &lt;/p&gt;

&lt;p&gt;Yeah, you no longer have to give fake smiles to everyone in the office, but there is more to it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Positives:
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Flexibility&lt;/strong&gt; : You can define your own timings for the "personal tasks" you have.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Saves time&lt;/strong&gt;: You don't need to travel to the office just to sit the whole day at a desk; your home is your office.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Work with anyone&lt;/strong&gt;: You get a chance to work with bright minds from different parts of the world.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;More opportunities&lt;/strong&gt;: You are not bound to opportunities in a specific region.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Negatives
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Remote work can be depressing&lt;/strong&gt;: Nobody talks about it, but remote work can be insanely depressing at times. You don't interact with anyone in person for days, and things start getting piled up. Even I ended up depressed for months.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Health&lt;/strong&gt;: If you don't spend time in the gym or do physical activities before you know it, you will start facing problems like pain in your back or obesity because of extensive sitting.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Blurred boundaries&lt;/strong&gt;: While people usually see remote work as a work-life balance tool, I reckon it's the opposite. You have blurred boundaries between your normal life and work life. You will find yourself doing work at unusual times.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Blockers&lt;/strong&gt;: There can be communication and collaboration blockers at times because you are not in the office.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  So, should I opt for remote work?
&lt;/h2&gt;

&lt;p&gt;The answer depends on factors like opportunity, the level of your career, and the role you are going to work in.&lt;/p&gt;

&lt;p&gt;If you are early in your career and your role requires a lot of collaboration with your colleagues, I recommend that you work for at least a few months in an office to learn collaboration, communication, and other skills.&lt;/p&gt;

&lt;h2&gt;
  
  
  How can I excel in remote work?
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Over-communicate&lt;/strong&gt;: This is your key to success in remote working. You HAVE to over-communicate on tasks, requirements, and everything else with your team, in order to make sure you are well aligned with everyone.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Health is the first priority&lt;/strong&gt;: Make sure you are taking care of your health and sign up for a gym or any physical activity.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Set boundaries&lt;/strong&gt;: Create a work routine, make hard stops after a particular time, and have blockers in your calendar.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Go to office&lt;/strong&gt;: Yes, visit your office every now and then, from time to time, maybe once in a few months or wherever you get a chance, to say Hi on some product launch, for some project, or on a team building. Making a personal connection helps a lot in collaboration.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Take breaks&lt;/strong&gt;: Take breaks, go outside, take leaves, travel. Helps your mental health a lot.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I started working remotely in Covid for an opportunity that was not available where I live. I had to go through the ups and downs of it, but I learned how to survive and excel with time. ✌ &lt;/p&gt;

&lt;p&gt;I hope it helped you in some way; feel free to hit me up with any questions. :) &lt;/p&gt;

&lt;p&gt;(Pic: Throwback to the day when I started my remote job at Apify)&lt;/p&gt;

</description>
      <category>workplace</category>
      <category>programming</category>
      <category>beginners</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Reverse engineering GraphQL persistedQuery extension</title>
      <dc:creator>Saurav Jain</dc:creator>
      <pubDate>Fri, 22 Nov 2024 08:03:47 +0000</pubDate>
      <link>https://forem.com/crawlee/reverse-engineering-graphql-persistedquery-extension-4o9j</link>
      <guid>https://forem.com/crawlee/reverse-engineering-graphql-persistedquery-extension-4o9j</guid>
      <description>&lt;p&gt;GraphQL is a query language for getting deeply nested structured data from a website's backend, similar to MongoDB queries.&lt;/p&gt;

&lt;p&gt;The request is usually a POST to some general &lt;code&gt;/graphql&lt;/code&gt; endpoint with a body like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcrawlee.dev%2Fassets%2Fimages%2Fgraphql-a3962ed441b2a078e43c8158ad64336a.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcrawlee.dev%2Fassets%2Fimages%2Fgraphql-a3962ed441b2a078e43c8158ad64336a.webp" alt="image.png" width="800" height="257"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;However, with large data structures, this becomes inefficient - you are sending a large query in a POST request body, which is (almost always) the same and only changes on website updates; POST requests can’t be cached, etc. Therefore, an extension called “persisted queries” was developed. This isn’t an anti-scraping secret; you can read the public documentation about it &lt;a href="https://www.apollographql.com/docs/apollo-server/performance/apq/" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;TLDR: the client computes the sha256 hash of the &lt;code&gt;query&lt;/code&gt; text and only sends that hash. In addition, you can possibly fit all of this into the query string of a GET request, making it easily cachable. Below is an example request from Zillow&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frwnh5itsnqavjrdy3z41.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frwnh5itsnqavjrdy3z41.png" alt="Image description" width="800" height="170"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As you can see, it’s just some metadata about the persistedQuery extension, the hash of the query, and variables to be embedded in the query.&lt;/p&gt;

&lt;p&gt;Here’s another request from expedia.com, sent as a POST, but with the same extension:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3v3zsylb5azcnu9di1cd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3v3zsylb5azcnu9di1cd.png" alt="Image description" width="800" height="372"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This primarily optimizes website performance, but it creates several challenges for web scraping:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GET requests are usually more prone to being blocked.&lt;/li&gt;
&lt;li&gt;Hidden Query Parameters: We don’t know the full query, so if the website responds with a “Persisted query not found” error (asking us to send the query in full, not just the hash), we can’t send it.&lt;/li&gt;
&lt;li&gt;Once the website changes even a little bit and the clients start asking for a new query - even though the old one might still work, the server will very soon forget its ID/hash, and your request with this hash will never work again, since you can’t “remind” the server of the full query text.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Therefore, for different reasons, you might find yourself in the need to extract the whole query text. You could dig through the website JavaScript, and if you’re lucky, you might find the query text there in full, but often, it is somehow dynamically constructed from multiple fragments, etc. &lt;/p&gt;

&lt;p&gt;Therefore, we figured out a better way: we will not touch the client-side JavaScript at all. Instead, we will try to simulate the situation where the client tries to use a hash that the server does not know. Therefore, we need to intercept the (valid) request sent by the browser in-flight and modify the hash to a bogus one before passing it to the server.&lt;/p&gt;

&lt;p&gt;For exactly this use case, a perfect tool exists: &lt;a href="https://mitmproxy.org/" rel="noopener noreferrer"&gt;mitmproxy&lt;/a&gt;, an open-source Python library that intercepts requests made by your own devices, websites, or apps and allows you to modify them with simple Python scripts.&lt;/p&gt;

&lt;p&gt;Download &lt;code&gt;mitmproxy&lt;/code&gt;, and prepare a Python script like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;flow&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;dat&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;flow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;dat&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;extensions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;persistedQuery&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sha256Hash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0d9e&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="c1"&gt;# any bogus hex string here
&lt;/span&gt;        &lt;span class="n"&gt;flow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dat&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;pass&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This defines a hook that &lt;code&gt;mitmproxy&lt;/code&gt; will run on every request: it tries to load the request's JSON body, modifies the hash to an arbitrary value, and writes the updated JSON as a new body of the request.&lt;/p&gt;

&lt;p&gt;We also need to make sure we reroute our browser requests to &lt;code&gt;mitmproxy&lt;/code&gt;. For this purpose we are going to use a browser extension called &lt;a href="https://chromewebstore.google.com/detail/foxyproxy/gcknhkkoolaabfmlnjonogaaifnjlfnp?hl=en" rel="noopener noreferrer"&gt;FoxyProxy&lt;/a&gt;. It is available in both Firefox and Chrome.&lt;/p&gt;

&lt;p&gt;Just add a route with these settings:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fic2dy8eu6jltlpt8b2wf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fic2dy8eu6jltlpt8b2wf.png" alt="Image description" width="800" height="222"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now we can run &lt;code&gt;mitmproxy&lt;/code&gt; with this script: &lt;code&gt;mitmweb -s script.py&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;This will open a browser tab where you can watch all the intercepted requests in real-time.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdi01445nm11jeuy5hhsq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdi01445nm11jeuy5hhsq.png" alt="Image description" width="800" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you go to the particular path and see the query in the request section, you will see some garbage value has replaced the hash.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkwz8oezzqsufvcvxbj47.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkwz8oezzqsufvcvxbj47.png" alt="Image description" width="800" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now, if you visit Zillow and open that particular path that we tried for the extension, and go to the response section, the client-side receives the &lt;code&gt;PersistedQueryNotFound&lt;/code&gt; error.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F805icok85iit1egchc1p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F805icok85iit1egchc1p.png" alt="Image description" width="800" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The front end of Zillow reacts with sending the whole query as a POST request.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0mnb2akea8osjxadwdyy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0mnb2akea8osjxadwdyy.png" alt="Image description" width="800" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We extract the query and hash directly from this POST request. To ensure that the Zillow server does not forget about this hash, we periodically run this POST request with the exact same query and hash. This will ensure that the scraper continues to work even when the server's cache is cleaned or reset or the website changes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Persisted queries are a powerful optimization tool for GraphQL APIs, enhancing website performance by minimizing payload sizes and enabling GET request caching. However, they also pose significant challenges for web scraping, primarily due to the reliance on server-stored hashes and the potential for those hashes to become invalid.&lt;/p&gt;

&lt;p&gt;Using &lt;code&gt;mitmproxy&lt;/code&gt; to intercept and manipulate GraphQL requests gives an efficient approach to reveal  the full query text without delving into complex client-side JavaScript. By forcing the server to respond with a &lt;code&gt;PersistedQueryNotFound&lt;/code&gt; error, we can capture the full query payload and utilize it for scraping purposes. Periodically running the extracted query ensures the scraper remains functional, even when server-side cache resets occur or the website evolves.&lt;/p&gt;

</description>
      <category>graphql</category>
      <category>webdev</category>
      <category>programming</category>
      <category>javascript</category>
    </item>
    <item>
      <title>Optimizing web scraping: Scraping auth data using JSDOM</title>
      <dc:creator>Saurav Jain</dc:creator>
      <pubDate>Wed, 09 Oct 2024 05:26:49 +0000</pubDate>
      <link>https://forem.com/crawlee/optimizing-web-scraping-scraping-auth-data-using-jsdom-3cji</link>
      <guid>https://forem.com/crawlee/optimizing-web-scraping-scraping-auth-data-using-jsdom-3cji</guid>
      <description>&lt;p&gt;As scraping developers, we sometimes need to extract authentication data like temporary keys to perform our tasks. However, it is not as simple as that. Usually, it is in HTML or XHR network requests, but sometimes, the auth data is computed. In that case, we can either reverse-engineer the computation, which takes a lot of time to deobfuscate scripts or run the JavaScript that computes it. Normally, we use a browser, but that is expensive. Crawlee provides support for running browser scraper and Cheerio Scraper in parallel, but that is very complex and expensive in terms of compute resource usage. JSDOM helps us run page JavaScript with fewer resources than a browser and slightly higher than Cheerio.&lt;/p&gt;

&lt;p&gt;This article will discuss a new approach that we use in one of our Actors to obtain the authentication data from TikTok ads creative center generated by browser web applications without actually running the browser but instead of it, using JSDOM.&lt;/p&gt;

&lt;h2&gt;
  
  
  Analyzing the website
&lt;/h2&gt;

&lt;p&gt;When you visit this URL:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;https://ads.tiktok.com/business/creativecenter/inspiration/popular/hashtag/pc/en&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;You will see a list of hashtags with their live ranking, the number of posts they have, trend chart, creators, and analytics. You can also notice that we can filter the industry, set the time period, and use a check box to filter if the trend is new to the top 100 or not.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1h1lkrn46ac65ovtcl75.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1h1lkrn46ac65ovtcl75.png" alt="tiktok-trends" width="800" height="438"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Our goal here is to extract the top 100 hashtags from the list with the given filters.&lt;/p&gt;

&lt;p&gt;The two possible approaches are to use &lt;a href="https://crawlee.dev/docs/guides/cheerio-crawler-guide" rel="noopener noreferrer"&gt;&lt;code&gt;CheerioCrawler&lt;/code&gt;&lt;/a&gt;, and the second one will be browser-based scraping. Cheerio gives results faster but does not work with JavaScript-rendered websites.&lt;/p&gt;

&lt;p&gt;Cheerio is not the best option here as the &lt;a href="https://ads.tiktok.com/business/creativecenter/inspiration/popular/hashtag/pad/en" rel="noopener noreferrer"&gt;Creative Center&lt;/a&gt; is a web application, and the data source is API, so we can only get the hashtags initially present in the HTML structure but not each of the 100 as we require.&lt;/p&gt;

&lt;p&gt;The second approach can be using libraries like Puppeteer, Playwright, etc, to do browser-based scraping and using automation to scrape all of the hashtags, but with previous experiences, it takes a lot of time for such a small task.&lt;/p&gt;

&lt;p&gt;Now comes the new approach that we developed to make this process a lot better than browser based and very close to CheerioCrawler based crawling.&lt;/p&gt;

&lt;h2&gt;
  
  
  JSDOM Approach
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;Before diving deep into this approach, I would like to give credit to &lt;a href="https://apify.com/alexey" rel="noopener noreferrer"&gt;Alexey Udovydchenko&lt;/a&gt;, Web Automation Engineer at Apify, for developing this approach. Kudos to him!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In this approach, we are going to make API calls to &lt;code&gt;https://ads.tiktok.com/creative_radar_api/v1/popular_trend/hashtag/list&lt;/code&gt; to get the required data.&lt;/p&gt;

&lt;p&gt;Before making calls to this API, we will need few required headers (auth data), so we will first make the call to &lt;code&gt;https://ads.tiktok.com/business/creativecenter/inspiration/popular/hashtag/pad/en&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;We will start this approach by creating a function that will create the URL for the API call for us and, make the call and get the data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;createStartUrls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;days&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;7&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nx"&gt;country&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nx"&gt;resultsLimit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nx"&gt;industry&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nx"&gt;isNewToTop100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;filterBy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;isNewToTop100&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;new_on_board&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`https://ads.tiktok.com/creative_radar_api/v1/popular_trend/hashtag/list?page=1&amp;amp;limit=50&amp;amp;period=&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;days&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;&amp;amp;country_code=&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;country&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;&amp;amp;filter_by=&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;filterBy&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;&amp;amp;sort_by=popular&amp;amp;industry_id=&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;industry&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="c1"&gt;// required headers&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="na"&gt;userData&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;resultsLimit&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;];&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the above function, we create the start url for the API call that include various parameters as we talked about earlier. After creating the URL according to the parameters it will call the &lt;code&gt;creative_radar_api&lt;/code&gt; and fetch all the results.&lt;/p&gt;

&lt;p&gt;But it won’t work until we get the headers. So, let’s create a function that will first create a session using &lt;code&gt;sessionPool&lt;/code&gt; and &lt;code&gt;proxyConfiguration&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;createSessionFunction&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nx"&gt;sessionPool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nx"&gt;proxyConfiguration&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;proxyUrl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;proxyConfiguration&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;newUrl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;random&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;toString&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
        &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https://ads.tiktok.com/business/creativecenter/inspiration/popular/hashtag/pad/en&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="c1"&gt;// need url with data to generate token&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;gotScraping&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;proxyUrl&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;getApiUrlWithVerificationToken&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;body&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toString&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Token generation blocked`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Generated API verification headers`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;Object&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;values&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Session&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="na"&gt;userData&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="nx"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="nx"&gt;sessionPool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this function, the main goal is to call &lt;code&gt;https://ads.tiktok.com/business/creativecenter/inspiration/popular/hashtag/pad/en&lt;/code&gt; and get headers in return. To get the headers we are using &lt;code&gt;getApiUrlWithVerificationToken&lt;/code&gt; function.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Before going ahead, I want to mention that Crawlee natively supports JSDOM using the &lt;a href="https://crawlee.dev/api/jsdom-crawler" rel="noopener noreferrer"&gt;JSDOM Crawler&lt;/a&gt;. It gives a framework for the parallel crawling of web pages using plain HTTP requests and jsdom DOM implementation. It uses raw HTTP requests to download web pages, it is very fast and efficient on data bandwidth.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Let’s see how we are going to create the &lt;code&gt;getApiUrlWithVerificationToken&lt;/code&gt; function:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;getApiUrlWithVerificationToken&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Getting API session`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;virtualConsole&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;VirtualConsole&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nb"&gt;window&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;JSDOM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;contentType&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;text/html&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;runScripts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;dangerously&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;usable&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;CustomResourceLoader&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="c1"&gt;// ^ 'usable' faster than custom and works without canvas&lt;/span&gt;
        &lt;span class="na"&gt;pretendToBeVisual&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nx"&gt;virtualConsole&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;
    &lt;span class="nx"&gt;virtualConsole&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;error&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// ignore errors cause by fake XMLHttpRequest&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;apiHeaderKeys&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;anonymous-user-id&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;timestamp&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user-sign&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;apiValues&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{};&lt;/span&gt;
    &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;retries&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="c1"&gt;// api calls made outside of fetch, hack below is to get URL without actual call&lt;/span&gt;
    &lt;span class="nb"&gt;window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;XMLHttpRequest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;prototype&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;setRequestHeader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;apiHeaderKeys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;includes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="nx"&gt;apiValues&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;Object&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;values&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;apiValues&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="nx"&gt;apiHeaderKeys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="nx"&gt;retries&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;};&lt;/span&gt;
    &lt;span class="nb"&gt;window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;XMLHttpRequest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;prototype&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;open&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;method&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;urlToOpen&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;static&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;scontent&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt;
                &lt;span class="nx"&gt;urlToOpen&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startsWith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`https://&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;x&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;debug&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;urlToOpen&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;urlToOpen&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;};&lt;/span&gt;
    &lt;span class="k"&gt;do&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4000&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="nx"&gt;retries&lt;/span&gt;&lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;while &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;retries&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nb"&gt;window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;apiValues&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this function, we are creating a virtual console that uses &lt;code&gt;CustomResourceLoader&lt;/code&gt; to run the background process and replace the browser with JSDOM.&lt;/p&gt;

&lt;p&gt;For this particular example, we need three mandatory headers to make the API call, and those are &lt;code&gt;anonymous-user-id,&lt;/code&gt; &lt;code&gt;timestamp,&lt;/code&gt; and &lt;code&gt;user-sign.&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Using &lt;code&gt;XMLHttpRequest.prototype.setRequestHeader&lt;/code&gt;, we are checking if the mentioned headers are in the response or not, if yeas, we take the value of those headers, and repeat the retries until we get all the headers.&lt;/p&gt;

&lt;p&gt;Then, the most important part is that we use &lt;code&gt;XMLHttpRequest.prototype.open&lt;/code&gt; to extract the auth data and make calls without actually using browsers or exposing the bot activity.&lt;/p&gt;

&lt;p&gt;At the end of &lt;code&gt;createSessionFunction&lt;/code&gt;, it returns a session with the required headers.&lt;/p&gt;

&lt;p&gt;Now coming to our main code, we will use CheerioCrawler and will use &lt;code&gt;prenavigationHooks&lt;/code&gt; to inject the headers that we got from the earlier function into the &lt;code&gt;requestHandler&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;crawler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;CheerioCrawler&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;sessionPoolOptions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="na"&gt;maxPoolSize&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;createSessionFunction&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;sessionPool&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt;
            &lt;span class="nf"&gt;createSessionFunction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;sessionPool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;proxyConfiguration&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="na"&gt;preNavigationHooks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;crawlingContext&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;session&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;crawlingContext&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
            &lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;userData&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nx"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;};&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="nx"&gt;proxyConfiguration&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Finally in the request handler we make the call using the headers and make sure how many calls are needed to fetch all the data handling pagination.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nf"&gt;requestHandler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;json&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;userData&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;itemsCounter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;resultsLimit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;userData&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;BLOCKED&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;json&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;items&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;list&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;counter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;itemsCounter&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;items&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;dataItems&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;items&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nx"&gt;resultsLimit&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;counter&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;resultsLimit&lt;/span&gt;
            &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="nx"&gt;resultsLimit&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;itemsCounter&lt;/span&gt;
            &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;undefined&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pushData&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;dataItems&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="na"&gt;pagination&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;total&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="s2"&gt;`Scraped &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;dataItems&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; results out of &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;total&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; from search page &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;isResultsLimitNotReached&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
        &lt;span class="nx"&gt;counter&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;total&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;resultsLimit&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;isResultsLimitNotReached&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;pagination&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;has_more&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;nextUrl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;URL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="nx"&gt;nextUrl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;searchParams&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;page&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;crawler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;addRequests&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;nextUrl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toString&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                &lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="na"&gt;userData&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;userData&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="na"&gt;itemsCounter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;itemsCounter&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;dataItems&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;]);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One important thing to note here is that we are making this code in a way that we can make any numbers of API calls.&lt;/p&gt;

&lt;p&gt;In this particular example we just made one request and a single session, but you can make more if you need. When the first API call will be completed, it will create the second API call. Again, you can make more calls if needed, but we stopped at two.&lt;/p&gt;

&lt;p&gt;To make things more clear, here is how code flow looks:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc5t11xbj12jrwzoml51v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc5t11xbj12jrwzoml51v.png" alt="code-flow" width="800" height="478"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This approach helps us to get a third way to extract the authentication data without actually using a browser and pass the data to CheerioCrawler. This significantly improves the performance and reduces the RAM requirement by 50%, and while browser-based scraping performance is ten times slower than pure Cheerio, JSDOM does it just 3-4 times slower, which makes it 2-3 times faster than browser-based scraping.&lt;/p&gt;

&lt;p&gt;The project's codebase is already &lt;a href="https://github.com/souravjain540/tiktok-trends" rel="noopener noreferrer"&gt;uploaded here&lt;/a&gt;. The code is written as an Apify Actor; you can find more about it &lt;a href="https://docs.apify.com/academy/getting-started/creating-actors" rel="noopener noreferrer"&gt;here&lt;/a&gt;, but you can also run it without using Apify SDK.&lt;/p&gt;

&lt;p&gt;If you have any doubts or questions about this approach, reach out to us on our &lt;a href="https://apify.com/discord" rel="noopener noreferrer"&gt;Discord server&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>javascript</category>
      <category>webscraping</category>
      <category>jsdom</category>
    </item>
    <item>
      <title>How to scrape infinite scrolling webpages with Python</title>
      <dc:creator>Saurav Jain</dc:creator>
      <pubDate>Wed, 28 Aug 2024 05:38:51 +0000</pubDate>
      <link>https://forem.com/crawlee/how-to-scrape-infinite-scrolling-webpages-with-python-3jhn</link>
      <guid>https://forem.com/crawlee/how-to-scrape-infinite-scrolling-webpages-with-python-3jhn</guid>
      <description>&lt;h1&gt;
  
  
  How to scrape infinite scrolling webpages with Python
&lt;/h1&gt;

&lt;p&gt;Hello, Crawlee Devs, and welcome back to another tutorial on the Crawlee Blog. This tutorial will teach you how to scrape infinite-scrolling websites using Crawlee for Python. &lt;/p&gt;

&lt;p&gt;For context, infinite-scrolling pages are a modern alternative to classic pagination. When users scroll to the bottom of the webpage instead of choosing the next page, the page automatically loads more data, and users can scroll more.&lt;/p&gt;

&lt;p&gt;As a big sneakerhead, I'll take the Nike shoes infinite-scrolling &lt;a href="https://www.nike.com/" rel="noopener noreferrer"&gt;website&lt;/a&gt; as an example, and we'll scrape thousands of sneakers from it. &lt;/p&gt;

&lt;p&gt;Crawlee for Python has some amazing initial features, such as a unified interface for HTTP and headless browser crawling, automatic retries, and much more.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites and bootstrapping the project
&lt;/h2&gt;

&lt;p&gt;Let’s start the tutorial by installing Crawlee for Python with this command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pipx run crawlee create nike-crawler
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Before going ahead if you like reading this blog, we would be really happy if you gave Crawlee for Python a star on GitHub!&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev.to%2Fassets%2Fgithub-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/apify" rel="noopener noreferrer"&gt;
        apify
      &lt;/a&gt; / &lt;a href="https://github.com/apify/crawlee-python" rel="noopener noreferrer"&gt;
        crawlee-python
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;
    &lt;a href="https://crawlee.dev" rel="nofollow noopener noreferrer"&gt;
        
          
          &lt;img alt="Crawlee" src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fapify%2Fcrawlee-python%2Fmaster%2Fwebsite%2Fstatic%2Fimg%2Fcrawlee-light.svg%3Fsanitize%3Dtrue" width="500"&gt;
        
    &lt;/a&gt;
    &lt;br&gt;
    A web scraping and browser automation library
&lt;/h1&gt;
&lt;/div&gt;

&lt;p&gt;
    &lt;a href="https://badge.fury.io/py/crawlee" rel="nofollow noopener noreferrer"&gt;
        &lt;img src="https://camo.githubusercontent.com/15da2423bb8cab10ccfcebcfb55789b5f898bd2374ae3504b9157693be92f3ab/68747470733a2f2f62616467652e667572792e696f2f70792f637261776c65652e737667" alt="PyPI version"&gt;
    &lt;/a&gt;
    &lt;a href="https://pypi.org/project/crawlee/" rel="nofollow noopener noreferrer"&gt;
        &lt;img src="https://camo.githubusercontent.com/3d2b832d76d0fb0fc283d197c6cd685e5cfcfb46ebba485b520dfd434c3c6237/68747470733a2f2f696d672e736869656c64732e696f2f707970692f646d2f637261776c6565" alt="PyPI - Downloads"&gt;
    &lt;/a&gt;
    &lt;a href="https://pypi.org/project/crawlee/" rel="nofollow noopener noreferrer"&gt;
        &lt;img src="https://camo.githubusercontent.com/8466f1b43e225ad363016bde2b6e445f843fbe2161a61d6803ae80c01a5dae7c/68747470733a2f2f696d672e736869656c64732e696f2f707970692f707976657273696f6e732f637261776c6565" alt="PyPI - Python Version"&gt;
    &lt;/a&gt;
    &lt;a href="https://discord.gg/jyEM2PRvMU" rel="nofollow noopener noreferrer"&gt;
        &lt;img src="https://camo.githubusercontent.com/afa7e1b87d07c3e1531206ff182c487373079a12f504d7a95b371871ca09d3b0/68747470733a2f2f696d672e736869656c64732e696f2f646973636f72642f3830313136333731373931353537343332333f6c6162656c3d646973636f7264" alt="Chat on discord"&gt;
    &lt;/a&gt;
&lt;/p&gt;

&lt;p&gt;Crawlee covers your crawling and scraping end-to-end and &lt;strong&gt;helps you build reliable scrapers. Fast.&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;🚀 Crawlee for Python is open to early adopters!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Your crawlers will appear almost human-like and fly under the radar of modern bot protections even with the default configuration. Crawlee gives you the tools to crawl the web for links, scrape data and persistently store it in machine-readable formats, without having to worry about the technical details. And thanks to rich configuration options, you can tweak almost any aspect of Crawlee to suit your project's needs if the default settings don't cut it.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;👉 &lt;strong&gt;View full documentation, guides and examples on the &lt;a href="https://crawlee.dev/python/" rel="nofollow noopener noreferrer"&gt;Crawlee project website&lt;/a&gt;&lt;/strong&gt; 👈&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We also have a TypeScript implementation of the Crawlee, which you can explore and utilize for your projects. Visit our GitHub repository for more information &lt;a href="https://github.com/apify/crawlee" rel="noopener noreferrer"&gt;Crawlee for JS/TS on GitHub&lt;/a&gt;.&lt;/p&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Installation&lt;/h2&gt;
&lt;/div&gt;

&lt;p&gt;We…&lt;/p&gt;
&lt;/div&gt;


&lt;/div&gt;
&lt;br&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/apify/crawlee-python" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;br&gt;
&lt;/div&gt;
&lt;br&gt;


&lt;p&gt;We will scrape using headless browsers. Select &lt;code&gt;PlaywrightCrawler&lt;/code&gt; in the terminal when Crawlee for Python asks for it.&lt;/p&gt;

&lt;p&gt;After installation, Crawlee for Python will create boilerplate code for you. Redirect into the project folder and then run this command for all the dependencies installation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;poetry &lt;span class="nb"&gt;install&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  How to scrape infinite scrolling webpages
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Handling accept cookie dialog&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Adding request of all shoes links&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Extract data from product details&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Accept Cookies context manager&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Handling infinite scroll on the listing page&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Exporting data to CSV format&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;
  
  
  Handling accept cookie dialog
&lt;/h3&gt;

&lt;p&gt;After all the necessary installations, we'll start looking into the files and configuring them accordingly.&lt;/p&gt;

&lt;p&gt;When you look into the folder, you'll see many files, but for now, let’s focus on &lt;code&gt;main.py&lt;/code&gt; and &lt;code&gt;routes.py&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;In &lt;code&gt;__main__.py&lt;/code&gt;, let's change the target location to the Nike website. Then, just to see how scraping will happen, we'll add &lt;code&gt;headless = False&lt;/code&gt; to the &lt;code&gt;PlaywrightCrawler&lt;/code&gt; parameters. Let's also increase the maximum requests per crawl option to 100 to see the power of parallel scraping in Crawlee for Python.&lt;/p&gt;

&lt;p&gt;The final code will look like this:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;crawlee.playwright_crawler&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PlaywrightCrawler&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;.routes&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;router&lt;/span&gt;


&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

    &lt;span class="n"&gt;crawler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PlaywrightCrawler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;headless&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;request_handler&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;router&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;max_requests_per_crawl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;crawler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;https://nike.com/,
        ]
    )


if __name__ == &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="n"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;:
    asyncio.run(main())
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Now coming to &lt;code&gt;routes.py&lt;/code&gt;, let’s remove:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;enqueue_links&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;As we don’t want to scrape the whole website. &lt;/p&gt;

&lt;p&gt;Now, if you run the crawler using the command:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;poetry run python &lt;span class="nt"&gt;-m&lt;/span&gt; nike-crawler
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;As the cookie dialog is blocking us from crawling more than one page's worth of shoes, let’s get it out of our way.&lt;/p&gt;

&lt;p&gt;We can handle the cookie dialog by going to Chrome dev tools and looking at the &lt;code&gt;test_id&lt;/code&gt; of the "accept cookies" button, which is &lt;code&gt;dialog-accept-button&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Now, let’s remove the &lt;code&gt;context.push_data&lt;/code&gt; call that was left there from the project template and add the code to accept the dialog in routes.py. The updated code will look like this:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;crawlee.router&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Router&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;crawlee.playwright_crawler&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PlaywrightCrawlingContext&lt;/span&gt;

&lt;span class="n"&gt;router&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Router&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;PlaywrightCrawlingContext&lt;/span&gt;&lt;span class="p"&gt;]()&lt;/span&gt;

&lt;span class="nd"&gt;@router.default_handler&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;default_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;PlaywrightCrawlingContext&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

    &lt;span class="c1"&gt;# Wait for the popup to be visible to ensure it has loaded on the page.
&lt;/span&gt;    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_by_test_id&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dialog-accept-button&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;click&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Adding request of all shoes links
&lt;/h3&gt;

&lt;p&gt;Now, if you hover over the top bar and see all the sections, i.e., man, woman, and kids, you'll notice the “All shoes” section. As we want to scrape all the sneakers, this section interests us. Let’s use &lt;code&gt;get_by_test_id&lt;/code&gt; with the filter of &lt;code&gt;has_text=’All shoes’&lt;/code&gt; and add all the links with the text “All shoes” to the request handler. Let’s add this code to the existing &lt;code&gt;routes.py&lt;/code&gt; file:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;    &lt;span class="n"&gt;shoe_listing_links&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_by_test_id&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;link&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;has_text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;All shoes&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_requests&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_url&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;listing&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;link&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;shoe_listing_links&lt;/span&gt;
            &lt;span class="nf"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;link&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;href&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@router.handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;listing&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;listing_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;PlaywrightCrawlingContext&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Handler for shoe listings.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Extract data from product details
&lt;/h3&gt;

&lt;p&gt;Now that we have all the links to the pages with the title “All Shoes,” the next step is to scrape all the products on each page and the information provided on them.&lt;/p&gt;

&lt;p&gt;We'll extract each shoe's URL, title, price, and description. Again, let's go to dev tools and extract each parameter's relevant &lt;code&gt;test_id&lt;/code&gt;. After scraping each of the parameters, we'll use the &lt;code&gt;context.push_data&lt;/code&gt; function to add it to the local storage. Now let's add the following code to the &lt;code&gt;listing_handler&lt;/code&gt; and update it in the &lt;code&gt;routes.py&lt;/code&gt; file:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@router.handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;listing&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;listing_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;PlaywrightCrawlingContext&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Handler for shoe listings.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;        

    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;enqueue_links&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;selector&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;a.product-card__link-overlay&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;detail&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="nd"&gt;@router.handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;detail&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;detail_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;PlaywrightCrawlingContext&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Handler for shoe details.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_by_test_id&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;product_title&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;text_content&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;price&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_by_test_id&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;currentPrice-container&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;first&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;text_content&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;description&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_by_test_id&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;product-description&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;text_content&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loaded_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;price&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Accept Cookies context manager
&lt;/h3&gt;

&lt;p&gt;Since we're dealing with multiple browser pages with multiple links and we want to do infinite scrolling, we may encounter an accept cookie dialog on each page. This will prevent loading more shoes via infinite scroll.&lt;/p&gt;

&lt;p&gt;We'll need to check for cookies on every page, as each one may be opened with a fresh session (no stored cookies) and we'll get the accept cookie dialog even though we already accepted it in another browser window. However, if we don't get the dialog, we want the request handler to work as usual.&lt;/p&gt;

&lt;p&gt;To solve this problem, we'll try to deal with the dialog in a parallel task that will run in the background. A context manager is a nice abstraction that will allow us to reuse this logic in all the router handlers. So, let's build a context manager:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;playwright.async_api&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nb"&gt;TimeoutError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;PlaywrightTimeoutError&lt;/span&gt;

&lt;span class="nd"&gt;@asynccontextmanager&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;accept_cookies&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Page&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_by_test_id&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dialog-accept-button&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;click&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;yield&lt;/span&gt;
    &lt;span class="k"&gt;finally&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;done&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cancel&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;suppress&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CancelledError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;PlaywrightTimeoutError&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This context manager will make sure we're accepting the cookie dialog if it exists before scrolling and scraping the page. Let’s implement it in the &lt;code&gt;routes.py&lt;/code&gt; file, and the updated code is &lt;a href="https://github.com/janbuchar/crawlee-python-demo/blob/6ca6f7f1d1bbbf789a3b86f14bec492cf756251e/crawlee-python-webinar/routes.py" rel="noopener noreferrer"&gt;here&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Handling infinite scroll on the listing page
&lt;/h3&gt;

&lt;p&gt;Now for the last and most interesting part of the tutorial! How to handle the infinite scroll of each shoe listing page and make sure our crawler is scrolling and scraping the data constantly.&lt;/p&gt;

&lt;p&gt;To handle infinite scrolling in Crawlee for Python, we just need to make sure the page is loaded, which is done by waiting for the &lt;code&gt;network_idle&lt;/code&gt; load state, and then use the &lt;code&gt;infinite_scroll&lt;/code&gt; helper function which will keep scrolling to the bottom of the page as long as that makes additional items appear.&lt;/p&gt;

&lt;p&gt;Let’s add two lines of code to the &lt;code&gt;listing&lt;/code&gt; handler:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@router.handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;listing&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;listing_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;PlaywrightCrawlingContext&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Handler for shoe listings
&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;accept_cookies&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wait_for_load_state&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;networkidle&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;infinite_scroll&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;enqueue_links&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;selector&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;a.product-card__link-overlay&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;detail&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  Exporting data to CSV format
&lt;/h2&gt;

&lt;p&gt;As we want to store all the shoe data into a CSV file, we can just add a call to the &lt;code&gt;export_data&lt;/code&gt; helper into the &lt;code&gt;__main__.py&lt;/code&gt; file just after the crawler run:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;crawler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;export_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;shoes.csv&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  Working crawler and its code
&lt;/h2&gt;

&lt;p&gt;Now, we have a crawler ready that can scrape all the shoes from the Nike website while handling infinite scrolling and many other problems, like the cookies dialog.&lt;/p&gt;

&lt;p&gt;You can find the complete working crawler code here on the &lt;a href="https://github.com/janbuchar/crawlee-python-demo" rel="noopener noreferrer"&gt;GitHub repository&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;If you have any doubts regarding this tutorial or using Crawlee for Python, feel free to &lt;a href="https://apify.com/discord/" rel="noopener noreferrer"&gt;join our discord community&lt;/a&gt; and ask fellow developers or the Crawlee team.&lt;/p&gt;


&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
    &lt;div class="c-embed__body"&gt;
      &lt;h2 class="fs-xl lh-tight"&gt;
        &lt;a href="https://discord.com/invite/jyEM2PRvMU" rel="noopener noreferrer" class="c-link"&gt;
          Crawlee &amp;amp; Apify
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;p class="truncate-at-3"&gt;
          This is the official developer community of Apify and Crawlee. | 8785 members
        &lt;/p&gt;
      &lt;div class="color-secondary fs-s flex items-center"&gt;
          &lt;img alt="favicon" class="c-embed__favicon m-0 mr-2 radius-0" src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdiscord.com%2Fassets%2Ffavicon.ico"&gt;
        discord.com
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;






&lt;p&gt;This tutorial is taken from the webinar held on August 5th where Jan Buchar, Senior Python Engineer at Apify, gave a live demo about this use case. Watch the whole webinar &lt;a href="https://www.youtube.com/live/ip8Ii0eLfRY?si=XoiwdRM6eldKgw-Z" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>webdev</category>
      <category>python</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
