Forem: Saurav Jain

Firecrawl vs. Apify: 2025 guide for AI and data teams

Saurav Jain — Mon, 11 Aug 2025 08:42:57 +0000

A detailed comparison of Firecrawl's unified AI-driven scraping and Apify's comprehensive, flexible ecosystem. We explain what each does best

Fresh, structured web data is the fuel for AI, including agents, RAG pipelines, competitive‑intelligence dashboards, and change‑monitoring services. Two platforms dominate that space in 2025:

Firecrawl – an API‑first crawler that turns any URL into LLM‑ready Markdown/JSON in seconds.
Apify – a full‑stack scraping platform with thousands of reusable data collection tools.

We'll go deep into the differences and what they do best, so you can choose (or combine) wisely.

Let's kick off with a table of each platform’s features, benefits, and trade‑offs so you can see where each one excels.

	Firecrawl	Apify
Core value prop	Fast, semantic scraping API geared for AI	End-to-end scraping platform (6,000+ data collection tools, proxies, compliance)
Stand‑out features	Pre-warmed browsersNL extractionAGPL OSS & self-hostStealth proxies	Scraper marketplaceJS/TS & Py SDKsGlobal proxy pool + CAPTCHACron-scheduler, retries, webhooksSOC 2 Type II and GDPR compliance
Primary benefits	Sub-second latency on cached pagesPrompt-based (no selectors)Predictable credit pricing	Fine‑grained session & proxy controlNo‑code operation for analysts; devs extend via code
Pros	✔ Fast single‑page fetches✔ Self‑host to avoid vendor lock‑in✔ Low entry cost (free + $16 Hobby tier)	✔ Breadth: 6,000 off-the-shelf scrapers✔ Effective anti‑blocking technology✔ Monetize your own scrapers; earn rev‑share
Cons	✖ Credits can disappear fast on large crawls✖ Limited built-in scheduling✖ AGPL copyleft for forks	✖ Actors / CU concepts add a learning curve✖ Consumption costs can spike with inefficient code✖ Cold-start ≈1.5s

Get data for AI with Apify

Pricing

Firecrawl’s pricing model

Firecrawl uses a simple credit-based model: 1 page = 1 credit (under standard conditions). That makes it very easy to predict costs. The free tier lets you scrape up to 500 pages before committing financially (no credit card required). Paid plans range from $16 to $333 for standard pricing.

Extraction tasks that go beyond simple scraping consume additional credits or tokens, and Firecrawl offers dedicated “Extract” plans ranging from $89 to $719/month, depending on token volume.

Apify’s pricing model

Apify combines a subscription and consumption model. You get a base amount of platform credit with each pricing plan, but the consumption rate depends on the resources used. 1 compute unit = 1 gigabyte-hour of RAM.

It’s harder to predict costs this way because scraping a JavaScript-rendered site that requires browser automation consumes a lot more than a simple HTML scraper.

That’s why Apify has introduced pay-per-event pricing. Scrapers with this pricing model charge for specific actions rather than just results. For example, a scraper that charges $5 per run start and $2 per 1,000 results would cost $15 for 5,000 results. This can make large-scale scraping jobs cheaper in the long run.

Apify's forever-free tier gives you $5 of credit that renews automatically every month, so you can test any scraper on Apify Store without financial commitment (no credit card required). Paid plans range from $39 to $999 per month.

Firecrawl and Apify pricing compared

	Firecrawl(flat credits)	Apify(pre-paid credit+CU)
3,000 pages	Hobby: $16 ✅	Starter: $39 ✅
100,000 pages	Standard: $83 ✅	Scale: $199 ✅
500,000 pages	Growth: $333 or Enterprise	Business: $999 or Enterprise
Summary	500k pages/month? Firecrawl is usually cheaper.Millions of lightweight pages or heavy anti‑bot workflows? A well‑optimized Apify scraper can win on total cost.

Performance

Firecrawl: Unified AI-driven scraping

Firecrawl offers a single, consistent API that handles scraping, crawling, and AI‑driven site navigation, so developers never have to juggle multiple endpoints or bespoke parameters.

When you request a page, the service decides on‑the‑fly whether it needs a headless browser, waits for all dynamic elements to render, and then applies extraction models that automatically ignore ads, menus, and other noise.

Instead of writing brittle CSS or XPath rules, you ask for the data in plain language — “product prices and availability,” for example — and Firecrawl returns a clean JSON block that stays stable even when the site’s markup changes.

The same intelligence powers the crawler: it goes through internal links without sitemaps, skips duplicate content, and can infer which pages matter most from their position in the site hierarchy and linking patterns, all while respecting any boundaries you set.

For sites that hide information behind clicks, forms, or paginated views, you can enable the FIRE‑1 agent. It mimics human behavior by clicking “Load More,” filling search fields, and even solving simple CAPTCHA, eliminating special‑case code.

Performance optimizations run throughout the platform. Recently scraped pages come from cache in milliseconds, hundreds of URLs can be batched in a single call for parallel processing, and converting HTML to lightweight Markdown cuts the token count for downstream LLMs by roughly two‑thirds. The net effect is faster, lower‑maintenance data collection that remains reliable as websites evolve.

Apify: A comprehensive, flexible ecosystem

Apify approaches web scraping as an ecosystem problem rather than a single‑tool exercise. Its foundation is the Actor system — self‑contained programs that run in Apify’s cloud with uniform input/output, shared storage, and common scheduling and monitoring. Because every Actor behaves the same way, you can link them into multistep workflows just by passing data from one to the next.

This standardization fuels Apify Store, a marketplace of more than 6,000 pre‑built scrapers and automation tools maintained by domain specialists. If you need Amazon product data, Instagram follower stats, or a one‑off government registry crawl, chances are an Actor already exists and is kept up to date as site layouts change, so you rarely start from a blank page.

When an off‑the‑shelf Actor doesn’t fit, you can build your own with the Apify SDK (also released as the open‑source Crawlee library). The SDK offers high‑level helpers — request queues, automatic retries, error handling, parallel processing — while still letting you drop down to raw Puppeteer, Playwright, or HTTP calls when necessary. It supports both JavaScript/TypeScript and Python, making it easy to slot into existing codebases.

Infrastructure management is largely hands‑off. Apify autoscales compute instances, rotates datacenter or residential proxies by geography, and applies anti‑detection tactics such as browser‑fingerprint randomization, human‑like delays, and outsourced CAPTCHA solving. Enterprise‑grade features cover the operational side: detailed run statistics, alerting, conditional scheduling (e.g., “scrape 100 sites at 06:00 only if yesterday’s run succeeded”), real‑time webhooks for downstream systems, and configurable result retention to satisfy compliance audits or historical analyses.

In practice, the platform lets teams mix and match ready‑made Actors, custom code, and reliable infrastructure to solve diverse scraping tasks without rebuilding each time.

Technical feature	Firecrawl	Apify
API surface	Single base URL with multiple HTTP/JSON endpoints under one uniform API	Each Actor exposes a standard REST interface: you POST to the “run” endpoint with a JSON input and then GET results from key‑value stores or datasets
Dynamic‑content handling	Automatically spins up a headless browser when JS rendering is detected, with no extra configuration	You choose per Actor: Puppeteer, Playwright, Cheerio, raw HTTP, etc.
Extraction approach	“Zero‑selector” extraction: ML/NLP models parse pages into JSON or Markdown out‑of‑the‑box	Code‑based extraction inside each Actor using Crawlee's page handlers or raw selectors
Workflow composition	In a single request you can switch between scrape, crawl, or the FIRE‑1 “agent” mode using flags	Chain Actors/tasks asynchronously using shared storage, datasets, or webhooks to build multistep pipelines
Browser automation / navigation	FIRE‑1 agent clicks buttons, paginates, fills forms, solves simple CAPTCHAs	Automation coded per Actor; CAPTCHA solving baked in
Anti‑detection & proxy support	Fully managed proxies, rotating sessions, fingerprint spoofing and geotargeting are all built‑in	Apify Proxy (datacenter & residential) with session rotation, geo‑targeting and spoofing; you enable it per Actor
Caching & batching	Global intelligent cache for recently fetched pages; supports batching hundreds of URLs in one API call	Request queues with autoscaled parallelism; per‑Actor caching logic is up to you (e.g. dataset dedupe, custom cache)
Scalability	Firecrawl auto‑scales browser instances for concurrent requests	Platform scales Actors and browser instances elastically according to queue depth and configured concurrency limits
Monitoring & scheduling	Metrics and scheduling built into the single API dashboard	Per‑Actor run metrics, alerts, conditional scheduling, and webhook triggers via Apify Console
Data format optimization	Converts HTML → Markdown to cut LLM tokens by ≈ 67%	Returns whatever the Actor emits (HTML, JSON, CSV, etc.); no built‑in token reduction
Language / SDK support	Any language that can call HTTP/JSON (official clients in Python, Node, Go, Rust, C#, etc.)	Official SDKs in JavaScript/TypeScript and Python (Apify SDK / Crawlee); other languages via raw REST calls

How each platform scales

Firecrawl

Firecrawl scales like a classic SaaS API. Your subscription tier defines a fixed pool of headless‑browser workers, and the service queues requests behind those workers. Because the queueing and resource allocation are handled centrally, you get stable latency and never touch infrastructure. Even the mid‑tier plans can move thousands of pages per minute thanks to caching and smart routing, so the raw worker counts rarely become a choke point.

Apify

Apify takes a looser, cloud‑native approach. You launch as many Actors as your budget allows, and the platform spins up containers to run them in parallel. That “elastic infinity” is perfect for bursty jobs — say, crawling an entire retail catalog overnight or tracking tens of thousands of social‑media accounts — because you can flood the platform with work and pay only for the compute you burn. Automatic retries and detailed run logs keep large Actor swarms reliable.

The marketplace magnifies Apify’s scale advantage: if your project hits 50 different sites, chances are someone has already published an Actor for most of them. You trade weeks of scraper authoring for coordination logic — managing many Actors’ versions, schedules, and output formats — but you gain speed and capacity on day one.

Integrations

Integrating Firecrawl into AI pipelines

Firecrawl treats integration the same way it treats scraping: make the routine path effortless and keep the escape hatches open.

Official SDKs for Python, JavaScript, Rust, and Go expose idiomatic methods, but the real strength lies in direct hooks to AI tooling. A native LangChain loader, for example, can be wired up in just a few lines; it fetches pages, paginates automatically, preserves attribution metadata, and delivers chunks that are ready for embeddings or other RAG workflows. LlamaIndex receives similar first‑class support with retrievers that fetch, de‑duplicate, and format content for chatbots or summarization agents while keeping token counts under control.

Outside pure code, Firecrawl blocks appear in Make.com and Zapier, so non‑developers can drag‑and‑drop flows — say, watch a competitor’s site, extract new product data in plain English, and update a shared spreadsheet — without touching a script. The result is a scraper that plugs into AI stacks and no‑code tools with equal ease.

Apify integrations for production workflows

Apify’s connectors reflect a more mature, enterprise‑centric agenda: it's designed to sit inside CI/CD pipelines and data‑engineering stacks, not just trigger one‑off jobs.

The Zapier app goes beyond “run an Actor” by adding triggers for job completion, built‑in data filters, and error‑handling branches, so a sales team can scrape LinkedIn, enrich leads with a second Actor, and post qualified prospects to a CRM in a single Zap.

The GitHub integration lets teams treat scraper code like any other service — commit, test, review, and deploy via GitHub Actions — bringing familiar DevOps discipline to crawling logic.

For AI tooling, Apify has loaders for LangChain and LlamaIndex, and integrations for HuggingFace, Pinecone, Haystack, Qdrant, and other vector databases.

For data pipelines, Apify ships connectors that push results straight into S3, GCS, or Azure Blob, or stream them through webhooks to Kafka or Pub/Sub, turning the platform into a managed data‑ingestion layer that feeds downstream analytics with no extra glue code.

Integration feature	Firecrawl	Apify
Official SDKs	Python, JavaScript/TypeScript (official), Rust, Go (community)	JavaScript/TypeScript & Python (via Apify SDK / Crawlee)
AI‑framework hooks	Native loaders/retrievers for LangChain and LlamaIndex (handles pagination, chunking, metadata, token‑cost control)	Official loaders for LangChain and LlamaIndex; individual Actors may embed extra AI logic
No‑code / automation tools	Make and Zapier blocks for drag‑and‑drop flows	Zapier app with branching, filters, and error handling for complex automations
DevOps / CI · CD	–	GitHub/Bitbucket integration for version control, tests, and CI‑based deployments
Data‑pipeline connectors	– (pull via API or SDK)	Export Actors to S3, GCS, Azure; real‑time streaming to Kafka/Pub/Sub/webhooks
Webhook support	Crawl/scrape lifecycle webhooks for async callbacks	Webhooks on Actor events; plus metamorph & transform steps for streaming
Token‑usage optimization	HTML→Markdown plus loader‑level chunk sizing & token‑budget helpers	Up to individual Actor / downstream loader; no platform‑wide token controls

Picking the right platform for your workload

When Firecrawl is the better fit

Choose Firecrawl when low‑latency access to web data and tight coupling with AI pipelines are top priorities. Its uniform API, natural‑language extraction, and sub‑second response times let you build chatbots, RAG systems, or research agents without wrestling with scraping logic. Pricing is credit‑based and predictable. If you know you’ll process about  50k pages a month, you can budget the spend to the dollar and count on the same performance every day. The open‑source core offers a self‑host option for teams with data‑residency mandates or those who simply want an exit ramp from the hosted service.

When Apify delivers more value

Apify is the best option when breadth, flexibility, and enterprise guarantees matter. Its marketplace of 6,000‑plus maintained Actors means you can cover dozens of sites in hours instead of building each scraper yourself. Non‑developers can launch and schedule those Actors through a web UI, while engineering teams still have full SDK control when needed. SOC 2 compliance, GDPR alignment, and a history of large‑scale deployments satisfy procurement checklists, and the platform’s elastic Actor model handles everything from tiny HTML scrapes to overnight catalog crawls without manual capacity planning.

Get started with Apify

5 best JavaScript web scraping libraries in 2025

Saurav Jain — Thu, 24 Jul 2025 18:30:00 +0000

Even if you're not a JavaScript developer, it's worth knowing these libraries for parsing HTML, interacting with pages, and dealing with dynamic content.

Websites are becoming increasingly complex and dynamic. The modern web is full of JavaScript-rendered apps that load content asynchronously, use auth systems involving multiple steps and JavaScript-based token handling, and block scraping bots.

That's why JavaScript is still a great choice for collecting web data in 2025. But if you're a developer new to web scraping or unfamiliar with the JavaScript language, you're probably wondering which libraries and frameworks you should try.

At Apify, we've been scraping the web with JavaScript and Node.js for a decade. This selection of 5 libraries is informed by our experience of using them for data extraction, from parsing HTML to navigating web pages and scraping dynamic content.

1. Crawlee

Tackling complexity, stealth, and scalability in one package

Juggling multiple libraries for requests, parsing, browser automation, and crawling logic quickly becomes a maintenance headache. You end up writing glue code to handle queues, rotate proxies, and merge results, only to find you still get blocked or stranded when scale increases.

Crawlee, developed by the Apify team, unifies everything under a single interface. Out of the box, it mimics real browsers (headers, TLS fingerprints, and even stealth plugins) so you avoid common anti-bot defenses without manual header or fingerprint tweaking. Instead of wiring together Cheerio + Playwright/Puppeteer + queue managers, Crawlee provides:

Switchable crawler classes: CheerioCrawler for static HTML, PlaywrightCrawler or PuppeteerCrawler for dynamic pages, all sharing a common configuration style.
Built-in queue management: Breadth-first or depth-first crawling with concurrency settings, retry logic, and automatic backoff. You define start URLs; Crawlee handles enqueuing, prioritization, and scheduling.
Automatic proxy rotation and session handling: Effectively rotate proxies or manage cookies and browser contexts, so you stay under rate limits and maintain logins across multiple pages.
Pluggable data storages: Datasets (JSON, CSV, or key-value stores) appear in a local “datasets” directory, making it trivial to persist results or resume failed crawls.
Lifecycle hooks and customizability: Logging, error handling, and custom request handlers via routers, so you can insert your own logic at enqueue, request success, or failure without rewriting core code.
Native integration with the Apify platform: Once your crawler is ready, running apify push deploys it, and Apify handles autoscaling, proxy billing, and data exports. No extra configuration needed.
Starter templates and file structure: When you run npx crawlee create my-crawler, you get a main.js and routes.js setup. Boilerplate code means you can focus on selectors rather than instantiating browser instances, setting headers, or wiring queues. The default file structure looks like this:

/my-crawler
├── main.js      # entry point: initializes crawler class and starts run()
├── routes.js    # defines request handlers via createCheerioRouter/createPlaywrightRouter
├── storages /
       datasets/    # where results are stored as JSON files per page
├──    key-value-stores/  # storage for arbitrary binary data (images, videos, JSON files…)
└── package.json

Code snapshot (Cheerio crawler)

// routes.js
import { Dataset, createCheerioRouter } from "crawlee";

export const router = createCheerioRouter();

router.addDefaultHandler(async ({ enqueueLinks, log, $ }) => {
  log.info(`enqueueing new URLs`);
  // Finds “next” pages and enqueues them
  await enqueueLinks({ globs: ["https://news.ycombinator.com/?p=*"] });

  // Extract post URL, title, rank
  const data = $(".athing")
    .map((idx, post) => ({
      postUrl: $(post).find(".title a").attr("href"),
      title: $(post).find(".title a").text(),
      rank: $(post).find(".rank").text(),
    }))
    .toArray();

  // Push to dataset for automatic file output
  await Dataset.pushData({ data });
});

Why this matters for your workflow

Simplicity: No more separate proxy rotation libraries, queue managers, or manual header generators. Crawlee handles it all.
Scaling: You can start locally and then deploy to the Apify platform, where it auto‐scales, monitors memory/CPU, and logs failures.
Maintenance: Switching from CheerioCrawler to PlaywrightCrawler only requires changing one import and maybe tweaking selectors. The core logic stays the same.

2. Impit

Making browser impersonation simple

Sending vanilla HTTP requests often gets you blocked by modern anti-scraping systems. You might spend hours rotating user-agents, randomizing delays, or implementing captchas manually, only to still find your IP banned.

Impit is an HTTP client for Node.js and Python, based on Rust’s reqwest, specifically tailored for scraping. Instead of wrestling with header spoofing or TLS fingerprinting yourself, you get:

Automatic fingerprint spoofing: Pick from a library of existing browser fingerprints, and impit builds a full set of realistic HTTP headers and matching TLS settings. This makes your requests indistinguishable from browser requests and reduces detection risk.
Integrated tough-cookie support: Handle session cookies out of the box, so you can maintain login sessions or track redirects using the most popular JS cookie library.
fetch API: Impit implements a subset of the well-known fetch API (MDN), so you can write your scrapers without having to read lengthy docs.
Proxy integration: Support for HTTP and HTTPS proxies via a single option, so you can rotate IPs with minimal code.

Code snapshot (impersonating Firefox)

import { Impit } from "impit";

async function fetchHtml() {
  const impit = new Impit({ browser: 'firefox', http3: true })
  const response = await impit.fetch("https://news.ycombinator.com/");
  console.log(await response.text()); // raw HTML
}

fetchHtml();

Why this matters for your workflow

Stealth: You no longer manually assemble user-agent strings or randomize headers; impit covers 95% of common anti-bot checks.
Error handling: Configurable retries and timeouts mean fewer surprises when a request fails.

3. Cheerio

Converting unruly HTML into structured data

Plain HTML is cluttered: nested tags, inconsistent class names, and no programmatic way to navigate the DOM on the server. If you’ve written custom regex or string-based parsers, you know how brittle that can be.

Cheerio loads raw HTML into a fast, jQuery-like API on the server. You can query for elements, attributes, and text using familiar CSS selectors, then extract exactly what you need without worrying about manual string manipulation.

Code snapshot (parsing Hacker News)

import { gotScraping } from "got-scraping";
import * as cheerio from "cheerio";

async function fetchTitles() {
  const response = await gotScraping("https://news.ycombinator.com/");
  const $ = cheerio.load(response.body);

  $(".athing").each((_, post) => {
    const title = $(post).find(".title a").text();
    const rank = $(post).find(".rank").text();
    console.log(`${rank} ${title}`);
  });
}
fetchTitles();

Why this matters for your workflow

Robustness: No more fragile regex. With Cheerio, you use .find(), .text(), and .attr() just like jQuery.
Performance: Cheerio is lightweight and blazingly fast, so parsing large HTML documents doesn’t become your scraper’s bottleneck - especially when compared to full-headless browsers.
Familiar syntax: If you’ve used jQuery on the front end, there’s almost zero onboarding time.

4. Playwright

Handling JavaScript‐driven, dynamic content reliably

Many modern websites rely on client-side JavaScript to populate the DOM, for lazy loading, infinite scrolling, or data fetched via XHR/AJAX. Cheerio has no power here.

Playwright, on the other hand, spins up a real browser (Chromium, Firefox, or WebKit), navigates pages as a human would, waits for selectors or network to idle, and then gives you a fully rendered DOM snapshot. You can even intercept requests to block ads or unwanted resources.

Code snapshot (Amazon product page)

import { firefox } from "playwright";

async function scrapeAmazon() {
  const browser = await firefox.launch({ headless: true });
  const page = await browser.newPage();
  await page.goto(
    "https://www.amazon.com/Hitchhikers-Guide-Galaxy-Douglas-Adams-ebook/dp/B000XUBC2C/"
  );

  const book = {
    title: await page.locator("#productTitle").innerText(),
    author: await page.locator("span.author a").innerText(),
    kindlePrice: await page
      .locator("#formats span.ebook-price-value")
      .innerText(),
    paperbackPrice: await page
      .locator("#tmm-grid-swatch-PAPERBACK .slot-price span")
      .innerText(),
    hardcoverPrice: await page
      .locator("#tmm-grid-swatch-HARDCOVER .slot-price span")
      .innerText(),
  };

  console.log(book);
  await browser.close();
}
scrapeAmazon();

Why this matters for your workflow

Reliability: If the data isn’t in the initial HTML, you need a browser to run the page’s JS. Playwright ensures you get exactly what a real user sees.
Flexible waits: You can await page.waitForSelector() or specify .waitUntil("networkidle") so you only scrape once all resources load, reducing flaky results.
Intercepting resources: Block images, CSS, or analytics endpoints to speed up scrapes and reduce noise in your logs.

5. Puppeteer

A Chrome-centric approach to browser automation

Puppeteer and Playwright are pretty much the same thing, except for some minor differences in API, and unlike Playwright, it's limited to JavaScript and Node.js. Puppeteer is older, but only recently did Firefox add official support for Puppeteer.

Switching to Playwright isn't difficult, but if you prefer Chromium’s engine, this is a good option.

Puppeteer gives you a headless (or headed) Chrome instance with an easy API for navigation, selection, and evaluation. It supports intercepting requests, generating PDFs, and capturing screenshots. While it doesn’t include the same cross-browser support as Playwright, it’s been around for longer and integrates well with Chrome DevTools Protocol.

Code snapshot (basic Puppeteer scraper)

import puppeteer from "puppeteer";

async function scrapeSite() {
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();
  await page.goto("https://news.ycombinator.com/", { waitUntil: "networkidle2" });

  const articles = await page.$$eval(".athing", posts =>
    posts.map(post => ({
      title: post.querySelector(".title a")?.innerText.trim(),
      url: post.querySelector(".title a")?.href,
      rank: post.querySelector(".rank")?.innerText.trim(),
    }))
  );

  console.log(articles);
  await browser.close();
}
scrapeSite();

Why this matters for your workflow

Existing Puppeteer code: Migrate incrementally or reuse libraries that depend on Puppeteer.
Chrome-only features: Use DevTools Protocol to capture screenshots, trace performance, or emulate network conditions without additional dependencies.
Lightweight automation needs: if you only need a headless Chrome for a few pages and already have an IP rotation or session management solution, Puppeteer might be the simplest choice.

In summary

JavaScript remains a top choice for web scraping in 2025, thanks to its solid ecosystem of open-source libraries, which make it easier to parse HTML, interact with web pages, and deal with dynamic content. In our opinion, these five are the best:

Crawlee - A comprehensive, all-in-one scraping framework that handles browser automation, proxy rotation, session management, queuing, and data storage. It simplifies scaling and maintenance by unifying multiple tools under one interface.
Impit - A stealthy HTTP client tailored for scraping, with automatic realistic header generation, cookie jar support, and proxy integration - ideal for scraping without a full browser.
Cheerio - A fast and lightweight HTML parser that mimics jQuery, perfect for extracting structured data from static HTML without using a browser.
Playwright - A full browser automation library for scraping JavaScript-rendered sites. It supports multiple browsers, waits for content to load, and intercepts resources, making it highly reliable for dynamic pages.
Puppeteer - A headless Chrome automation tool with strong DevTools support. It’s suitable for existing Puppeteer codebases or lightweight scraping needs focused on Chromium.

Whether you’re parsing static HTML or navigating complex, JavaScript-rendered pages, this toolkit helps you choose and combine the best options for performance, stealth, and scalability.

Build and deploy MCP servers in minutes with a TypeScript template

Saurav Jain — Thu, 24 Jul 2025 07:33:07 +0000

Transform any stdio MCP server into a scalable, cloud-hosted service.

Model Context Protocol (MCP) is transforming and simplifying how AI applications connect with external tools. While we’ve covered how to use MCP with tools that give agents context from the web, this guide digs deeper into the developer side: how to build and deploy your own MCP servers on the Apify platform.

With Apify's MCP templates, you can transform any stdio or remote MCP server into a scalable, cloud-hosted service in minutes. There are currently two templates available: for Python and for TypeScript.

In this tutorial, we'll show you how to build an MCP server on Apify with TypeScript.

Why deploy MCP servers on Apify?

Before we get into implementation, here’s why the Apify platform is ideal for hosting MCP servers:

1. Instant scalability: Apify's infrastructure automatically scales based on demand, from a single request to thousands of concurrent connections.
2. Built-in monetization: With the pay-per-event (PPE) model, you can charge users for each tool request, API call, or custom event, turning your MCP server into a revenue stream.
3. Persistent URLs: Standby mode provides stable endpoints like https://your-username--your-mcp-server.apify.actor/sse, perfect for MCP client configurations.
4. Zero infrastructure management: No servers to maintain, no Docker orchestration, no SSL certificates. Just deploy and run.

Understanding the architecture

When you deploy an MCP server on Apify, you can work with two types of MCP servers:

stdio MCP servers: Local servers that communicate via standard input/output, which Apify converts to SSE (Server-Sent Events) for remote access
SSE MCP servers: Remote servers that already communicate via HTTP/SSE, which Apify can proxy and enhance with monetization features

This flexible architecture allows you to turn any type of MCP server into an Apify Actor, whether it's a local stdio-based tool or a remote SSE endpoint, and expose it through a unified SSE interface with built-in scaling and monetization.

Actors are lightweight, containerized programs that take JSON inputs, execute tasks, and return structured outputs.

Step-by-step implementation

Let’s walk through building an MCP server on Apify using the TypeScript template.

Step 1: Choose and create your Actor from the template

# Create the TypeScript MCP server from template
apify create my-mcp-server --template ts-mcp-server
cd my-mcp-server

Step 2: Configure your MCP server

Open src/main.ts and set the MCP_COMMAND:

// For stdio servers:
// Example: Everything MCP server
const MCP_COMMAND = 'npx @modelcontextprotocol/server-everything';

// For SSE servers (requires mcp-remote package):
// Custom SSE endpoint
const MCP_COMMAND = 'npx mcp-remote https://your-domain.com/mcp-endpoint';

Step 3: Install your MCP server dependencies

Update package.json with the MCP server dependencies:

{
  "dependencies": {
    "@modelcontextprotocol/server-everything": "^2025.5.12",
    "mcp-remote": "^0.1.16"
  }
}

Note: If you’re developing the Actor locally, you can use npm install instead of editing package.json directly.

Step 4: Set up monetization (optional)

In .actor/pay_per_event.json:

[
  {
    "tool-request": {
      "eventTitle": "Tool Request",
      "eventDescription": "Charge for each tool execution",
      "eventPriceUsd": 0.05
    }
  }
]

Trigger charges in your code:

For TypeScript:

await Actor.charge({ eventName: 'tool-request' })

Step 5: Configure Actor settings

In .actor/actor.json:

{
  "actorSpecification": 1,
  "name": "my-mcp-server",
  "version": "1.0.0",
  "usesStandbyMode": true,
  "minMemoryMbytes": 512,
  "maxMemoryMbytes": 4096,
  "webServerMcpPath": "/sse" // So the Actor is recognized as an MCP server
}

Step 6: Deploy to Apify

apify login
apify push

Step 7: Configure standby mode

In Apify Console:

Go to your Actor’s settings
Enable “Standby mode”
Set idle timeout (e.g., 300 seconds)
Adjust memory allocation as needed

Step 8: Connect your MCP client

Use this URL:

https://your-username--my-mcp-server.apify.actor/sse

Configure your client by pointing your MCP client to the URL, and be sure to set the Authorization headers with the bearer auth token Authorization: Bearer your-token.

Advanced configuration

Environment variables

To set non-sensitive environment variables, use actor.json:

{
  "environmentVariables": {
    "RATE_LIMIT": "100"
  }
}

Important: Do not put API tokens or other sensitive values in the actor.json file. Instead, set sensitive environment variables in the Apify Console UI under your Actor's settings when doing the build. For more information on setting custom environment variables, see the documentation.

TypeScript template capabilities

The TypeScript template supports both stdio and SSE server types:

stdio servers: Simple command string configuration for local MCP servers
SSE servers: Remote server proxying using mcp-remote for connecting to external SSE endpoints

Debugging and monitoring

Local development

To run the MCP server locally, use the following command:

APIFY_META_ORIGIN="STANDBY" ACTOR_WEB_SERVER_PORT=8080 apify run

You can use the MCP inspector on GitHub for debugging and testing locally.

Production monitoring

View Actor logs in Apify Console
Set up error/usage alerts
Track monetization in Analytics

Troubleshooting common issues

Memory errors: Increase memory or optimize code
Auth failures: Check tokens

What next?

Add more tools
Integrate with Apify Actors
Build custom clients
Share your server on Apify Store

Start building

Conclusion

Deploying MCP servers on the Apify platform turns local tools into scalable, monetizable cloud services. With our TypeScript template and standby mode, you can get a production-ready server running in minutes, whether it's a local stdio server or an existing SSE server.

Explore what’s possible:

[Boost]

Saurav Jain — Fri, 21 Mar 2025 13:51:28 +0000

How to scrape Bluesky with Python

Max Bohomolov for Crawlee ・ Mar 21

#webscraping #webdev #python #developers

Inside implementing SuperScraper with Crawlee.

Saurav Jain — Thu, 06 Mar 2025 07:23:07 +0000

SuperScraper is an open-source Actor that combines features from various web scraping services, including ScrapingBee, ScrapingAnt, and ScraperAPI.

A key capability is its standby mode, which runs the Actor as a persistent API server. This removes the usual start-up times - a common pain point in many systems - and lets users make direct API calls to interact with the system immediately.

This blog explains how SuperScraper works, highlights its implementation details, and provides code snippets to demonstrate its core functionality.

apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

A web scraping and browser automation library

Crawlee covers your crawling and scraping end-to-end and helps you build reliable scrapers. Fast.

Your crawlers will appear human-like and fly under the radar of modern bot protections even with the default configuration. Crawlee gives you the tools to crawl the web for links, scrape data, and store it to disk or cloud while staying configurable to suit your project's needs.

Crawlee is available as the crawlee NPM package.

👉 View full documentation, guides and examples on the Crawlee project website 👈

Crawlee for Python is open for early adopters. 🐍 👉 Checkout the source code 👈.

Installation

We recommend visiting the Introduction tutorial in Crawlee documentation for more information.

Crawlee requires Node.js 16 or higher.

With Crawlee CLI

The fastest way to try Crawlee out is to use the Crawlee CLI and choose the Getting started example. The CLI…

View on GitHub

What is SuperScraper?

SuperScraper transforms a traditional scraper into an API server. Instead of running with static inputs and waiting for completion, it starts only once, stays active, and listens for incoming requests.

How to enable standby mode

To activate standby mode, you must configure the settings so it listens for incoming requests.

Server setup

The project uses Node.js http module to create a server that listens on the desired port. After the server starts, a check ensures users are interacting with it correctly by sending requests instead of running it traditionally. This keeps SuperScraper operating as a persistent server.

Handling multiple crawlers

SuperScraper processes user requests using multiple instances of Crawlee’s PlaywrightCrawler. Since each PlaywrightCrawler instance can only handle one proxy configuration, a separate crawler is created for each unique proxy setting.

For example, if the user sends one request for “normal” proxies and one request with residential US proxies, a separate crawler needs to be created for each proxy configuration. Hence, to solve this, we store the crawlers in a key-value map, where the key is a stringified proxy configuration.

const crawlers = new Map<string, PlaywrightCrawler>();

Here’s a part of the code that gets executed when a new request from the user arrives; if the crawler for this proxy configuration exists in the map, it will be used. Otherwise, a new crawler gets created. Then, we add the request to the crawler’s queue so it can be processed.

const key = JSON.stringify(crawlerOptions); 
const crawler = crawlers.has(key) ? crawlers.get(key)! : await createAndStartCrawler(crawlerOptions);

await crawler.addRequests([request]);

The function below initializes new crawlers with predefined settings and behaviors. Each crawler utilizes its own in-memory queue created with the MemoryStorage client. This approach is used for two key reasons:

Performance: In-memory queues are faster, and there's no need to persist them when SuperScraper migrates.
Isolation: Using a separate queue prevents interference with the shared default queue of the SuperScraper Actor, avoiding potential bugs when multiple crawlers use it simultaneously.

export const createAndStartCrawler = async (crawlerOptions: CrawlerOptions = DEFAULT_CRAWLER_OPTIONS) => {
    const client = new MemoryStorage({ persistStorage: false });
    const queue = await RequestQueue.open(undefined, { storageClient: client });

    const proxyConfig = await Actor.createProxyConfiguration(crawlerOptions.proxyConfigurationOptions);

    const crawler = new PlaywrightCrawler({
        keepAlive: true,
        proxyConfiguration: proxyConfig,
        maxRequestRetries: 4,
        requestQueue: queue,
    });
};

At the end of the function, we start the crawler and log a message if it terminates for any reason. Next, we add the newly created crawler to the key-value map containing all crawlers, and finally, we return the crawler.

crawler.run().then(
    () => log.warning(`Crawler ended`, crawlerOptions),
    () => { }
);

crawlers.set(JSON.stringify(crawlerOptions), crawler);

log.info('Crawler ready 🚀', crawlerOptions);

return crawler;

Mapping standby HTTP requests to Crawlee requests

When creating the server, it accepts a request listener function that takes two arguments: the user’s request and a response object. The response object is used to send scraped data back to the user. These response objects are stored in a key-value map to so they can be accessed later in the code. The key is a randomly generated string shared between the request and its corresponding response object, it is used as request.uniqueKey.

const responses = new Map<string, ServerResponse>();

Saving response objects

The following function stores a response object in the key-value map:

export function addResponse(responseId: string, response: ServerResponse) {
    responses.set(responseId, response);
}

Updating crawler logic to store responses

Here’s the updated logic for fetching/creating the corresponding crawler for a given proxy configuration, with a call to store the response object:

const key = JSON.stringify(crawlerOptions); 
const crawler = crawlers.has(key) ? crawlers.get(key)! : await createAndStartCrawler(crawlerOptions);

addResponse(request.uniqueKey!, res);

await crawler.requestQueue!.addRequest(request);

Sending scraped data back

Once a crawler finishes processing a request, it retrieves the corresponding response object using the key and sends the scraped data back to the user:

export const sendSuccResponseById = (responseId: string, result: unknown, contentType: string) => {
    const res = responses.get(responseId);
    if (!res) {
        log.info(`Response for request ${responseId} not found`);
        return;
    }

    res.writeHead(200, { 'Content-Type': contentType });
    res.end(result);
    responses.delete(responseId);
};

Error handling

There is similar logic to send a response back if an error occurs during scraping:

export const sendErrorResponseById = (responseId: string, result: string, statusCode: number = 500) => {
    const res = responses.get(responseId);
    if (!res) {
        log.info(`Response for request ${responseId} not found`);
        return;
    }

    res.writeHead(statusCode, { 'Content-Type': 'application/json' });
    res.end(result);
    responses.delete(responseId);
};

Adding timeouts during migrations

During migration, SuperScraper adds timeouts to pending responses to handle termination cleanly.

export const addTimeoutToAllResponses = (timeoutInSeconds: number = 60) => {
    const migrationErrorMessage = {
        errorMessage: 'Actor had to migrate to another server. Please, retry your request.',
    };

    const responseKeys = Object.keys(responses);

    for (const key of responseKeys) {
        setTimeout(() => {
            sendErrorResponseById(key, JSON.stringify(migrationErrorMessage));
        }, timeoutInSeconds * 1000);
    }
};

Managing migrations

SuperScraper handles migrations by timing out active responses to prevent lingering requests during server transitions.

Actor.on('migrating', ()=>{
    addTimeoutToAllResponses(60);
});

Users receive clear feedback during server migrations, maintaining stable operation.

Build your own

This guide showed how to build and manage a standby web scraper using Apify’s platform and Crawlee. The implementation handles multiple proxy configurations through PlaywrightCrawler instances while managing request-response cycles efficiently to support diverse scraping needs.

Standby mode transforms SuperScraper into a persistent API server, eliminating start-up delays. The migration handling system keeps operations stable during server transitions. You can build on this foundation to create web scraping tools tailored to your requirements.

To get started, explore the project on GitHub or learn more about Crawlee to build your own scalable web scraping tools.

[Boost]

Saurav Jain — Thu, 13 Feb 2025 15:54:12 +0000

🤯 11 Exciting GitHub Repositories You Should Check Right Now⚡️

Arindam Majumder ・ Feb 13

#opensource #javascript #webdev #python

Crawlee GiveAway for our dev.to community! :)

Saurav Jain — Fri, 13 Dec 2024 05:02:55 +0000

7 Must-Try Open-Source Tools for Python and JavaScript Developers 🚀

Arindam Majumder ・ Dec 12

#javascript #webdev #python #beginners

Worth a read! :)

Saurav Jain — Mon, 02 Dec 2024 05:32:14 +0000

How to scrape Google search results with Python

Max Bohomolov for Crawlee ・ Dec 2

#webscraping #webdev #python #developers

Three years of working remotely! 🥂 ( A summary)

Saurav Jain — Fri, 22 Nov 2024 09:21:19 +0000

Three years of working remotely! 🥂 ( A summary)

Today marks the completion of my three years of working remotely, and I must say it's not been a bed of roses. 😅

Yeah, you no longer have to give fake smiles to everyone in the office, but there is more to it.

Positives:

Flexibility : You can define your own timings for the "personal tasks" you have.
Saves time: You don't need to travel to the office just to sit the whole day at a desk; your home is your office.
Work with anyone: You get a chance to work with bright minds from different parts of the world.
More opportunities: You are not bound to opportunities in a specific region.

Negatives

Remote work can be depressing: Nobody talks about it, but remote work can be insanely depressing at times. You don't interact with anyone in person for days, and things start getting piled up. Even I ended up depressed for months.
Health: If you don't spend time in the gym or do physical activities before you know it, you will start facing problems like pain in your back or obesity because of extensive sitting.
Blurred boundaries: While people usually see remote work as a work-life balance tool, I reckon it's the opposite. You have blurred boundaries between your normal life and work life. You will find yourself doing work at unusual times.
Blockers: There can be communication and collaboration blockers at times because you are not in the office.

So, should I opt for remote work?

The answer depends on factors like opportunity, the level of your career, and the role you are going to work in.

If you are early in your career and your role requires a lot of collaboration with your colleagues, I recommend that you work for at least a few months in an office to learn collaboration, communication, and other skills.

How can I excel in remote work?

Over-communicate: This is your key to success in remote working. You HAVE to over-communicate on tasks, requirements, and everything else with your team, in order to make sure you are well aligned with everyone.
Health is the first priority: Make sure you are taking care of your health and sign up for a gym or any physical activity.
Set boundaries: Create a work routine, make hard stops after a particular time, and have blockers in your calendar.
Go to office: Yes, visit your office every now and then, from time to time, maybe once in a few months or wherever you get a chance, to say Hi on some product launch, for some project, or on a team building. Making a personal connection helps a lot in collaboration.
Take breaks: Take breaks, go outside, take leaves, travel. Helps your mental health a lot.

I started working remotely in Covid for an opportunity that was not available where I live. I had to go through the ups and downs of it, but I learned how to survive and excel with time. ✌

I hope it helped you in some way; feel free to hit me up with any questions. :)

(Pic: Throwback to the day when I started my remote job at Apify)

Reverse engineering GraphQL persistedQuery extension

Saurav Jain — Fri, 22 Nov 2024 08:03:47 +0000

GraphQL is a query language for getting deeply nested structured data from a website's backend, similar to MongoDB queries.

The request is usually a POST to some general /graphql endpoint with a body like this:

However, with large data structures, this becomes inefficient - you are sending a large query in a POST request body, which is (almost always) the same and only changes on website updates; POST requests can’t be cached, etc. Therefore, an extension called “persisted queries” was developed. This isn’t an anti-scraping secret; you can read the public documentation about it here.

TLDR: the client computes the sha256 hash of the query text and only sends that hash. In addition, you can possibly fit all of this into the query string of a GET request, making it easily cachable. Below is an example request from Zillow

As you can see, it’s just some metadata about the persistedQuery extension, the hash of the query, and variables to be embedded in the query.

Here’s another request from expedia.com, sent as a POST, but with the same extension:

This primarily optimizes website performance, but it creates several challenges for web scraping:

GET requests are usually more prone to being blocked.
Hidden Query Parameters: We don’t know the full query, so if the website responds with a “Persisted query not found” error (asking us to send the query in full, not just the hash), we can’t send it.
Once the website changes even a little bit and the clients start asking for a new query - even though the old one might still work, the server will very soon forget its ID/hash, and your request with this hash will never work again, since you can’t “remind” the server of the full query text.

Therefore, for different reasons, you might find yourself in the need to extract the whole query text. You could dig through the website JavaScript, and if you’re lucky, you might find the query text there in full, but often, it is somehow dynamically constructed from multiple fragments, etc.

Therefore, we figured out a better way: we will not touch the client-side JavaScript at all. Instead, we will try to simulate the situation where the client tries to use a hash that the server does not know. Therefore, we need to intercept the (valid) request sent by the browser in-flight and modify the hash to a bogus one before passing it to the server.

For exactly this use case, a perfect tool exists: mitmproxy, an open-source Python library that intercepts requests made by your own devices, websites, or apps and allows you to modify them with simple Python scripts.

Download mitmproxy, and prepare a Python script like this:

import json

def request(flow):
    try:
        dat = json.loads(flow.request.text)
        dat[0]["extensions"]["persistedQuery"]["sha256Hash"] = "0d9e" # any bogus hex string here
        flow.request.text = json.dumps(dat)
    except:
        pass

This defines a hook that mitmproxy will run on every request: it tries to load the request's JSON body, modifies the hash to an arbitrary value, and writes the updated JSON as a new body of the request.

We also need to make sure we reroute our browser requests to mitmproxy. For this purpose we are going to use a browser extension called FoxyProxy. It is available in both Firefox and Chrome.

Just add a route with these settings:

Now we can run mitmproxy with this script: mitmweb -s script.py

This will open a browser tab where you can watch all the intercepted requests in real-time.

If you go to the particular path and see the query in the request section, you will see some garbage value has replaced the hash.

Now, if you visit Zillow and open that particular path that we tried for the extension, and go to the response section, the client-side receives the PersistedQueryNotFound error.

The front end of Zillow reacts with sending the whole query as a POST request.

We extract the query and hash directly from this POST request. To ensure that the Zillow server does not forget about this hash, we periodically run this POST request with the exact same query and hash. This will ensure that the scraper continues to work even when the server's cache is cleaned or reset or the website changes.

Conclusion

Persisted queries are a powerful optimization tool for GraphQL APIs, enhancing website performance by minimizing payload sizes and enabling GET request caching. However, they also pose significant challenges for web scraping, primarily due to the reliance on server-stored hashes and the potential for those hashes to become invalid.

Using mitmproxy to intercept and manipulate GraphQL requests gives an efficient approach to reveal the full query text without delving into complex client-side JavaScript. By forcing the server to respond with a PersistedQueryNotFound error, we can capture the full query payload and utilize it for scraping purposes. Periodically running the extracted query ensures the scraper remains functional, even when server-side cache resets occur or the website evolves.

Optimizing web scraping: Scraping auth data using JSDOM

Saurav Jain — Wed, 09 Oct 2024 05:26:49 +0000

As scraping developers, we sometimes need to extract authentication data like temporary keys to perform our tasks. However, it is not as simple as that. Usually, it is in HTML or XHR network requests, but sometimes, the auth data is computed. In that case, we can either reverse-engineer the computation, which takes a lot of time to deobfuscate scripts or run the JavaScript that computes it. Normally, we use a browser, but that is expensive. Crawlee provides support for running browser scraper and Cheerio Scraper in parallel, but that is very complex and expensive in terms of compute resource usage. JSDOM helps us run page JavaScript with fewer resources than a browser and slightly higher than Cheerio.

This article will discuss a new approach that we use in one of our Actors to obtain the authentication data from TikTok ads creative center generated by browser web applications without actually running the browser but instead of it, using JSDOM.

Analyzing the website

When you visit this URL:

https://ads.tiktok.com/business/creativecenter/inspiration/popular/hashtag/pc/en

You will see a list of hashtags with their live ranking, the number of posts they have, trend chart, creators, and analytics. You can also notice that we can filter the industry, set the time period, and use a check box to filter if the trend is new to the top 100 or not.

Our goal here is to extract the top 100 hashtags from the list with the given filters.

The two possible approaches are to use CheerioCrawler, and the second one will be browser-based scraping. Cheerio gives results faster but does not work with JavaScript-rendered websites.

Cheerio is not the best option here as the Creative Center is a web application, and the data source is API, so we can only get the hashtags initially present in the HTML structure but not each of the 100 as we require.

The second approach can be using libraries like Puppeteer, Playwright, etc, to do browser-based scraping and using automation to scrape all of the hashtags, but with previous experiences, it takes a lot of time for such a small task.

Now comes the new approach that we developed to make this process a lot better than browser based and very close to CheerioCrawler based crawling.

JSDOM Approach

Before diving deep into this approach, I would like to give credit to Alexey Udovydchenko, Web Automation Engineer at Apify, for developing this approach. Kudos to him!

In this approach, we are going to make API calls to https://ads.tiktok.com/creative_radar_api/v1/popular_trend/hashtag/list to get the required data.

Before making calls to this API, we will need few required headers (auth data), so we will first make the call to https://ads.tiktok.com/business/creativecenter/inspiration/popular/hashtag/pad/en.

We will start this approach by creating a function that will create the URL for the API call for us and, make the call and get the data.

export const createStartUrls = (input) => {
    const {
        days = '7',
        country = '',
        resultsLimit = 100,
        industry = '',
        isNewToTop100,
    } = input;

    const filterBy = isNewToTop100 ? 'new_on_board' : '';
    return [
        {
            url: `https://ads.tiktok.com/creative_radar_api/v1/popular_trend/hashtag/list?page=1&limit=50&period=${days}&country_code=${country}&filter_by=${filterBy}&sort_by=popular&industry_id=${industry}`,
            headers: {
                // required headers
            },
            userData: { resultsLimit },
        },
    ];
};

In the above function, we create the start url for the API call that include various parameters as we talked about earlier. After creating the URL according to the parameters it will call the creative_radar_api and fetch all the results.

But it won’t work until we get the headers. So, let’s create a function that will first create a session using sessionPool and proxyConfiguration.

export const createSessionFunction = async (
    sessionPool,
    proxyConfiguration,
) => {
    const proxyUrl = await proxyConfiguration.newUrl(Math.random().toString());
    const url =
        'https://ads.tiktok.com/business/creativecenter/inspiration/popular/hashtag/pad/en';
    // need url with data to generate token
    const response = await gotScraping({ url, proxyUrl });
    const headers = await getApiUrlWithVerificationToken(
        response.body.toString(),
        url,
    );
    if (!headers) {
        throw new Error(`Token generation blocked`);
    }
    log.info(`Generated API verification headers`, Object.values(headers));
    return new Session({
        userData: {
            headers,
        },
        sessionPool,
    });
};

In this function, the main goal is to call https://ads.tiktok.com/business/creativecenter/inspiration/popular/hashtag/pad/en and get headers in return. To get the headers we are using getApiUrlWithVerificationToken function.

Before going ahead, I want to mention that Crawlee natively supports JSDOM using the JSDOM Crawler. It gives a framework for the parallel crawling of web pages using plain HTTP requests and jsdom DOM implementation. It uses raw HTTP requests to download web pages, it is very fast and efficient on data bandwidth.

Let’s see how we are going to create the getApiUrlWithVerificationToken function:

const getApiUrlWithVerificationToken = async (body, url) => {
    log.info(`Getting API session`);
    const virtualConsole = new VirtualConsole();
    const { window } = new JSDOM(body, {
        url,
        contentType: 'text/html',
        runScripts: 'dangerously',
        resources: 'usable' || new CustomResourceLoader(),
        // ^ 'usable' faster than custom and works without canvas
        pretendToBeVisual: false,
        virtualConsole,
    });
    virtualConsole.on('error', () => {
        // ignore errors cause by fake XMLHttpRequest
    });

    const apiHeaderKeys = ['anonymous-user-id', 'timestamp', 'user-sign'];
    const apiValues = {};
    let retries = 10;
    // api calls made outside of fetch, hack below is to get URL without actual call
    window.XMLHttpRequest.prototype.setRequestHeader = (name, value) => {
        if (apiHeaderKeys.includes(name)) {
            apiValues[name] = value;
        }
        if (Object.values(apiValues).length === apiHeaderKeys.length) {
            retries = 0;
        }
    };
    window.XMLHttpRequest.prototype.open = (method, urlToOpen) => {
        if (
            ['static', 'scontent'].find((x) =>
                urlToOpen.startsWith(`https://${x}`),
            )
        )
        log.debug('urlToOpen', urlToOpen);
    };
    do {
        await sleep(4000);
        retries--;
    } while (retries > 0);

    await window.close();
    return apiValues;
};

In this function, we are creating a virtual console that uses CustomResourceLoader to run the background process and replace the browser with JSDOM.

For this particular example, we need three mandatory headers to make the API call, and those are anonymous-user-id, timestamp, and user-sign.

Using XMLHttpRequest.prototype.setRequestHeader, we are checking if the mentioned headers are in the response or not, if yeas, we take the value of those headers, and repeat the retries until we get all the headers.

Then, the most important part is that we use XMLHttpRequest.prototype.open to extract the auth data and make calls without actually using browsers or exposing the bot activity.

At the end of createSessionFunction, it returns a session with the required headers.

Now coming to our main code, we will use CheerioCrawler and will use prenavigationHooks to inject the headers that we got from the earlier function into the requestHandler.

const crawler = new CheerioCrawler({
    sessionPoolOptions: {
        maxPoolSize: 1,
        createSessionFunction: async (sessionPool) =>
            createSessionFunction(sessionPool, proxyConfiguration),
    },
    preNavigationHooks: [
        (crawlingContext) => {
            const { request, session } = crawlingContext;
            request.headers = {
                ...request.headers,
                ...session.userData?.headers,
            };
        },
    ],
    proxyConfiguration,
});

Finally in the request handler we make the call using the headers and make sure how many calls are needed to fetch all the data handling pagination.

async requestHandler(context) {
    const { log, request, json } = context;
    const { userData } = request;
    const { itemsCounter = 0, resultsLimit = 0 } = userData;
    if (!json.data) {
        throw new Error('BLOCKED');
    }
    const { data } = json;
    const items = data.list;
    const counter = itemsCounter + items.length;
    const dataItems = items.slice(
        0,
        resultsLimit && counter > resultsLimit
            ? resultsLimit - itemsCounter
            : undefined,
    );
    await context.pushData(dataItems);
    const {
        pagination: { page, total },
    } = data;
    log.info(
        `Scraped ${dataItems.length} results out of ${total} from search page ${page}`,
    );
    const isResultsLimitNotReached =
        counter < Math.min(total, resultsLimit);
    if (isResultsLimitNotReached && data.pagination.has_more) {
        const nextUrl = new URL(request.url);
        nextUrl.searchParams.set('page', page + 1);
        await crawler.addRequests([
            {
                url: nextUrl.toString(),
                headers: request.headers,
                userData: {
                    ...request.userData,
                    itemsCounter: itemsCounter + dataItems.length,
                },
            },
        ]);
    }
}

One important thing to note here is that we are making this code in a way that we can make any numbers of API calls.

In this particular example we just made one request and a single session, but you can make more if you need. When the first API call will be completed, it will create the second API call. Again, you can make more calls if needed, but we stopped at two.

To make things more clear, here is how code flow looks:

Conclusion

This approach helps us to get a third way to extract the authentication data without actually using a browser and pass the data to CheerioCrawler. This significantly improves the performance and reduces the RAM requirement by 50%, and while browser-based scraping performance is ten times slower than pure Cheerio, JSDOM does it just 3-4 times slower, which makes it 2-3 times faster than browser-based scraping.

The project's codebase is already uploaded here. The code is written as an Apify Actor; you can find more about it here, but you can also run it without using Apify SDK.

If you have any doubts or questions about this approach, reach out to us on our Discord server.

How to scrape infinite scrolling webpages with Python

Saurav Jain — Wed, 28 Aug 2024 05:38:51 +0000

How to scrape infinite scrolling webpages with Python

Hello, Crawlee Devs, and welcome back to another tutorial on the Crawlee Blog. This tutorial will teach you how to scrape infinite-scrolling websites using Crawlee for Python.

For context, infinite-scrolling pages are a modern alternative to classic pagination. When users scroll to the bottom of the webpage instead of choosing the next page, the page automatically loads more data, and users can scroll more.

As a big sneakerhead, I'll take the Nike shoes infinite-scrolling website as an example, and we'll scrape thousands of sneakers from it.

Crawlee for Python has some amazing initial features, such as a unified interface for HTTP and headless browser crawling, automatic retries, and much more.

Prerequisites and bootstrapping the project

Let’s start the tutorial by installing Crawlee for Python with this command:

pipx run crawlee create nike-crawler

Before going ahead if you like reading this blog, we would be really happy if you gave Crawlee for Python a star on GitHub!

apify / crawlee-python

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

A web scraping and browser automation library

Crawlee covers your crawling and scraping end-to-end and helps you build reliable scrapers. Fast.

🚀 Crawlee for Python is open to early adopters!

Your crawlers will appear almost human-like and fly under the radar of modern bot protections even with the default configuration. Crawlee gives you the tools to crawl the web for links, scrape data and persistently store it in machine-readable formats, without having to worry about the technical details. And thanks to rich configuration options, you can tweak almost any aspect of Crawlee to suit your project's needs if the default settings don't cut it.

👉 View full documentation, guides and examples on the Crawlee project website 👈

We also have a TypeScript implementation of the Crawlee, which you can explore and utilize for your projects. Visit our GitHub repository for more information Crawlee for JS/TS on GitHub.

Installation

We…

View on GitHub

We will scrape using headless browsers. Select PlaywrightCrawler in the terminal when Crawlee for Python asks for it.

After installation, Crawlee for Python will create boilerplate code for you. Redirect into the project folder and then run this command for all the dependencies installation:

poetry install

How to scrape infinite scrolling webpages

Handling accept cookie dialog
Adding request of all shoes links
Extract data from product details
Accept Cookies context manager
Handling infinite scroll on the listing page
Exporting data to CSV format

Handling accept cookie dialog

After all the necessary installations, we'll start looking into the files and configuring them accordingly.

When you look into the folder, you'll see many files, but for now, let’s focus on main.py and routes.py.

In __main__.py, let's change the target location to the Nike website. Then, just to see how scraping will happen, we'll add headless = False to the PlaywrightCrawler parameters. Let's also increase the maximum requests per crawl option to 100 to see the power of parallel scraping in Crawlee for Python.

The final code will look like this:

import asyncio

from crawlee.playwright_crawler import PlaywrightCrawler

from .routes import router


async def main() -> None:

    crawler = PlaywrightCrawler(
        headless=False,
        request_handler=router,
        max_requests_per_crawl=100,
    )

    await crawler.run(
        [
            'https://nike.com/,
        ]
    )


if __name__ == '__main__':
    asyncio.run(main())

Now coming to routes.py, let’s remove:

await context.enqueue_links()

As we don’t want to scrape the whole website.

Now, if you run the crawler using the command:

poetry run python -m nike-crawler

As the cookie dialog is blocking us from crawling more than one page's worth of shoes, let’s get it out of our way.

We can handle the cookie dialog by going to Chrome dev tools and looking at the test_id of the "accept cookies" button, which is dialog-accept-button.

Now, let’s remove the context.push_data call that was left there from the project template and add the code to accept the dialog in routes.py. The updated code will look like this:

from crawlee.router import Router
from crawlee.playwright_crawler import PlaywrightCrawlingContext

router = Router[PlaywrightCrawlingContext]()

@router.default_handler
async def default_handler(context: PlaywrightCrawlingContext) -> None:

    # Wait for the popup to be visible to ensure it has loaded on the page.
    await context.page.get_by_test_id('dialog-accept-button').click()

Adding request of all shoes links

Now, if you hover over the top bar and see all the sections, i.e., man, woman, and kids, you'll notice the “All shoes” section. As we want to scrape all the sneakers, this section interests us. Let’s use get_by_test_id with the filter of has_text=’All shoes’ and add all the links with the text “All shoes” to the request handler. Let’s add this code to the existing routes.py file:

    shoe_listing_links = (
        await context.page.get_by_test_id('link').filter(has_text='All shoes').all()
    )
    await context.add_requests(
        [
            Request.from_url(url, label='listing')
            for link in shoe_listing_links
            if (url := await link.get_attribute('href'))
        ]
    )

@router.handler('listing')
async def listing_handler(context: PlaywrightCrawlingContext) -> None:
    """Handler for shoe listings."""

Extract data from product details

Now that we have all the links to the pages with the title “All Shoes,” the next step is to scrape all the products on each page and the information provided on them.

We'll extract each shoe's URL, title, price, and description. Again, let's go to dev tools and extract each parameter's relevant test_id. After scraping each of the parameters, we'll use the context.push_data function to add it to the local storage. Now let's add the following code to the listing_handler and update it in the routes.py file:

@router.handler('listing')
async def listing_handler(context: PlaywrightCrawlingContext) -> None:
    """Handler for shoe listings."""        

    await context.enqueue_links(selector='a.product-card__link-overlay', label='detail')


@router.handler('detail')
async def detail_handler(context: PlaywrightCrawlingContext) -> None:
    """Handler for shoe details."""

    title = await context.page.get_by_test_id(
        'product_title',
    ).text_content()

    price = await context.page.get_by_test_id(
        'currentPrice-container',
    ).first.text_content()

    description = await context.page.get_by_test_id(
        'product-description',
    ).text_content()

    await context.push_data(
        {
            'url': context.request.loaded_url,
            'title': title,
            'price': price,
            'description': description,
        }
    )

Accept Cookies context manager

Since we're dealing with multiple browser pages with multiple links and we want to do infinite scrolling, we may encounter an accept cookie dialog on each page. This will prevent loading more shoes via infinite scroll.

We'll need to check for cookies on every page, as each one may be opened with a fresh session (no stored cookies) and we'll get the accept cookie dialog even though we already accepted it in another browser window. However, if we don't get the dialog, we want the request handler to work as usual.

To solve this problem, we'll try to deal with the dialog in a parallel task that will run in the background. A context manager is a nice abstraction that will allow us to reuse this logic in all the router handlers. So, let's build a context manager:


from playwright.async_api import TimeoutError as PlaywrightTimeoutError

@asynccontextmanager
async def accept_cookies(page: Page):
    task = asyncio.create_task(page.get_by_test_id('dialog-accept-button').click())
    try:
        yield
    finally:
        if not task.done():
            task.cancel()

        with suppress(asyncio.CancelledError, PlaywrightTimeoutError):
            await task

This context manager will make sure we're accepting the cookie dialog if it exists before scrolling and scraping the page. Let’s implement it in the routes.py file, and the updated code is here

Handling infinite scroll on the listing page

Now for the last and most interesting part of the tutorial! How to handle the infinite scroll of each shoe listing page and make sure our crawler is scrolling and scraping the data constantly.

To handle infinite scrolling in Crawlee for Python, we just need to make sure the page is loaded, which is done by waiting for the network_idle load state, and then use the infinite_scroll helper function which will keep scrolling to the bottom of the page as long as that makes additional items appear.

Let’s add two lines of code to the listing handler:

@router.handler('listing')
async def listing_handler(context: PlaywrightCrawlingContext) -> None:
    # Handler for shoe listings

    async with accept_cookies(context.page):
        await context.page.wait_for_load_state('networkidle')
        await context.infinite_scroll()
        await context.enqueue_links(
            selector='a.product-card__link-overlay', label='detail'
        )

Exporting data to CSV format

As we want to store all the shoe data into a CSV file, we can just add a call to the export_data helper into the __main__.py file just after the crawler run:

await crawler.export_data('shoes.csv')

Working crawler and its code

Now, we have a crawler ready that can scrape all the shoes from the Nike website while handling infinite scrolling and many other problems, like the cookies dialog.

You can find the complete working crawler code here on the GitHub repository.

If you have any doubts regarding this tutorial or using Crawlee for Python, feel free to join our discord community and ask fellow developers or the Crawlee team.

Crawlee & Apify

This is the official developer community of Apify and Crawlee. | 8785 members

discord.com

This tutorial is taken from the webinar held on August 5th where Jan Buchar, Senior Python Engineer at Apify, gave a live demo about this use case. Watch the whole webinar here.

Forem: Saurav Jain

Firecrawl vs. Apify: 2025 guide for AI and data teams

Pricing

Firecrawl’s pricing model

Apify’s pricing model

Firecrawl and Apify pricing compared

Performance

Firecrawl: Unified AI-driven scraping

Apify: A comprehensive, flexible ecosystem

How each platform scales

Firecrawl

Apify

Integrations

Integrating Firecrawl into AI pipelines

Apify integrations for production workflows

Picking the right platform for your workload

When Firecrawl is the better fit

When Apify delivers more value

5 best JavaScript web scraping libraries in 2025

Even if you're not a JavaScript developer, it's worth knowing these libraries for parsing HTML, interacting with pages, and dealing with dynamic content.

1. Crawlee

Code snapshot (Cheerio crawler)

Why this matters for your workflow

2. Impit

Code snapshot (impersonating Firefox)

Why this matters for your workflow

3. Cheerio

Code snapshot (parsing Hacker News)

Why this matters for your workflow

4. Playwright

Code snapshot (Amazon product page)

Why this matters for your workflow

5. Puppeteer

Code snapshot (basic Puppeteer scraper)

Why this matters for your workflow

In summary

Build and deploy MCP servers in minutes with a TypeScript template

Transform any stdio MCP server into a scalable, cloud-hosted service.

Why deploy MCP servers on Apify?

Understanding the architecture

Step-by-step implementation

Step 1: Choose and create your Actor from the template

Step 2: Configure your MCP server

Step 3: Install your MCP server dependencies

Step 4: Set up monetization (optional)

Step 5: Configure Actor settings

Step 6: Deploy to Apify

Step 7: Configure standby mode

Step 8: Connect your MCP client

Advanced configuration

Environment variables

TypeScript template capabilities

Debugging and monitoring

Local development

Production monitoring

Troubleshooting common issues

What next?

Conclusion

[Boost]

How to scrape Bluesky with Python

Max Bohomolov for Crawlee ・ Mar 21

Inside implementing SuperScraper with Crawlee.

apify / crawlee

A web scraping and browser automation library

Installation

With Crawlee CLI

What is SuperScraper?

How to enable standby mode

Server setup

Handling multiple crawlers

Mapping standby HTTP requests to Crawlee requests

Managing migrations

Build your own

[Boost]

🤯 11 Exciting GitHub Repositories You Should Check Right Now⚡️

Arindam Majumder ・ Feb 13

Crawlee GiveAway for our dev.to community! :)

7 Must-Try Open-Source Tools for Python and JavaScript Developers 🚀

Arindam Majumder ・ Dec 12

Worth a read! :)

Integrating Firecrawl into AI pipelines