Forem: Charles

5 Things Every Developer Should Know About Web Scraping API Keys

Charles — Tue, 19 May 2026 04:04:13 +0000

5 Things Every Developer Should Know About Web Scraping API Keys

Don't leak your keys. Lessons from experience.

1. Never Hardcode

// BAD
const key = "xc-..."
// GOOD
const key = process.env.XCRAWL_API_KEY;

2. Use .env Files

Add to .gitignore immediately.

3. Rotate Monthly

Generate new key, revoke old one. Keep a lifecycle log.

4. Separate per Environment

dev-key, prod-key, ci-key. If one leaks, limited damage.

5. Monitor Usage

Check daily consumption. Spikes = possible leak.

Manage keys at dash.xcrawl.com

How to Auto-Fill Google Sheets with Web Scraped Data (No Coding Required)

Charles — Tue, 19 May 2026 04:03:40 +0000

How to Auto-Fill Google Sheets with Web Scraped Data

Want scraped data flowing into Google Sheets automatically?

Method 1: Google Apps Script

function scrapeAndFill() {
  const sheet = SpreadsheetApp.getActiveSheet();
  const options = {
    method: 'POST',
    headers: { 'X-API-Key': 'your-key' },
    payload: JSON.stringify({ url: 'https://example.com/products' })
  };
  const response = UrlFetchApp.fetch('https://run.xcrawl.com/v1/scrape', options);
  const data = JSON.parse(response);
  sheet.appendRow([data.name, data.price]);
}

Set to run on time trigger.

Method 2: n8n + Google Sheets

Build the workflow visually. HTTP node → transform → Google Sheets node.

Method 3: CLI → CSV → Import

npx xcrawl scrape "https://example.com" --output data.json

Pro Tips

Schedule off-peak hours (2-6 AM)
Always error-check before writing
Keep raw + processed sheets separate

Works with any automation tool: dash.xcrawl.com

Puppeteer vs Playwright vs Cheerio vs XCrawl: Which Web Scraping Tool to Use in 2026

Charles — Tue, 19 May 2026 04:03:07 +0000

Puppeteer vs Playwright vs Cheerio vs XCrawl: Which Web Scraping Tool to Use in 2026

Four tools, four different philosophies.

Comparison

Feature	Cheerio	Puppeteer	Playwright	XCrawl
Setup	5 min	15 min	15 min	2 min
JS rendering	❌	✅	✅	✅
Anti-bot	❌	❌ Basic	❌ Basic	✅ Built-in
CAPTCHA	❌	❌	❌	✅ Auto
Proxies	❌ DIY	❌ DIY	❌ DIY	✅ Auto
Price	Free	Free	Free	From $49/mo

When to Use What

Cheerio: Static HTML, simple parsing, you have proxy infra.

Puppeteer: Need CDP access, screenshots, already in Node.js.

Playwright: Cross-browser support, mobile emulation, E2E tests.

XCrawl: Production scraping, anti-bot bypass, want API simplicity.

Real-World Decision

One page once → Cheerio
Screenshots → Puppeteer
Testing → Playwright
1000 pages/day → XCrawl

XCrawl API — web scraping with built-in proxies, CAPTCHA solving, and JS rendering.

The Complete Guide to Real Estate Web Scraping in 2026

Charles — Tue, 19 May 2026 04:02:41 +0000

The Complete Guide to Real Estate Web Scraping in 2026

Real estate is one of the most competitive data markets. Here's how to do it right.

Why Scrape Real Estate?

Price monitoring — Track listing prices, drops, trends
Market analysis — Compare neighborhoods, cities, regions
Investment research — Identify undervalued properties
Competitive intelligence — Watch what other agents/brokers list

Major Platforms

Platform	Difficulty	Anti-Bot	Data Quality
Zillow	Hard	Cloudflare	Excellent
Realtor.com	Medium	Rate limits	Excellent
Redfin	Hard	JS rendering	Very Good
Rightmove (UK)	Easy	Basic	Good

Key Data Points

Price, address, bedrooms/bathrooms, sqft
Listing date, days on market
Price history (if available)
Tax assessment, HOA fees
Nearby schools, crime stats

Avoiding Common Pitfalls

IP Blocking: Real estate platforms are aggressive. Use residential proxies.

Data Accuracy: Some platforms hide data behind lazy loading. Ensure JS rendering.

Legal: Public listings are public information. Respect robots.txt and rate limits.

Sample Extraction

const result = await fetch('https://run.xcrawl.com/v1/ai-extract', {
  method: 'POST',
  headers: { 'X-API-Key': 'your-key' },
  body: JSON.stringify({
    url: 'https://www.zillow.com/homedetails/...',
    schema: {
      price: 'Current listing price',
      address: 'Street address',
      beds: 'Number of bedrooms',
      baths: 'Number of bathrooms',
      sqft: 'Square footage'
    }
  })
});

Built with XCrawl proxy API: dash.xcrawl.com

How to Scrape LinkedIn Company Data Legally and Efficiently

Charles — Tue, 19 May 2026 03:54:46 +0000

How to Scrape LinkedIn Company Data Legally and Efficiently

LinkedIn is a goldmine for B2B data. But scraping it is famously difficult.

The Legal Framework

First, let's be clear: Scraping public LinkedIn data was ruled legal (hiQ Labs v. LinkedIn, 2022). But respecting robots.txt and rate limits is still required.

What You Can Scrape

Public company pages — About, size, industry, website
Public profiles — Name, headline, location (if not logged in)
Company employee lists — Aggregated data (not personal)

What You Cannot Scrape

Private profiles (only visible when logged in)
Messages, endorsements, connections — LinkedIn's ToS explicitly forbids
Rate-limited data — LinkedIn blocks aggressively

The Tech Approach

Standard browsers won't work — LinkedIn detects headless Chrome instantly.

XCrawl's approach:

const response = await fetch('https://run.xcrawl.com/v1/scrape', {
  method: 'POST',
  headers: { 'X-API-Key': 'your-key' },
  body: JSON.stringify({
    url: 'https://www.linkedin.com/company/microsoft/',
    proxy: { country: 'US' },
    js_rendering: true
  })
});

Anti-Detection Features Required

Real residential IPs — Datacenter IPs get blocked instantly
Browser fingerprint spoofing — Headers, WebGL, canvas
Human-like interaction patterns — Random delays, scroll behavior
CAPTCHA solving — LinkedIn throws captchas at high volume

Best Practices

Start small: 50-100 pages/day, not 10,000
Use a dedicated proxy pool per account
Store results in structured format (JSON/CSV)
Monitor 429 rates and back off immediately

XCrawl handles LinkedIn anti-detection automatically: dash.xcrawl.com

How to Extract Structured Data from Any Website Using AI Extraction

Charles — Tue, 19 May 2026 03:27:20 +0000

How to Extract Structured Data from Any Website Using AI Extraction

Traditional web scraping means writing selectors. One CSS class change and everything breaks.

AI extraction changes this.

The Old Way

// Fragile: depends on HTML structure
const title = document.querySelector(".product-title h1 span").innerText;
const price = document.querySelector(".price-amount .current").innerText;

The New Way

// Robust: describe what you want
const result = await client.scrape({
  url: "https://example.com/product",
  extraction: { mode: "llm", schema: { title: "Product name", price: "Current price in USD", rating: "Average rating out of 5" } }
});

Benefits

Selector-free: No CSS selectors to maintain
Structure-proof: Works even if the site redesigns
Flexible: Change what to extract without rewrites
Accurate: LLMs understand context

Try AI extraction with XCrawl API

Web Scraping vs APIs: When to Use Which (And Why)

Charles — Tue, 19 May 2026 03:26:48 +0000

Web Scraping vs APIs: When to Use Which

Every developer faces this choice. Here's my framework.

Use an API when:

The site offers a public API with good documentation
You need structured data (JSON, not HTML)
Rate limits are reasonable (>100 req/hour)
You don't need real-time data

Use Web Scraping when:

The site has no public API
The API is rate-limited or costly
You need data not exposed through the API
The site's data is rendered client-side (SPA)
You need historical/diff data over time

The Hybrid Approach

Many projects need both:

Use the API when possible (faster, more reliable)
Fall back to scraping when the API doesn't have what you need
Use scraping tools that look like APIs

Real Example: E-Commerce Price Monitoring

API approach: Amazon's Product Advertising API — limited data, requires approval, request-based pricing.

Scraping approach: Directly scrape product pages — get every data point, no approval needed, pay per page.

Best approach: A scraping API that abstracts the complexity while giving you API-like simplicity.

XCrawl gives you API-like simplicity with web scraping power: dash.xcrawl.com

5 Web Scraping Mistakes That Cost You Time and Money

Charles — Tue, 19 May 2026 03:26:48 +0000

5 Web Scraping Mistakes That Cost You Time and Money

After building hundreds of scrapers, these are the most expensive mistakes I see developers make.

Mistake 1: Building Your Own Proxy Infrastructure

You think: "I'll buy some proxies and rotate them myself."
Reality: You spend 2 weeks building, 2 hours/week maintaining, and $200/month on proxy services.

Cost: $200/month + 10+ hours/month
Better: Use a scraping API ($49-99/month, zero maintenance)

Mistake 2: No Error Handling

Your scraper works on 80% of pages. The other 20% fail silently. You don't notice until your dataset has holes.

Fix: Always wrap in try/catch. Log every failure. Alert on >10% error rate.

Mistake 3: Ignoring Robots.txt

Scrape a site that blocks you? They update their CDN rules. Now your IP is banned permanently.

Fix: Check robots.txt first. Respect crawl-delay directives.

Mistake 4: Writing One Big Script

A 500-line scraper with no functions. Good luck debugging when it breaks.

Fix: Modular design. Separator: fetcher, parser, storage, notification.

Mistake 5: No Rate Limiting

You send 100 requests/second. The site blocks you after 10 seconds.

Fix: Add delays. 1-3 seconds between requests. Use exponential backoff on 429s.

Avoid these mistakes: XCrawl API

How to Scrape 1000 Pages Per Day Without Getting Banned

Charles — Tue, 19 May 2026 02:53:13 +0000

How to Scrape 1000 Pages Per Day Without Getting Banned

Scaling from 10 pages to 1000 pages per day is where most scrapers fail. Here's how to do it right.

The Golden Rule

Look like a human, not a bot.

Bots are detected by patterns, not volume. A human browsing 1000 pages per day would:

Click on things
Scroll at varied speeds
Spend random time on each page
Come from different IPs
Use different user agents

Proxy Pool

You need at least 10-20 IPs for 1000 pages/day. DIY costs $50-200/month. APIs include it built-in.

Request Patterns

// Bad: Mechanical timing
for each page: scrape(page); wait(2 seconds);

// Good: Human-like timing
for each page: scrape(page); wait(1500 + random(1000, 3000));

Concurrency

Run 3-5 parallel requests. More triggers rate limiting.

Error Handling

429: Back off 30-60s
403: Rotate IP
503: Try later

Sample Pipeline

const client = new XcrawlScraper({ apiKey: "YOUR_KEY" });
const urls = [...]; // 1000 URLs
for (let i = 0; i < urls.length; i += 5) {
  const batch = urls.slice(i, i + 5);
  const results = await Promise.allSettled(
    batch.map(url => client.scrape({ url, js_render: true }))
  );
  await new Promise(r => setTimeout(r, 3000 + Math.random() * 5000));
}

Scale your scraping with XCrawl API

The Complete Guide to Web Scraping E-Commerce Sites in 2026

Charles — Tue, 19 May 2026 02:52:26 +0000

The Complete Guide to Web Scraping E-Commerce Sites in 2026

E-commerce scraping is the most common — and most difficult — scraping task. Here's the complete playbook.

Why E-Commerce is Hard

Anti-bot protection: Amazon, Walmart, Target all use aggressive bot detection
Dynamic content: Products load via JavaScript, not HTML
Rate limits: Aggressive throttling after N requests
Session tracking: Behavioral analysis tracks mouse movements and scroll patterns

Step-by-Step Strategy

Step 1: Choose Your Approach

Approach	Best For	Difficulty
API	Simple sites, small scale	Easy
Headless Browser	JS-rendered, moderate scale	Medium
Scraping API	Any site, any scale	Easy (just configure)

Step 2: Handle Product Pages

Key data to extract:

Title, price, availability
Reviews and ratings
Specifications
Images (URLs)
SKU/ASIN

Step 3: Handle Pagination

Most e-commerce sites paginate. Solutions:

URL parameter cycling (?page=1, ?page=2)
"Show More" button clicking (requires headless browser)
Infinite scroll (requires headless browser)

Step 4: Handle Variants

Products come in colors, sizes, models. Each variant has a different SKU and often a different URL.

Step 5: Scale

Use concurrent requests (5-10 parallel), rotate proxies, add random delays.

Quick Start with XCrawl

const { XcrawlScraper } = require('xcrawl-scraper');
const client = new XcrawlScraper({ apiKey: 'YOUR_KEY' });

const product = await client.scrape({
  url: 'https://amazon.com/dp/EXAMPLE',
  js_render: true,
  proxy: { country: 'US' },
  extraction: {
    mode: 'llm',
    schema: { title: 'string', price: 'string', rating: 'number' }
  }
});

Scrape e-commerce sites reliably: XCrawl API

Why Your Production Web Scraper Keeps Breaking (And How to Fix It)

Charles — Tue, 19 May 2026 02:52:15 +0000

Why Your Production Web Scraper Keeps Breaking

You built a scraper. It worked for a week. Then it broke. You fixed it. It broke again.

This is the lifecycle of every DIY web scraper in production.

The Top 5 Failure Modes

1. HTML Structure Changes

A dev on the target site changes a class name. Your .product-price selector breaks.

Fix: Use semantic selectors (data attributes, text content) instead of CSS classes.

2. IP Blocks

Your scraper sends too many requests from one IP. The CDN blocks you.

Fix: Proxy rotation. Every request from a different IP.

3. Rate Limiting

You hit 429 Too Many Requests. Backoff logic is mandatory.

Fix: Implement exponential backoff. Most APIs need 1-5s between requests.

4. JavaScript Rendered Content

The site switched from SSR to CSR. Suddenly requests.get() returns an empty shell.

Fix: Use js_render: true in your scraping API (like XCrawl).

5. CAPTCHA Walls

After N requests, Google reCAPTCHA appears. Game over for simple scrapers.

Fix: CAPTCHA solving services or — better — use an API that handles this.

The Reliable Stack

JS rendering — Always-on headless browser
Proxy rotation — Residential IP pool
Retry logic — Automatic retry on failure
Alert monitoring — Know when things break

Building all this yourself? Expect 2-4 hours/week of maintenance.

Using a scraping API? Set it and forget it.

Try a production-ready scraping API: XCrawl

Web Scraping 101: What Every Developer Should Know Before Writing Their First Scraper

Charles — Tue, 19 May 2026 02:51:33 +0000

Web Scraping 101: What Every Developer Should Know

Before you write your first scraper, here's what you need to know.

The Three Hard Problems

1. JavaScript Rendering

Modern websites are SPAs. curl and requests won't get you the real content.

Solution: Use a headless browser or an API that handles JS rendering automatically.

2. Anti-Bot Protection

Cloudflare, DataDome, PerimeterX — these actively block scrapers. You need:

Residential proxy rotation
Browser fingerprint spoofing
CAPTCHA solving

3. Rate Limiting

Scrape too fast? You get blocked. Too slow? Takes forever.

Tools Compared

Tool	JS Rendering	Proxies	Cost	Learning Curve
Puppeteer	✅ Built-in	❌ Manual	Free	Medium
Playwright	✅ Built-in	❌ Manual	Free	Medium
Scrapy	❌ (needs splash)	❌ Manual	Free	High
XCrawl API	✅ Auto	✅ Auto	$$	Low

My Advice

Start with a simple API. If a page gives you the HTML, use cheerio. If it blocks you, upgrade to an API that handles the hard parts. Don't build your own proxy infrastructure — it's not worth the time.

Built with XCrawl API