<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Charles</title>
    <description>The latest articles on Forem by Charles (@charles_90891cea4a1800830).</description>
    <link>https://forem.com/charles_90891cea4a1800830</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3930561%2Ffe7dbf6f-a808-436a-ad76-e12cbbe330af.png</url>
      <title>Forem: Charles</title>
      <link>https://forem.com/charles_90891cea4a1800830</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/charles_90891cea4a1800830"/>
    <language>en</language>
    <item>
      <title>5 Things Every Developer Should Know About Web Scraping API Keys</title>
      <dc:creator>Charles</dc:creator>
      <pubDate>Tue, 19 May 2026 04:04:13 +0000</pubDate>
      <link>https://forem.com/charles_90891cea4a1800830/5-things-every-developer-should-know-about-web-scraping-api-keys-9bo</link>
      <guid>https://forem.com/charles_90891cea4a1800830/5-things-every-developer-should-know-about-web-scraping-api-keys-9bo</guid>
      <description>&lt;h1&gt;
  
  
  5 Things Every Developer Should Know About Web Scraping API Keys
&lt;/h1&gt;

&lt;p&gt;Don't leak your keys. Lessons from experience.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Never Hardcode
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// BAD&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;xc-...&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
&lt;span class="c1"&gt;// GOOD&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;XCRAWL_API_KEY&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  2. Use .env Files
&lt;/h2&gt;

&lt;p&gt;Add to .gitignore immediately.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Rotate Monthly
&lt;/h2&gt;

&lt;p&gt;Generate new key, revoke old one. Keep a lifecycle log.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Separate per Environment
&lt;/h2&gt;

&lt;p&gt;dev-key, prod-key, ci-key. If one leaks, limited damage.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Monitor Usage
&lt;/h2&gt;

&lt;p&gt;Check daily consumption. Spikes = possible leak.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Manage keys at &lt;a href="https://dash.xcrawl.com" rel="noopener noreferrer"&gt;dash.xcrawl.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>tutorial</category>
      <category>javascript</category>
      <category>beginners</category>
    </item>
    <item>
      <title>How to Auto-Fill Google Sheets with Web Scraped Data (No Coding Required)</title>
      <dc:creator>Charles</dc:creator>
      <pubDate>Tue, 19 May 2026 04:03:40 +0000</pubDate>
      <link>https://forem.com/charles_90891cea4a1800830/how-to-auto-fill-google-sheets-with-web-scraped-data-no-coding-required-2edn</link>
      <guid>https://forem.com/charles_90891cea4a1800830/how-to-auto-fill-google-sheets-with-web-scraped-data-no-coding-required-2edn</guid>
      <description>&lt;h1&gt;
  
  
  How to Auto-Fill Google Sheets with Web Scraped Data
&lt;/h1&gt;

&lt;p&gt;Want scraped data flowing into Google Sheets automatically?&lt;/p&gt;

&lt;h2&gt;
  
  
  Method 1: Google Apps Script
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;scrapeAndFill&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;sheet&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;SpreadsheetApp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getActiveSheet&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;options&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;method&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;POST&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;X-API-Key&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;your-key&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="na"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https://example.com/products&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
  &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;UrlFetchApp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https://run.xcrawl.com/v1/scrape&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;options&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;sheet&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;appendRow&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;price&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Set to run on time trigger.&lt;/p&gt;

&lt;h2&gt;
  
  
  Method 2: n8n + Google Sheets
&lt;/h2&gt;

&lt;p&gt;Build the workflow visually. HTTP node → transform → Google Sheets node.&lt;/p&gt;

&lt;h2&gt;
  
  
  Method 3: CLI → CSV → Import
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx xcrawl scrape &lt;span class="s2"&gt;"https://example.com"&lt;/span&gt; &lt;span class="nt"&gt;--output&lt;/span&gt; data.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Pro Tips
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Schedule off-peak hours (2-6 AM)&lt;/li&gt;
&lt;li&gt;Always error-check before writing&lt;/li&gt;
&lt;li&gt;Keep raw + processed sheets separate&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Works with any automation tool: &lt;a href="https://dash.xcrawl.com" rel="noopener noreferrer"&gt;dash.xcrawl.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>tutorial</category>
      <category>javascript</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Puppeteer vs Playwright vs Cheerio vs XCrawl: Which Web Scraping Tool to Use in 2026</title>
      <dc:creator>Charles</dc:creator>
      <pubDate>Tue, 19 May 2026 04:03:07 +0000</pubDate>
      <link>https://forem.com/charles_90891cea4a1800830/puppeteer-vs-playwright-vs-cheerio-vs-xcrawl-which-web-scraping-tool-to-use-in-2026-2fn</link>
      <guid>https://forem.com/charles_90891cea4a1800830/puppeteer-vs-playwright-vs-cheerio-vs-xcrawl-which-web-scraping-tool-to-use-in-2026-2fn</guid>
      <description>&lt;h1&gt;
  
  
  Puppeteer vs Playwright vs Cheerio vs XCrawl: Which Web Scraping Tool to Use in 2026
&lt;/h1&gt;

&lt;p&gt;Four tools, four different philosophies.&lt;/p&gt;

&lt;h2&gt;
  
  
  Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Cheerio&lt;/th&gt;
&lt;th&gt;Puppeteer&lt;/th&gt;
&lt;th&gt;Playwright&lt;/th&gt;
&lt;th&gt;XCrawl&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Setup&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;5 min&lt;/td&gt;
&lt;td&gt;15 min&lt;/td&gt;
&lt;td&gt;15 min&lt;/td&gt;
&lt;td&gt;2 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;JS rendering&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Anti-bot&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌ Basic&lt;/td&gt;
&lt;td&gt;❌ Basic&lt;/td&gt;
&lt;td&gt;✅ Built-in&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CAPTCHA&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅ Auto&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Proxies&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌ DIY&lt;/td&gt;
&lt;td&gt;❌ DIY&lt;/td&gt;
&lt;td&gt;❌ DIY&lt;/td&gt;
&lt;td&gt;✅ Auto&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Price&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;From $49/mo&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  When to Use What
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Cheerio:&lt;/strong&gt; Static HTML, simple parsing, you have proxy infra.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Puppeteer:&lt;/strong&gt; Need CDP access, screenshots, already in Node.js.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Playwright:&lt;/strong&gt; Cross-browser support, mobile emulation, E2E tests.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;XCrawl:&lt;/strong&gt; Production scraping, anti-bot bypass, want API simplicity.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-World Decision
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;One page once → Cheerio
Screenshots → Puppeteer
Testing → Playwright
1000 pages/day → XCrawl
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;&lt;em&gt;&lt;a href="https://dash.xcrawl.com" rel="noopener noreferrer"&gt;XCrawl API&lt;/a&gt; — web scraping with built-in proxies, CAPTCHA solving, and JS rendering.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>tutorial</category>
      <category>javascript</category>
      <category>beginners</category>
    </item>
    <item>
      <title>The Complete Guide to Real Estate Web Scraping in 2026</title>
      <dc:creator>Charles</dc:creator>
      <pubDate>Tue, 19 May 2026 04:02:41 +0000</pubDate>
      <link>https://forem.com/charles_90891cea4a1800830/the-complete-guide-to-real-estate-web-scraping-in-2026-496d</link>
      <guid>https://forem.com/charles_90891cea4a1800830/the-complete-guide-to-real-estate-web-scraping-in-2026-496d</guid>
      <description>&lt;h1&gt;
  
  
  The Complete Guide to Real Estate Web Scraping in 2026
&lt;/h1&gt;

&lt;p&gt;Real estate is one of the most competitive data markets. Here's how to do it right.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Scrape Real Estate?
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Price monitoring&lt;/strong&gt; — Track listing prices, drops, trends&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Market analysis&lt;/strong&gt; — Compare neighborhoods, cities, regions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Investment research&lt;/strong&gt; — Identify undervalued properties&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Competitive intelligence&lt;/strong&gt; — Watch what other agents/brokers list&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Major Platforms
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Platform&lt;/th&gt;
&lt;th&gt;Difficulty&lt;/th&gt;
&lt;th&gt;Anti-Bot&lt;/th&gt;
&lt;th&gt;Data Quality&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Zillow&lt;/td&gt;
&lt;td&gt;Hard&lt;/td&gt;
&lt;td&gt;Cloudflare&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Realtor.com&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Rate limits&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Redfin&lt;/td&gt;
&lt;td&gt;Hard&lt;/td&gt;
&lt;td&gt;JS rendering&lt;/td&gt;
&lt;td&gt;Very Good&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rightmove (UK)&lt;/td&gt;
&lt;td&gt;Easy&lt;/td&gt;
&lt;td&gt;Basic&lt;/td&gt;
&lt;td&gt;Good&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Key Data Points
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Price, address, bedrooms/bathrooms, sqft&lt;/li&gt;
&lt;li&gt;Listing date, days on market&lt;/li&gt;
&lt;li&gt;Price history (if available)&lt;/li&gt;
&lt;li&gt;Tax assessment, HOA fees&lt;/li&gt;
&lt;li&gt;Nearby schools, crime stats&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Avoiding Common Pitfalls
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;IP Blocking:&lt;/strong&gt; Real estate platforms are aggressive. Use residential proxies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data Accuracy:&lt;/strong&gt; Some platforms hide data behind lazy loading. Ensure JS rendering.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Legal:&lt;/strong&gt; Public listings are public information. Respect robots.txt and rate limits.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sample Extraction
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https://run.xcrawl.com/v1/ai-extract&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;method&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;POST&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;X-API-Key&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;your-key&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="na"&gt;body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https://www.zillow.com/homedetails/...&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;price&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Current listing price&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;address&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Street address&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;beds&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Number of bedrooms&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;baths&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Number of bathrooms&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;sqft&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Square footage&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;&lt;em&gt;Built with XCrawl proxy API: &lt;a href="https://dash.xcrawl.com" rel="noopener noreferrer"&gt;dash.xcrawl.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>tutorial</category>
      <category>javascript</category>
      <category>realestate</category>
    </item>
    <item>
      <title>How to Scrape LinkedIn Company Data Legally and Efficiently</title>
      <dc:creator>Charles</dc:creator>
      <pubDate>Tue, 19 May 2026 03:54:46 +0000</pubDate>
      <link>https://forem.com/charles_90891cea4a1800830/how-to-scrape-linkedin-company-data-legally-and-efficiently-40kb</link>
      <guid>https://forem.com/charles_90891cea4a1800830/how-to-scrape-linkedin-company-data-legally-and-efficiently-40kb</guid>
      <description>&lt;h1&gt;
  
  
  How to Scrape LinkedIn Company Data Legally and Efficiently
&lt;/h1&gt;

&lt;p&gt;LinkedIn is a goldmine for B2B data. But scraping it is famously difficult.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Legal Framework
&lt;/h2&gt;

&lt;p&gt;First, let's be clear: &lt;strong&gt;Scraping public LinkedIn data was ruled legal&lt;/strong&gt; (hiQ Labs v. LinkedIn, 2022). But respecting robots.txt and rate limits is still required.&lt;/p&gt;

&lt;h2&gt;
  
  
  What You Can Scrape
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Public company pages&lt;/strong&gt; — About, size, industry, website&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Public profiles&lt;/strong&gt; — Name, headline, location (if not logged in)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Company employee lists&lt;/strong&gt; — Aggregated data (not personal)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What You Cannot Scrape
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Private profiles&lt;/strong&gt; (only visible when logged in)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Messages, endorsements, connections&lt;/strong&gt; — LinkedIn's ToS explicitly forbids&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rate-limited data&lt;/strong&gt; — LinkedIn blocks aggressively&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Tech Approach
&lt;/h2&gt;

&lt;p&gt;Standard browsers won't work — LinkedIn detects headless Chrome instantly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;XCrawl's approach:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https://run.xcrawl.com/v1/scrape&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;method&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;POST&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;X-API-Key&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;your-key&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="na"&gt;body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https://www.linkedin.com/company/microsoft/&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;country&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;US&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="na"&gt;js_rendering&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Anti-Detection Features Required
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Real residential IPs&lt;/strong&gt; — Datacenter IPs get blocked instantly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Browser fingerprint spoofing&lt;/strong&gt; — Headers, WebGL, canvas&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Human-like interaction patterns&lt;/strong&gt; — Random delays, scroll behavior&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CAPTCHA solving&lt;/strong&gt; — LinkedIn throws captchas at high volume&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Best Practices
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Start small: 50-100 pages/day, not 10,000&lt;/li&gt;
&lt;li&gt;Use a dedicated proxy pool per account&lt;/li&gt;
&lt;li&gt;Store results in structured format (JSON/CSV)&lt;/li&gt;
&lt;li&gt;Monitor 429 rates and back off immediately&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;XCrawl handles LinkedIn anti-detection automatically: &lt;a href="https://dash.xcrawl.com" rel="noopener noreferrer"&gt;dash.xcrawl.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>linkedin</category>
      <category>tutorial</category>
      <category>javascript</category>
    </item>
    <item>
      <title>How to Extract Structured Data from Any Website Using AI Extraction</title>
      <dc:creator>Charles</dc:creator>
      <pubDate>Tue, 19 May 2026 03:27:20 +0000</pubDate>
      <link>https://forem.com/charles_90891cea4a1800830/how-to-extract-structured-data-from-any-website-using-ai-extraction-1j5h</link>
      <guid>https://forem.com/charles_90891cea4a1800830/how-to-extract-structured-data-from-any-website-using-ai-extraction-1j5h</guid>
      <description>&lt;h1&gt;
  
  
  How to Extract Structured Data from Any Website Using AI Extraction
&lt;/h1&gt;

&lt;p&gt;Traditional web scraping means writing selectors. One CSS class change and everything breaks.&lt;/p&gt;

&lt;p&gt;AI extraction changes this.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Old Way
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Fragile: depends on HTML structure&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;title&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;document&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;querySelector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;.product-title h1 span&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;innerText&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;price&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;document&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;querySelector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;.price-amount .current&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;innerText&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The New Way
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Robust: describe what you want&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scrape&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https://example.com/product&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;extraction&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;llm&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Product name&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;price&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Current price in USD&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;rating&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Average rating out of 5&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Benefits
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Selector-free:&lt;/strong&gt; No CSS selectors to maintain&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structure-proof:&lt;/strong&gt; Works even if the site redesigns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flexible:&lt;/strong&gt; Change what to extract without rewrites&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Accurate:&lt;/strong&gt; LLMs understand context&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Try AI extraction with &lt;a href="https://dash.xcrawl.com" rel="noopener noreferrer"&gt;XCrawl API&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>ai</category>
      <category>tutorial</category>
      <category>javascript</category>
    </item>
    <item>
      <title>Web Scraping vs APIs: When to Use Which (And Why)</title>
      <dc:creator>Charles</dc:creator>
      <pubDate>Tue, 19 May 2026 03:26:48 +0000</pubDate>
      <link>https://forem.com/charles_90891cea4a1800830/web-scraping-vs-apis-when-to-use-which-and-why-3gee</link>
      <guid>https://forem.com/charles_90891cea4a1800830/web-scraping-vs-apis-when-to-use-which-and-why-3gee</guid>
      <description>&lt;h1&gt;
  
  
  Web Scraping vs APIs: When to Use Which
&lt;/h1&gt;

&lt;p&gt;Every developer faces this choice. Here's my framework.&lt;/p&gt;

&lt;h2&gt;
  
  
  Use an API when:
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;The site offers a public API with good documentation&lt;/li&gt;
&lt;li&gt;You need structured data (JSON, not HTML)&lt;/li&gt;
&lt;li&gt;Rate limits are reasonable (&amp;gt;100 req/hour)&lt;/li&gt;
&lt;li&gt;You don't need real-time data&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Use Web Scraping when:
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;The site has no public API&lt;/li&gt;
&lt;li&gt;The API is rate-limited or costly&lt;/li&gt;
&lt;li&gt;You need data not exposed through the API&lt;/li&gt;
&lt;li&gt;The site's data is rendered client-side (SPA)&lt;/li&gt;
&lt;li&gt;You need historical/diff data over time&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Hybrid Approach
&lt;/h2&gt;

&lt;p&gt;Many projects need both:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Use the API when possible (faster, more reliable)&lt;/li&gt;
&lt;li&gt;Fall back to scraping when the API doesn't have what you need&lt;/li&gt;
&lt;li&gt;Use scraping tools that look like APIs&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Real Example: E-Commerce Price Monitoring
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;API approach:&lt;/strong&gt; Amazon's Product Advertising API — limited data, requires approval, request-based pricing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scraping approach:&lt;/strong&gt; Directly scrape product pages — get every data point, no approval needed, pay per page.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best approach:&lt;/strong&gt; A scraping API that abstracts the complexity while giving you API-like simplicity.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;XCrawl gives you API-like simplicity with web scraping power: &lt;a href="https://dash.xcrawl.com" rel="noopener noreferrer"&gt;dash.xcrawl.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>api</category>
      <category>tutorial</category>
      <category>javascript</category>
    </item>
    <item>
      <title>5 Web Scraping Mistakes That Cost You Time and Money</title>
      <dc:creator>Charles</dc:creator>
      <pubDate>Tue, 19 May 2026 03:26:48 +0000</pubDate>
      <link>https://forem.com/charles_90891cea4a1800830/5-web-scraping-mistakes-that-cost-you-time-and-money-1246</link>
      <guid>https://forem.com/charles_90891cea4a1800830/5-web-scraping-mistakes-that-cost-you-time-and-money-1246</guid>
      <description>&lt;h1&gt;
  
  
  5 Web Scraping Mistakes That Cost You Time and Money
&lt;/h1&gt;

&lt;p&gt;After building hundreds of scrapers, these are the most expensive mistakes I see developers make.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mistake 1: Building Your Own Proxy Infrastructure
&lt;/h2&gt;

&lt;p&gt;You think: "I'll buy some proxies and rotate them myself."&lt;br&gt;
Reality: You spend 2 weeks building, 2 hours/week maintaining, and $200/month on proxy services.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost:&lt;/strong&gt; $200/month + 10+ hours/month&lt;br&gt;
&lt;strong&gt;Better:&lt;/strong&gt; Use a scraping API ($49-99/month, zero maintenance)&lt;/p&gt;

&lt;h2&gt;
  
  
  Mistake 2: No Error Handling
&lt;/h2&gt;

&lt;p&gt;Your scraper works on 80% of pages. The other 20% fail silently. You don't notice until your dataset has holes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Always wrap in try/catch. Log every failure. Alert on &amp;gt;10% error rate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mistake 3: Ignoring Robots.txt
&lt;/h2&gt;

&lt;p&gt;Scrape a site that blocks you? They update their CDN rules. Now your IP is banned permanently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Check robots.txt first. Respect crawl-delay directives.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mistake 4: Writing One Big Script
&lt;/h2&gt;

&lt;p&gt;A 500-line scraper with no functions. Good luck debugging when it breaks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Modular design. Separator: fetcher, parser, storage, notification.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mistake 5: No Rate Limiting
&lt;/h2&gt;

&lt;p&gt;You send 100 requests/second. The site blocks you after 10 seconds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Add delays. 1-3 seconds between requests. Use exponential backoff on 429s.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Avoid these mistakes: &lt;a href="https://dash.xcrawl.com" rel="noopener noreferrer"&gt;XCrawl API&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>tutorial</category>
      <category>javascript</category>
      <category>beginners</category>
    </item>
    <item>
      <title>How to Scrape 1000 Pages Per Day Without Getting Banned</title>
      <dc:creator>Charles</dc:creator>
      <pubDate>Tue, 19 May 2026 02:53:13 +0000</pubDate>
      <link>https://forem.com/charles_90891cea4a1800830/how-to-scrape-1000-pages-per-day-without-getting-banned-5bmf</link>
      <guid>https://forem.com/charles_90891cea4a1800830/how-to-scrape-1000-pages-per-day-without-getting-banned-5bmf</guid>
      <description>&lt;h1&gt;
  
  
  How to Scrape 1000 Pages Per Day Without Getting Banned
&lt;/h1&gt;

&lt;p&gt;Scaling from 10 pages to 1000 pages per day is where most scrapers fail. Here's how to do it right.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Golden Rule
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Look like a human, not a bot.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Bots are detected by patterns, not volume. A human browsing 1000 pages per day would:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Click on things&lt;/li&gt;
&lt;li&gt;Scroll at varied speeds&lt;/li&gt;
&lt;li&gt;Spend random time on each page&lt;/li&gt;
&lt;li&gt;Come from different IPs&lt;/li&gt;
&lt;li&gt;Use different user agents&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Proxy Pool
&lt;/h2&gt;

&lt;p&gt;You need at least 10-20 IPs for 1000 pages/day. DIY costs $50-200/month. APIs include it built-in.&lt;/p&gt;

&lt;h2&gt;
  
  
  Request Patterns
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Bad: Mechanical timing&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="nx"&gt;each&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;scrape&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="nf"&gt;wait&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="nx"&gt;seconds&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Good: Human-like timing&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="nx"&gt;each&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;scrape&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="nf"&gt;wait&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1500&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nf"&gt;random&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3000&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Concurrency
&lt;/h2&gt;

&lt;p&gt;Run 3-5 parallel requests. More triggers rate limiting.&lt;/p&gt;

&lt;h2&gt;
  
  
  Error Handling
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;429&lt;/strong&gt;: Back off 30-60s&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;403&lt;/strong&gt;: Rotate IP&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;503&lt;/strong&gt;: Try later&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Sample Pipeline
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;XcrawlScraper&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;apiKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;YOUR_KEY&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;urls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[...];&lt;/span&gt; &lt;span class="c1"&gt;// 1000 URLs&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nx"&gt;urls&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;batch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;urls&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;allSettled&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nx"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scrape&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;js_render&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;}))&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;setTimeout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3000&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;random&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;5000&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;&lt;em&gt;Scale your scraping with &lt;a href="https://dash.xcrawl.com" rel="noopener noreferrer"&gt;XCrawl API&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>scalability</category>
      <category>tutorial</category>
      <category>javascript</category>
    </item>
    <item>
      <title>The Complete Guide to Web Scraping E-Commerce Sites in 2026</title>
      <dc:creator>Charles</dc:creator>
      <pubDate>Tue, 19 May 2026 02:52:26 +0000</pubDate>
      <link>https://forem.com/charles_90891cea4a1800830/the-complete-guide-to-web-scraping-e-commerce-sites-in-2026-1hgm</link>
      <guid>https://forem.com/charles_90891cea4a1800830/the-complete-guide-to-web-scraping-e-commerce-sites-in-2026-1hgm</guid>
      <description>&lt;h1&gt;
  
  
  The Complete Guide to Web Scraping E-Commerce Sites in 2026
&lt;/h1&gt;

&lt;p&gt;E-commerce scraping is the most common — and most difficult — scraping task. Here's the complete playbook.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why E-Commerce is Hard
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Anti-bot protection&lt;/strong&gt;: Amazon, Walmart, Target all use aggressive bot detection&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dynamic content&lt;/strong&gt;: Products load via JavaScript, not HTML&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rate limits&lt;/strong&gt;: Aggressive throttling after N requests&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Session tracking&lt;/strong&gt;: Behavioral analysis tracks mouse movements and scroll patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step-by-Step Strategy
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: Choose Your Approach
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;th&gt;Difficulty&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;API&lt;/td&gt;
&lt;td&gt;Simple sites, small scale&lt;/td&gt;
&lt;td&gt;Easy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Headless Browser&lt;/td&gt;
&lt;td&gt;JS-rendered, moderate scale&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scraping API&lt;/td&gt;
&lt;td&gt;Any site, any scale&lt;/td&gt;
&lt;td&gt;Easy (just configure)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Step 2: Handle Product Pages
&lt;/h3&gt;

&lt;p&gt;Key data to extract:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Title, price, availability&lt;/li&gt;
&lt;li&gt;Reviews and ratings&lt;/li&gt;
&lt;li&gt;Specifications&lt;/li&gt;
&lt;li&gt;Images (URLs)&lt;/li&gt;
&lt;li&gt;SKU/ASIN&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 3: Handle Pagination
&lt;/h3&gt;

&lt;p&gt;Most e-commerce sites paginate. Solutions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;URL parameter cycling (?page=1, ?page=2)&lt;/li&gt;
&lt;li&gt;"Show More" button clicking (requires headless browser)&lt;/li&gt;
&lt;li&gt;Infinite scroll (requires headless browser)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 4: Handle Variants
&lt;/h3&gt;

&lt;p&gt;Products come in colors, sizes, models. Each variant has a different SKU and often a different URL.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5: Scale
&lt;/h3&gt;

&lt;p&gt;Use concurrent requests (5-10 parallel), rotate proxies, add random delays.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Start with XCrawl
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;XcrawlScraper&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;xcrawl-scraper&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;XcrawlScraper&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;apiKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;YOUR_KEY&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;product&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scrape&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https://amazon.com/dp/EXAMPLE&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;js_render&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;country&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;US&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="na"&gt;extraction&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;llm&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;string&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;price&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;string&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;rating&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;number&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;&lt;em&gt;Scrape e-commerce sites reliably: &lt;a href="https://dash.xcrawl.com" rel="noopener noreferrer"&gt;XCrawl API&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>ecommerce</category>
      <category>tutorial</category>
      <category>javascript</category>
    </item>
    <item>
      <title>Why Your Production Web Scraper Keeps Breaking (And How to Fix It)</title>
      <dc:creator>Charles</dc:creator>
      <pubDate>Tue, 19 May 2026 02:52:15 +0000</pubDate>
      <link>https://forem.com/charles_90891cea4a1800830/why-your-production-web-scraper-keeps-breaking-and-how-to-fix-it-1ije</link>
      <guid>https://forem.com/charles_90891cea4a1800830/why-your-production-web-scraper-keeps-breaking-and-how-to-fix-it-1ije</guid>
      <description>&lt;h1&gt;
  
  
  Why Your Production Web Scraper Keeps Breaking
&lt;/h1&gt;

&lt;p&gt;You built a scraper. It worked for a week. Then it broke. You fixed it. It broke again.&lt;/p&gt;

&lt;p&gt;This is the lifecycle of every DIY web scraper in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Top 5 Failure Modes
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. HTML Structure Changes
&lt;/h3&gt;

&lt;p&gt;A dev on the target site changes a class name. Your &lt;code&gt;.product-price&lt;/code&gt; selector breaks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Use semantic selectors (data attributes, text content) instead of CSS classes.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. IP Blocks
&lt;/h3&gt;

&lt;p&gt;Your scraper sends too many requests from one IP. The CDN blocks you.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Proxy rotation. Every request from a different IP.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Rate Limiting
&lt;/h3&gt;

&lt;p&gt;You hit 429 Too Many Requests. Backoff logic is mandatory.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Implement exponential backoff. Most APIs need 1-5s between requests.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. JavaScript Rendered Content
&lt;/h3&gt;

&lt;p&gt;The site switched from SSR to CSR. Suddenly &lt;code&gt;requests.get()&lt;/code&gt; returns an empty shell.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Use &lt;code&gt;js_render: true&lt;/code&gt; in your scraping API (like XCrawl).&lt;/p&gt;

&lt;h3&gt;
  
  
  5. CAPTCHA Walls
&lt;/h3&gt;

&lt;p&gt;After N requests, Google reCAPTCHA appears. Game over for simple scrapers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; CAPTCHA solving services or — better — use an API that handles this.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Reliable Stack
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;JS rendering&lt;/strong&gt; — Always-on headless browser&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Proxy rotation&lt;/strong&gt; — Residential IP pool&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retry logic&lt;/strong&gt; — Automatic retry on failure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alert monitoring&lt;/strong&gt; — Know when things break&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Building all this yourself? Expect 2-4 hours/week of maintenance.&lt;/p&gt;

&lt;p&gt;Using a scraping API? Set it and forget it.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Try a production-ready scraping API: &lt;a href="https://dash.xcrawl.com" rel="noopener noreferrer"&gt;XCrawl&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>production</category>
      <category>tutorial</category>
      <category>javascript</category>
    </item>
    <item>
      <title>Web Scraping 101: What Every Developer Should Know Before Writing Their First Scraper</title>
      <dc:creator>Charles</dc:creator>
      <pubDate>Tue, 19 May 2026 02:51:33 +0000</pubDate>
      <link>https://forem.com/charles_90891cea4a1800830/web-scraping-101-what-every-developer-should-know-before-writing-their-first-scraper-429a</link>
      <guid>https://forem.com/charles_90891cea4a1800830/web-scraping-101-what-every-developer-should-know-before-writing-their-first-scraper-429a</guid>
      <description>&lt;h1&gt;
  
  
  Web Scraping 101: What Every Developer Should Know
&lt;/h1&gt;

&lt;p&gt;Before you write your first scraper, here's what you need to know.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Hard Problems
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. JavaScript Rendering
&lt;/h3&gt;

&lt;p&gt;Modern websites are SPAs. &lt;code&gt;curl&lt;/code&gt; and &lt;code&gt;requests&lt;/code&gt; won't get you the real content.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Use a headless browser or an API that handles JS rendering automatically.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Anti-Bot Protection
&lt;/h3&gt;

&lt;p&gt;Cloudflare, DataDome, PerimeterX — these actively block scrapers. You need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Residential proxy rotation&lt;/li&gt;
&lt;li&gt;Browser fingerprint spoofing&lt;/li&gt;
&lt;li&gt;CAPTCHA solving&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Rate Limiting
&lt;/h3&gt;

&lt;p&gt;Scrape too fast? You get blocked. Too slow? Takes forever.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tools Compared
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;JS Rendering&lt;/th&gt;
&lt;th&gt;Proxies&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;th&gt;Learning Curve&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Puppeteer&lt;/td&gt;
&lt;td&gt;✅ Built-in&lt;/td&gt;
&lt;td&gt;❌ Manual&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Playwright&lt;/td&gt;
&lt;td&gt;✅ Built-in&lt;/td&gt;
&lt;td&gt;❌ Manual&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scrapy&lt;/td&gt;
&lt;td&gt;❌ (needs splash)&lt;/td&gt;
&lt;td&gt;❌ Manual&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;XCrawl API&lt;/td&gt;
&lt;td&gt;✅ Auto&lt;/td&gt;
&lt;td&gt;✅ Auto&lt;/td&gt;
&lt;td&gt;$$&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  My Advice
&lt;/h2&gt;

&lt;p&gt;Start with a simple API. If a page gives you the HTML, use &lt;code&gt;cheerio&lt;/code&gt;. If it blocks you, upgrade to an API that handles the hard parts. Don't build your own proxy infrastructure — it's not worth the time.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built with &lt;a href="https://dash.xcrawl.com" rel="noopener noreferrer"&gt;XCrawl API&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>tutorial</category>
      <category>javascript</category>
      <category>beginners</category>
    </item>
  </channel>
</rss>
