How I Scrape 250,000 Shopify Stores Without Getting Blocked

Anders Myrmel — Thu, 12 Mar 2026 11:06:43 +0000

Right-click any Shopify store. View Source. You'll see <script> tags from every app they've installed, a Shopify.theme object with their exact theme, and tracking pixels from every ad platform they use. None of this is hidden.

I wanted to scrape all of it across 250K stores. That's the detection engine behind StoreInspect, where I map the Shopify ecosystem by scanning what stores run.

This post is the technical walkthrough. What worked, what didn't, and the parts that took way longer than expected.

The stack

Rebrowser-Puppeteer (not regular Puppeteer)
PostgreSQL with JSONB snapshots
Webshare proxies + Tailscale SOCKS5 tunnels
Detection logic bundled as a string so it runs in both Puppeteer and a Chrome Extension

Why not regular Puppeteer

Standard Puppeteer gets flagged immediately. Shopify itself doesn't block scrapers aggressively, but Cloudflare and bot detection on individual stores will. Rebrowser is a drop-in replacement that patches the webdriver property and fixes the obvious fingerprinting leaks.

I also set a real viewport (1920x1080), proper Chrome 131 user agent strings, and route through residential proxies. Nothing exotic. The bar for scraping Shopify stores is low because the data is public. You just need to not look like a bot.

Page loading strategy

First mistake I made: using networkidle2 as the wait condition. Shopify stores have analytics scripts, chat widgets, and ad pixels that fire continuously. networkidle2 waits for 500ms of network silence, which sometimes never comes.

Switched to domcontentloaded plus a flat 5-second delay:

await page.goto(url, { waitUntil: 'domcontentloaded', timeout: 30000 });
await new Promise(r => setTimeout(r, 5000));

The 5 seconds lets lazy-loaded scripts, GTM tags, and deferred pixels fire. Not elegant, but it catches 95%+ of what I need.

I also block images, fonts, and media via request interception. I only need the HTML and scripts. This cut page load times by about 60%.

The four detection layers

One method doesn't catch everything, so I use four:

1. Script URL matching

Most Shopify apps inject a script tag with a recognizable domain. Klaviyo loads from static.klaviyo.com. Judge.me loads from judge.me. Yotpo loads from staticw2.yotpo.com. Match the domain, you've identified the app.

This is the most reliable method. About 70% of detections come from script URLs.

2. JavaScript globals

Apps set window variables when they initialize. Klaviyo sets window.klaviyo and window._learnq. Gorgias sets window.GorgiasChat. TikTok's pixel sets window.ttq.

I check these with page.evaluate() after the page loads. Useful as a second signal when script URLs are obfuscated or loaded through a tag manager.

3. DOM elements

Some apps only inject UI. A chat bubble, a reviews widget, a popup form. CSS selectors like .jdgm-widget (Judge.me) or [data-yotpo-instance-id] (Yotpo) catch these.

4. Theme App Extension blocks

This one took me a while to find. Shopify's Online Store 2.0 lets apps inject server-rendered blocks, and Shopify wraps them in HTML comments:

<!-- BEGIN app block: shopify://apps/judge-me-reviews/blocks/preview_badge/... -->

These are gold. They map directly to the app's Shopify App Store slug. A store can have zero client-side scripts from an app but still have its Theme App Extension block in the HTML. I maintain a map from Shopify slugs to app IDs to catch these.

The cookie consent problem

This one cost me a week. I was getting pixel detection rates way below what I expected. Meta Pixel was showing up on maybe 40% of stores when industry benchmarks say 50%+.

The problem: cookie consent managers. OneTrust, Cookiebot, and similar tools block ad pixels from loading until the user clicks "Accept." My scraper never clicks accept, so the pixels never fire.

Fix:

await page.evaluate(() => {
  document.getElementById('onetrust-accept-btn-handler')?.click();
  document.querySelector('button[id*="accept"]')?.click();
});
await new Promise(r => setTimeout(r, 8000));

Click accept, wait 8 seconds for GTM to process and load the blocked tags. Pixel detection accuracy went from roughly 60% to 95%.

The 8-second wait feels long but it's necessary. GTM doesn't fire tags synchronously after consent.

Detection bundle architecture

I wanted the same detection code in both the Puppeteer scraper (server-side) and a Chrome Extension (client-side). The signatures for 180+ apps, 40+ pixels, and theme detection logic need to be identical.

The solution: bundle everything into a single self-executing function string.

const detectionScript = `(function() {
  const APP_SIGNATURES = { /* 180 apps */ };
  const PIXEL_SIGNATURES = { /* 40 pixels */ };
  // ... detection logic
  return { isShopify, theme, apps, pixels };
})()`;

// Puppeteer
const result = await page.evaluate(detectionScript);

// Chrome Extension (content script)
const result = eval(detectionScript);

One source of truth. When I add a new app signature, both the scraper and extension pick it up.

Storing results in JSONB

Each scrape produces a snapshot stored as JSONB in PostgreSQL:

store_snapshots: {
  store_id: int,
  apps: jsonb,     // [{name, category, detected_via}]
  pixels: jsonb,   // [{name, type, pixel_id}]
  theme: jsonb,    // {name, type, author}
  metrics: jsonb,  // {product_count, traffic_tier, ...}
  snapshot_date: timestamp
}

Why JSONB instead of normalized tables? The schema changes constantly. Every time I add a new detection field or metric, I don't want to run a migration. JSONB lets me evolve the structure without downtime.

I also keep denormalized counts on the main stores table (app_count, pixel_count, theme_name) for fast filtering and sorting. The snapshots are for historical comparison.

Error handling at scale

At 250K stores, every edge case happens. Password-protected stores return a /password redirect. Stores with lapsed billing return 402. Dead stores return 404. Some stores infinite-redirect between www and non-www.

I classify errors into retryable and non-retryable:

// Don't retry these
'dns_not_found'   // Domain doesn't resolve
'ssl_error'       // Cert problems
'store_closed'    // 402 Payment Required

// Retry with backoff
'timeout'         // Might be temporary
'blocked'         // Try different proxy
'network_error'   // Transient

DNS failures and SSL errors get marked and skipped permanently. Timeouts and blocks get retried with proxy rotation. This keeps the scraper from wasting cycles on stores that will never respond.

Regional domain deduplication

Brands like gymshark.com and gymshark.co.uk are the same store. Without deduplication, they'd appear as separate entries with identical tech stacks.

I check regional TLDs (.co.uk, .com.au, .de, .fr, etc.) against the .com version. If both exist, I skip the regional variant. Simple, but it prevented thousands of duplicates.

What I learned about Shopify's ecosystem

After scanning 250K stores: the median store runs 1 app (usually Shop Pay) and 4 pixels (usually Shopify's own pixel plus GA4). 59% have no email marketing tool. 78% have no reviews app. The ecosystem is almost empty.

The detection engine is available as a free Chrome extension if you want to try it on individual stores.

The full dataset updates daily at storeinspect.com/report/state-of-shopify.

If you have questions about the scraping setup or detection patterns, drop them in the comments.

How I Built Two Data Tools for Ecommerce: Store Inspector and ProductLair

Anders Myrmel — Fri, 05 Dec 2025 14:43:56 +0000

Over the last year I have been building two tools for ecommerce researchers and founders who want faster insights without spending hours digging through apps, themes, ads, or product signals. The tools are Store Inspector and ProductLair, and both came from my own frustration trying to understand what actually makes a store or product perform well.

This post explains the technical thinking behind both projects and some of the challenges I ran into while building them.

Store Inspector: Browser First Architecture

Store Inspector is a free Chrome extension that analyzes any Shopify store. It detects themes, apps, pixels, and a basic lead score. Most tools rely on servers to scrape stores, which is slow and often blocked. I wanted something that runs fully in the browser to avoid rate limits and speed issues.

How detection works

Load the page HTML directly from the active tab
Parse scripts, linked files, and HTML identifiers
Match patterns for themes, apps, and pixels
Score based on tech stack, tracking depth, and store structure

Everything happens locally. Nothing is sent to my servers unless the user opens the detailed view. This makes detection extremely fast and keeps privacy simple.

Challenges

Some apps hide their scripts behind dynamic loaders
Themes often change naming conventions
Pixels can appear under custom wrappers

I solved these problems by building a pattern matcher that is constantly updated and by using weighted scoring instead of binary detection.

ProductLair: Data Consolidation for Product Research

ProductLair focuses on the product side rather than the store side. The goal is to collect signals from social platforms, ads, stores, and search trends and combine them into a single product profile.

Tech stack behind ProductLair

Next.js for the frontend
Supabase as the database and auth layer
Serverless functions for scraping and enrichment
A lightweight scoring engine for market assessment
Recharts for performance summaries

Why build it

There are many product finder tools, but almost all of them are either outdated, low quality, or only work on TikTok. I wanted something that gives real analysis, not random product dumps. That meant manually curating products, building comparison tools, and designing an interface that feels fast.

Interesting issues I ran into

TikTok creative links expire faster than expected
Store data needed normalization because everyone structures their titles differently
Search trend scoring required smoothing to avoid spikes from small regions

What I learned building both tools

Browser based detection is far faster than remote scraping for anything involving Shopify
Data is useless without normalization and scoring
Users prefer one click insights over dashboards with too many metrics
Speed matters more than features for research tools
Clear visual hierarchy makes or breaks a research page

What is next

I am currently working on automated tracking for Store Inspector and deeper funnel metrics for ProductLair. Both tools are early but they already help thousands of users research faster.

If you are working on ecommerce related data tools, browser extensions, or store analysis systems, I would love to hear what you are building.

Forem: Anders Myrmel