DEV Community

Cover image for Building a High-Performance Web Scraper with Python
Damilare Ogundele
Damilare Ogundele

Posted on

1

Building a High-Performance Web Scraper with Python

Introduction

This article explores the architecture and implementation of a high-performance web scraper built to extract product data from e-commerce platforms. The scraper uses multiple Python libraries and techniques to efficiently process thousands of products while maintaining resilience against common scraping challenges.

Technical Architecture

The scraper is built on a fully asynchronous foundation using Python's asyncio ecosystem, with these key components:

  • Network Layer: aiohttp for async HTTP requests with connection pooling
  • DOM Processing: BeautifulSoup4 for HTML parsing
  • Dynamic Content: Playwright for JavaScript-rendered content extraction
  • Data Processing: pandas for data manipulation and export

Implementation Highlights

Concurrency Management

The scraper implements a worker pool pattern with configurable concurrency limits:

# Concurrency settings
self.max_workers = int(os.getenv('MAX_WORKERS'))
self.max_connections = int(os.getenv('MAX_CONNECTIONS'))

# TCP connection pooling
connector = aiohttp.TCPConnector(
    limit=self.max_connections,
    resolver=resolver  # Custom DNS resolver
)
Enter fullscreen mode Exit fullscreen mode

This prevents overwhelming the target server while maximizing throughput.

Resilient Network Requests

The network layer implements sophisticated retry logic with exponential backoff:

async def fetch_url(self, session, url):
    retries = 0
    while retries < self.max_retries:
        try:
            headers = {
                'User-Agent': self.user_agent.random,
                # Additional headers omitted for brevity
            }

            async with session.get(url, headers=headers, timeout=self.request_timeout) as response:
                if response.status == 200:
                    return await response.read()
                elif response.status == 429:
                    # Rate limit handling
                    retry_after = int(response.headers.get('Retry-After', 
                                    self.retry_backoff ** (retries + 2)))
                    await asyncio.sleep(retry_after)

            # Retry with exponential backoff
            retries += 1
            wait_time = self.retry_backoff ** (retries + 1)
            await asyncio.sleep(wait_time)

        except (asyncio.TimeoutError, aiohttp.ClientError) as e:
            logger.warning(f"Network error: {e}")
            retries += 1
Enter fullscreen mode Exit fullscreen mode

Hybrid Content Extraction

The scraper employs a two-phase extraction approach:

  1. Static HTML Parsing: Uses BeautifulSoup to extract readily available content
  2. Dynamic Content Extraction: Uses Playwright to handle JavaScript-rendered elements
async def fetch_product(self, session, url, page):
    # Static content extraction
    with concurrent.futures.ThreadPoolExecutor() as executor:
        loop = asyncio.get_event_loop()
        product = await loop.run_in_executor(
            executor, 
            partial(self.scrape_product_html, content, url)
        )

    # Dynamic content extraction
    image_url, description = await self.scrape_dynamic_content_playwright(page, url)
    product.image_url = image_url
    product.description = description
Enter fullscreen mode Exit fullscreen mode

This approach optimizes for both speed and completeness.

DNS Resilience

The scraper implements DNS fallbacks to handle potential DNS resolution issues:

try:
    import aiodns
    resolver = aiohttp.AsyncResolver(nameservers=["8.8.8.8", "1.1.1.1"])
except ImportError:
    logger.warning("aiodns library not found. Falling back to default resolver.")
    resolver = None
Enter fullscreen mode Exit fullscreen mode

Data Processing Pipeline

The scraper implements a thread-safe queue for handling scraped data:

# Thread-safe queue for results
self.results_queue = queue.Queue()

# Data processing
def save_results_from_queue(self):
    products = []
    while not self.results_queue.empty():
        try:
            products.append(self.results_queue.get_nowait())
        except queue.Empty:
            break

    if products:
        df = pd.DataFrame(products)
        # Save to CSV with proper encoding and escaping
        df.to_csv(
            filename,
            index=False,
            encoding='utf-8-sig',
            escapechar='\\',
            quoting=csv.QUOTE_ALL
        )
Enter fullscreen mode Exit fullscreen mode

Performance Optimizations

Several techniques are employed to maximize throughput:

  1. Batch Processing: Products are processed in configurable batches
  2. Random Delays: Randomized delays between requests prevent detection
  3. Connection Pooling: TCP connection reuse reduces overhead
  4. ThreadPoolExecutor: CPU-bound tasks are offloaded to prevent blocking the event loop
  5. Sampling: For large datasets, statistical sampling is used to estimate total counts

Error Handling and Reliability

The scraper implements comprehensive error handling:

try:
    # Scraping logic
except Exception as e:
    logger.error(f"Error in scrape_all_products: {e}")
    # Save any results in queue before exiting
    self.save_results_from_queue()
    raise
Enter fullscreen mode Exit fullscreen mode

This ensures that even if the scraper crashes, partial results are saved.

Conclusion

The architecture outlined here demonstrates how to build a high-performance web scraper that balances speed, reliability, and target server courtesy. By leveraging asynchronous programming, connection pooling, and hybrid content extraction techniques, the scraper can efficiently process thousands of products while maintaining resilience against common scraping challenges.

Key takeaways:

  • Asynchronous programming is essential for high-performance web scraping
  • Hybrid static/dynamic extraction maximizes data completeness
  • Proper error handling and resilience mechanisms are crucial for production use
  • Configurable parameters allow for fine-tuning based on target site characteristics

Playwright CLI Flags Tutorial

5 Playwright CLI Flags That Will Transform Your Testing Workflow

  • 0:56 --last-failed
  • 2:34 --only-changed
  • 4:27 --repeat-each
  • 5:15 --forbid-only
  • 5:51 --ui --headed --workers 1

Learn how these powerful command-line options can save you time, strengthen your test suite, and streamline your Playwright testing experience. Click on any timestamp above to jump directly to that section in the tutorial!

Watch Full Video 📹️

Top comments (0)

AWS Q Developer image

Your AI Code Assistant

Automate your code reviews. Catch bugs before your coworkers. Fix security issues in your code. Built to handle large projects, Amazon Q Developer works alongside you from idea to production code.

Get started free in your IDE

👋 Kindness is contagious

Explore a trove of insights in this engaging article, celebrated within our welcoming DEV Community. Developers from every background are invited to join and enhance our shared wisdom.

A genuine "thank you" can truly uplift someone’s day. Feel free to express your gratitude in the comments below!

On DEV, our collective exchange of knowledge lightens the road ahead and strengthens our community bonds. Found something valuable here? A small thank you to the author can make a big difference.

Okay