Jonathan Geiger

Posted on Mar 21 • Originally published at capturekit.dev

How to Extract HTML from Web Pages with Puppeteer

#webdev #puppeteer #webscraping

Extracting HTML content from websites is a fundamental task for web scrapers, data scientists, and developers building automation tools. Puppeteer, a Node.js library developed by Google, provides a robust way to interact with web pages programmatically. In this guide, we'll explore how to extract HTML content effectively with Puppeteer and address common challenges.

What is Puppeteer?

Puppeteer is a powerful Node.js library that provides a high-level API to control Chrome or Chromium browsers. It enables developers to:

Scrape web content and extract data
Automate form submissions and user interactions
Generate screenshots and PDFs
Run automated testing
Monitor website performance
Crawl single-page applications (SPAs)

Let's dive into using Puppeteer for HTML extraction.

Setting Up Puppeteer

First, install Puppeteer via npm:

npm install puppeteer

This command installs both Puppeteer and a compatible version of Chromium. If you'd prefer to use your existing Chrome installation, use puppeteer-core instead:

npm install puppeteer-core

Basic HTML Extraction

Here's a simple script to extract the entire HTML from a webpage:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://example.com');

  // Get the page's HTML content
  const html = await page.content();
  console.log(html);

  await browser.close();
})();

This script:

Launches a headless browser
Opens a new page
Navigates to https://example.com
Extracts the full HTML content
Closes the browser

Extracting HTML from Specific Elements

To extract HTML from a specific element on the page:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://example.com');

  // Extract HTML from a specific element
  const elementHtml = await page.evaluate(() => {
    const element = document.querySelector('.main-content');
    return element ? element.outerHTML : null;
  });

  console.log(elementHtml);
  await browser.close();
})();

Waiting for Dynamic Content

Modern websites often load content dynamically. To ensure all content is loaded before extraction:

await page.goto('https://example.com', { 
  waitUntil: 'networkidle2' 
});

For pages with specific elements that load asynchronously:

await page.waitForSelector('.dynamic-content', { visible: true });
const html = await page.content();

Extracting Text Content

If you only need the text content without HTML tags:

const textContent = await page.evaluate(() => {
  return document.body.innerText;
});

For a specific element:

const elementText = await page.$eval('.article', el => el.textContent);

Extracting Metadata

To extract a webpage's metadata like title, description, and Open Graph data:

const metadata = await page.evaluate(() => {
  return {
    title: document.title,
    description: document.querySelector('meta[name="description"]')?.content || null,
    ogTitle: document.querySelector('meta[property="og:title"]')?.content || null,
    ogDescription: document.querySelector('meta[property="og:description"]')?.content || null,
    ogImage: document.querySelector('meta[property="og:image"]')?.content || null
  };
});

console.log(metadata);

Extracting Links

To extract all links from a webpage:

const links = await page.evaluate(() => {
  return Array.from(document.querySelectorAll('a')).map(a => {
    return {
      text: a.textContent.trim(),
      href: a.href
    };
  });
});

console.log(links);

Handling Authentication

For websites that require authentication:

await page.goto('https://example.com/login');
await page.type('#username', 'your_username');
await page.type('#password', 'your_password');
await page.click('#login-button');
await page.waitForNavigation();

// Now that we're logged in, extract the protected content
const html = await page.content();

Avoiding Detection

Many websites implement anti-bot measures. Use stealth mode to avoid detection:

npm install puppeteer-extra puppeteer-extra-plugin-stealth

const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
puppeteer.use(StealthPlugin());

// Now use puppeteer as usual
const browser = await puppeteer.launch();

Saving Extracted HTML to a File

To save the extracted HTML to a file:

const fs = require('fs');

// Extract HTML
const html = await page.content();

// Write to file
fs.writeFileSync('extracted-page.html', html);

Working with iframes

To extract HTML from an iframe:

const frameContent = await page.frames()[1].content(); // gets content from the second frame

// Or find a frame by its name
const namedFrame = page.frames().find(frame => frame.name() === 'frameName');
const namedFrameContent = await namedFrame.content();

Alternative to Puppeteer: CaptureKit API

Setting up and maintaining Puppeteer for HTML extraction can be challenging. If you need a reliable, scalable solution without infrastructure headaches, consider using CaptureKit API:

curl "https://api.capturekit.dev/content?url=https://example.com&access_key=YOUR_ACCESS_KEY&include_html=true"

Benefits of CaptureKit API

Complete Solution: Extract not just HTML, but also metadata, links, and structured content
No Browser Management: No need to maintain browser instances
Scale Effortlessly: Handle high-volume extraction without infrastructure concerns

Example Response from CaptureKit API:

{
  "success": true,
  "data": {
    "metadata": {
      "title": "Tailwind CSS - Rapidly build modern websites without ever leaving your HTML.",
      "description": "Tailwind CSS is a utility-first CSS framework.",
      "favicon": "https://tailwindcss.com/favicons/favicon-32x32.png",
      "ogImage": "https://tailwindcss.com/opengraph-image.jpg"
    },
    "links": {
      "internal": ["https://tailwindcss.com/", "https://tailwindcss.com/docs"],
      "external": ["https://tailwindui.com", "https://shopify.com"],
      "social": [
        "https://github.com/tailwindlabs/tailwindcss",
        "https://x.com/tailwindcss"
      ]
    },
    "html": "<html><body><h1>Hello, world!</h1></body></html>"
  }
}

Conclusion

Puppeteer offers powerful capabilities for extracting HTML from websites, but it can be complex to set up and maintain. For developers who need a reliable, maintenance-free solution that provides more than just raw HTML, CaptureKit API offers a compelling alternative with comprehensive data extraction capabilities.

Diagram Like A Pro

Bring your cloud architecture to life with expert tips from AWS and Datadog. In this ebook, AWS Solutions Architects Jason Mimick and James Wenzel reveal pro strategies for building clear, compelling diagrams that make an impact.

Get the Guide

DEV Community