DEV Community

Jonathan Geiger
Jonathan Geiger

Posted on • Originally published at capturekit.dev

1 2 1 1 1

How to Extract HTML from Web Pages with Puppeteer

Extracting HTML content from websites is a fundamental task for web scrapers, data scientists, and developers building automation tools. Puppeteer, a Node.js library developed by Google, provides a robust way to interact with web pages programmatically. In this guide, we'll explore how to extract HTML content effectively with Puppeteer and address common challenges.

What is Puppeteer?

Puppeteer is a powerful Node.js library that provides a high-level API to control Chrome or Chromium browsers. It enables developers to:

  • Scrape web content and extract data
  • Automate form submissions and user interactions
  • Generate screenshots and PDFs
  • Run automated testing
  • Monitor website performance
  • Crawl single-page applications (SPAs)

Let's dive into using Puppeteer for HTML extraction.

Setting Up Puppeteer

First, install Puppeteer via npm:

npm install puppeteer
Enter fullscreen mode Exit fullscreen mode

This command installs both Puppeteer and a compatible version of Chromium. If you'd prefer to use your existing Chrome installation, use puppeteer-core instead:

npm install puppeteer-core
Enter fullscreen mode Exit fullscreen mode

Basic HTML Extraction

Here's a simple script to extract the entire HTML from a webpage:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://example.com');

  // Get the page's HTML content
  const html = await page.content();
  console.log(html);

  await browser.close();
})();
Enter fullscreen mode Exit fullscreen mode

This script:

  1. Launches a headless browser
  2. Opens a new page
  3. Navigates to https://example.com
  4. Extracts the full HTML content
  5. Closes the browser

Extracting HTML from Specific Elements

To extract HTML from a specific element on the page:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://example.com');

  // Extract HTML from a specific element
  const elementHtml = await page.evaluate(() => {
    const element = document.querySelector('.main-content');
    return element ? element.outerHTML : null;
  });

  console.log(elementHtml);
  await browser.close();
})();
Enter fullscreen mode Exit fullscreen mode

Waiting for Dynamic Content

Modern websites often load content dynamically. To ensure all content is loaded before extraction:

await page.goto('https://example.com', { 
  waitUntil: 'networkidle2' 
});
Enter fullscreen mode Exit fullscreen mode

For pages with specific elements that load asynchronously:

await page.waitForSelector('.dynamic-content', { visible: true });
const html = await page.content();
Enter fullscreen mode Exit fullscreen mode

Extracting Text Content

If you only need the text content without HTML tags:

const textContent = await page.evaluate(() => {
  return document.body.innerText;
});
Enter fullscreen mode Exit fullscreen mode

For a specific element:

const elementText = await page.$eval('.article', el => el.textContent);
Enter fullscreen mode Exit fullscreen mode

Extracting Metadata

To extract a webpage's metadata like title, description, and Open Graph data:

const metadata = await page.evaluate(() => {
  return {
    title: document.title,
    description: document.querySelector('meta[name="description"]')?.content || null,
    ogTitle: document.querySelector('meta[property="og:title"]')?.content || null,
    ogDescription: document.querySelector('meta[property="og:description"]')?.content || null,
    ogImage: document.querySelector('meta[property="og:image"]')?.content || null
  };
});

console.log(metadata);
Enter fullscreen mode Exit fullscreen mode

Extracting Links

To extract all links from a webpage:

const links = await page.evaluate(() => {
  return Array.from(document.querySelectorAll('a')).map(a => {
    return {
      text: a.textContent.trim(),
      href: a.href
    };
  });
});

console.log(links);
Enter fullscreen mode Exit fullscreen mode

Handling Authentication

For websites that require authentication:

await page.goto('https://example.com/login');
await page.type('#username', 'your_username');
await page.type('#password', 'your_password');
await page.click('#login-button');
await page.waitForNavigation();

// Now that we're logged in, extract the protected content
const html = await page.content();
Enter fullscreen mode Exit fullscreen mode

Avoiding Detection

Many websites implement anti-bot measures. Use stealth mode to avoid detection:

npm install puppeteer-extra puppeteer-extra-plugin-stealth
Enter fullscreen mode Exit fullscreen mode
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
puppeteer.use(StealthPlugin());

// Now use puppeteer as usual
const browser = await puppeteer.launch();
Enter fullscreen mode Exit fullscreen mode

Saving Extracted HTML to a File

To save the extracted HTML to a file:

const fs = require('fs');

// Extract HTML
const html = await page.content();

// Write to file
fs.writeFileSync('extracted-page.html', html);
Enter fullscreen mode Exit fullscreen mode

Working with iframes

To extract HTML from an iframe:

const frameContent = await page.frames()[1].content(); // gets content from the second frame

// Or find a frame by its name
const namedFrame = page.frames().find(frame => frame.name() === 'frameName');
const namedFrameContent = await namedFrame.content();
Enter fullscreen mode Exit fullscreen mode

Alternative to Puppeteer: CaptureKit API

Setting up and maintaining Puppeteer for HTML extraction can be challenging. If you need a reliable, scalable solution without infrastructure headaches, consider using CaptureKit API:

curl "https://api.capturekit.dev/content?url=https://example.com&access_key=YOUR_ACCESS_KEY&include_html=true"
Enter fullscreen mode Exit fullscreen mode

Benefits of CaptureKit API

  • Complete Solution: Extract not just HTML, but also metadata, links, and structured content
  • No Browser Management: No need to maintain browser instances
  • Scale Effortlessly: Handle high-volume extraction without infrastructure concerns

Example Response from CaptureKit API:

{
  "success": true,
  "data": {
    "metadata": {
      "title": "Tailwind CSS - Rapidly build modern websites without ever leaving your HTML.",
      "description": "Tailwind CSS is a utility-first CSS framework.",
      "favicon": "https://tailwindcss.com/favicons/favicon-32x32.png",
      "ogImage": "https://tailwindcss.com/opengraph-image.jpg"
    },
    "links": {
      "internal": ["https://tailwindcss.com/", "https://tailwindcss.com/docs"],
      "external": ["https://tailwindui.com", "https://shopify.com"],
      "social": [
        "https://github.com/tailwindlabs/tailwindcss",
        "https://x.com/tailwindcss"
      ]
    },
    "html": "<html><body><h1>Hello, world!</h1></body></html>"
  }
}
Enter fullscreen mode Exit fullscreen mode

Conclusion

Puppeteer offers powerful capabilities for extracting HTML from websites, but it can be complex to set up and maintain. For developers who need a reliable, maintenance-free solution that provides more than just raw HTML, CaptureKit API offers a compelling alternative with comprehensive data extraction capabilities.

Hot sauce if you're wrong - web dev trivia for staff engineers

Hot sauce if you're wrong · web dev trivia for staff engineers (Chris vs Jeremy, Leet Heat S1.E4)

  • Shipping Fast: Test your knowledge of deployment strategies and techniques
  • Authentication: Prove you know your OAuth from your JWT
  • CSS: Demonstrate your styling expertise under pressure
  • Acronyms: Decode the alphabet soup of web development
  • Accessibility: Show your commitment to building for everyone

Contestants must answer rapid-fire questions across the full stack of modern web development. Get it right, earn points. Get it wrong? The spice level goes up!

Watch Video 🌶️🔥

Top comments (0)

AWS Q Developer image

Your AI Code Assistant

Automate your code reviews. Catch bugs before your coworkers. Fix security issues in your code. Built to handle large projects, Amazon Q Developer works alongside you from idea to production code.

Get started free in your IDE

👋 Kindness is contagious

Engage with a wealth of insights in this thoughtful article, valued within the supportive DEV Community. Coders of every background are welcome to join in and add to our collective wisdom.

A sincere "thank you" often brightens someone’s day. Share your gratitude in the comments below!

On DEV, the act of sharing knowledge eases our journey and fortifies our community ties. Found value in this? A quick thank you to the author can make a significant impact.

Okay