Forem: Nikolas Serafini

A practical and gentle introduction to web scraping with Puppeteer

Nikolas Serafini — Fri, 13 Mar 2020 20:34:16 +0000

If you are wondering what that is, Puppeteer is a Google-maintained Node library that provides an API over the DevTools protocol, offering us the ability to take control over Chrome or Chromium and do very nice automation and scraping related things.

It's very resourceful, widely used, and probably what you should take a look today if you need to develop something of the like. It's use even extends to performing e2e tests with front-end web frameworks such as Angular, it's a very powerful tool.

In this article we aim to show some of the essential Puppeteer operations along with a very simple example of extracting Google's first page results for a keyword, as a way of wrapping things up.
Oh, and a full and working repository example with all the code shown in this post can be found here if you need!

TL;DR

We'll learn how to make Puppeteer's basic configuration
Also how to access Google's website and scrap the results page
All of this getting into detail about a couple of commonly used API functions

First step, launching a Browser instance

Before we can attempt to do anything, we need to launch a Browser instance as a matter to actually access a specific website. As the name suggests, we are actually going to launch a full-fledged Chromium browser (or not, we can run in headless mode), capable of opening multiple tabs and as feature-rich as the browser you may be using right now.

Launching a Browser can be simple as typing await puppeteer.launch(), but we should be aware that there is a huge amount of launching options available, whose use depends on your needs. Since we will be using Docker in the example, some additional tinkering is done here so we can run it inside a container without problems, but still serves as a good example:

async function initializePuppeteer() {
  const launchArgs = [
  // Required for Docker version of Puppeteer
  "--no-sandbox",
  "--disable-setuid-sandbox",
  // Disable GPU
  "--disable-gpu",
  // This will write shared memory files into /tmp instead of /dev/shm,
  // because Docker’s default for /dev/shm is 64MB
  "--disable-dev-shm-usage"
  ];

  return puppeteer.launch({
    executablePath: "/usr/bin/chromium-browser",
    args: launchArgs,
    defaultViewport: {
      width: 1024,
      height: 768
    }
  });
}

Working with tabs

Since we have already initialized our Browser, we need to create tabs (or pages) to be able to access our very first website. Using the function we defined above, we can simply do something of the like:

const browser = await initializePuppeteer()
const page = await browser.newPage()
await scrapSomeSite(page)

Accessing a website

Now that we have a proper page opened, we can manage to access a website and do something nice. By default, the newly created page always open blank so we must manually navigate to somewhere specific. Again, a very simple operation:

await page.goto("https://www.google.com/?gl=us&hl=en", {
    timeout: 30000,
    waitUntil: ["load"],
  });

There are a couple of options in this operation that requires extra attention and can heavily impact your implementation if misused:

timeout: while the default is 30s, if we are dealing with a somewhat slow website or even running behind proxies, we need to set a proper value to avoid undesired execution errors.
waitUntil: this guy is really important as different sites have completely different behaviors. It defines the page events that are going to be waited before considering that the page actually loaded, not waiting for the right events can break your scraping code. We can use one or all of them, defaulting to load . You can find all the available options here.

Page shenanigans

Google's first page

So, we finally opened a web page! That's nice. We now have arrived at the actually fun part.
Let's follow the idea of scraping Google's first result page, shall we? Since we have already navigated to the main page we need to do two different things:

Fill the form field with a keyword
Press the search button

Before we can interact with any element inside a page, we need to find it by code first, so then we can replicate all the necessary steps to accomplish our goals. This is a little detective work, and it may take some time to figure out.

We are using the US Google page so we all see the same page, the link is in the code example above. If we take a look at Google's HTML code you'll see that a lot of element properties are properly obfuscated with different hashes that change over time, so we have lesser options to always get the same element we desire.

But, lucky us, if we inspect the input field, one can find easy-to-spot properties such as title="Search" on the element. If we check it with a document.querySelectorAll("[title=Search]") on the browser we'll verify that it is a unique element for this query. One down.

We could apply the same logic to the submit button, but I'll take a different approach here on purpose. Since everything is inside a form , and we only have one in the page, we can forcefully submit it to instantly navigate to the result screen, by simply calling a form.submit(). Two down.

And how we can "find" these elements and do these awesome operations by code? Easy-peasy:

// Filling the form
const inputField = await page.$("[title=Search]");
await inputField.type("puppeteer", { delay: 100 });

// Forces form submission
await page.$eval("form", form => form.submit());
await page.waitForNavigation({ waitUntil: ["load"] });

So we first grab the input field by executing a page.$(selectorGoesHere) , function that actually runs document.querySelector on the browser's context, returning the first element that matches our selector. Being that said you have to make sure that you're fetching the right element with a correct and unique selector, otherwise things may not go the way they should. On a side note, to fetch all the elements that match a specific selector, you may want to run a page.$$(selectorGoesHere) , that runs a document.querySelectorAll inside the browser's context.

As for actually typing the keyword into the element, we can simply use the page.type function with the content we want to search for. Keep in mind that, depending on the website, you may want to add a typing delay (as we did in the example) to simulate a human-like behavior. Not adding a delay may lead to weird things like input drop downs not showing or a plethora of different strange things that we don't really want to face.

Want to check if we filled everything correctly? Taking a screenshot and the page's full HTML for inspecting is also very easy:

await page.screenshot({
  path: "./firstpage",
  fullPage: true,
  type: "jpeg"
});

const html = await page.content();

To submit the form, we are introduced to a very useful function: page.$eval(selector, pageFunction) . It actually runs a document.querySelector for it's first argument, and passes the element result as the first argument of the provided page function. This is really useful if you have to run code that needs to be inside the browser's context to work, as our form.submit() . As the previous function we mentioned, we also have the alternate page.$$eval(selector, pageFunction) that works the same way but differs by running a document.querySelectorAll for the selector provided instead.

As forcing the form submission causes a page navigation, we need to be explicit in what conditions we should wait for it before we continue with the scraping process. In this case, waiting until the navigated page launches a load event is sufficient.

The result page

With the result page loaded we can finally extract some data from it! We are looking only for the textual results, so we need to scope them down first.
If we take a very careful look, the entire results container can be found with the [id=search] > div > [data-async-context] selector. There are probably different ways to reach the same element, so that's not a definitive answer. If you find a easier path, let me know.

And, lucky us, every text entry here has the weird .g class! So, if we query this container element we found for every sub-element that has this specific class (yes, this is also supported) we can have direct access to all the results! And we can do all that with stuff we already mentioned:

const rawResults = await page.$("[id=search] > div > [data-async-context]");

const filteredResults = await rawResults.$$eval(".g", results =>
    Array.from(results)
      .map(r => r.innerText)
      .filter(r => r !== "")
);

console.log(filteredResults)

So we use the page.$ function to take a hold on that beautiful container we just saw, so then a .$$eval function can be used on this container to fetch all the sub-elements that have the .g class, applying a custom function for these entries. As for the function, we just retrieved the innerText for every element and removed the empty strings on the end, to tidy up our results.

One thing that should not be overlooked here is that we had to use Array.from() on the returning results so we could actually make use of functions like map , filter and reduce . The returning element from a .$$eval call is a NodeList , not an Array, and it does not offer support for some of the functions that we otherwise would find on the last.

If we check on the filtered results, we'll find something of the like:

[
  '\n' +
    'puppeteer/puppeteer: Headless Chrome Node.js API - GitHub\n' +
    'github.com › puppeteer › puppeteer\n' +
    'Puppeteer runs headless by default, but can be configured to run full (non-headless) Chrome or Chromium. What can I do? Most things that you can do manually ...\n' +
    '‎Puppeteer API · ‎37 releases · ‎Puppeteer for Firefox · ‎How do I get puppeteer to ...',
  '\n' +
    'Puppeteer | Tools for Web Developers | Google Developers\n' +
    'developers.google.com › web › tools › puppeteer\n' +
    'Jan 28, 2020 - Puppeteer is a Node library which provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol. It can also be configured to use full (non-headless) Chrome or Chromium.\n' +
    '‎Quick start · ‎Examples · ‎Headless Chrome: an answer · ‎Debugging tips',
  'People also ask\n' +
    'What is puppeteer used for?\n' +
    'How does a puppeteer work?\n' +
    'What is puppeteer JS?\n' +
    'Does puppeteer need Chrome installed?\n' +
    'Feedback',
...
]

And we have all the data we want right here! We could parse every entry here on several different ways, and create full-fledged objects for further processing, but I'll leave this up to you.

Our objective was to get our hands into the text data, and we managed just that. Congratulations to us, we finished!

Wrapping things up

Our scope here was to present Puppeteer itself along with a series of operations that could be considered basic for almost every web scraping context. This is most probably a mere start for more complex and deeper operations one may find during a page's scraping process.

We barely managed to scratch the surface of Puppeteer's extensive API, one that you should really consider taking a serious look into. It's pretty well-written and loaded with easy-to-understand examples for almost everything.

This is just the first of a series of posts regarding Web scraping with Puppeteer that will (probably) come to fruition in the future. Stay tuned!

Running your first Docker Image on ECS

Nikolas Serafini — Wed, 04 Mar 2020 17:28:40 +0000

Working with containers has become a hefty trend in Software Engineering in this past couple of years. Containers can offer several advantages for software development and application deployment, possibly taking away a lot of problems faced by development and devops teams. We'll peek a little bit on the basics and hows of running your first Docker image on Amazon Web Services.

Of the gruesome volume of different services offered by Amazon, the Elastic Container Service (ECS) is the one to go when we commonly need to run and deploy dockerized applications.

For one to fully deploy a application in said service, at least three little things are needed: an configured ECS Cluster, at least one image on ECR, and a Task Definition.

A Cluster is a grouping of tasks and services, whose creation is free of charge by itself. Tasks must be placed inside a cluster to run, so their creation is mandatory. You'll probably want to create different clusters for different applications or execution contexts to keep everything tied up and organized.

A Task Definition is an AWS resource used to describe containers and volume definitions of ECR tasks. There one can define what docker images to use, environment and network configuration, instance requirements for task placement among other configurations. Which lead us to the definition of Task, a Task Definition instance, the running entity executing all the software in the images included, in the way we defined.

Prerequisites

Before we can jump into the Task Definition configuration shenanigans we must first have at least one container image (since a Task Definition can make use of multiple containers at once) on one ECR repository and a already-configured (or empty) Cluster.

Docker images must be built and pushed to another AWS Service, the Elastic Container Registry (ECR). In the pretty straightforward service, each different Docker image must be tagged and pushed to a separate repository so they can be referenced in our task definitions. It's a must.
Amazon is friendly enough to even give us all the necessary commands to do all the basic operations: all you have to do is push that beautiful "View push commands" button, as every needed step is explained in detail over there.

Clusters can be created by clicking on the intuitive "Create Cluster" button on the ECS main page. Make sure to select the correct template for your needs, depending on the OS you'll want to use, but you'll probably want to stick with the "Linux + Networking" one. Do not choose the "Networking only" unless you're really sure on about what you're doing, as you will be dealing with yet another completely different service, AWS Fargate.

We have the option of creating an empty cluster, thus not allocating any kind of on-demand or spot instances, or not doing that. Since creating an empty cluster gives us the hassle of providing running machines in some way so our tasks can actually run (and that's completely out of our scope here), I suggest you to just pick a t2.micro as a ECS Instance Type as it's almost free. Keep in mind that if you want to meddle with Fargate no instances need to be picked as Amazon will take care of machine allocation for you, so you can just create an empty cluster.

The Task Definition

With all the needed configuration at hand, we can finally safely access the "Create new Task Definition" section found no the ECS Task Definition main page.

At first, Amazon will ask you to choose your Task Definition launch type based on where you want run your tasks, giving you two choices: Fargate and ECS. Fargate is an AWS alternate service that allows us to execute tasks without the hassle of explicitly allocating machines to run your containers, whereas ECS is a launch type that requires you to manually configure several execution aspects, a lower level operation. Both launch types work and are billed differently, you can find more detailed information here.

In the following page we will find the more meaningful and deeper configuration:

Task Definition Name: the name of your Task Definition. Obligatory.
Requires Compatibilities: is set following the Launch Type chosen in the first configuration page.
Network Mode: set network mode ECS will use to start our containers. If Fargate is used, we have no other option as to set the awsvpc mode. If you are not really sure what this means, take a look here.
Task Execution Role: necessary role to be given to each task execution as to allow it to have access and sufficient permissions to run correctly inside the AWS ecosystem. An Execution Role has access to several different permission policies to give specific access to the applications running under it. For instance, a common use case for ECS-executed tasks are to have access to ECR (so we can pull our images) and to have permission to use Cloudwatch (Amazon logging and monitoring service), meaning we would have to attach to our Role policies like: "ecr:BatchGetImage" and "logs:PutLogEvents". Do not confuse this with the Task Role, the Task Execution Role wraps the permissions to be conceded to the ECS agent responsible for placing the tasks, not to the tasks themselves. Detailed information of Task Execution Roles can be found here.
Task Role: necessary role to be given to the task itself. It follows the same principles listed above, so if your application needs to send messages to a SQS queue, communicate to Redis cluster or save data into a RDS database, you'll need to set all the specific policies into your custom Task Role configuration.
Task size: a pretty straightforward section, allows you to set specific memory and CPU requirements for running your task. Keep in mind that if you try to run your task on a machine with lower specs than specified here, it won't start at all.
Container Definitions: the fun part. You'll have to pass through all the steps listed below for each container you want to include inside the same Task Definition. You can set up any number of containers you need.
- Container name: self-explanatory.
- Image: the ECR repository image URL from which you'll pull the application.
- Memory Limits: the definition hard or soft memory limits (Mib) for the container. Link to universal knowledge if you do not know what this means.
- Port mappings: if your application will make use of specific ports for any kind of outgoing communication, you'll probably have to bind the host ports to your container ones. Otherwise, nothing is gonna actually work.
- Healthcheck: sets up a container health check routine in specific timed intervals. Useful to continuously monitor your container health and used by ECS to know when your application actually started running. If you defined a specific route for this in your application, you'll probably have something of the like as your command: CMD-SHELL,curl -f http://localhost:port/healthcheck || exit 1
- Environment: Lets you set up the amount of CPU units your container is going to use, configure all the environment variables you need so your application runs correctly among other configuration. Also allows you to set the container as essential, meaning in the scenario that if it dies for some reason, the entire task is going to be killed shortly after.
- Startup Dependency Ordering: allows you to control the order the containers are going to start, and when they are going to. Not mandatory.
- Container Timeouts: self-explanatory and not mandatory.
- Network Settings: also self-explanatory. Not really mandatory, unless you have some advanced network shenanigans going on.
- Storage and Logging: Couple of advanced setups. At a basic level you'll probably want to configure the Cloudwatch Log configuration, or simply let Amazon handle everything with the beautiful "Auto-configure CloudWatch Logs" button.
- Security, Resource Limits, and Docker Labels: advanced an context-specific configurations. Not mandatory. We'll not cover them here.

And that's it. There a couple of other options listed in the task definition main page, as Mesh and FireLens integration, but they are very specific and not really needed for your everyday task definitions. You can skip the rest and press the "Create" button.

Running your task

After all this explanation, we want to see if the containerized application will actually run right? So click on your created ECS cluster, hit the bottom "Tasks" tab and click on the "Run new Task" button.

On the new screen, select "EC2" as a Launch Type, fill in your task definition name and the name of the cluster where we'll run it and click on "Run Task". If you had set the t2.micro as we suggested, your application will boot in no time.

You can check all the running information of your task in the "Task" tab of your cluster.

Since we did not enter into the merits of what kind of application you are trying to run, it's up to yourself to check if your application is running as it should. You can check the logs of your application by clicking on the task (inside the aforementioned "Task" tab) and searching for the "View logs in CloudWatch" under the desired configured container.

Wrapping up

Our main objective here was to show the simplest (and somewhat lengthy) path of how one can deploy a containerized application on Amazon Web Services without any kind of context what one could actually run over there.

A lot of deeper points were omitted since they serve no purpose in a introduction, a couple of other complementary ones were also not mentioned now (such as configuring services and making spot fleet requests) but are going to be featured in future following articles. Those additional content will be complementary and crucial for a more consistent understanding of the multitude of ECS services and overall environment.

I invite you all to stay tuned for the next articles. Feel free to use the comments section below to post any questions or commentaries, we'll take a look on them all, promise.
Until next time, happy deploying.