Forem: ScraperAPI Zoltan Bettenbuk

Building a Go Web Scraper Step-by-Step: The Beginners Guide to Colly

ScraperAPI Zoltan Bettenbuk — Thu, 01 Sep 2022 00:24:59 +0000

Go allows developers to create complicated projects with a simpler syntax than C, but with almost the same efficiency and control.

Its simplicity and efficiency is why we decided to add Golang to our web scraping beginners series and show you how to use it to extract data at scale.

Why Use Go Over Python or JavaScript for Web Scraping?

If you’ve followed our series, you’ve seen how easy web scraping is with languages like Python and JavaScript, so why should you give Go a shot?

There are three main reasons for choosing Go over other languages:

Go is a statically typed language, making it easier to find errors without running your program. Integrated Development Environments (IDEs) will highlight errors immediately and even show you suggestions to fix them.
IDEs can be more helpful the more they understand the code, and because we declare the data types in Go, there’s less ambiguity in the code. Thus, IDEs can provide better auto-complete features and suggestions than in other languages.
Unlike Python or JavaScript, Go is a compiled language that outputs machine code directly, making it faster than Python. In an experiment run by Arnesh Agrawal, Python (using the Beautiful Soup library) took almost 40 minutes to scrape 2000 URLs, while Go (using the Goquery package) took less than 20 minutes.

In summary, Go is an excellent option if you need to optimize scraping speed or if you’re looking for a statically typed language to transition to.

Web Scraping with Go

For this project, we’ll scrape the Jack and Jones shoe category page to extract product names, prices, and URLs before exporting the data into a JSON file using Go’s web scraping library Colly.

A consideration: We’ll try to explain every step of the way as detailed as possible, so even if you don’t have experience with Go, you’ll be able to follow along. However, if you still feel lost while reading, here’s a great introduction to Go to watch beforehand.

1. Setting Up Our Project

We first need to head to https://go.dev/dl/ and download the right version of Go based on our operating system. In our case, we’ll pick the ARM64 version as we’re using a MacBook Air M1.

Note: You can also use Homebrew in MacOS or Chocolatey in Windows to install Go.

Once the download is complete, follow the instructions to install it on your machine. With it installed, let’s create a new directory named go-scraper and open it on VScode or your preferred IDE.

Note: Open the terminal and enter the go version command. If everything goes well, it will log the version like this:

For this tutorial, we’ll be using VScode to write our code, so we’ll also download Go’s VScode extension for a better experience.

Next, open the terminal and type the following command to create or initialize the project: go mod init go-scraper.

In Go, “a module is a collection of Go packages stored in a file tree with a go.mod file at its root.” The command above – go mod init – tells Go that the directory we’re specifying (go-scraper) is the module’s root.

Without leaving the terminal, let’s create a new jack-scraper.go file using: touch jack-scraper.go and – using the command found on Colly’s documentation – go get -u github.com/gocolly/colly/... to install Colly.

All the dependencies downloaded were added to the go.mod file and a new go.sum file – which may contain hashes for multiple versions of a module- was created.

Now we can consider our environment set!

2. Sending HTTP Requests with Colly

At the top of our jack-scraper.go file, we’ll add our package’s name, which as a convention, we’ll name main:

package main

And then import the dependencies to the project:

import (
   "github.com/gocolly/colly"
)

Note: Due to Go’s typed nature, VScode will tell us there’s an error with your import. Basically, it’s telling us that we imported a dependency, but we’re not using it. This is one of the advantages of using Go, as we don’t need to run our code to find errors.

Something particular to Go is that we need to provide a starting point for the code to run. So we’ll create the main function, and all our logic will be inside it. For now, we’ll print “Function is working.”

func main() {
  
   fmt.Println("Function is working")
}

If everything went well it should return this:

You might have also noticed that a new package was imported. That’s because Go can tell we’re trying to use the .Println() function from the fmt package, so it automatically imported it. Handy, right?

To handle our request and deal with all callbacks necessary to extract the data, Colly uses a collector object. Initializing the collector is as simple as calling the .NewCollector() method from Colly and passing the domains we want to allow Colly to visit:

c := colly.NewCollector(
    colly.AllowedDomains("www.jackjones.com"),
)

On our new c instance, we can now call the .Visit() method to make colly send our request.

c.Visit("https://www.jackjones.com/nl/en/jj/shoes/"

However, there are a few more things we need to add to our code before sending our collector out into the world. Let’s see what’s happening behind the scenes.

First, we’ll create a callback to print out the URL Colly is navigating to – this will become more useful as we scale our scraper from one page to multiple pages.

c.OnRequest(func(r *colly.Request) {
    fmt.Println("Scraping:", r.URL)
})

And then a callback to print out the status of the request.

c.OnResponse(func(r *colly.Response) {
    fmt.Println("Status:", r.StatusCode)
})

As we said before, the collector object is responsible for the callbacks attached to a collector job. So each of these functions will be called at various stages in the collector job cycle.

In our example, the .OnRequest() function is called right before the collector makes the HTTP request, while the .OnRespond() function will be called after a response is received.

Just for good measure, let’s also create an error handler before running the scraper.

c.OnError(func(r *colly.Response, err error) {
    fmt.Println("Request URL:", r.Request.URL, "failed with response:", r, "\nError:", err)
})

To run our code, open a terminal and use the go run jack-scraper.go command.

Awesome, we got a beautiful 200 (successful) status code! We’re ready for phase 2, the extraction.

3. Inspecting the Target Website

To extract data from an HTML file, we need more than just access to the site. Each website has its own structure and we must understand it in order to scrape specific elements without adding noise to our dataset.

The good news is that we can quickly look at the site’s HTML structure by right-clicking on the page and selecting Inspect from the menu. It’ll open the Inspector tools, and we can start hovering on elements to see their positions in the HTML and attributes.

We use CSS selectors in web scraping (classes, IDs, etc.) to tell our scrapers where to locate the elements we want them to extract for us. Fortunately for us, Colly is based on the Goquery package, which provides Colly with a JQuery-like syntax to target these selectors.

Let’s try to grab the first shoe’s product name using the DevTools console to test the ground.

4. Using DevTools to Test Our CSS Selectors

If we inspect the element further, we can see that the name of the product is in a <a> tag with the class “product-tile__name__link”, wrapped between <header> tags.

On our browser’s console, let’s use the document.querySelectorAll() method to find the that element.

document.querySelectorAll("a.product-tile__name__link.js-product-tile-link")

Yes! It returns 44 elements that match perfectly with the number of elements on the page.

The product’s URL is inside the same element, so we can use the same selector for it. On the other hand, after some testing, we can use “em.value__price” to pick the price.

5. Scraping All Product Names on the Page

If you’ve read some of our previous articles, you know we love testing. So, before extracting all our target elements, we’re first going to test our scraper by extracting all product names from the page.

To do so, Colly has a .OnHTML() function to handle the HTML from the response (there’s also an .OnXML() function you can use if you’re not sure if the response will be HTML or XML content).

c.OnHTML("a.product-tile__name__link.js-product-tile-link", func(h *colly.HTMLElement) {
    fmt.Println(h.Text)
})

Here’s a breakdown of what’s happening:

We pass the main element we want to work with a.product-tile__name__link.js-product-tile-link as the first argument of the OnHTML() function.
The second argument is a function that will run when Colly finds the element we specified.

Inside that function, we told Colly what to do with the h object. In our case, it’s printing the text of the element.

And yes, it’s working great so far. However, there’s a lot of empty space around our text, adding noise to the data.

To solve this problem, let’s get even more specific by adding the <header> element as our main selector and then looking for the text using the .ChildText() function.

c.OnHTML("header.product-tile__name", func(h *colly.HTMLElement) {
    fmt.Println(h.ChildText("a.product-tile__name__link.js-product-tile-link"))
})

6. Extracting All HTML Elements

By now, we have a good grasp of the logic behind the OnHTML() function, so let’s scale it a little bit more and extract the rest of the data.

We begin by changing the main selector to the one that contains all the data we want:

Then, we go down the hierarchy to extract the name, price, and URL of each product:

Here’s how it translates to code:

c.OnHTML("div.product-tile__content-wrapper", func(h *colly.HTMLElement) {
    name := h.ChildText("a.product-tile__name__link.js-product-tile-link")
    price := h.ChildText("em.value__price")
    url := h.ChildAttr("a.product-tile__name__link.js-product-tile-link", "href")
    fmt.Println(name, price, url)
})

For this function, we’re storing each element inside a variable to make it easier to work with later. For example, printing all the elements scraped to the console.

Note: When using the .ChildAttr() callback function, we need to pass the selectors of the element as a first argument (a.product-tile__name__link.js-product-tile-link) and the name of the attribute as the second (href).

We got a beautiful list of elements printed on the console but not quite the best format for analysis. Let’s solve that!

7. Export Data to a JSON File in Colly

Saving our data to a JSON file will be more useful than having everything on the terminal, and it’s not that hard to do, thanks to Go’s built-in JSON modules.

Creating a Structure

Outside the main function, we need to create a structure (struct) to group every data set (name, price, URL) into a single type.

type products struct {
   Name  string
   Price string
   URL   string
}

Structures let you combine different data types, so we have to define the data type for each element. We could also define the JSON field if we want, but it’s not mandatory, as in some cases we don’t know what these will be.

type products struct {
   Name  string `json:"name"`
   Price string `json:"price"`
   URL   string `json:"url"`
}

TIP: for a cleaner look, we can group all fields sharing the same type in a single line of code like:

type products struct {
   Name, Price, URL string
}

Of course, there’s going to be an “unused” error message on our code because we haven’t, well, used the products struct anywhere. So, we’ll assign each scraped element to one of our fields in the struct.

c.OnHTML("div.product-tile__content-wrapper", func(h *colly.HTMLElement) {
    products := products{
        Name:  h.ChildText("a.product-tile__name__link.js-product-tile-link"),
        Price: h.ChildText("em.value__price"),
        URL:   h.ChildAttr("a.product-tile__name__link.js-product-tile-link", "href"),
    }
    
    fmt.Println(products)
})

If we run our application, our data is now grouped, making it possible to pass each element as an individual “item” to the empty list we’ll later turn into a JSON file.

Adding Scraped Elements to a Slice

With our structure holding the product data, we’ll next send all of them to an empty Slice (instead of an Array like we would in other languages) to create a list of items we’ll export to the JSON file.

To collect the empty slice, add this code after the collector initiation code:

var allProducts []products

Inside the .OnHTML() function, instead of printing our structure, let’s instead append all the items inside products to the Slice.

allProducts = append(allProducts, products)

If we print the Slice out now, here’s the result:

You can see that each product information set is inside curly braces ({…}), and the entire Slice is inside brackets ([…]).

Writing the Slice into a JSON File

We already did all the heavy lifting we needed to do. From here on, Go has a very easy-to-use JSON module that’ll handle the writing for us:

   //We pass the data to Marshal
   content, err := json.Marshal(allProducts)
   //We need to handle someone the potential error
   if err != nil {
       fmt.Println(err.Error())
   }
   //We write a new file passing the file name, data, and permission
   os.WriteFile("jack-shoes.json", content, 0644)
}

Save it, and the necessary dependencies will update:

import (
   "encoding/json"
   "fmt"
   "os"
  
   "github.com/gocolly/colly"
)

Let’s run our code and see what it returns:

Note: When you open your file, it has all the data in a single line. To display your document as the image above, right-click on the window and select format document.

We could also print the length allProducts after creating the JSON file for testing purposes:

fmt.Println(len(allProducts))

But we should be getting everything we want.

That said, a single page isn’t enough for most projects. In fact, in almost all projects like this, we’ll need to scrape multiple pages to gather as much information as possible. Luckily, we can scale our project with just a few lines of code.

8. Scraping Multiple Pages

If we scroll down to the bottom of the product list, we can see that J&J is using a numbered pagination on their category page.

We could try to figure out how they construct their URLs and see if we can mimic it with a loop, but Colly has a more elegant solution, similar to how Scrapy navigates paginations.

c.OnHTML("a.paging-controls__next.js-page-control", func(p *colly.HTMLElement) {
    nextPage := p.Request.AbsoluteURL(p.Attr("data-href"))
    c.Visit(nextPage)
})

In a new OnHTML() function, we’re targeting the next button on the menu.

On the internal function, we’re grabbing the value inside data-href (which contains the URL), storing it into a new variable (nextPage), and then telling our scraper to visit the page.

Running the code now, we’ll bring all product data from every page in the pagination.

As you can see, our Slice now contains 102 items.

9. Avoid Getting Your Colly Scraper Blocked

One thing we always have to consider when scraping the web is that most sites don’t like web scrapers because many developers have no regard for the websites they extract data from.

Imagine that you want to scrape thousands of pages from a site. Every request you send takes resources away from the real users, creating more expenses for the site’s owner and possibly hurting the user experience with slow load times.

That’s why we should always use web scraping best practices to ensure we’re not hurting our target sites.

However, there’s another side of the story. To prevent scrapers from accessing the site, more and more websites implement anti-scraping systems designed to identify and block scrapers.

Although scraping a few pages won’t raise any flags – in most cases – scraping multiple pages will definitely put your IP and scraper at risk.

To avoid these measures, we would have to create a function that changes our IP address, have access to a pool of IP addresses for our script to rotate between, create some way to deal with CAPTCHAs, and handle javascript pages – which are becoming more common.

Or we could just send our HTTP request through ScraperAPI’s server and let them handle everything automatically:

1. First, we’ll only need to create a free ScraperAPI account to redeem 5000 free API credits and get access to our API key from the dashboard.
2. For simplicity, we’ll delete the colly.AllowedDomains("www.jackjones.com") setting from the collector.
3. We’ll add the ScraperAPI endpoint to our initial .Visit() function like this:

c.Visit("http://api.scraperapi.com?api_key={yourApiKey}&url=https://www.jackjones.com/nl/en/jj/shoes/")

4. And make a similar change for how we visit the next page:

c.Visit("http://api.scraperapi.com?api_key={yourApiKey}&url=" + nextPage)

5. For it to work properly and avoid errors, we’ll need to change Colly’s 10 second default timeout to at least 60 seconds to give our scraper enough time to handle any headers, CAPTCHAs, etc. We’ll use the sample code from Colly’s documentation – but change the timeout from 30 to 60:

c.WithTransport(&http.Transport{
    DialContext: (&net.Dialer{
        Timeout:   60 * time.Second,
        KeepAlive: 30 * time.Second,
        DualStack: true,
    }).DialContext,
    MaxIdleConns:          100,
    IdleConnTimeout:       90 * time.Second,
    TLSHandshakeTimeout:   10 * time.Second,
    ExpectContinueTimeout: 1 * time.Second,
})

Add this code after creating the collector.

Running our code will bring back the same data as before. With the difference being that ScraperAPI will rotate our IP address for each request sent, look for the best “proxy + headers” combination to ensure a successful request, and handle any other complexities our scraper could encounter.

Colly will save all the data into a formatted JSON file that we can use in other applications or projects.

With just a few small changes to the code, you can scrape any website you need, as long as the information is live in the HTML file.

However, we can also configure our ScraperAPI endpoint to render JavaScript content before returning the HTML document. So unless the content is behind an event (like clicking a button), you should also be able to grab dynamic content without a problem.

Wrapping Up: Full Colly Web Scraper Code

Congratulations, you created your first Colly web scraper! If you’ve followed along, here’s how your codebase should look:

package main
  
import (
   "encoding/json"
   "fmt"
   "net"
   "net/http"
   "os"
   "time"
  
   "github.com/gocolly/colly"
)
  
type products struct {
   Name  string `json:"name"`
   Price string `json:"price"`
   URL   string `json:"url"`
}
  
func main() {
   c := colly.NewCollector()
   c.WithTransport(&http.Transport{
       DialContext: (&net.Dialer{
           Timeout:   60 * time.Second,
           KeepAlive: 30 * time.Second,
           DualStack: true,
       }).DialContext,
       MaxIdleConns:          100,
       IdleConnTimeout:       90 * time.Second,
       TLSHandshakeTimeout:   10 * time.Second,
       ExpectContinueTimeout: 1 * time.Second,
   })
  
   var allProducts []products
  
   c.OnRequest(func(r *colly.Request) {
       fmt.Println("Scraping:", r.URL)
   })
  
   c.OnResponse(func(r *colly.Response) {
       fmt.Println("Status:", r.StatusCode)
   })
  
   c.OnHTML("div.product-tile__content-wrapper", func(h *colly.HTMLElement) {
       products := products{
           Name:  h.ChildText("a.product-tile__name__link.js-product-tile-link"),
           Price: h.ChildText("em.value__price"),
           URL:   h.ChildAttr("a.product-tile__name__link.js-product-tile-link", "href"),
       }
  
       allProducts = append(allProducts, products)
   })
  
   c.OnHTML("a.paging-controls__next.js-page-control", func(p *colly.HTMLElement) {
       nextPage := p.Request.AbsoluteURL(p.Attr("data-href"))
       c.Visit("http://api.scraperapi.com?api_key={yourApiKey}&url=" + nextPage)
   })
  
   c.OnError(func(r *colly.Response, err error) {
       fmt.Println("Request URL:", r.Request.URL, "failed with response:", r, "\nError:", err)
   })
  
   c.Visit("http://api.scraperapi.com?api_key={yourApiKey}&url=https://www.jackjones.com/nl/en/jj/shoes/")
  
   content, err := json.Marshal(allProducts)
   if err != nil {
       fmt.Println(err.Error())
   }
   os.WriteFile("jack-shoes.json", content, 0644)
   fmt.Println("Total products: ", len(allProducts))
}

Web scraping is one of the most powerful tools to have in your data collection arsenal. However, remember that every website is built somewhat differently. Focus on the fundamentals of website structure, and you’ll be able to solve any problem that comes your way.

Until next time, happy scraping!

How to Use Python to Loop Through HTML Tables and Scrape Tabular Data

ScraperAPI Zoltan Bettenbuk — Wed, 31 Aug 2022 04:40:07 +0000

Tabular data is one of the best sources of data on the web. They can store a massive amount of useful information without losing its easy-to-read format, making it gold mines for data-related projects.

Whether it is to scrape football data or extract stock market data, we can use Python to quickly access, parse and extract data from HTML tables, thanks to Requests and Beautiful Soup.

Also, we have a little black and white surprise for you at the end, so keep reading!

Understanding HTML Table’s Structure

Visually, an HTML table is a set of rows and columns displaying information in a tabular format. For this tutorial, we’ll be scraping the table above:

To be able to scrape the data contained within this table, we’ll need to go a little deeper into its coding.

Generally speaking, HTML tables are actually built using the following HTML tags:

: It marks the start of an HTML table

: Defines a row as the heading of the table

: Indicates the section where the data is

: Indicates a row in the table

: Defines a cell in the table

However, as we’ll see in real-life scenarios, not all developers respect these conventions when building their tables, making some projects harder than others. Still, understanding how they work is crucial for finding the right approach.

Let’s enter the table’s URL (https://datatables.net/examples/styling/stripe.html) in our browser and inspect the page to see what’s happening under the hood.

This is why this is a great page to practice scraping tabular data with Python. There’s a clear

tag pair opening and closing the table and all the relevant data is inside the tag. It only shows ten rows which matches the number of entries selected on the front-end.

A few more things to know about this table is that it has a total of 57 entries we’ll want to scrape and there seems to be two solutions to access the data. The first is clicking the drop-down menu and selecting “100” to show all entries:

Or clicking on the next button to move through the pagination.

So which one is gonna be? Either of these solutions will add extra complexity to our script, so instead, let’s check where’s the data getting pulled from first.

Of course, because this is an HTML table, all the data should be on the HTML file itself without the need for an AJAX injection. To verify this, Right Click > View Page Source. Next, copy a few cells and search for them in the Source Code.

We did the same thing for a couple more entries from different paginated cells and yes, it seems like all our target data is in there even though the front-end doesn’t display it.

And with this information, we’re ready to move to the code!

Scraping HTML Tables Using Python’s Beautiful Soup

Because all the employee data we’re looking to scrape is on the HTML file, we can use the Requests library to send the HTTP request and parse the respond using Beautiful Soup.

Note: If you’re new to web scraping, we’ve created a web scraping in Python tutorial for beginners. Although you’ll be able to follow along without experience, it’s always a good idea to start from the basics.

1. Sending Our Main Request

Let’s create a new directory for the project named python-html-table, then a new folder named bs4-table-scraper and finally, create a new python_table_scraper.py file.54

From the terminal, let’s pip3 install requests beautifulsoup4 and import them to our project as follows:

import requests
from bs4 import BeautifulSoup

To send an HTTP requests with Requests, all we need to do is set an URL and pass it through requests.get(), store the returned HTML inside a response variable and print response.status_code.

Note: If you’re totally new to Python, you can run your code from the terminal with the command python3 python_table_scraper.py.

url = 'https://datatables.net/examples/styling/stripe.html'

response = requests.get(url)

print(response.status_code)

If it’s working, it’s going to return a 200 status code. Anything else means that your IP is getting rejected by the anti-scraping systems the website has in placed. A potential solution is adding custom headers to your script to make your script look more human – but that might not be sufficient. Another solution is using an web scraping API to handle all these complexities for you.

2. Integrating ScraperAPI to Avoid Anti-Scraping systems

ScraperAPI is an elegant solution to avoid almost any type of anti-scraping technique. It uses machine learning and years of statistical analysis to determine the best headers and IP combinations to access the data, handle CAPTCHAs and rotate your IP between each request.

To start, let’s create a new ScraperAPI free account to redeem 5000 free APIs and our API Key. From our account’s dashboard, we can copy our key value to build the URL of the request.

http://api.scraperapi.com?api_key={Your_API_KEY}&amp;amp;amp;amp;amp;url={TARGET_URL}

Following this structure, we replace the holders with our data and send our request again:

import requests
from bs4 import BeautifulSoup

url = 'http://api.scraperapi.com?api_key=51e43be283e4db2a5afbxxxxxxxxxxxx&amp;amp;amp;amp;amp;url=https://datatables.net/examples/styling/stripe.html'

response = requests.get(url)

print(response.status_code)

Awesome, it’s working without any hiccup!

3. Building the Parser Using Beautiful Soup

Before we can extract the data, we need to turn the raw HTML into formatted or parsed data. We’ll store this parsed HTML into a soup object like this:

soup = BeautifulSoup(response.text, 'html.parser')

From here, we can traverse the parse tree using the HTML tags and their attributes.

If we go back to the table on the page, we’ve already seen that the table is enclosed between <table> tags with the class stripe dataTable, which we can use to select the table.

table = soup.find('table', class_ = 'stripe')
print(table)

Note: After testing, adding the second class (dataTable) didn’t return the element. In fact, in the return elements, the table’s class is only stripe. You can also use id = ‘example’.

Here’s what it returns:

Now that we grabbed the table, we can loop through the rows and grab the data we want.

4. Looping Through the HTML Table

Thinking back to the table’s structure, every row is represented by a <tr> element, and within them there’s <td> element containing data, all of this is wrapped between a <tbody> tag pair.

To extract the data, we’ll create two for looks, one to grab the <tbody> section of the table (where all rows are) and another to store all rows into a variable we can use:

for employee_data in table.find_all('tbody'):
   rows = employee_data.find_all('tr')
   print(rows)

In rows we’ll store all the <tr> elements found within the body section of the table. If you’re following our logic, the next step is to store each individual row into a single object and loop through them to find the desired data.

For starters, let’s try to pick the first employee’s name on our browser’s console using the .querySelectorAll() method. A really usuful feature of this method is that we can go deeper and deeper into the hierarchy implementing the greater than (>) symbol to define the parent element (on the left) and the child we want to grab (on the right).

document.querySelectorAll('table.stripe &amp;amp;amp;amp;gt; tbody &amp;amp;amp;amp;gt; tr &amp;amp;amp;amp;gt; td')[0]

That couldn’t work any better. As you see, once we grab all

elements, these become a nodelist. Because we can’t rely on a class to grab each cell, all we need to know is their position in the index and the first one, name, is 0.

From there, we can write our code like this:

for row in rows:
    name = row.find_all('td')[0].text
    print(name)

In simple terms, we’re taking each row, one by one, and finding all the cells inside, once we have the list, we grab only the first one in the index (position 0) and finish with the .text method to only grab the element’s text, ignoring the HTML data we don’t need.

There they are, a list with all the names employees names! For the rest, we just follow the same logic:

position = row.find_all('td')[1].text
office = row.find_all('td')[2].text
age = row.find_all('td')[3].text
start_date = row.find_all('td')[4].text
salary = row.find_all('td')[5].text

However, having all this data printed on our console isn’t super helpful. Instead, let’s store this data into a new, more useful format.

5. Storing Tabular Data Into a JSON File

Although we could easily create a CSV file and send our data there, that wouldn’t be the most manageble format if we can to create something new using the scraped data.

Still, here’s a project we did a few months ago explaining how to create a CSV file to store scraped data.

The good news is that Python has its own JSON module for working with JSON objects, so we don’t need to install anything, just import it.

import json

But, before we can go ahead and create our JSON file, we’ll need to turn all this scraped data into a list. To do so, we’ll create an empty array outside of our loop.

employee_list = []

And then append the data to it, with each loop appending a new object to the array.

employee_list.append({
    'Name': name,
    'Position': position,
    'Office': office,
    'Age': age,
    'Start date': start_date,
    'salary': salary
})

If we print(employee_list), here’s the result:

Still a little messy, but we have a set of objects ready to be transformed into JSON.

Note: As a test, we printed the length of employee_list and it returned 57, which is the correct number of rows we scraped (rows now being objects within the array).

Importing a list to JSON just requires two lines of code:

with open('json_data', 'w') as json_file:
   json.dump(employee_list, json_file, indent=2)

First, we open a new file passing in the name we want for the file (json_data) and ‘w’ as we want to write data to it.
Next, we use the .dump() function to, well, dump the data from the array (employee_list) and indent=2 so every object has it’s own line instead of everything being in one unreadable line.

6. Running the Script and Full Code

If you’ve been following along, your codebase should look like this:

#dependencies
import requests
from bs4 import BeautifulSoup
import json

url = 'http://api.scraperapi.com?api_key=51e43be283e4db2a5afbxxxxxxxxxxx&amp;url=https://datatables.net/examples/styling/stripe.html'

#empty array
employee_list = []

#requesting and parsing the HTML file
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

#selecting the table
table = soup.find('table', class_ = 'stripe')
#storing all rows into one variable
for employee_data in table.find_all('tbody'):
   rows = employee_data.find_all('tr')
   #looping through the HTML table to scrape the data
   for row in rows:
       name = row.find_all('td')[0].text
       position = row.find_all('td')[1].text
       office = row.find_all('td')[2].text
       age = row.find_all('td')[3].text
       start_date = row.find_all('td')[4].text
       salary = row.find_all('td')[5].text
       #sending scraped data to the empty array
       employee_list.append({
           'Name': name,
           'Position': position,
           'Office': office,
           'Age': age,
           'Start date': start_date,
           'salary': salary
       })
#importing the array to a JSON file
with open('employee_data', 'w') as json_file:
   json.dump(employee_list, json_file, indent=2)

Note: We added some comments for context.

And here’s a look at the first three objects from the JSON file:

Storing scraped data in JSON format allow us to repurpose the information for new applications or

Scraping HTML Tables Using Pandas

Before you leave the page, we want to explore a second approach to scrape HTML tables. In a few lines of code, we can scrape all tabular data from an HTML document and store it into a dataframe using Pandas.

Create a new folder inside the project’s directory (we named it pandas-html-table-scraper) and create a new file name pandas_table_scraper.py.

Let’s open a new terminal and navigate to the folder we just created (cd pandas-html-table-scraper) and from there install pandas:

pip install pandas

And we import it at the top of the file.

import pandas as pd

Pandas has a function called read_html() which basically scrape the target URL for us and returns all HTML tables as a list of DataFrame objects.

However, for this to work, the HTML table needs to be structured at least somewhat decently, as the function will look for elements like

to identify the tables on the file.

To use the function, let’s create a new variable and pass the URL we used previously to it:

employee_data = pd.read_html('http://api.scraperapi.com?api_key=51e43be283e4db2a5afbxxxxxxxxxxxx&url=https://datatables.net/examples/styling/stripe.html')

When printing it, it’ll return a list of HTML tables within the page.

If we compare the first three rows in the DataFrame they’re a perfect match to what we scraped with Beautiful Soup.

To work with JSON, Pandas can has a built-in .to_json() fuction. It’ll convert a list of DataFrame objects into a JSON string

All we need to do is calling the method on our DataFrame and pass in the path, the format (split, data, records, index, etc.) and add the indent to make it more readable:

employee_data[0].to_json('./employee_list.json', orient='index', indent=2)

If we run our code now, here’s the resulting file:

Notice that we needed to select our table from the index ([0])because .read_html() returns a list not a single object.

Here’s the full code for your reference:

import pandas as pd

employee_data = pd.read_html('http://api.scraperapi.com?api_key=51e43be283e4db2a5afbxxxxxxxxxxxx&url=https://datatables.net/examples/styling/stripe.html')

employee_data[0].to_json('./employee_list.json', orient='index', indent=2)

Armed with this new knowledge, you’re ready to start scraping virtually any HTML table on the web. Just remember that if you understand how the website is structured and the logic behind it, there’s nothing you can’t scrape.

That said, these methods will only work as long as the data is inside the HTML file. If you encounter a dynamically generated table, you’ll need to find a new approach. For these type of tables, we’ve created a step-by-step guide to scraping JavaScript tables with Python without the need for headless browsers.

Until next time, happy scraping!

Originally published on Scraper API: How to Use Python to Loop Through HTML Tables and Scrape Tabular Data

How to Scrape HTML Tables in JavaScript [Export Table Data to a CSV]

ScraperAPI Zoltan Bettenbuk — Mon, 18 Jul 2022 18:56:29 +0000

Originally published on ScraperAPI.

HTML tables are the best data sources on the web. They are easy to understand and can hold an immense amount of data in a simple-to-read and understand format. Being able to scrape HTML tables is a crucial skill to develop for any developer interested in data science or in data analysis in general.

In this tutorial, we’re going to go deeper into HTML tables and build a simple, yet powerful, script to extract tabular data and export it to a CSV file.

What is an HTML Web Table?

An HTML table is a set of rows and columns that are used to display information in a grid format directly on a web page. They are commonly used to display tabular data, such as spreadsheets or databases, and are a great source of data for our projects.

From sports data and weather data to books and authors’ data, most big datasets on the web are accessible through HTML tables because of how great they are to display information in a structured and easy-to-navigate format.

The great news for us is that, unlike dynamically generated content, the HTML table’s data lives directly inside of the table element in the HTML file, meaning that we can scrape all the information we need exactly as we would with other elements of the web – as long as we understand their structure.

Understanding HTML Table’s Structure

Though you can only see the columns and rows in the front end, these tables are actually created using a few different HTML tags:

: It marks the start of an HTML table

: Indicates a row in the table

<td>: Defines a cell in the table

The content goes inside the <td> tag and <tr> is used to create a row. In other words: Table > Row > Cell || table > tr > td hierarchy is followed to create an HTML table.

A special cell can be created using the <th> tag which means table header. Basically, the first cells of the first row can be created using the <th> tag to indicate the row is the heading of the table.

Here is an example to create a simple two-row and two-column based HTML table:

https://gist.github.com/scraperapikerins/ada86e5a5c5398599b77d4399be543ed

<table>
  <tr>
    <th>Pet 1</th>
    <th>Pet 2</th>
  </tr>
  <tr>
    <td>Dog</td>
    <td>Cat</td>
  </tr>
</table>

There’s one major difference when scraping HTML tables though. Unlike other elements on a web page, CSS selectors target the overall cells and rows – or even the entire table – because all of these elements are actually components of the <table> element.

Instead of targeting a CSS selector for each data point, we want to scrape, we’ll need to create a list with all the rows of the table and loop through them to grab the data from their cells.

If we understand this logic, creating our script is actually pretty straightforward.

Scraping HTML Tables to CSV with Node.JS

If this is your first time using Node.JS for web scraping, it might be useful to go through some of our previous tutorials:

However, we’ll keep this tutorial as beginner-friendly as possible so you can use it even as a starting point.

Note: For Node.JS installation instructions, please refer to the first article on the list.

For today’s project, we’ll build a web scraper using Axios and Cheerio to scrape the employee data displayed on https://datatables.net/examples/styling/display.html.

We’ll be extracting the name, position, office, age, start date, and salary for each employee, and then send the data to a CSV using the ObjectsToCsv package.

1. Getting Our Files Ready

To kickstart our project, let’s create a new directory named html-table-scraper, open the new folder on VScode (or your code editor of preference) and open a new terminal.

In the terminal, we’ll run npm init -y to start a new Node.JS project. You’ll now have a new JSON file in your folder.

Next, we’ll install our dependencies using the following commands:

Axios: npm install axios
Cheerio: npm install cheerio
ObjectsToCsv: npm install objects-to-csv

Our for a one command installation: npm i axios cheerio objects-to-csv.

Now we can create a new file named tablescraper.js and import our dependencies at the top.

const axios = require("axios");
const cheerio = require("cheerio");
const ObjectsToCsv = require("objects-to-csv");

Also, your project should be looking like this:

2. Testing the Target Site Using DevTools

Before writing the code, we need to understand how is the website structured. Yes, all tables use the basic structure, but that doesn’t mean that all are created equally.

The first thing we need to determine is whether or not this is, in fact, an HTML table. It’s very common for sites to use JavaScript to inject data into their tables, especially if there are any real-time data involved. For those cases, we would have to use a totally different approach like using a headless browser.

To test if the data is inside the HTML file, all we need to do is copy some data points – let’s say the name – and look for it in the source code of the page.

We did the same for other names and data points just to make sure, and yes, all the data is right there at our disposal. Another interesting surprise is that all the rows of the table are inside the raw HTML, even though there seems to be some kind of pagination on the front end.

Plus, we also now know that there are a total of 57 rows to scrape. This is important because we can know whether or not we’re actually grabbing all the data available.

The second thing we want to test directly on the browser is our selectors. Instead of sending a bunch of unnecessary requests, we can use the browser’s console to grab elements using the document.querySelectorAll() method.

If we go to the console and type document.querySelectorAll('table'), it return four different tables.

Mousing over the tables, we quickly realized that the first table (number 0) is the right one. So let’s do it again but specifying the class – which in the list is represented by the dots (.).

Great, we’re one step closer to our data!

By taking a closer look, the data of the table is wrapped around a <tbody> tag, so let’s add it to our selector to make sure that we’re only grabbing the rows containing the data we want.

Lastly, we’ll want to grab all the rows and verify that our selector is grabbing the entire 57 rows.

Note: Because we’re using the console to select elements on the rendered HTML, we needed to set the total amount of displayed items to 100. Otherwise, our selector on the console will only show10 node items.

With all this information, we can now start writing our code!

3. Sending Our HTTP Request and Parsing the Raw HTML

Axios makes it super easy to send HTTP requests inside an Async Function. All we need to do is create an async function and pass the URL to Axios in a constant named response. We’ll also log the status code of the response (which should be 200 for a successful request).

(async function html_scraper() {
   const response = await axios('https://datatables.net/examples/styling/display.html');
   console.log(response.status)
})();

Note: You can name these variables as you’d like, but keep them as descriptive as possible.

Next, we’ll store the data from the response (raw HTML) into a new constant named html, so we can then pass it to Cheerio for parsing using cheerio.load().

const html = await response.data;
   const $ = cheerio.load(html);

4. Iterating Through the HTML Table Rows

Using the selector we’ve tested before, let’s select all the rows inside the HTML table.

const allRows = $('table.display &amp;gt; tbody &amp;gt; tr');
  console.log(allRows.length)

For testing purposes, let’s console.log() the length of allRows to verify that, indeed, we’ve picked all our target rows.

57 is exactly what we were aiming for!

Of course, to go through the list of rows, we’ll be using the .each() method, but there’s one more thing we need to figure out: the order of the cells.

Unlike common HTML elements, cells don’t have a unique class assigned to them. So trying to scrape each data point with a CSS class could be a mess. Instead, we’re going to target the <td>s position within its row.

In other words, we’ll tell our script to go to each row, select all cells inside the row, and then store each data point in a variable based on its position within the row.

Note: In Node.JS, all lists start at 0. So the first position would be [0], and the second cell would be [1].

But how do we know which position is which? We go back to our browser’s console and test it out:

Now that we know where each element is in relation to the rest, here’s the finished parser:

allRows.each((index, element) =&amp;gt; {
    const tds = $(element).find('td');
    const name = $(tds[0]).text();
    const position = $(tds[1]).text();
    const office = $(tds[2]).text();
    const age = $(tds[3]).text();
    const startDate = $(tds[4]).text();
    const salary = $(tds[5]).text();

5. Pushing the Scraped Data Into an Empty Array

If we console.log() the scraped data we’ll see that we’re scraping the text out of each cell, but with very disorganized results – which in turn makes it harder to create our CSV file.

So before we export our data, let’s give it some order by pushing the data to an empty array to create a simple node list.

First, create an empty array outside of the main function – if you create the empty array inside the function, it will be overwritten with every iteration, which is not something we want.

employeeData = [];

Then, as part of our parser, let’s use the .push() method to store our data in the empty list we’ve created.

employeeData.push({
    'Name': name,
    'Position': position,
    'Office': office,
    'Age': age,
    'Start Date': startDate,
    'Salary': salary,
})

Like always, let’s console.log() the employeeData’s length to make sure that we now have 57 items in it.

For visual context, we can also log the array to see what’s stored inside.

As we can see, all the data is now stored inside node items which contain every piece of data in a structured format.

6. Sending Scraped Data to a CSV File

With our data organized, we can pass our list to ObjectsToCsv and it’ll create the file for us with no extra work:

const csv = new ObjectsToCsv(employeeData);
await csv.toDisk('./employeeData.csv')

All we need to do is create a new csv object and pass the list to ObjectsToCsv, and then tell it to save it in our machine providing the path.

7. HTML Table Scraper [Full Code]

Congratulations, you’ve officially created your first HTML table scraper! Compare your code to the finished codebase of this tutorial to ensure you haven’t missed anything:

const axios = require("axios");
const cheerio = require("cheerio");
const ObjectsToCsv = require("objects-to-csv");

employeeData = [];

(async function html_scraper() {
   const response = await axios('https://datatables.net/examples/styling/display.html')
   const html = await response.data;
   const $ = cheerio.load(html);

   //Selecting all rows inside our target table
   const allRows = $('table.display &gt; tbody &gt; tr');
   console.log('Going through rows')

   //Looping through the rows
   allRows.each((index, element) =&gt; {
       //Selecting all cells within the row
       const tds = $(element).find('td');
       //Extracting the text out of each cell
       const name = $(tds[0]).text();
       const position = $(tds[1]).text();
       const office = $(tds[2]).text();
       const age = $(tds[3]).text();
       const startDate = $(tds[4]).text();
       const salary = $(tds[5]).text();

       //Pushing scraped data to our empty array
       employeeData.push({
           'Name': name,
           'Position': position,
           'Office': office,
           'Age': age,
           'Start Date': startDate,
           'Salary': salary,
       })
   })
   //Exporting scraped data to a CSV file
   console.log('Saving data to CSV');
   const csv = new ObjectsToCsv(employeeData);
   await csv.toDisk('./employeeData.csv')
   console.log('Saved to CSV')
})();

After running our script, a new CSV file gets created inside our project’s folder:

Now, you can use this data to run further analysis like making salary comparisons based on job title or starting date, or look for trends in bigger job datasets.

Of course, this script can be adapted to handle almost any HTML table you’ll find, so keep your mind open to new possibilities and

Avoid Getting Blocked: Integrating ScraperAPI in a Single Line of Code

Before yougo away, there’s one more thing we need to do to make our scraper more resilient and that’s handling anti-scraping techniques and systems. A lot of websites don’t like to be scraped because, sadly, a lot of scrapers are badly optimized and tend to hurt their sites.

For that reason, you need to follow some web scraping best practices to ensure that you’re handling your projects correctly, without putting too much pressure on your target website, nor putting your script and IP in risk of getting ban or blacklisted – making it impossible to access the needed data from your machine again.

To handle IP rotation, JavaScript rendering, find and implement HTTP headers, CAPTCHAs and more, all we need to do is send our initial request through ScraperAPI’s server. This API will use years of statistical analysis and machine learning to determine which is the best combination of headers and proxy, handle any unsuccessful requests and time our request so it doesn’t overload our target server.

Adding it to our script is as simple as adding this string to the URL passed to Axios:

const response = await axios('http://api.scraperapi.com?api_key={Your_API_Key}&url=https://datatables.net/examples/styling/display.html')

Remember to substitute {Your_API_Key} with your own API key – which you can generate by creating a free ScraperAPI account.

Your initial request will take a little longer while ScraperAPI handles any complexities for you and will only consume API credits for successful requests.

Now it’s your turn. Web scraping is all about practice. Every website is a different puzzle so there’s no one way to do things. Instead, focus on using the foundations to take on more complex challenges.

If you want to keep practicing, a few websites we recommend are:

Until next time, happy scraping!