<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: ScraperAPI Zoltan Bettenbuk</title>
    <description>The latest articles on Forem by ScraperAPI Zoltan Bettenbuk (@scraperapi).</description>
    <link>https://forem.com/scraperapi</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F894092%2F2157c58b-f33a-4b64-848d-3f8574f7f75a.png</url>
      <title>Forem: ScraperAPI Zoltan Bettenbuk</title>
      <link>https://forem.com/scraperapi</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/scraperapi"/>
    <language>en</language>
    <item>
      <title>Building a Go Web Scraper Step-by-Step: The Beginners Guide to Colly</title>
      <dc:creator>ScraperAPI Zoltan Bettenbuk</dc:creator>
      <pubDate>Thu, 01 Sep 2022 00:24:59 +0000</pubDate>
      <link>https://forem.com/scraperapi/building-a-go-web-scraper-step-by-step-the-beginners-guide-to-colly-206j</link>
      <guid>https://forem.com/scraperapi/building-a-go-web-scraper-step-by-step-the-beginners-guide-to-colly-206j</guid>
      <description>&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--l-F4hdXX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/golang-scraper.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--l-F4hdXX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/golang-scraper.jpg" alt="" width="880" height="455"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Go allows developers to create complicated projects with a simpler syntax than C, but with almost the same efficiency and control.&lt;/p&gt;

&lt;p&gt;Its simplicity and efficiency is why we decided to add Golang to our web scraping beginners series and show you how to use it to extract data at scale.&lt;/p&gt;

&lt;h2&gt;Why Use Go Over Python or JavaScript for Web Scraping?&lt;/h2&gt;

&lt;p&gt;If you’ve followed our series, you’ve seen how easy web scraping is with languages like Python and JavaScript, so why should you give Go a shot?&lt;/p&gt;

&lt;p&gt;There are three main reasons for choosing Go over other languages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Go is a &lt;a href="https://www.geeksforgeeks.org/what-is-a-typed-language/" rel="noreferrer noopener"&gt;statically typed language&lt;/a&gt;, making it easier to find errors without running your program. &lt;a href="https://www.codecademy.com/article/what-is-an-ide" rel="noreferrer noopener"&gt;Integrated Development Environments&lt;/a&gt; (IDEs) will highlight errors immediately and even show you suggestions to fix them.&lt;/li&gt;
&lt;li&gt;IDEs can be more helpful the more they understand the code, and because we declare the data types in Go, there’s less ambiguity in the code. Thus, IDEs can provide better auto-complete features and suggestions than in other languages.&lt;/li&gt;
&lt;li&gt;Unlike Python or JavaScript, Go is a compiled language that outputs machine code directly, making it faster than Python. In an &lt;a href="https://medium.com/@arnesh07/how-golang-can-save-you-days-of-web-scraping-72f019a6de87#:~:text=From%20this%20graph%2C%20it%20is,the%20same%20number%20of%20URLs." rel="noreferrer noopener"&gt;experiment run by Arnesh Agrawal&lt;/a&gt;, Python (using the Beautiful Soup library) took almost 40 minutes to scrape 2000 URLs, while Go (using the Goquery package) took less than 20 minutes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In summary, Go is an excellent option if you need to optimize scraping speed or if you’re looking for a statically typed language to transition to.&lt;/p&gt;

&lt;h2&gt;Web Scraping with Go&lt;/h2&gt;

&lt;p&gt;For this project, we’ll scrape the&lt;a href="https://www.jackjones.com/nl/en/jj/shoes/" rel="noreferrer noopener"&gt; &lt;em&gt;Jack and Jones&lt;/em&gt; shoe category&lt;/a&gt; page to extract product names, prices, and URLs before exporting the data into a JSON file using Go’s web scraping library Colly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A consideration:&lt;/strong&gt; We’ll try to explain every step of the way as detailed as possible, so even if you don’t have experience with Go, you’ll be able to follow along. However, if you still feel lost while reading, here’s a great&lt;a href="https://www.youtube.com/watch?v=yyUHQIec83I" rel="noreferrer noopener"&gt; introduction to Go&lt;/a&gt; to watch beforehand.&lt;/p&gt;

&lt;h3&gt;1. Setting Up Our Project&lt;/h3&gt;

&lt;p&gt;We first need to head to&lt;a href="https://go.dev/dl/" rel="noreferrer noopener"&gt; https://go.dev/dl/&lt;/a&gt; and download the right version of Go based on our operating system. In our case, we’ll pick the ARM64 version as we’re using a MacBook Air M1.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--KToBEE9F--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/golang-scraper-img1-1024x242.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--KToBEE9F--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/golang-scraper-img1-1024x242.png" alt="" width="880" height="208"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; You can also use&lt;a href="https://formulae.brew.sh/formula/go#default" rel="noreferrer noopener"&gt; Homebrew&lt;/a&gt; in MacOS or&lt;a href="https://chocolatey.org/" rel="noreferrer noopener"&gt; Chocolatey&lt;/a&gt; in Windows to install Go.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--vMc2arja--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/golang-scraper-img2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--vMc2arja--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/golang-scraper-img2.png" alt="" width="370" height="146"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once the download is complete, follow the instructions to install it on your machine. With it installed, let’s create a new directory named go-scraper and open it on VScode or your preferred IDE.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Open the terminal and enter the go version command. If everything goes well, it will log the version like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--BNnvSSZ1--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/golang-scraper-img3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--BNnvSSZ1--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/golang-scraper-img3.png" alt="" width="572" height="98"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For this tutorial, we’ll be using VScode to write our code, so we’ll also download Go’s VScode extension for a better experience.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--fOm0EfJD--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/golang-scraper-img4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--fOm0EfJD--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/golang-scraper-img4.png" alt="" width="819" height="258"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Next, open the terminal and type the following command to create or initialize the project: &lt;code&gt;go mod init go-scraper&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--U-Q4U6EG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/golang-scraper-img5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--U-Q4U6EG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/golang-scraper-img5.png" alt="" width="629" height="174"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In Go, “a&lt;a href="https://go.dev/blog/using-go-modules" rel="noreferrer noopener"&gt; module&lt;/a&gt; is a collection of &lt;a href="https://go.dev/ref/spec#Packages" rel="noreferrer noopener"&gt;Go packages&lt;/a&gt; stored in a file tree with a &lt;em&gt;go.mod&lt;/em&gt; file at its root.” The command above – &lt;code&gt;go mod init&lt;/code&gt; – tells Go that the directory we’re specifying (&lt;em&gt;go-scraper&lt;/em&gt;) is the module’s root.&lt;/p&gt;

&lt;p&gt;Without leaving the terminal, let’s create a new &lt;em&gt;jack-scraper.go&lt;/em&gt; file using: touch jack-scraper.go and – using the command found on Colly’s documentation – &lt;code&gt;go get -u github.com/gocolly/colly/...&lt;/code&gt; to install Colly.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Wix4juXI--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/golang-scraper-img6-1024x589.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Wix4juXI--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/golang-scraper-img6-1024x589.png" alt="" width="880" height="506"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;All the dependencies downloaded were added to the &lt;em&gt;go.mod&lt;/em&gt; file and a new &lt;em&gt;go.sum&lt;/em&gt; file – which may contain hashes for multiple versions of a module- was created.&lt;/p&gt;

&lt;p&gt;Now we can consider our environment set!&lt;/p&gt;

&lt;h3&gt;2. Sending HTTP Requests with Colly&lt;/h3&gt;

&lt;p&gt;At the top of our &lt;em&gt;jack-scraper.go&lt;/em&gt; file, we’ll add our package’s name, which as a convention, we’ll name main:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;package main&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And then import the dependencies to the project:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;import (
   "github.com/gocolly/colly"
)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Due to Go’s typed nature, VScode will tell us there’s an error with your import. Basically, it’s telling us that we imported a dependency, but we’re not using it. This is one of the advantages of using Go, as we don’t need to run our code to find errors.&lt;/p&gt;

&lt;p&gt;Something particular to Go is that we need to provide a starting point for the code to run. So we’ll create the main function, and all our logic will be inside it. For now, we’ll print “Function is working.”&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;func main() {
  
   fmt.Println("Function is working")
}&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;If everything went well it should return this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--aENMVDDi--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/golang-scraper-img7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--aENMVDDi--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/golang-scraper-img7.png" alt="" width="626" height="75"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You might have also noticed that a new package was imported. That’s because Go can tell we’re trying to use the .Println() function from the fmt package, so it automatically imported it. Handy, right?&lt;/p&gt;

&lt;p&gt;To handle our request and deal with all callbacks necessary to extract the data, Colly uses a &lt;a href="http://go-colly.org/docs/introduction/start/" rel="noreferrer noopener"&gt;collector&lt;/a&gt; object. Initializing the collector is as simple as calling the .NewCollector() method from Colly and passing the domains we want to allow Colly to visit:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;c := colly.NewCollector(
    colly.AllowedDomains("www.jackjones.com"),
)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;On our new c instance, we can now call the &lt;code&gt;.Visit()&lt;/code&gt; method to make colly send our request.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;c.Visit("https://www.jackjones.com/nl/en/jj/shoes/"&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;However, there are a few more things we need to add to our code before sending our collector out into the world. Let’s see what’s happening behind the scenes.&lt;/p&gt;

&lt;p&gt;First, we’ll create a callback to print out the URL Colly is navigating to – this will become more useful as we scale our scraper from one page to multiple pages.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;c.OnRequest(func(r *colly.Request) {
    fmt.Println("Scraping:", r.URL)
})&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And then a callback to print out the status of the request.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;c.OnResponse(func(r *colly.Response) {
    fmt.Println("Status:", r.StatusCode)
})&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;As we said before, the &lt;code&gt;collector&lt;/code&gt; object is responsible for the callbacks attached to a &lt;code&gt;collector&lt;/code&gt; job. So each of these functions will be called at various stages in the &lt;code&gt;collector&lt;/code&gt; job cycle.&lt;/p&gt;

&lt;p&gt;In our example, the &lt;code&gt;.OnRequest()&lt;/code&gt; function is called right before the collector makes the HTTP request, while the &lt;code&gt;.OnRespond()&lt;/code&gt; function will be called after a response is received.&lt;/p&gt;

&lt;p&gt;Just for good measure, let’s also create an error handler before running the scraper.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;c.OnError(func(r *colly.Response, err error) {
    fmt.Println("Request URL:", r.Request.URL, "failed with response:", r, "\nError:", err)
})&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;To run our code, open a terminal and use the &lt;code&gt;go run jack-scraper.go&lt;/code&gt; command.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--CdagFDMC--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/golang-scraper-img8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--CdagFDMC--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/golang-scraper-img8.png" alt="" width="635" height="97"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Awesome, we got a beautiful 200 (successful) status code! We’re ready for phase 2, the extraction.&lt;/p&gt;

&lt;h3&gt;3. Inspecting the Target Website&lt;/h3&gt;

&lt;p&gt;To extract data from an HTML file, we need more than just access to the site. Each website has its own structure and we must understand it in order to scrape specific elements without adding noise to our dataset.&lt;/p&gt;

&lt;p&gt;The good news is that we can quickly look at the site’s HTML structure by right-clicking on the page and selecting &lt;em&gt;Inspect&lt;/em&gt; from the menu. It’ll open the Inspector tools, and we can start hovering on elements to see their positions in the HTML and attributes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--4leYFdHL--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/golang-scraper-img9-1024x490.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--4leYFdHL--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/golang-scraper-img9-1024x490.png" alt="" width="880" height="421"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We use &lt;a href="https://www.scraperapi.com/blog/css-selectors-cheat-sheet/"&gt;CSS selectors in web scraping&lt;/a&gt; (classes, IDs, etc.) to tell our scrapers where to locate the elements we want them to extract for us. Fortunately for us, Colly is based on the &lt;a href="https://pkg.go.dev/github.com/PuerkitoBio/goquery" rel="noreferrer noopener"&gt;Goquery package&lt;/a&gt;, which provides Colly with a JQuery-like syntax to target these selectors.&lt;/p&gt;

&lt;p&gt;Let’s try to grab the first shoe’s product name using the DevTools console to test the ground.&lt;/p&gt;

&lt;h3&gt;4. Using DevTools to Test Our CSS Selectors&lt;/h3&gt;

&lt;p&gt;If we inspect the element further, we can see that the name of the product is in a &lt;code&gt;&amp;lt;a&amp;gt;&lt;/code&gt; tag with the class “&lt;code&gt;product-tile__name__link&lt;/code&gt;”, wrapped between &lt;code&gt;&amp;lt;header&amp;gt;&lt;/code&gt; tags.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--OF_4nUeZ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/golang-scraper-img10.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--OF_4nUeZ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/golang-scraper-img10.png" alt="" width="880" height="495"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;On our &lt;a href="https://developer.chrome.com/docs/devtools/console/" rel="noreferrer noopener"&gt;browser’s console&lt;/a&gt;, let’s use the &lt;code&gt;document.querySelectorAll()&lt;/code&gt; method to find the that element.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;document.querySelectorAll("a.product-tile__name__link.js-product-tile-link")&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--QqWP93S7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/golang-scraper-img11.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--QqWP93S7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/golang-scraper-img11.png" alt="" width="442" height="102"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Yes! It returns 44 elements that match perfectly with the number of elements on the page.&lt;/p&gt;

&lt;p&gt;The product’s URL is inside the same element, so we can use the same selector for it. On the other hand, after some testing, we can use “&lt;code&gt;em.value__price&lt;/code&gt;” to pick the price.&lt;/p&gt;

&lt;h3&gt;5. Scraping All Product Names on the Page&lt;/h3&gt;

&lt;p&gt;If you’ve read some of our previous articles, you know we love testing. So, before extracting all our target elements, we’re first going to test our scraper by extracting all product names from the page.&lt;/p&gt;

&lt;p&gt;To do so, Colly has a &lt;code&gt;.OnHTML()&lt;/code&gt; function to handle the HTML from the response (there’s also an &lt;code&gt;.OnXML()&lt;/code&gt; function you can use if you’re not sure if the response will be HTML or XML content).&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;c.OnHTML("a.product-tile__name__link.js-product-tile-link", func(h *colly.HTMLElement) {
    fmt.Println(h.Text)
})&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Here’s a breakdown of what’s happening:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We pass the main element we want to work with &lt;code&gt;a.product-tile__name__link.js-product-tile-link&lt;/code&gt; as the first argument of the &lt;code&gt;OnHTML()&lt;/code&gt; function.&lt;/li&gt;
&lt;li&gt;The second argument is a function that will run when Colly finds the element we specified.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Inside that function, we told Colly what to do with the h object. In our case, it’s printing the text of the element.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--THPa3CFt--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/golang-scraper-img12.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--THPa3CFt--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/golang-scraper-img12.png" alt="" width="619" height="313"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And yes, it’s working great so far. However, there’s a lot of empty space around our text, adding noise to the data.&lt;/p&gt;

&lt;p&gt;To solve this problem, let’s get even more specific by adding the &lt;code&gt;&amp;lt;header&amp;gt;&lt;/code&gt; element as our main selector and then looking for the text using the &lt;code&gt;.ChildText()&lt;/code&gt; function.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;c.OnHTML("header.product-tile__name", func(h *colly.HTMLElement) {
    fmt.Println(h.ChildText("a.product-tile__name__link.js-product-tile-link"))
})&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--hDs2zfEB--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/golang-scraper-img13.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--hDs2zfEB--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/golang-scraper-img13.png" alt="" width="642" height="297"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;6. Extracting All HTML Elements&lt;/h3&gt;

&lt;p&gt;By now, we have a good grasp of the logic behind the &lt;code&gt;OnHTML()&lt;/code&gt; function, so let’s scale it a little bit more and extract the rest of the data.&lt;/p&gt;

&lt;p&gt;We begin by changing the main selector to the one that contains all the data we want:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--6CEZ01lP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/golang-scraper-img14-1024x200.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--6CEZ01lP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/golang-scraper-img14-1024x200.png" alt="" width="880" height="172"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Then, we go down the hierarchy to extract the name, price, and URL of each product:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--aNUpMG8Z--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/golang-scraper-img15.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--aNUpMG8Z--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/golang-scraper-img15.png" alt="" width="866" height="143"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here’s how it translates to code:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;c.OnHTML("div.product-tile__content-wrapper", func(h *colly.HTMLElement) {
    name := h.ChildText("a.product-tile__name__link.js-product-tile-link")
    price := h.ChildText("em.value__price")
    url := h.ChildAttr("a.product-tile__name__link.js-product-tile-link", "href")
    fmt.Println(name, price, url)
})&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;For this function, we’re storing each element inside a variable to make it easier to work with later. For example, printing all the elements scraped to the console.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; When using the &lt;code&gt;.ChildAttr()&lt;/code&gt; callback function, we need to pass the selectors of the element as a first argument&lt;code&gt; (&lt;/code&gt;&lt;code&gt;a.product-tile__name__link.js-product-tile-link&lt;/code&gt;&lt;code&gt;)&lt;/code&gt; and the name of the attribute as the second &lt;code&gt;(&lt;/code&gt;&lt;code&gt;href)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--xNWeVhHu--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/golang-scraper-img16-1024x127.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--xNWeVhHu--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/golang-scraper-img16-1024x127.png" alt="" width="880" height="109"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We got a beautiful list of elements printed on the console but not quite the best format for analysis. Let’s solve that!&lt;/p&gt;

&lt;h3&gt;7. Export Data to a JSON File in Colly&lt;/h3&gt;

&lt;p&gt;Saving our data to a JSON file will be more useful than having everything on the terminal, and it’s not that hard to do, thanks to Go’s built-in JSON modules.&lt;/p&gt;

&lt;h4&gt;Creating a Structure&lt;/h4&gt;

&lt;p&gt;Outside the main function, we need to create a structure (struct) to group every data set (name, price, URL) into a single type.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;type products struct {
   Name  string
   Price string
   URL   string
}&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Structures let you combine different data types, so we have to define the data type for each element. We could also define the JSON field if we want, but it’s not mandatory, as in some cases we don’t know what these will be.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;type products struct {
   Name  string `json:"name"`
   Price string `json:"price"`
   URL   string `json:"url"`
}&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;strong&gt;TIP:&lt;/strong&gt; for a cleaner look, we can group all fields sharing the same type in a single line of code like:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;type products struct {
   Name, Price, URL string
}&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Of course, there’s going to be an “unused” error message on our code because we haven’t, well, used the &lt;code&gt;products struct&lt;/code&gt; anywhere. So, we’ll assign each scraped element to one of our fields in the &lt;code&gt;struct&lt;/code&gt;.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;c.OnHTML("div.product-tile__content-wrapper", func(h *colly.HTMLElement) {
    products := products{
        Name:  h.ChildText("a.product-tile__name__link.js-product-tile-link"),
        Price: h.ChildText("em.value__price"),
        URL:   h.ChildAttr("a.product-tile__name__link.js-product-tile-link", "href"),
    }
    
    fmt.Println(products)
})&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;If we run our application, our data is now grouped, making it possible to pass each element as an individual “item” to the empty list we’ll later turn into a JSON file.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--8JHEcskh--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/golang-scraper-img17-1024x128.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--8JHEcskh--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/golang-scraper-img17-1024x128.png" alt="" width="880" height="110"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;Adding Scraped Elements to a Slice&lt;/h4&gt;

&lt;p&gt;With our structure holding the product data, we’ll next send all of them to an empty &lt;a href="https://go.dev/tour/moretypes/7" rel="noreferrer noopener"&gt;Slice&lt;/a&gt; (instead of an Array like we would in other languages) to create a list of items we’ll export to the JSON file.&lt;/p&gt;

&lt;p&gt;To collect the empty slice, add this code after the collector initiation code:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;var allProducts []products&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Inside the &lt;code&gt;.OnHTML()&lt;/code&gt; function, instead of printing our structure, let’s instead append all the items inside products to the Slice.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;allProducts = append(allProducts, products)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;If we print the Slice out now, here’s the result:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--nObsQh5D--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/golang-scraper-img18-1024x152.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--nObsQh5D--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/golang-scraper-img18-1024x152.png" alt="" width="880" height="131"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can see that each product information set is inside curly braces ({…}), and the entire Slice is inside brackets ([…]).&lt;/p&gt;

&lt;h4&gt;Writing the Slice into a JSON File&lt;/h4&gt;

&lt;p&gt;We already did all the heavy lifting we needed to do. From here on, Go has a very easy-to-use JSON module that’ll handle the writing for us:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;   //We pass the data to Marshal
   content, err := json.Marshal(allProducts)
   //We need to handle someone the potential error
   if err != nil {
       fmt.Println(err.Error())
   }
   //We write a new file passing the file name, data, and permission
   os.WriteFile("jack-shoes.json", content, 0644)
}&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Save it, and the necessary dependencies will update:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;import (
   "encoding/json"
   "fmt"
   "os"
  
   "github.com/gocolly/colly"
)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Let’s run our code and see what it returns:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Yf--2mC2--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/golang-scraper-img19-1024x269.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Yf--2mC2--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/golang-scraper-img19-1024x269.png" alt="" width="880" height="231"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; When you open your file, it has all the data in a single line. To display your document as the image above, right-click on the window and select &lt;em&gt;format document&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;We could also print the length allProducts after creating the JSON file for testing purposes:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;fmt.Println(len(allProducts))&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;But we should be getting everything we want.&lt;/p&gt;

&lt;p&gt;That said, a single page isn’t enough for most projects. In fact, in almost all projects like this, we’ll need to scrape multiple pages to gather as much information as possible. Luckily, we can scale our project with just a few lines of code.&lt;/p&gt;

&lt;h3&gt;8. Scraping Multiple Pages&lt;/h3&gt;

&lt;p&gt;If we scroll down to the bottom of the product list, we can see that J&amp;amp;J is using a numbered pagination on their category page.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--i7b3QlI7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/golang-scraper-img20.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--i7b3QlI7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/golang-scraper-img20.png" alt="" width="880" height="357"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We could try to figure out how they construct their URLs and see if we can mimic it with a loop, but Colly has a more elegant solution, similar to &lt;a href="https://www.scraperapi.com/blog/how-to-deal-with-pagination-in-python-step-by-step-guide-full-code/"&gt;how Scrapy navigates paginations&lt;/a&gt;.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;c.OnHTML("a.paging-controls__next.js-page-control", func(p *colly.HTMLElement) {
    nextPage := p.Request.AbsoluteURL(p.Attr("data-href"))
    c.Visit(nextPage)
})&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;In a new OnHTML() function, we’re targeting the next button on the menu.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--aeDx7fjV--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/golang-scraper-img21-1024x182.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--aeDx7fjV--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/golang-scraper-img21-1024x182.png" alt="" width="880" height="156"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;On the internal function, we’re grabbing the value inside data-href (which contains the URL), storing it into a new variable (nextPage), and then telling our scraper to visit the page.&lt;/p&gt;

&lt;p&gt;Running the code now, we’ll bring all product data from every page in the pagination.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--0MypEdoL--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/golang-scraper-img22.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--0MypEdoL--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/golang-scraper-img22.png" alt="" width="629" height="174"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As you can see, our Slice now contains 102 items.&lt;/p&gt;

&lt;h3&gt;9. Avoid Getting Your Colly Scraper Blocked&lt;/h3&gt;

&lt;p&gt;One thing we always have to consider when scraping the web is that most sites don’t like web scrapers because many developers have no regard for the websites they extract data from.&lt;/p&gt;

&lt;p&gt;Imagine that you want to scrape thousands of pages from a site. Every request you send takes resources away from the real users, creating more expenses for the site’s owner and possibly hurting the user experience with slow load times.&lt;/p&gt;

&lt;p&gt;That’s why we should always use &lt;a href="https://www.scraperapi.com/blog/web-scraping-best-practices/"&gt;web scraping best practices&lt;/a&gt; to ensure we’re not hurting our target sites.&lt;/p&gt;

&lt;p&gt;However, there’s another side of the story. To prevent scrapers from accessing the site, more and more websites implement anti-scraping systems designed to identify and block scrapers.&lt;/p&gt;

&lt;p&gt;Although scraping a few pages won’t raise any flags – in most cases – scraping multiple pages will definitely put your IP and scraper at risk.&lt;/p&gt;

&lt;p&gt;To avoid these measures, we would have to create a function that changes our IP address, have access to a pool of IP addresses for our script to rotate between, create some way to deal with CAPTCHAs, and handle javascript pages – which are becoming more common.&lt;/p&gt;

&lt;p&gt;Or we could just send our HTTP request through ScraperAPI’s server and let them handle everything automatically:&lt;/p&gt;

&lt;p&gt;1. First, we’ll only need to &lt;a href="https://www.scraperapi.com/signup"&gt;create a free ScraperAPI account to redeem 5000 free API credits&lt;/a&gt; and get access to our API key from the dashboard.&lt;br&gt;2. For simplicity, we’ll delete the &lt;code&gt;colly.AllowedDomains("www.jackjones.com")&lt;/code&gt; setting from the &lt;code&gt;collector&lt;/code&gt;.&lt;br&gt;3. We’ll add the ScraperAPI endpoint to our initial &lt;code&gt;.Visit()&lt;/code&gt; function like this:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;c.Visit("http://api.scraperapi.com?api_key={yourApiKey}&amp;amp;url=https://www.jackjones.com/nl/en/jj/shoes/")&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;4. And make a similar change for how we visit the next page:&lt;br&gt;&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;c.Visit("http://api.scraperapi.com?api_key={yourApiKey}&amp;amp;url=" + nextPage)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;5. For it to work properly and avoid errors, we’ll need to change Colly’s 10 second default timeout to at least 60 seconds to give our scraper enough time to handle any headers, CAPTCHAs, etc. We’ll use the sample code from Colly’s documentation – but change the timeout from 30 to 60:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;c.WithTransport(&amp;amp;http.Transport{
    DialContext: (&amp;amp;net.Dialer{
        Timeout:   60 * time.Second,
        KeepAlive: 30 * time.Second,
        DualStack: true,
    }).DialContext,
    MaxIdleConns:          100,
    IdleConnTimeout:       90 * time.Second,
    TLSHandshakeTimeout:   10 * time.Second,
    ExpectContinueTimeout: 1 * time.Second,
})&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Add this code after creating the collector.&lt;/p&gt;

&lt;p&gt;Running our code will bring back the same data as before. With the difference being that ScraperAPI will rotate our IP address for each request sent, look for the best “proxy + headers” combination to ensure a successful request, and handle any other complexities our scraper could encounter.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--6Yvmm6Mw--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/golang-scraper-img23-1024x168.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--6Yvmm6Mw--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/golang-scraper-img23-1024x168.png" alt="" width="880" height="144"&gt;&lt;/a&gt;&lt;/p&gt;



&lt;p&gt;Colly will save all the data into a formatted JSON file that we can use in other applications or projects.&lt;/p&gt;

&lt;p&gt;With just a few small changes to the code, you can scrape any website you need, as long as the information is live in the HTML file.&lt;/p&gt;

&lt;p&gt;However, we can also configure our ScraperAPI endpoint to render JavaScript content before returning the HTML document. So unless the content is behind an event (like clicking a button), you should also be able to grab dynamic content without a problem.&lt;/p&gt;

&lt;h2&gt;Wrapping Up: Full Colly Web Scraper Code&lt;/h2&gt;

&lt;p&gt;Congratulations, you created your first Colly web scraper! If you’ve followed along, here’s how your codebase should look:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;package main
  
import (
   "encoding/json"
   "fmt"
   "net"
   "net/http"
   "os"
   "time"
  
   "github.com/gocolly/colly"
)
  
type products struct {
   Name  string `json:"name"`
   Price string `json:"price"`
   URL   string `json:"url"`
}
  
func main() {
   c := colly.NewCollector()
   c.WithTransport(&amp;amp;http.Transport{
       DialContext: (&amp;amp;net.Dialer{
           Timeout:   60 * time.Second,
           KeepAlive: 30 * time.Second,
           DualStack: true,
       }).DialContext,
       MaxIdleConns:          100,
       IdleConnTimeout:       90 * time.Second,
       TLSHandshakeTimeout:   10 * time.Second,
       ExpectContinueTimeout: 1 * time.Second,
   })
  
   var allProducts []products
  
   c.OnRequest(func(r *colly.Request) {
       fmt.Println("Scraping:", r.URL)
   })
  
   c.OnResponse(func(r *colly.Response) {
       fmt.Println("Status:", r.StatusCode)
   })
  
   c.OnHTML("div.product-tile__content-wrapper", func(h *colly.HTMLElement) {
       products := products{
           Name:  h.ChildText("a.product-tile__name__link.js-product-tile-link"),
           Price: h.ChildText("em.value__price"),
           URL:   h.ChildAttr("a.product-tile__name__link.js-product-tile-link", "href"),
       }
  
       allProducts = append(allProducts, products)
   })
  
   c.OnHTML("a.paging-controls__next.js-page-control", func(p *colly.HTMLElement) {
       nextPage := p.Request.AbsoluteURL(p.Attr("data-href"))
       c.Visit("http://api.scraperapi.com?api_key={yourApiKey}&amp;amp;url=" + nextPage)
   })
  
   c.OnError(func(r *colly.Response, err error) {
       fmt.Println("Request URL:", r.Request.URL, "failed with response:", r, "\nError:", err)
   })
  
   c.Visit("http://api.scraperapi.com?api_key={yourApiKey}&amp;amp;url=https://www.jackjones.com/nl/en/jj/shoes/")
  
   content, err := json.Marshal(allProducts)
   if err != nil {
       fmt.Println(err.Error())
   }
   os.WriteFile("jack-shoes.json", content, 0644)
   fmt.Println("Total products: ", len(allProducts))
}&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Web scraping is one of the most powerful tools to have in your data collection arsenal. However, remember that every website is built somewhat differently. Focus on the fundamentals of website structure, and you’ll be able to solve any problem that comes your way.&lt;/p&gt;

&lt;p&gt;Until next time, happy scraping!&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>javascript</category>
      <category>go</category>
      <category>python</category>
    </item>
    <item>
      <title>How to Use Python to Loop Through HTML Tables and Scrape Tabular Data</title>
      <dc:creator>ScraperAPI Zoltan Bettenbuk</dc:creator>
      <pubDate>Wed, 31 Aug 2022 04:40:07 +0000</pubDate>
      <link>https://forem.com/scraperapi/how-to-use-python-to-loop-through-html-tables-and-scrape-tabular-data-df9</link>
      <guid>https://forem.com/scraperapi/how-to-use-python-to-loop-through-html-tables-and-scrape-tabular-data-df9</guid>
      <description>&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--PiEeMJRn--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/python-scrape-htmltables-tabular-data.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--PiEeMJRn--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/python-scrape-htmltables-tabular-data.jpg" alt="" width="880" height="455"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Tabular data is one of the best sources of data on the web. They can store a massive amount of useful information without losing its easy-to-read format, making it gold mines for data-related projects.&lt;/p&gt;

&lt;p&gt;Whether it is to &lt;a href="https://www.scraperapi.com/blog/how-to-scrape-football-data/"&gt;scrape football data&lt;/a&gt; or &lt;a href="https://www.scraperapi.com/blog/how-to-scrape-stock-market-data-with-python/"&gt;extract stock market data&lt;/a&gt;, we can use Python to quickly access, parse and extract data from HTML tables, thanks to Requests and Beautiful Soup.&lt;/p&gt;

&lt;p&gt;Also, we have a little black and white surprise for you at the end, so keep reading!&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding HTML Table’s Structure
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--SpS4mHGn--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/img1-python-loop-thru-html-tabular-data.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--SpS4mHGn--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/img1-python-loop-thru-html-tabular-data.png" alt="" width="877" height="522"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Visually, an &lt;a href="https://www.w3schools.com/tags/tag_table.asp"&gt;HTML table&lt;/a&gt; is a set of rows and columns displaying information in a tabular format. For this tutorial, we’ll be scraping the table above:&lt;/p&gt;

&lt;p&gt;To be able to scrape the data contained within this table, we’ll need to go a little deeper into its coding.&lt;/p&gt;

&lt;p&gt;Generally speaking, HTML tables are actually built using the following HTML tags:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;: It marks the start of an HTML table
&lt;li&gt;  &lt;th&gt; or &lt;thead&gt;: Defines a row as the heading of the table
&lt;li&gt;  &lt;tbody&gt;: Indicates the section where the data is
&lt;li&gt;  &lt;tr&gt;: Indicates a row in the table
&lt;li&gt;  &lt;td&gt;: Defines a cell in the table


&lt;p&gt;However, as we’ll see in real-life scenarios, not all developers respect these conventions when building their tables, making some projects harder than others. Still, understanding how they work is crucial for finding the right approach.&lt;/p&gt;

&lt;p&gt;Let’s enter the table’s URL (&lt;a href="https://datatables.net/examples/styling/stripe.html"&gt;https://datatables.net/examples/styling/stripe.html&lt;/a&gt;) in our browser and inspect the page to see what’s happening under the hood.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--SwcTJWp5--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/img2-python-loop-thru-html-tabular-data-1024x712.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--SwcTJWp5--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/img2-python-loop-thru-html-tabular-data-1024x712.png" alt="" width="880" height="612"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is why this is a great page to practice scraping tabular data with Python. There’s a clear &lt;/p&gt;
&lt;table&gt; tag pair opening and closing the table and all the relevant data is inside the &lt;tbody&gt; tag. It only shows ten rows which matches the number of entries selected on the front-end.

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--C9LwQBDP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/img3-python-loop-thru-html-tabular-data.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--C9LwQBDP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/img3-python-loop-thru-html-tabular-data.png" alt="" width="426" height="538"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A few more things to know about this table is that it has a total of 57 entries we’ll want to scrape and there seems to be two solutions to access the data. The first is clicking the drop-down menu and selecting “100” to show all entries:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--GeQWShuq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/img-4-python-loop-thru-html-tabular-data.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--GeQWShuq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/img-4-python-loop-thru-html-tabular-data.png" alt="" width="180" height="119"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Or clicking on the next button to move through the pagination.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--QstPyUxw--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/img5-python-loop-thru-html-tabular-data.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--QstPyUxw--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/img5-python-loop-thru-html-tabular-data.png" alt="" width="404" height="57"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So which one is gonna be? Either of these solutions will add extra complexity to our script, so instead, let’s check where’s the data getting pulled from first.&lt;/p&gt;

&lt;p&gt;Of course, because this is an HTML table, all the data should be on the HTML file itself without the need for an AJAX injection. To verify this, &lt;em&gt;Right Click&lt;/em&gt; &amp;gt; &lt;em&gt;View Page Source&lt;/em&gt;. Next, copy a few cells and search for them in the Source Code.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--_bD61A7G--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/img6-python-loop-thru-html-tabular-data.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--_bD61A7G--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/img6-python-loop-thru-html-tabular-data.png" alt="" width="858" height="213"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We did the same thing for a couple more entries from different paginated cells and yes, it seems like all our target data is in there even though the front-end doesn’t display it.&lt;/p&gt;

&lt;p&gt;And with this information, we’re ready to move to the code!&lt;/p&gt;
&lt;h2&gt;
  
  
  Scraping HTML Tables Using Python’s Beautiful Soup
&lt;/h2&gt;

&lt;p&gt;Because all the employee data we’re looking to scrape is on the HTML file, we can use the &lt;a href="https://github.com/psf/requests"&gt;Requests&lt;/a&gt; library to send the HTTP request and &lt;a href="https://www.scraperapi.com/blog/what-is-data-parsing/"&gt;parse the respond&lt;/a&gt; using &lt;a href="https://github.com/wention/BeautifulSoup4"&gt;Beautiful Soup&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; If you’re new to web scraping, we’ve created a &lt;a href="https://www.scraperapi.com/blog/web-scraping-python/"&gt;web scraping in Python tutorial for beginners&lt;/a&gt;. Although you’ll be able to follow along without experience, it’s always a good idea to start from the basics.&lt;/p&gt;
&lt;h3&gt;
  
  
  1. Sending Our Main Request
&lt;/h3&gt;

&lt;p&gt;Let’s create a new directory for the project named &lt;em&gt;python-html-table&lt;/em&gt;, then a new folder named &lt;em&gt;bs4-table-scraper&lt;/em&gt; and finally, create a new &lt;em&gt;python_table_scraper.py&lt;/em&gt; file.54&lt;/p&gt;

&lt;p&gt;From the terminal, let’s &lt;code&gt;pip3 install requests beautifulsoup4&lt;/code&gt; and import them to our project as follows:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import requests
from bs4 import BeautifulSoup
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;To send an HTTP requests with Requests, all we need to do is set an URL and pass it through requests.get(), store the returned HTML inside a response variable and print response.status_code.&lt;/p&gt;

&lt;p&gt;Note: If you’re totally new to Python, you can run your code from the terminal with the command python3 python_table_scraper.py.&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;url = 'https://datatables.net/examples/styling/stripe.html'

response = requests.get(url)

print(response.status_code)
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;If it’s working, it’s going to return a 200 status code. Anything else means that your IP is getting rejected by the anti-scraping systems the website has in placed. A potential solution is &lt;a href="https://www.scraperapi.com/blog/headers-and-cookies-for-web-scraping/"&gt;adding custom headers to your script&lt;/a&gt; to make your script look more human – but that might not be sufficient. Another solution is using an web scraping API to handle all these complexities for you.&lt;/p&gt;
&lt;h3&gt;
  
  
  2. Integrating ScraperAPI to Avoid Anti-Scraping systems
&lt;/h3&gt;

&lt;p&gt;ScraperAPI is an elegant solution to avoid almost any type of anti-scraping technique. It uses machine learning and years of statistical analysis to determine the best headers and IP combinations to access the data, handle CAPTCHAs and rotate your IP between each request.&lt;/p&gt;

&lt;p&gt;To start, let’s &lt;a href="https://www.scraperapi.com/signup"&gt;create a new ScraperAPI free account&lt;/a&gt; to redeem 5000 free APIs and our API Key. From our account’s dashboard, we can copy our key value to build the URL of the request.&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;http://api.scraperapi.com?api_key={Your_API_KEY}&amp;amp;amp;amp;amp;amp;amp;url={TARGET_URL}
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;Following this structure, we replace the holders with our data and send our request again:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import requests
from bs4 import BeautifulSoup

url = 'http://api.scraperapi.com?api_key=51e43be283e4db2a5afbxxxxxxxxxxxx&amp;amp;amp;amp;amp;amp;amp;url=https://datatables.net/examples/styling/stripe.html'

response = requests.get(url)

print(response.status_code)
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--QPUzXOHs--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/img7-python-loop-thru-html-tabular-data.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--QPUzXOHs--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/img7-python-loop-thru-html-tabular-data.png" alt="" width="455" height="39"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Awesome, it’s working without any hiccup!&lt;/p&gt;
&lt;h3&gt;
  
  
  3. Building the Parser Using Beautiful Soup
&lt;/h3&gt;

&lt;p&gt;Before we can extract the data, we need to turn the raw HTML into formatted or parsed data. We’ll store this parsed HTML into a soup object like this:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;soup = BeautifulSoup(response.text, 'html.parser')
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;From here, we can traverse the parse tree using the HTML tags and their attributes.&lt;/p&gt;

&lt;p&gt;If we go back to the table on the page, we’ve already seen that the table is enclosed between &lt;code&gt;&amp;lt;table&amp;gt;&lt;/code&gt; tags with the class &lt;code&gt;stripe dataTable&lt;/code&gt;, which we can use to select the table.&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;table = soup.find('table', class_ = 'stripe')
print(table)
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; After testing, adding the second class (dataTable) didn’t return the element. In fact, in the return elements, the table’s class is only stripe. You can also use id = ‘example’.&lt;/p&gt;

&lt;p&gt;Here’s what it returns:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--7qjc7iZT--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/img8-python-loop-thru-html-tabular-data.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--7qjc7iZT--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/img8-python-loop-thru-html-tabular-data.png" alt="" width="450" height="257"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now that we grabbed the table, we can loop through the rows and grab the data we want.&lt;/p&gt;
&lt;h3&gt;
  
  
  4. Looping Through the HTML Table
&lt;/h3&gt;

&lt;p&gt;Thinking back to the table’s structure, every row is represented by a &lt;code&gt;&amp;lt;tr&amp;gt;&lt;/code&gt; element, and within them there’s &lt;code&gt;&amp;lt;td&amp;gt;&lt;/code&gt; element containing data, all of this is wrapped between a &lt;code&gt;&amp;lt;tbody&amp;gt;&lt;/code&gt; tag pair.&lt;/p&gt;

&lt;p&gt;To extract the data, we’ll create two for looks, one to grab the &lt;code&gt;&amp;lt;tbody&amp;gt;&lt;/code&gt; section of the table (where all rows are) and another to store all rows into a variable we can use:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;for employee_data in table.find_all('tbody'):
   rows = employee_data.find_all('tr')
   print(rows)
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;In rows we’ll store all the &lt;code&gt;&amp;lt;tr&amp;gt;&lt;/code&gt; elements found within the body section of the table. If you’re following our logic, the next step is to store each individual row into a single object and loop through them to find the desired data.&lt;/p&gt;

&lt;p&gt;For starters, let’s try to pick the first employee’s name on our browser’s console using the &lt;a href="https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelectorAll"&gt;.querySelectorAll()&lt;/a&gt; method. A really usuful feature of this method is that we can go deeper and deeper into the hierarchy implementing the greater than (&amp;gt;) symbol to define the parent element (on the left) and the child we want to grab (on the right).&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;document.querySelectorAll('table.stripe &amp;amp;amp;amp;amp;amp;gt; tbody &amp;amp;amp;amp;amp;amp;gt; tr &amp;amp;amp;amp;amp;amp;gt; td')[0]
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Z0p4lbeI--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/browser-console-test1.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Z0p4lbeI--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/browser-console-test1.gif" alt="" width="" height=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That couldn’t work any better. As you see, once we grab all &lt;/p&gt;
&lt;td&gt; elements, these become a nodelist. Because we can’t rely on a class to grab each cell, all we need to know is their position in the index and the first one, name, is 0.

&lt;p&gt;From there, we can write our code like this:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;for row in rows:
    name = row.find_all('td')[0].text
    print(name)
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;In simple terms, we’re taking each row, one by one, and finding all the cells inside, once we have the list, we grab only the first one in the index (position 0) and finish with the .text method to only grab the element’s text, ignoring the HTML data we don’t need.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--SUKVTLUH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/img9-python-loop-thru-html-tabular-data.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--SUKVTLUH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/img9-python-loop-thru-html-tabular-data.png" alt="" width="700" height="283"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There they are, a list with all the names employees names! For the rest, we just follow the same logic:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;position = row.find_all('td')[1].text
office = row.find_all('td')[2].text
age = row.find_all('td')[3].text
start_date = row.find_all('td')[4].text
salary = row.find_all('td')[5].text
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;However, having all this data printed on our console isn’t super helpful. Instead, let’s store this data into a new, more useful format.&lt;/p&gt;
&lt;h3&gt;
  
  
  5. Storing Tabular Data Into a JSON File
&lt;/h3&gt;

&lt;p&gt;Although we could easily create a CSV file and send our data there, that wouldn’t be the most manageble format if we can to create something new using the scraped data.&lt;/p&gt;

&lt;p&gt;Still, here’s a project we did a few months ago explaining how to &lt;a href="https://www.scraperapi.com/blog/linkedin-scraper-python/"&gt;create a CSV file to store scraped data&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The good news is that Python has its own JSON module for working with JSON objects, so we don’t need to install anything, just import it.&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import json
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;But, before we can go ahead and create our JSON file, we’ll need to turn all this scraped data into a list. To do so, we’ll create an empty array outside of our loop.&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;employee_list = []
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;And then append the data to it, with each loop appending a new object to the array.&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;employee_list.append({
    'Name': name,
    'Position': position,
    'Office': office,
    'Age': age,
    'Start date': start_date,
    'salary': salary
})
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;If we &lt;code&gt;print(employee_list)&lt;/code&gt;, here’s the result:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ipEWb6Xz--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/img10-python-loop-thru-html-tabular-data.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ipEWb6Xz--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/img10-python-loop-thru-html-tabular-data.png" alt="" width="880" height="255"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Still a little messy, but we have a set of objects ready to be transformed into JSON.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; As a test, we printed the length of &lt;code&gt;employee_list&lt;/code&gt; and it returned 57, which is the correct number of rows we scraped (rows now being objects within the array).&lt;/p&gt;

&lt;p&gt;Importing a list to JSON just requires two lines of code:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;with open('json_data', 'w') as json_file:
   json.dump(employee_list, json_file, indent=2)
&lt;/code&gt;&lt;/pre&gt;


&lt;ul&gt;
&lt;li&gt;  First, we open a new file passing in the name we want for the file &lt;code&gt;(json_data)&lt;/code&gt; and ‘w’ as we want to write data to it.&lt;/li&gt;
&lt;li&gt;  Next, we use the &lt;code&gt;.dump()&lt;/code&gt; function to, well, dump the data from the array &lt;code&gt;(&lt;/code&gt;&lt;code&gt;employee_list&lt;/code&gt;&lt;code&gt;)&lt;/code&gt; and &lt;code&gt;indent=2&lt;/code&gt; so every object has it’s own line instead of everything being in one unreadable line.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  6. Running the Script and Full Code
&lt;/h3&gt;

&lt;p&gt;If you’ve been following along, &lt;strong&gt;your codebase should look like this:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#dependencies
import requests
from bs4 import BeautifulSoup
import json

url = 'http://api.scraperapi.com?api_key=51e43be283e4db2a5afbxxxxxxxxxxx&amp;amp;amp;url=https://datatables.net/examples/styling/stripe.html'

#empty array
employee_list = []

#requesting and parsing the HTML file
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

#selecting the table
table = soup.find('table', class_ = 'stripe')
#storing all rows into one variable
for employee_data in table.find_all('tbody'):
   rows = employee_data.find_all('tr')
   #looping through the HTML table to scrape the data
   for row in rows:
       name = row.find_all('td')[0].text
       position = row.find_all('td')[1].text
       office = row.find_all('td')[2].text
       age = row.find_all('td')[3].text
       start_date = row.find_all('td')[4].text
       salary = row.find_all('td')[5].text
       #sending scraped data to the empty array
       employee_list.append({
           'Name': name,
           'Position': position,
           'Office': office,
           'Age': age,
           'Start date': start_date,
           'salary': salary
       })
#importing the array to a JSON file
with open('employee_data', 'w') as json_file:
   json.dump(employee_list, json_file, indent=2)
&lt;/code&gt;&lt;/pre&gt;



&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; We added some comments for context.&lt;/p&gt;

&lt;p&gt;And here’s a look at the first three objects from the JSON file:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--0dzmuE0v--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/img11-python-loop-thru-html-tabular-data.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--0dzmuE0v--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/img11-python-loop-thru-html-tabular-data.png" alt="" width="427" height="512"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Storing scraped data in JSON format allow us to repurpose the information for new applications or&lt;/p&gt;

&lt;h2&gt;
  
  
  Scraping HTML Tables Using Pandas
&lt;/h2&gt;

&lt;p&gt;Before you leave the page, we want to explore a second approach to scrape HTML tables. In a few lines of code, we can scrape all tabular data from an HTML document and store it into a dataframe using Pandas.&lt;/p&gt;

&lt;p&gt;Create a new folder inside the project’s directory (we named it pandas-html-table-scraper) and create a new file name pandas_table_scraper.py.&lt;/p&gt;

&lt;p&gt;Let’s open a new terminal and navigate to the folder we just created (cd pandas-html-table-scraper) and from there install pandas:&lt;br&gt;
&lt;/p&gt;

&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install pandas
&lt;/code&gt;&lt;/pre&gt;



&lt;p&gt;And we import it at the top of the file.&lt;br&gt;
&lt;/p&gt;

&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pandas as pd
&lt;/code&gt;&lt;/pre&gt;



&lt;p&gt;Pandas has a function called &lt;a href="https://pandas.pydata.org/docs/reference/api/pandas.read_html.html"&gt;read_html()&lt;/a&gt; which basically scrape the target URL for us and returns all HTML tables as a list of DataFrame objects.&lt;/p&gt;

&lt;p&gt;However, for this to work, the HTML table needs to be structured at least somewhat decently, as the function will look for elements like &lt;/p&gt;

&lt;table&gt; to identify the tables on the file.

&lt;p&gt;To use the function, let’s create a new variable and pass the URL we used previously to it:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;employee_data = pd.read_html('http://api.scraperapi.com?api_key=51e43be283e4db2a5afbxxxxxxxxxxxx&amp;amp;url=https://datatables.net/examples/styling/stripe.html')
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;When printing it, it’ll return a list of HTML tables within the page.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--qojrCZwg--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/img12-python-loop-thru-html-tabular-data.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--qojrCZwg--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/img12-python-loop-thru-html-tabular-data.png" alt="" width="622" height="383"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If we compare the first three rows in the DataFrame they’re a perfect match to what we scraped with Beautiful Soup.&lt;/p&gt;

&lt;p&gt;To work with JSON, Pandas can has a built-in &lt;a href="https://appdividend.com/2022/03/15/pandas-to_json/#:~:text=To%20convert%20the%20object%20to,use%20the%20to_json()%20function."&gt;.to_json()&lt;/a&gt; fuction. It’ll convert a list of DataFrame objects into a JSON string&lt;/p&gt;

&lt;p&gt;All we need to do is calling the method on our DataFrame and pass in the path, the format (split, data, records, index, etc.) and add the indent to make it more readable:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;employee_data[0].to_json('./employee_list.json', orient='index', indent=2)
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;If we run our code now, here’s the resulting file:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--zFEf-NC7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/img13-python-loop-thru-html-tabular-data.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--zFEf-NC7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/08/img13-python-loop-thru-html-tabular-data.png" alt="" width="360" height="469"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Notice that we needed to select our table from the index ([0])because .read_html() returns a list not a single object.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Here’s the full code for your reference:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pandas as pd

employee_data = pd.read_html('http://api.scraperapi.com?api_key=51e43be283e4db2a5afbxxxxxxxxxxxx&amp;amp;url=https://datatables.net/examples/styling/stripe.html')

employee_data[0].to_json('./employee_list.json', orient='index', indent=2)
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;Armed with this new knowledge, you’re ready to start scraping virtually any HTML table on the web. Just remember that if you understand how the website is structured and the logic behind it, there’s nothing you can’t scrape.&lt;/p&gt;

&lt;p&gt;That said, these methods will only work as long as the data is inside the HTML file. If you encounter a dynamically generated table, you’ll need to find a new approach. For these type of tables, we’ve created a &lt;a href="https://www.scraperapi.com/blog/scrape-javascript-tables-python/"&gt;step-by-step guide to scraping JavaScript tables with Python&lt;/a&gt; without the need for headless browsers.&lt;/p&gt;

&lt;p&gt;Until next time, happy scraping!&lt;/p&gt;

&lt;p&gt;Originally published on Scraper API: &lt;a href="https://www.scraperapi.com/blog/python-loop-through-html-table/"&gt;How to Use Python to Loop Through HTML Tables and Scrape Tabular Data&lt;/a&gt;&lt;/p&gt;
&lt;/table&gt;


&lt;/td&gt;
&lt;br&gt;
&lt;/tbody&gt;
&lt;br&gt;
&lt;/table&gt;
&lt;br&gt;
&lt;/td&gt;
&lt;/li&gt;
&lt;br&gt;
&lt;/tr&gt;
&lt;/li&gt;
&lt;br&gt;
&lt;/tbody&gt;
&lt;/li&gt;
&lt;br&gt;
&lt;/thead&gt;
&lt;br&gt;
&lt;/th&gt;
&lt;/li&gt;
&lt;br&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>webdev</category>
      <category>python</category>
      <category>programming</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>How to Scrape HTML Tables in JavaScript [Export Table Data to a CSV]</title>
      <dc:creator>ScraperAPI Zoltan Bettenbuk</dc:creator>
      <pubDate>Mon, 18 Jul 2022 18:56:29 +0000</pubDate>
      <link>https://forem.com/scraperapi/how-to-scrape-html-tables-in-javascript-export-table-data-to-a-csv-239b</link>
      <guid>https://forem.com/scraperapi/how-to-scrape-html-tables-in-javascript-export-table-data-to-a-csv-239b</guid>
      <description>&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--m571PPpe--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/06/htmltable-to-csv-feat.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--m571PPpe--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/06/htmltable-to-csv-feat.jpg" alt="" width="880" height="462"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Originally published on &lt;a href="https://www.scraperapi.com/blog/scrape-html-table-to-csv/"&gt;ScraperAPI&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;HTML tables are the best data sources on the web. They are easy to understand and can hold an immense amount of data in a simple-to-read and understand format. Being able to scrape HTML tables is a crucial skill to develop for any developer interested in data science or in data analysis in general.&lt;/p&gt;

&lt;p&gt;In this tutorial, we’re going to go deeper into HTML tables and build a simple, yet powerful, script to extract tabular data and export it to a CSV file.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is an HTML Web Table?
&lt;/h2&gt;

&lt;p&gt;An &lt;a href="https://www.w3schools.com/tags/tag_table.asp"&gt;HTML table&lt;/a&gt; is a set of rows and columns that are used to display information in a grid format directly on a web page. They are commonly used to display tabular data, such as spreadsheets or databases, and are a great source of data for our projects.&lt;/p&gt;

&lt;p&gt;From &lt;a href="https://www.scraperapi.com/blog/how-to-scrape-football-data/"&gt;sports data&lt;/a&gt; and weather data to books and authors’ data, most big datasets on the web are accessible through HTML tables because of how great they are to display information in a structured and easy-to-navigate format.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--U3pXTPh4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/06/wiki-table1-1024x412.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--U3pXTPh4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/06/wiki-table1-1024x412.png" alt="" width="880" height="354"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The great news for us is that, unlike dynamically generated content, the HTML table’s data lives directly inside of the table element in the HTML file, meaning that we can scrape all the information we need exactly as we would with other elements of the web – as long as we understand their structure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding HTML Table’s Structure
&lt;/h2&gt;

&lt;p&gt;Though you can only see the columns and rows in the front end, these tables are actually created using a few different HTML tags:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;: It marks the start of an HTML table
&lt;li&gt;  &lt;tr&gt;: Indicates a row in the table
&lt;li&gt;  &lt;code&gt;&amp;lt;td&amp;gt;&lt;/code&gt;: Defines a cell in the table&lt;/li&gt;


&lt;p&gt;The content goes inside the &lt;code&gt;&amp;lt;td&amp;gt;&lt;/code&gt; tag and &lt;code&gt;&amp;lt;tr&amp;gt;&lt;/code&gt; is used to create a row. In other words: &lt;em&gt;Table &amp;gt; Row &amp;gt; Cell&lt;/em&gt; || &lt;em&gt;table &amp;gt; tr &amp;gt; td&lt;/em&gt; hierarchy is followed to create an HTML table.&lt;/p&gt;

&lt;p&gt;A special cell can be created using the &lt;code&gt;&amp;lt;th&amp;gt;&lt;/code&gt; tag which means &lt;em&gt;table header&lt;/em&gt;. Basically, the first cells of the first row can be created using the &lt;code&gt;&amp;lt;th&amp;gt;&lt;/code&gt; tag to indicate the row is the heading of the table.&lt;/p&gt;

&lt;p&gt;Here is an example to create a simple two-row and two-column based HTML table:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://gist.github.com/scraperapikerins/ada86e5a5c5398599b77d4399be543ed"&gt;https://gist.github.com/scraperapikerins/ada86e5a5c5398599b77d4399be543ed&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;table&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;tr&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;th&amp;gt;&lt;/span&gt;Pet 1&lt;span class="nt"&gt;&amp;lt;/th&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;th&amp;gt;&lt;/span&gt;Pet 2&lt;span class="nt"&gt;&amp;lt;/th&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/tr&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;tr&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;td&amp;gt;&lt;/span&gt;Dog&lt;span class="nt"&gt;&amp;lt;/td&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;td&amp;gt;&lt;/span&gt;Cat&lt;span class="nt"&gt;&amp;lt;/td&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/tr&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/table&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;There’s one major difference when scraping HTML tables though. Unlike other elements on a web page, CSS selectors target the overall cells and rows – or even the entire table – because all of these elements are actually components of the &lt;code&gt;&amp;lt;table&amp;gt;&lt;/code&gt; element.&lt;/p&gt;

&lt;p&gt;Instead of targeting a CSS selector for each data point, we want to scrape, we’ll need to create a list with all the rows of the table and loop through them to grab the data from their cells.&lt;/p&gt;

&lt;p&gt;If we understand this logic, creating our script is actually pretty straightforward.&lt;/p&gt;
&lt;h2&gt;
  
  
  Scraping HTML Tables to CSV with Node.JS
&lt;/h2&gt;

&lt;p&gt;If this is your first time using Node.JS for web scraping, it might be useful to go through some of our previous tutorials:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;a href="https://www.scraperapi.com/blog/web-scraping-javascript-tutorial/"&gt;Web Scraping with JavaScript and Node.js&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://www.scraperapi.com/uncategorized/how-to-build-a-linkedin-scraper/"&gt;How to Build a LinkedIn Scraper For Free&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://www.scraperapi.com/blog/how-to-scrape-football-data/"&gt;How to Build a Football Data Scraper Step-by-Step&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;However, we’ll keep this tutorial as beginner-friendly as possible so you can use it even as a starting point.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; For Node.JS installation instructions, please refer to the first article on the list.&lt;/p&gt;

&lt;p&gt;For today’s project, we’ll build a web scraper using &lt;a href="https://www.npmjs.com/package/axios"&gt;Axios&lt;/a&gt; and &lt;a href="https://cheerio.js.org/"&gt;Cheerio&lt;/a&gt; to scrape the employee data displayed on &lt;a href="https://datatables.net/examples/styling/display.html"&gt;https://datatables.net/examples/styling/display.html&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--SixVyxw2--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/06/data-tables2-1024x632.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--SixVyxw2--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/06/data-tables2-1024x632.jpg" alt="" width="880" height="543"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We’ll be extracting the name, position, office, age, start date, and salary for each employee, and then send the data to a CSV using the &lt;a href="https://www.npmjs.com/package/objects-to-csv"&gt;ObjectsToCsv&lt;/a&gt; package.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Getting Our Files Ready
&lt;/h3&gt;

&lt;p&gt;To kickstart our project, let’s create a new directory named &lt;em&gt;html-table-scraper&lt;/em&gt;, open the new folder on VScode (or your code editor of preference) and open a new terminal.&lt;/p&gt;

&lt;p&gt;In the terminal, we’ll run &lt;code&gt;npm init -y&lt;/code&gt; to start a new Node.JS project. You’ll now have a new JSON file in your folder.&lt;/p&gt;

&lt;p&gt;Next, we’ll install our dependencies using the following commands:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Axios: &lt;code&gt;npm install axios&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;  Cheerio: &lt;code&gt;npm install cheerio&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;  ObjectsToCsv: &lt;code&gt;npm install objects-to-csv&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Our for a one command installation: &lt;code&gt;npm i axios cheerio objects-to-csv&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Now we can create a new file named &lt;em&gt;tablescraper.js&lt;/em&gt; and import our dependencies at the top.&lt;br&gt;
&lt;/p&gt;

&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;axios&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;axios&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;cheerio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;cheerio&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ObjectsToCsv&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;objects-to-csv&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;



&lt;p&gt;Also, your project should be looking like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--W2uO9Mis--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/06/html-table-scraper3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--W2uO9Mis--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/06/html-table-scraper3.png" alt="" width="630" height="270"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Testing the Target Site Using DevTools
&lt;/h3&gt;

&lt;p&gt;Before writing the code, we need to understand how is the website structured. Yes, all tables use the basic structure, but that doesn’t mean that all are created equally.&lt;/p&gt;

&lt;p&gt;The first thing we need to determine is whether or not this is, in fact, an HTML table. It’s very common for sites to use JavaScript to inject data into their tables, especially if there are any real-time data involved. For those cases, we would have to use a totally different approach like using a headless browser.&lt;/p&gt;

&lt;p&gt;To test if the data is inside the HTML file, all we need to do is copy some data points – let’s say the name – and look for it in the source code of the page.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--K-quFp97--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/06/view-source-code4-1024x483.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--K-quFp97--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/06/view-source-code4-1024x483.jpg" alt="" width="880" height="415"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We did the same for other names and data points just to make sure, and yes, all the data is right there at our disposal. Another interesting surprise is that all the rows of the table are inside the raw HTML, even though there seems to be some kind of pagination on the front end.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--hORe5v92--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/06/pagination5-1024x101.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--hORe5v92--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/06/pagination5-1024x101.png" alt="" width="880" height="87"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Plus, we also now know that there are a total of 57 rows to scrape. This is important because we can know whether or not we’re actually grabbing all the data available.&lt;/p&gt;

&lt;p&gt;The second thing we want to test directly on the browser is our selectors. Instead of sending a bunch of unnecessary requests, we can use the browser’s console to grab elements using the &lt;code&gt;document.querySelectorAll()&lt;/code&gt; &lt;code&gt;method&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;If we go to the console and type &lt;code&gt;document.querySelectorAll('table')&lt;/code&gt;, it return four different tables.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--prkEzirH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/06/query-selector6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--prkEzirH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/06/query-selector6.png" alt="" width="866" height="376"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Mousing over the tables, we quickly realized that the first table (number 0) is the right one. So let’s do it again but specifying the class – which in the list is represented by the dots (.).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--6rKGnBDj--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/06/class-selector7-1024x470.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--6rKGnBDj--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/06/class-selector7-1024x470.jpg" alt="" width="880" height="404"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Great, we’re one step closer to our data!&lt;/p&gt;

&lt;p&gt;By taking a closer look, the data of the table is wrapped around a &lt;code&gt;&amp;lt;tbody&amp;gt;&lt;/code&gt; tag, so let’s add it to our selector to make sure that we’re only grabbing the rows containing the data we want.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--KmB57t7F--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/06/body-tag8-1024x416.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--KmB57t7F--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/06/body-tag8-1024x416.png" alt="" width="880" height="358"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Lastly, we’ll want to grab all the rows and verify that our selector is grabbing the entire 57 rows.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--OQXegSIS--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/06/grab-rows9.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--OQXegSIS--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/06/grab-rows9.jpg" alt="" width="880" height="488"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Because we’re using the console to select elements on the rendered HTML, we needed to set the total amount of displayed items to 100. Otherwise, our selector on the console will only show10 node items.&lt;/p&gt;

&lt;p&gt;With all this information, we can now start writing our code!&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Sending Our HTTP Request and Parsing the Raw HTML
&lt;/h3&gt;

&lt;p&gt;Axios makes it super easy to send HTTP requests inside an &lt;code&gt;Async Function&lt;/code&gt;. All we need to do is create an async function and pass the URL to Axios in a constant named &lt;code&gt;response&lt;/code&gt;. We’ll also log the status code of the response (which should be 200 for a successful request).&lt;br&gt;
&lt;/p&gt;

&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nx"&gt;html_scraper&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
   &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;axios&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https://datatables.net/examples/styling/display.html&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
   &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;status&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;})();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--skKA6V3G--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/06/async-function10-1024x149.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--skKA6V3G--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/06/async-function10-1024x149.png" alt="" width="880" height="128"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; You can name these variables as you’d like, but keep them as descriptive as possible.&lt;/p&gt;

&lt;p&gt;Next, we’ll store the data from the response (raw HTML) into a new constant named html, so we can then pass it to Cheerio for parsing using &lt;code&gt;cheerio.load()&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;html&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
   &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;$&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;cheerio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;html&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;



&lt;h3&gt;
  
  
  4. Iterating Through the HTML Table Rows
&lt;/h3&gt;

&lt;p&gt;Using the selector we’ve tested before, let’s select all the rows inside the HTML table.&lt;br&gt;
&lt;/p&gt;

&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;allRows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;table.display &amp;amp;amp;gt; tbody &amp;amp;amp;gt; tr&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;allRows&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;



&lt;p&gt;For testing purposes, let’s console.log() the length of allRows to verify that, indeed, we’ve picked all our target rows.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--3ZoZqCjv--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/06/console-log-allrows11-1024x164.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--3ZoZqCjv--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/06/console-log-allrows11-1024x164.png" alt="" width="880" height="141"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;57 is exactly what we were aiming for!&lt;/p&gt;

&lt;p&gt;Of course, to go through the list of rows, we’ll be using the &lt;code&gt;.each()&lt;/code&gt; method, but there’s one more thing we need to figure out: the order of the cells.&lt;/p&gt;

&lt;p&gt;Unlike common HTML elements, cells don’t have a unique class assigned to them. So trying to scrape each data point with a CSS class could be a mess. Instead, we’re going to target the &lt;code&gt;&amp;lt;td&amp;gt;&lt;/code&gt;s position within its row.&lt;/p&gt;

&lt;p&gt;In other words, we’ll tell our script to go to each row, select all cells inside the row, and then store each data point in a variable based on its position within the row.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; In Node.JS, all lists start at 0. So the first position would be [0], and the second cell would be [1].&lt;/p&gt;

&lt;p&gt;But how do we know which position is which? We go back to our browser’s console and test it out:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--nUEZhOdl--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/06/cell-position-testing.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--nUEZhOdl--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/06/cell-position-testing.gif" alt="" width="880" height="152"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now that we know where each element is in relation to the rest, here’s the finished parser:&lt;br&gt;
&lt;/p&gt;

&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;allRows&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;each&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;element&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;amp;&lt;/span&gt;&lt;span class="nx"&gt;amp&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="nx"&gt;gt&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;tds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;element&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;td&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;tds&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;position&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;tds&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;office&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;tds&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;age&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;tds&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;startDate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;tds&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;salary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;tds&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;



&lt;h3&gt;
  
  
  5. Pushing the Scraped Data Into an Empty Array
&lt;/h3&gt;

&lt;p&gt;If we &lt;code&gt;console.log()&lt;/code&gt; the scraped data we’ll see that we’re scraping the text out of each cell, but with very disorganized results – which in turn makes it harder to create our CSV file.&lt;/p&gt;

&lt;p&gt;So before we export our data, let’s give it some order by pushing the data to an empty array to create a simple node list.&lt;/p&gt;

&lt;p&gt;First, create an empty array outside of the main function – if you create the empty array inside the function, it will be overwritten with every iteration, which is not something we want.&lt;br&gt;
&lt;/p&gt;

&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;employeeData&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;



&lt;p&gt;Then, as part of our parser, let’s use the .push() method to store our data in the empty list we’ve created.&lt;br&gt;
&lt;/p&gt;

&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;employeeData&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;push&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Name&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Position&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;position&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Office&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;office&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Age&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Start Date&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;startDate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Salary&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;salary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;



&lt;p&gt;Like always, let’s  &lt;code&gt;console.log()&lt;/code&gt;  the employeeData’s length to make sure that we now have 57 items in it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--quoDSFcR--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/06/employeesdata12-1024x725.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--quoDSFcR--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/06/employeesdata12-1024x725.jpg" alt="" width="880" height="623"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For visual context, we can also log the array to see what’s stored inside.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--m32yJ_tb--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/06/log-array13.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--m32yJ_tb--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/06/log-array13.jpg" alt="" width="632" height="900"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As we can see, all the data is now stored inside node items which contain every piece of data in a structured format.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Sending Scraped Data to a CSV File
&lt;/h3&gt;

&lt;p&gt;With our data organized, we can pass our list to  &lt;code&gt;ObjectsToCsv&lt;/code&gt;  and it’ll create the file for us with no extra work:&lt;br&gt;
&lt;/p&gt;

&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;csv&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;ObjectsToCsv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;employeeData&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;csv&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;toDisk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;./employeeData.csv&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;



&lt;p&gt;All we need to do is create a new csv object and pass the list to  &lt;code&gt;ObjectsToCsv&lt;/code&gt;, and then tell it to save it in our machine providing the path.&lt;/p&gt;

&lt;h3&gt;
  
  
  7. HTML Table Scraper [Full Code]
&lt;/h3&gt;

&lt;p&gt;Congratulations, you’ve officially created your first HTML table scraper! Compare your code to the finished codebase of this tutorial to ensure you haven’t missed anything:&lt;br&gt;
&lt;/p&gt;

&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;axios&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;axios&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;cheerio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;cheerio&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ObjectsToCsv&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;objects-to-csv&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="nx"&gt;employeeData&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;

&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nx"&gt;html_scraper&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
   &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;axios&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https://datatables.net/examples/styling/display.html&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;html&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
   &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;$&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;cheerio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;html&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

   &lt;span class="c1"&gt;//Selecting all rows inside our target table&lt;/span&gt;
   &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;allRows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;table.display &amp;amp;gt; tbody &amp;amp;gt; tr&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
   &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Going through rows&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

   &lt;span class="c1"&gt;//Looping through the rows&lt;/span&gt;
   &lt;span class="nx"&gt;allRows&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;each&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;element&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;amp;&lt;/span&gt;&lt;span class="nx"&gt;gt&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
       &lt;span class="c1"&gt;//Selecting all cells within the row&lt;/span&gt;
       &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;tds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;element&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;td&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
       &lt;span class="c1"&gt;//Extracting the text out of each cell&lt;/span&gt;
       &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;tds&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
       &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;position&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;tds&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
       &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;office&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;tds&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
       &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;age&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;tds&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
       &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;startDate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;tds&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
       &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;salary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;tds&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

       &lt;span class="c1"&gt;//Pushing scraped data to our empty array&lt;/span&gt;
       &lt;span class="nx"&gt;employeeData&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;push&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
           &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Name&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Position&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;position&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Office&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;office&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Age&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Start Date&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;startDate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Salary&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;salary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="p"&gt;})&lt;/span&gt;
   &lt;span class="p"&gt;})&lt;/span&gt;
   &lt;span class="c1"&gt;//Exporting scraped data to a CSV file&lt;/span&gt;
   &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Saving data to CSV&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
   &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;csv&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;ObjectsToCsv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;employeeData&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
   &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;csv&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;toDisk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;./employeeData.csv&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Saved to CSV&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;})();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;



&lt;p&gt;After running our script, a new CSV file gets created inside our project’s folder:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--hcEBq-0z--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/06/csv-output14-1024x536.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--hcEBq-0z--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.scraperapi.com/wp-content/uploads/2022/06/csv-output14-1024x536.jpg" alt="" width="880" height="461"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now, you can use this data to run further analysis like making salary comparisons based on job title or starting date, or look for trends in bigger job datasets.&lt;/p&gt;

&lt;p&gt;Of course, this script can be adapted to handle almost any HTML table you’ll find, so keep your mind open to new possibilities and&lt;/p&gt;

&lt;h2&gt;
  
  
  Avoid Getting Blocked: Integrating ScraperAPI in a Single Line of Code
&lt;/h2&gt;

&lt;p&gt;Before yougo away, there’s one more thing we need to do to make our scraper more resilient and that’s handling anti-scraping techniques and systems. A lot of websites don’t like to be scraped because, sadly, a lot of scrapers are badly optimized and tend to hurt their sites.&lt;/p&gt;

&lt;p&gt;For that reason, you need to follow some &lt;a href="https://www.scraperapi.com/blog/web-scraping-best-practices/"&gt;web scraping best practices&lt;/a&gt; to ensure that you’re handling your projects correctly, without putting too much pressure on your target website, nor putting your script and IP in risk of getting ban or blacklisted – making it impossible to access the needed data from your machine again.&lt;/p&gt;

&lt;p&gt;To handle IP rotation, JavaScript rendering, &lt;a href="https://www.scraperapi.com/blog/headers-and-cookies-for-web-scraping/"&gt;find and implement HTTP headers&lt;/a&gt;, CAPTCHAs and more, all we need to do is send our initial request through ScraperAPI’s server. This API will use years of statistical analysis and machine learning to determine which is the best combination of headers and proxy, handle any unsuccessful requests and time our request so it doesn’t overload our target server.&lt;/p&gt;

&lt;p&gt;Adding it to our script is as simple as adding this string to the URL passed to Axios:&lt;br&gt;
&lt;/p&gt;

&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;axios&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;http://api.scraperapi.com?api_key={Your_API_Key}&amp;amp;url=https://datatables.net/examples/styling/display.html&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;



&lt;p&gt;Remember to substitute &lt;code&gt;{Your_API_Key}&lt;/code&gt; with your own API key – which you can generate by creating a &lt;a href="https://www.scraperapi.com/signup"&gt;free ScraperAPI account&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Your initial request will take a little longer while ScraperAPI handles any complexities for you and will only consume API credits for successful requests.&lt;/p&gt;

&lt;p&gt;Now it’s your turn. Web scraping is all about practice. Every website is a different puzzle so there’s no one way to do things. Instead, focus on using the foundations to take on more complex challenges.&lt;/p&gt;

&lt;p&gt;If you want to keep practicing, a few websites we recommend are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;a href="https://quotes.toscrape.com/"&gt;https://quotes.toscrape.com/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://books.toscrape.com/"&gt;https://books.toscrape.com/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://datatables.net/examples/index"&gt;https://datatables.net/examples/index&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Until next time, happy scraping!&lt;/p&gt;


&lt;/tr&gt;
&lt;/li&gt;
&lt;br&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>javascript</category>
      <category>html</category>
      <category>webdev</category>
      <category>datascience</category>
    </item>
  </channel>
</rss>
