Scraping All Site URLs

#python #crawling #programming #tutorial

This guide is for those who want to find all URLs on a website, go through them, and extract some data, without spending time on writing CSS selectors. Instead, I’ll use AI for the heavy lifting. The example will be on Python because I’m most comfortable with it.

Step 1. Install and import libraries

You can install the required library with:

pip install requests

The json and xml.etree.ElementTree modules are built into Python, so you don’t need to install them.
Now import everything:

import requests
import json
import xml.etree.ElementTree as ET

Step 2. Set your variables

You’ll need HasData’s API key, as we use this API to send requests and process content with an LLM:

api_key = "YOUR-API-KEY"

Then set the sitemap URL:

sitemap_url = "https://demo.nopcommerce.com/sitemap.xml"

If the site you want to scrape doesn’t have a sitemap, you can use HasData’s web crawler instead, it also supports AI-based parsing and will automatically find the right pages for you.

Step 3. Get all page URLs

Let’s make an API request to fetch the sitemap. First, define the request body:

payload = json.dumps({
  "url": sitemap_url,
  "proxyType": "datacenter",
  "proxyCountry": "US"
})

Then the headers:

headers = {
  'Content-Type': 'application/json',
  'x-api-key': api_key
}

Now send the request:

response = requests.request("POST", url, headers=headers, data=payload)

Parse the sitemap:

root = ET.fromstring(response.json().get("content"))
namespace = {'ns': 'http://www.sitemaps.org/schemas/sitemap/0.9'}

And extract all the page URLs:

links = [loc.text for loc in root.findall('.//ns:loc', namespace)]

You’ll also need a variable to store parsed results:

parsed = []

Step 4. Loop through links and extract data

Now loop through all the links:

for link in links:

Inside the loop, set up the payload with AI extraction rules:

    payload = json.dumps({
        "url": link,
        "proxyType": "datacenter",
        "proxyCountry": "US",
        "aiExtractRules": {
            "products": {
                "type": "list",
                "output": {
                    "title": {
                        "description": "title of product",
                        "type": "string"
                    },
                    "description": {
                        "description": "information about the product",
                        "type": "string"
                    },
                    "price": {
                        "description": "price of the product",
                        "type": "string"
                    }
                }
            }
        }
    })

You can find the full list of supported parameters in the official docs.
Now send the request:

    response = requests.post(url, headers=headers, data=payload)
    response.raise_for_status()

Parse the response:

    data = response.json()
    ai_raw = data.get("aiResponse")
    parsed.append({
        "url": link,
        "products": ai_raw.get("products")
    })

After processing all pages, save the results to a file:

with open("result_ai.json", "w", encoding="utf-8") as f:
    json.dump(parsed, f, indent=2, ensure_ascii=False)

The Result (TL;DR)

Here’s the full script:

import requests
import json
import xml.etree.ElementTree as ET


api_key = "YOUR-API-KEY"
sitemap_url = "https://demo.nopcommerce.com/sitemap.xml"


url = "https://api.hasdata.com/scrape/web"


payload = json.dumps({
  "url": sitemap_url,
  "proxyType": "datacenter",
  "proxyCountry": "US"
})
headers = {
  'Content-Type': 'application/json',
  'x-api-key': api_key
}


response = requests.request("POST", url, headers=headers, data=payload)


root = ET.fromstring(response.json().get("content"))
namespace = {'ns': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
links = [loc.text for loc in root.findall('.//ns:loc', namespace)]


parsed = []


for link in links:
    payload = json.dumps({
        "url": link,
        "proxyType": "datacenter",
        "proxyCountry": "US",
        "aiExtractRules": {
            "products": {
                "type": "list",
                "output": {
                    "title": {
                        "description": "title of product",
                        "type": "string"
                    },
                    "description": {
                        "description": "information about the product",
                        "type": "string"
                    },
                    "price": {
                        "description": "price of the product",
                        "type": "string"
                    }
                }
            }
        }
    })


    response = requests.post(url, headers=headers, data=payload)
    response.raise_for_status()


    data = response.json()
    ai_raw = data.get("aiResponse")
    parsed.append({
        "url": link,
        "products": ai_raw.get("products")
    })


with open("result_ai.json", "w", encoding="utf-8") as f:
    json.dump(parsed, f, indent=2, ensure_ascii=False)