DEV Community

Valentina Skakun for HasData

Posted on

1 1

Scraping All Site URLs

This guide is for those who want to find all URLs on a website, go through them, and extract some data, without spending time on writing CSS selectors. Instead, I’ll use AI for the heavy lifting. The example will be on Python because I’m most comfortable with it.

Step 1. Install and import libraries

You can install the required library with:

pip install requests
Enter fullscreen mode Exit fullscreen mode

The json and xml.etree.ElementTree modules are built into Python, so you don’t need to install them.
Now import everything:

import requests
import json
import xml.etree.ElementTree as ET
Enter fullscreen mode Exit fullscreen mode

Step 2. Set your variables

You’ll need HasData’s API key, as we use this API to send requests and process content with an LLM:

api_key = "YOUR-API-KEY"
Enter fullscreen mode Exit fullscreen mode

Then set the sitemap URL:

sitemap_url = "https://demo.nopcommerce.com/sitemap.xml"
Enter fullscreen mode Exit fullscreen mode

If the site you want to scrape doesn’t have a sitemap, you can use HasData’s web crawler instead, it also supports AI-based parsing and will automatically find the right pages for you.

Step 3. Get all page URLs

Let’s make an API request to fetch the sitemap. First, define the request body:

payload = json.dumps({
  "url": sitemap_url,
  "proxyType": "datacenter",
  "proxyCountry": "US"
})
Enter fullscreen mode Exit fullscreen mode

Then the headers:

headers = {
  'Content-Type': 'application/json',
  'x-api-key': api_key
}
Enter fullscreen mode Exit fullscreen mode

Now send the request:

response = requests.request("POST", url, headers=headers, data=payload)
Enter fullscreen mode Exit fullscreen mode

Parse the sitemap:

root = ET.fromstring(response.json().get("content"))
namespace = {'ns': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
Enter fullscreen mode Exit fullscreen mode

And extract all the page URLs:

links = [loc.text for loc in root.findall('.//ns:loc', namespace)]
Enter fullscreen mode Exit fullscreen mode

You’ll also need a variable to store parsed results:

parsed = []
Enter fullscreen mode Exit fullscreen mode

Step 4. Loop through links and extract data

Now loop through all the links:

for link in links:
Enter fullscreen mode Exit fullscreen mode

Inside the loop, set up the payload with AI extraction rules:

    payload = json.dumps({
        "url": link,
        "proxyType": "datacenter",
        "proxyCountry": "US",
        "aiExtractRules": {
            "products": {
                "type": "list",
                "output": {
                    "title": {
                        "description": "title of product",
                        "type": "string"
                    },
                    "description": {
                        "description": "information about the product",
                        "type": "string"
                    },
                    "price": {
                        "description": "price of the product",
                        "type": "string"
                    }
                }
            }
        }
    })
Enter fullscreen mode Exit fullscreen mode

You can find the full list of supported parameters in the official docs.
Now send the request:

    response = requests.post(url, headers=headers, data=payload)
    response.raise_for_status()
Enter fullscreen mode Exit fullscreen mode

Parse the response:

    data = response.json()
    ai_raw = data.get("aiResponse")
    parsed.append({
        "url": link,
        "products": ai_raw.get("products")
    })
Enter fullscreen mode Exit fullscreen mode

After processing all pages, save the results to a file:

with open("result_ai.json", "w", encoding="utf-8") as f:
    json.dump(parsed, f, indent=2, ensure_ascii=False)
Enter fullscreen mode Exit fullscreen mode

The Result (TL;DR)

Here’s the full script:

import requests
import json
import xml.etree.ElementTree as ET


api_key = "YOUR-API-KEY"
sitemap_url = "https://demo.nopcommerce.com/sitemap.xml"


url = "https://api.hasdata.com/scrape/web"


payload = json.dumps({
  "url": sitemap_url,
  "proxyType": "datacenter",
  "proxyCountry": "US"
})
headers = {
  'Content-Type': 'application/json',
  'x-api-key': api_key
}


response = requests.request("POST", url, headers=headers, data=payload)


root = ET.fromstring(response.json().get("content"))
namespace = {'ns': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
links = [loc.text for loc in root.findall('.//ns:loc', namespace)]


parsed = []


for link in links:
    payload = json.dumps({
        "url": link,
        "proxyType": "datacenter",
        "proxyCountry": "US",
        "aiExtractRules": {
            "products": {
                "type": "list",
                "output": {
                    "title": {
                        "description": "title of product",
                        "type": "string"
                    },
                    "description": {
                        "description": "information about the product",
                        "type": "string"
                    },
                    "price": {
                        "description": "price of the product",
                        "type": "string"
                    }
                }
            }
        }
    })


    response = requests.post(url, headers=headers, data=payload)
    response.raise_for_status()


    data = response.json()
    ai_raw = data.get("aiResponse")
    parsed.append({
        "url": link,
        "products": ai_raw.get("products")
    })


with open("result_ai.json", "w", encoding="utf-8") as f:
    json.dump(parsed, f, indent=2, ensure_ascii=False)
Enter fullscreen mode Exit fullscreen mode

You can also check out our other article on finding URLs on a website, it includes more examples and options for those who prefer to write less code.

Top comments (0)