Forem: Max Bohomolov

How to scrape YouTube using Python [2025 guide]

Max Bohomolov — Wed, 16 Jul 2025 12:10:32 +0000

In this guide, we'll explore how to efficiently collect data from YouTube using Crawlee for Python. The scraper will extract video metadata, video statistics, and transcripts - giving you structured YouTube data perfect for content analysis, ML training, or trend monitoring.

Note: One of our community members wrote this guide as a contribution to the Crawlee Blog. If you'd like to contribute articles like these, please reach out to us on Apify’s Discord channel.

Key steps we'll cover:

Project setup
Analyzing YouTube and determining a scraping strategy
Configuring YouTube
Extracting YouTube data
Enhancing the scraper capabilities
Creating a YouTube Actor on the Apify platform
Deploying to Apify

What you’ll need to get started

Python 3.10 or higher
Familiarity with web scraping concepts
Crawlee for Python v0.6.0 or higher
uv v0.7 or higher

1. Project setup

Note: Before starting the project, I'd like to ask you to star Crawlee for Python on GitHub. This will help us spread the word to fellow scraper developers.

In this project, we'll use uv for package management and a specific Python version will be installed through uv. If you don't have uv installed yet, just follow the guide or use this command:

curl -LsSf https://astral.sh/uv/install.sh | sh

To create the project, run:

uvx crawlee['cli'] create youtube-crawlee

In the cli menu that opens, select:

Playwright
Httpx
uv
Leave the default value - https://crawlee.dev
y

Or, just run the command:

uvx crawlee['cli'] create youtube-crawlee --crawler-type playwright --http-client httpx --package-manager uv --apify --start-url 'https://crawlee.dev'

Or, if you prefer to use pipx.

pipx run crawlee['cli'] create youtube-crawlee --crawler-type playwright --http-client httpx --package-manager uv --apify --start-url 'https://crawlee.dev'

Creating the project may take a few minutes. After installation is complete, navigate to the project folder:

cd youtube-crawlee

2. Analyzing YouTube and determining a scraping strategy

If you're working on a small project to extract data from YouTube, you should use the YouTube API to get your data. However, the API has very strict quotas, with no more than 10,000 units per day. This allows you to get just 100 search pages, and you can't increase this limit.

If your project requires more data than the API allows, you'll need to use crawling. Let's examine the site to develop an optimal crawling strategy.

Let's study YouTube navigation using Apify's YouTube channel as an example to better understand the features and data extraction points.

YouTube uses infinite scrolling to load new elements on the page, similar to what we discussed in the corresponding article from the Apify team. Let's look at how this works using DevTools and the Network tab.

If we look at the response structure, we can see that YouTube uses JSON to transmit data, but its structure is quite complex to navigate.

Therefore, we'll use Playwright for crawling, which will help us avoid parsing complex JSON responses. But if you want to practice crawling complex websites, try implementing a crawler based on an HTTP client, like in this article.

Let's analyze the selectors for getting video links using the Elements tab:

It looks like we're interested in a tags with the attribute id="video-title-link"!

Let's look at the video page to understand better how YouTube transmits data. As expected, we see data in JSON format.

Now let's get the transcript link. Click on the subtitles button in the player to trigger the transcript request.

Let's verify that we can access the transcript via this link. Remove the fmt=json3 parameter from the URL and open it in your browser. Removing the fmt parameter is necessary to get the data in a convenient XML format instead of the complex JSON3 format.

If you live in a country where GDPR applies, you'll need to handle the following pop-up before you can access the data:

After our analysis, we now understand:

Navigation strategy: How to navigate the channel page to retrieve all videos using infinite scroll.
Video metadata extraction: How to extract video statistics, title, description, publish date, and other metadata from video pages.
Transcript access: How to obtain the correct transcript link.
Data formats: Transcript data is available in XML format, which is easier to parse than JSON3
Regional considerations: Special handling required for GDPR consent in European countries

With this knowledge, we're ready to implement the YouTube scraper using Crawlee for Python.

3. Configuring Crawlee

Configuring Crawlee for YouTube is very similar to configuring it for TikTok, but with some key differences.

Since pages have infinite scrolling, we need to limit the number of elements we want to get. For this, we'll add a max_items parameter that will limit the maximum number of elements for each search, and pass it in user_data when forming a Request.

We'll limit the intensity of scraping by setting max_tasks_per_minute in ConcurrencySettings. This will help us reduce the likelihood of being blocked by YouTube.

Scrolling pages can take a long time, so we’ll increase the time limit for processing a single request using request_handler_timeout.

Since we won't be saving images, videos, and similar media content during crawling, we can block requests to them using block_requests and pre_navigation_hook.

Also, to handle the GDPR page only once, we'll use use_state to pass the appropriate cookies between sessions, ensuring all requests have the necessary cookies.

# main.py

from datetime import timedelta

from apify import Actor

from crawlee import ConcurrencySettings, Request
from crawlee.crawlers import PlaywrightCrawler

from .hooks import pre_hook
from .routes import router


async def main() -> None:
    """The crawler entry point."""
    async with Actor:
        # Create a crawler instance with the router
        crawler = PlaywrightCrawler(
            # Limit scraping intensity by setting a limit on requests per minute
            concurrency_settings=ConcurrencySettings(max_tasks_per_minute=50),
            # We'll configure the `router` in the next step
            request_handler=router,
            # Increase the timeout for the request handling pipeline
            request_handler_timeout=timedelta(seconds=120),
            # Runs browser without visual interface
            headless=True,
            # Limit requests per crawl for testing purposes
            max_requests_per_crawl=100,
        )

        # Set the maximum number of items to scrape per youtube channel
        max_items = 1
        # Set the list of channels to scrape
        channels = ['Apify']

        # Set hook for prepare context before navigation on each request
        crawler.pre_navigation_hook(pre_hook)

        await crawler.run(
            [
                Request.from_url(f'https://www.youtube.com/@{channel}/videos', user_data={'limit': max_items})
                for channel in channels
            ]
        )

Let's prepare the pre_hook function to block requests and set cookies (the cookie collection process will be explained in the extraction section):

# hooks.py

from crawlee.crawlers import PlaywrightPreNavCrawlingContext


async def pre_hook(context: PlaywrightPreNavCrawlingContext) -> None:
    """Prepare context before navigation."""
    crawler_state = await context.use_state()
    # Check if there are previously collected cookies in the crawler state and set them for the session
    if 'cookies' in crawler_state and context.session:
        cookies = crawler_state['cookies']
        # Set cookies for the session
        context.session.cookies.set_cookies_from_playwright_format(cookies)
    # Block requests to resources that aren't needed for parsing
    # This is similar to the default value, but we don't block `css` as it is needed for Player loading
    await context.block_requests(
        url_patterns=['.webp', '.jpg', '.jpeg', '.png', '.svg', '.gif', '.woff', '.pdf', '.zip']
    )

4. Extracting YouTube data

After configuration, let's move on to navigation and data extraction.

For infinite scrolling, we'll use the built-in helper function 'infinite_scroll'. But instead of waiting for scrolling to complete, which in some cases can take a really long time, we'll use Python's asyncio capabilities to make it a background task.

The GDPR page requiring consent for cookie usage is on the domain consent.youtube.com, which might cause an error when forming a Request for a video page. Therefore, we need to use a helper function for the transform_request_function parameter in extract_links.

This function will check each extracted URL. If it contains 'consent.youtube', we'll replace it with 'www.youtube'. This will allow us to get the correct URL for the video page.

# routes.py

from __future__ import annotations

import asyncio
import xml.etree.ElementTree as ET
from typing import TYPE_CHECKING

from yarl import URL

from crawlee import Request, RequestOptions, RequestTransformAction
from crawlee.crawlers import PlaywrightCrawlingContext
from crawlee.router import Router

if TYPE_CHECKING:
    from playwright.async_api import Request as PlaywrightRequest
    from playwright.async_api import Route as PlaywrightRoute

router = Router[PlaywrightCrawlingContext]()


def request_domain_transform(request_param: RequestOptions) -> RequestOptions | RequestTransformAction:
    """Transform request before adding it to the queue."""
    if 'consent.youtube' in request_param['url']:
        request_param['url'] = request_param['url'].replace('consent.youtube', 'www.youtube')
        return request_param
    return 'unchanged'

Let's implement a function that will intercept transcript requests for later modification and processing in the crawler:

# routes.py

async def extract_transcript_url(context: PlaywrightCrawlingContext) -> str | None:
    """Extract the transcript URL from request intercepted by Playwright."""
    # Create a Future to store the transcript URL
    transcript_future: asyncio.Future[str] = asyncio.Future()

    # Define a handler for the transcript request
    # This will be called when the page requests the transcript
    async def handle_transcript_request(route: PlaywrightRoute, request: PlaywrightRequest) -> None:
        # Set the result of the future with the transcript URL
        if not transcript_future.done():
            transcript_future.set_result(request.url)

        await route.fulfill(status=200)

    # Set up a route to intercept requests to the transcript API
    await context.page.route('**/api/timedtext**', handle_transcript_request)

    # Click the subtitles button to trigger the transcript request
    await context.page.click('.ytp-subtitles-button')

    # Wait for the transcript URL to be captured
    # The future will resolve when handle_transcript_request is called
    return await transcript_future

Now, let's create the main handler that will navigate to the channel page, perform infinite scrolling, and extract links to videos.

# routes.py

@router.default_handler
async def default_handler(context: PlaywrightCrawlingContext) -> None:
    """Handle requests that do not match any specific handler."""
    context.log.info(f'Processing {context.request.url} ...')

    # Get the limit from user_data, default to 10 if not set
    limit = context.request.user_data.get('limit', 10)
    if not isinstance(limit, int):
        raise TypeError('Limit must be an integer')

    # Wait for the page to load
    await context.page.locator('h1').first.wait_for(state='attached')

    # Check if there's a GDPR popup on the page requiring consent for cookie usage
    cookies_button = context.page.locator('button[aria-label*="Accept"]').first
    if await cookies_button.is_visible():
        await cookies_button.click()

        # Save cookies for later use with other sessions
        # You can learn more about `SOCS` cookies from - https://policies.google.com/technologies/cookies?hl=en-US
        cookies_state = [cookie for cookie in await context.page.context.cookies() if cookie['name'] == 'SOCS']
        crawler_state = await context.use_state()
        crawler_state['cookies'] = cookies_state

    # Wait until at least one video loads
    await context.page.locator('a[href*="watch"]').first.wait_for()

    # Create a background task for infinite scrolling
    scroll_task: asyncio.Task[None] = asyncio.create_task(context.infinite_scroll())
    # Scroll the page to the end until we reach the limit or finish scrolling
    while not scroll_task.done():
        # Extract links to videos
        requests = await context.extract_links(
            selector='a[href*="watch"]',
            label='video',
            transform_request_function=request_domain_transform,
            strategy='same-domain',
        )
        # Create a dictionary to avoid duplicates
        requests_map = {request.id: request for request in requests}
        # If the limit is reached, cancel the scrolling task
        if len(requests_map) >= limit:
            scroll_task.cancel()
            break
        # Switch the asynchronous context to allow other tasks to execute
        await asyncio.sleep(0.2)
    else:
        # If the scroll task is done, we can safely assume that we have reached the end of the page
        requests = await context.extract_links(
            selector='a[href*="watch"]',
            label='video',
            transform_request_function=request_domain_transform,
            strategy='same-domain',
        )
        requests_map = {request.id: request for request in requests}

    requests = list(requests_map.values())
    requests = requests[:limit]

    # Add the requests to the queue
    await context.enqueue_links(requests=requests)

Let's take a closer look at the parameters used in extract_links:

selector - selector for extracting links to videos. We expected that we could use id="video-title-link", but YouTube uses different page formats with different selectors, so the selector a[href*="watch"] will be more universal.
label - pointer for the router that will be used to handle the video page.
transform_request_function - function to transform the request before adding it to the queue. We use it to replace the domain consent.youtube with www.youtube, which helps avoid errors when processing the video page.
strategy - strategy for extracting links. We use same-domain to extract links to any subdomain of youtube.com.

Let's move on to the handler for video pages. In it, we'll extract video data and also look at how to get and process the video transcript link.

# routes.py

@router.handler('video')
async def video_handler(context: PlaywrightCrawlingContext) -> None:
    """Handle video requests."""
    context.log.info(f'Processing video {context.request.url} ...')
    # extract video data from the page
    video_data = await context.page.evaluate('window.ytInitialPlayerResponse')
    main_data = {
        'url': context.request.url,
        'title': video_data['videoDetails']['title'],
        'description': video_data['videoDetails']['shortDescription'],
        'channel': video_data['videoDetails']['author'],
        'channel_id': video_data['videoDetails']['channelId'],
        'video_id': video_data['videoDetails']['videoId'],
        'duration': video_data['videoDetails']['lengthSeconds'],
        'keywords': video_data['videoDetails']['keywords'],
        'view_count': video_data['videoDetails']['viewCount'],
        'like_count': video_data['microformat']['playerMicroformatRenderer']['likeCount'],
        'is_shorts': video_data['microformat']['playerMicroformatRenderer']['isShortsEligible'],
        'publish_date': video_data['microformat']['playerMicroformatRenderer']['publishDate'],
    }

    # Try to extract the transcript URL
    try:
        transcript_url = await asyncio.wait_for(extract_transcript_url(context), timeout=20)
    except asyncio.TimeoutError:
        transcript_url = None

    if transcript_url:
        transcript_url = str(URL(transcript_url).without_query_params('fmt'))
        context.log.info(f'Found transcript URL: {transcript_url}')
        await context.add_requests(
            [Request.from_url(transcript_url, label='transcript', user_data={'video_data': main_data})]
        )
    else:
        await context.push_data(main_data)

Note that if we want to extract the video transcript, we need to get the link to the transcript file and pass the video data to the next handler before it's saved to the Dataset.

The final stage is processing the transcript. YouTube uses XML to transmit transcript data, so we need to use a library to parse XML, such as xml.etree.ElementTree.

# routes.py

@router.handler('transcript')
async def transcript_handler(context: PlaywrightCrawlingContext) -> None:
    """Handle transcript requests."""
    context.log.info(f'Processing transcript {context.request.url} ...')

    # Get the main video data extracted in `video_handler`
    video_data = context.request.user_data.get('video_data', {})

    try:
        # Get XML data from the response
        root = ET.fromstring(await context.response.text())

        # Extract text elements from XML
        transcript_data = [text_element.text.strip() for text_element in root.findall('.//text') if text_element.text]

        # Enrich video data by adding the transcript
        video_data['transcript'] = '\n'.join(transcript_data)

        # Save the data to Dataset
        await context.push_data(video_data)
    except ET.ParseError:
        context.log.warning('Incorect XML Response')
        # Save the video data without the transcript
        await context.push_data(video_data)

After collecting the data, we need to save the results to a file. Just add the following code to the end of the main function in main.py:

# main.py

# Export the data from Dataset to JSON format
await crawler.export_data_json('youtube.json')

To run the crawler, use the command:

uv run python -m youtube_crawlee

Example result record:

{
  "url": "https://www.youtube.com/watch?v=r-1J94tk5Fo",
  "title": "Facebook Marketplace API - Scrape Data Based on LOCATION, CATEGORY and SEARCH",
  "description": "See how you can export Facebook Marketplace listings to Excel, CSV or JSON with the Facebook Marketplace API 🛍️ Input one or more URLs to scrape price, description, images, delivery info, seller data, location, listing status, and much more 📊\n\nWith the Facebook Marketplace Downloader, you can:\n🛒 **Extract listings and seller details** from any public Marketplace category or search query.\n📷 **Scrape product details**, including images, prices, descriptions, locations, and timestamps.\n💰 **Get thousands of marketplace listings** quickly and efficiently.\n📦 **Export results** via API or in JSON, CSV, or Excel with all listing details.\n\n🛍️ Facebook Marketplace Search API 👉 https://apify.it/3E5NLz4\n📱 Explore other Facebook Scrapers 👉 https://apify.it/43Bae1f\n\n*Why scrape Facebook Marketplace data?* 🤔\n💰 Price & Demand Analysis – Track product pricing trends and demand fluctuations.\n📊 Competitor Insights – Monitor listings from competitors to adjust pricing and strategy.\n📍 Location-Based Market Trends – Identify popular products in specific regions.\n🔎 Product Availability Monitoring – Detect shortages or oversupply in certain categories.\n📈 Reselling Opportunities – Find underpriced items for profitable flips.\n🛍 Consumer Behavior Insights – Understand what products and features attract buyers.\n💡 Trend Spotting – Discover emerging products before they go mainstream.\n📝 Market Research – Gather data for academic, business, or personal research.\n\n*How to* scrape *facebook marketplace? 🧑‍🏫* \nStep 1. Find the Facebook Marketplace dataset tool on Apify Store\nStep 2: Click ‘Try for free’\nStep 3: Input a URL\nStep 4: Fine tune the input\nStep 5: Start the Actor and get your data!\n\n*Useful links 🧑‍💻*\n📚 Read more about Scraping Facebook data: https://apify.it/43wyth9\n🧑‍💻 Sign up for Apify: https://apify.it/42e8nNu\n🧩 Integrate the Actor with other tools: https://apify.it/43Ustiz\n📱 Browse other Social Media Scrapers on Apify Store: https://apify.it/4jhq7i8\n\n*Follow us 🤳*\nhttps://www.linkedin.com/company/apifytech\nhttps://twitter.com/apify\nhttps://www.tiktok.com/@apifytech\nhttps://discord.com/invite/jyEM2PRvMU\n\n*Timestamps ⌛️*\n00:00 Introduction\n01:27 Input\n02:17 Run\n02:26 Export\n02:41 Scheduling\n02:54 Integrations\n03:00 API\n03:13 Other Meta Scrapers\n03:26 Like and subscribe!\n\n#webscraping #instagram",
  "channel": "Apify",
  "channel_id": "UCTgwcoeGGKmZ3zzCXN2qo_A",
  "video_id": "r-1J94tk5Fo",
  "duration": "226",
  "keywords": [
    "web scraping platform",
    "web automation",
    "scrapers",
    "Apify",
    "web crawling",
    "web scraping",
    "data extraction",
    "best web scraping tool",
    "API",
    "how to extract data from any website",
    "web scraping tutorial",
    "web scrape",
    "data collection tool",
    "RPA",
    "web integration",
    "how to turn website into API",
    "JSON",
    "python web scraping",
    "web scraping python",
    "web api integration",
    "how to turn website into api",
    "scraping",
    "apify",
    "data extraction tools",
    "how to web scrape",
    "web scraping javascript",
    "web scraping tool"
  ],
  "view_count": "765",
  "like_count": "8",
  "is_shorts": false,
  "publish_date": "2025-04-03T05:33:18-07:00",
  "transcript": "Hi, Theo here. In this video, I’ll \nshow you how to scrape structured\ndata from Facebook Marketplace by location, \ncategory, or specific search query. You’ll\nbe able to extract listing details like price, \ndescription, images, delivery info, seller data,\nlocation, and listing status — using a \ntool called Facebook Marketplace Scraper.\nHere’s what you can do with it. \nIf you&#39;re reselling, flipping,\nor deal hunting, scraping helps you track \nprices, spot trends, and catch underpriced\nor free items early. Looking for a rental \nor house? Compare listings across cities,\ncheck historical prices, and avoid wasting \ntime on overpriced options. Selling on\nMarketplace? Analyze top-performing listings, \noptimize keywords, and price competitively.\nFor businesses, scraping \nenables competitor tracking,\ndynamic pricing, real estate \nresearch, fraud detection,\nand brand protection — like spotting counterfeit \nor unauthorized listings before they do damage.\nThe best part is you don’t need to \njump through hoops to get this data:\nFacebook Marketplace Scraper makes things simple: \nno login, no cookies, no browser extension.\nIt runs in the cloud, and you can export \nresults in JSON, CSV, Excel — or use the API.\nLet’s see how it works.\nFirst, head to the link in the description, \nwhich’ll take you to Facebook Marketplace\nScraper’s README. Click on `try \nfor free`, which will send you to\nthe `Login page` and you can get started \nwith a free Apify account - don’t worry,\nthere’s no limit on the free plan and \nno credit card will ever be required.\nAfter logging in, you’ll land on the Actor’s \ninput page. While you can configure this through\neither the intuitive UI or JSON, we’ll \nstick with the UI option to keep it easy.\nFor scraping Facebook Marketplace, you’re gonna \nneed the URL from Facebook. You can use a URL of\na search term, location or an item category. For \nthis tutorial, we’re gonna go with an iPhone. So\nlet’s open up Facebook Marketplace, input a search \nterm and then copy the URL from the toolbar and\npaste it in the input. You can add more via the \nadd button, edit them in bulk or import the URLs\nas a text file. Next, you can limit how many \nposts you want to scrape. And that’s it.\nBefore running your Actor, it’s a great idea \nto save your configuration and create a task.\nThis will come in handy for scheduling or \nintegrating your Actor with other tools,\nor if you plan to work with \nmultiple configurations.\nNow that we have the `input`, let’s run \nthe Actor by hitting START. You can watch\nyour results appear in Overview or switch to \nthe Log tab to see more details about run.\nNow that your run is finished, we can get the \ndata via the Export button. You can choose your\npreffered format, and select which fields you want \nto include or exclude in your dataset. Then just\nhit Download and you have your dataset file. Let \nme show you what this looks like in JSON format.\nIf you want to automate your workflow \neven more, you can schedule your Facebook\nMarketplace Scraper to run at regular intervals. \nChoose your task and hit schedule. You can set\nthe frequency of how often you want to run \nthe Actor. You can even connect your Actor\nto other cloud services, such as Google \nDrive, Make, or any other Apify Actor.\nYou can also run this scraper locally via \nAPI. You can find the code in Node.js,\nPython, or curl in the API \ndrop down menu in the top-right\ncorner. To learn more about retrieving data \nprogramatically, check out our video on it.\nNeed more Facebook or Instagram data? \nCheck out our other scrapers in Apify\nStore. We have got dozens of meta \nscrapers, links are in the description.\nIf you prefer video tutorials, we have a playlist \ncovering different Instagram scraping use cases.\nAnd that’s all for today! Let us know what you \nthink about the Facebok Marketplace Scraper.\nRemember, if you come across any issues, make \nsure to report them to our team in Apify Console.\nIf you found this helpful, give us a thumbs \nup and subscribe. Don&#39;t forget to hit the\nbell to stay updated on new tutorials. Thanks for \nwatching! So long, and thanks for all the likes"
}

5. Enhancing the scraper capabilities

As with any project working with a large site like YouTube, you may encounter various issues that need to be resolved.
Currently, the Crawlee for Python documentation contains many guides and examples to help you with this.

Use Camoufox, a project compatible with Playwright, which allows you to get a browser configuration that's more resistant to blocking, and you can easily integrate it with Crawlee for Python.
Improve error handling and logging for unusual cases so you can easily debug and maintain the project; the guide on error handling is a good place to start.
Add proxy support to avoid blocks from YouTube. You can use Apify Proxy and ProxyConfiguration; you can learn more in this guide in the documentation.
Make your crawler a web service that crawls pages by user request, using FastAPI and following this guide.

6. Creating YouTube Actor on the Apify platform

For deployment, we'll use the Apify platform. It's a simple and effective environment for cloud deployment, allowing efficient interaction with your crawler. Call it via API, schedule tasks, integrate with various services, and much more.

To deploy to the Apify platform, we need to adapt our project for the Apify Actor structure.

Create an .actor folder with the necessary files.

mkdir .actor && touch .actor/{actor.json,input_schema.json}

Move the Dockerfile from the root folder to .actor.

mv Dockerfile .actor

Let's fill in the empty files:

The actor.json file contains project metadata for the Apify platform. Follow the documentation for proper configuration:

{
  "actorSpecification": 1,
  "name": "YouTube-Crawlee",
  "title": "YouTube - Crawlee",
  "minMemoryMbytes": 2048,
  "description": "Scrape video stats, metadata and transcripts from videos in YouTube channels",
  "version": "0.1",
  "meta": {
    "templateId": "youtube-crawlee"
  },
  "input": "./input_schema.json",
  "dockerfile": "./Dockerfile"
}

Actor input parameters are defined using input_schema.json, which is specified here.

Let's define input parameters for our crawler:

maxItems - maximum number of videos per channel for scraping.
channelNames - these are the YouTube channel names to scrape.
proxySettings - proxy settings, since without a proxy, you'll be using the datacenter IP that Apify uses.

{
    "title": "YouTube Crawlee",
    "type": "object",
    "schemaVersion": 1,
    "properties": {
        "channelNames": {
            "title": "List Channel Names",
            "type": "array",
            "description": "Channel names for extraction video stats, metadata and transcripts.",
            "editor": "stringList",
            "prefill": ["Apify"]
        },
        "maxItems": {
            "type": "integer",
            "editor": "number",
            "title": "Limit search results",
            "description": "Limits the maximum number of results, applies to each search separately.",
            "default": 10
        },
        "proxySettings": {
            "title": "Proxy configuration",
            "type": "object",
            "description": "Select proxies to be used by your scraper.",
            "prefill": { "useApifyProxy": true },
            "editor": "proxy"
        }
    },
    "required": ["channelNames"]
}

Let's update the code to accept input parameters.

# main.py

async def main() -> None:
    """The crawler entry point."""
    async with Actor:
        # highlight-start
        # Get the input parameters from the Actor
        actor_input = await Actor.get_input()

        max_items = actor_input.get('maxItems', 0)
        channels = actor_input.get('channelNames', [])
        proxy = await Actor.create_proxy_configuration(actor_proxy_input=actor_input.get('proxySettings'))
        # highlight-end

        crawler = PlaywrightCrawler(
            concurrency_settings=ConcurrencySettings(max_tasks_per_minute=50),
            request_handler=router,
            request_handler_timeout=timedelta(seconds=120),
            headless=True,
            max_requests_per_crawl=100,
            proxy_configuration=proxy
        )

And delete export to JSON from the main function, as the Apify platform will handle data storage in the Dataset.

That's it, the project is ready for deployment.

7. Deploying to Apify

Use the official Apify CLI to upload your code:

Authenticate using your API token from Apify Console:

apify login

Choose "Enter API token manually" and paste your token.

Push the project to the platform:

apify push

Now you can configure runs on the Apify platform.

Let's perform a test run:

Fill in the input parameters:

View results in the dataset:

If you want to make your Actor public and provide access to other users, potentially to earn income from it, follow this publishing guide for Apify Store.

Conclusion

We've created a good foundation for crawling YouTube using Crawlee for Python and Playwright. If you're just starting your journey in crawling, this will be an excellent project for learning and practice. You can use it as a basis for creating more complex crawlers that will collect data from YouTube. If this is your first project using Crawlee for Python, check out all the documentation links provided in this article; it will help you better understand how Crawlee for Python works and how you can use it for your projects.

You can find the complete code in the repository

If you enjoyed this blog, feel free to support Crawlee for Python by starring the repository or joining the maintainer team.

Do you have questions or want to discuss the details of the implementation? Join our Discord—our community of 11,000+ developers is there to help.

How to scrape TikTok using Python

Max Bohomolov — Wed, 30 Apr 2025 15:25:08 +0000

TikTok users generate tons of data that are valuable for analysis.

Which hashtags are trending now? What is an influencer's engagement rate? What topics are important for a content creator? You can find answers to these and many other questions by analyzing TikTok data. However, for analysis, you need to extract the data in a convenient format. In this blog, we'll explore how to scrape TikTok using Crawlee for Python.

Note: One of our community members wrote this blog as a contribution to the Crawlee Blog. If you'd like to contribute articles like these, please reach out to us on our Discord channel.

Key steps we'll cover:

Project setup
Analyzing TikTok and determining a scraping strategy
Configuring Crawlee
Extracting TikTok data
Creating TikTok Actor on the Apify platform
Deploying to Apify

Prerequisites

Python 3.9 or higher
Familiarity with web scraping concepts
Crawlee for Python v0.6.0 or higher
uv v0.6 or higher
An Apify account

1. Project setup

Note: Before going ahead with the project, I'd like to ask you to star Crawlee for Python on GitHub, it helps us to spread the word to fellow scraper developers.

In this project, we'll use uv for package management and a specific Python version will be installed through uv. Uv is a fast and modern package manager written in Rust.

If you don't have uv installed yet, just follow the guide or use this command:

curl -LsSf https://astral.sh/uv/install.sh | sh

To create the project, run:

uvx crawlee['cli'] create tiktok-crawlee

In the cli menu that opens, select:

Playwright
Httpx
uv
Leave the default value - https://crawlee.dev
y

Or, just run the command:

uvx crawlee['cli'] create tiktok-crawlee --crawler-type playwright --http-client httpx --package-manager uv --apify --start-url 'https://crawlee.dev'

Creating the project may take a few minutes. After installation is complete, navigate to the project folder:

cd tiktok-crawlee

2. Analyzing TikTok and determining a scraping strategy

TikTok uses quite a lot of JavaScript on its site, both for displaying content and for analyzing user behavior, including detecting and blocking crawlers. Therefore, for crawling TikTok, we'll use a headless browser with Playwright.

To load new elements on a user's page, TikTok uses infinite scrolling. You may already be familiar with this method from this article.

Let's look at what happens under the hood when we scroll a TikTok page. I recommend studying network activity in DevTools to understand what requests are going to the server.

Let's examine the HTML structure to understand if navigating to elements will be difficult.

Well, this looks quite simple. If using CSS selectors, [data-e2e="user-post-item"] a is sufficient.

Let's look at what a video page response looks like to see what data we can extract.

It seems that the HTML code contains JSON with all the data we're interested in. Great!

3. Configuring Crawlee

Now that we understand our scraping strategy, let's set up Crawlee for scraping TikTok.

Since pages have infinite scrolling, we need to limit the number of elements we want to get. For this, we'll add a max_items parameter that will limit the maximum number of elements for each search and pass it in user_data when forming a Request.

We'll limit the intensity of scraping by setting max_tasks_per_minute in ConcurrencySettings. This will help us reduce the likelihood of being blocked by TikTok.

We'll set browser_type to firefox, as it performed better for TikTok in my tests.
TikTok may request permissions to access device data, so we'll explicitly limit all permissions by passing the appropriate parameter to browser_new_context_options.

Scrolling pages can take a long time, so we should increase the time limit for processing a single request using request_handler_timeout.

# main.py

from datetime import timedelta

from apify import Actor

from crawlee import ConcurrencySettings, Request
from crawlee.crawlers import PlaywrightCrawler

from .routes import router

async def main() -> None:
    """The crawler entry point."""
    # When creating the template, we confirmed Apify integration.
    # However, this isn't important for us at this stage.
    async with Actor:

        max_items = 20

        # Create a crawler with the necessary settings
        crawler = PlaywrightCrawler(
            # Limit scraping intensity by setting a limit on requests per minute
            concurrency_settings=ConcurrencySettings(max_tasks_per_minute=50),
            # We'll configure the `router` in the next step
            request_handler=router,
            # You can use `False` during development. But for production, it's always `True`
            headless=True,
            max_requests_per_crawl=100,
            # Increase the timeout for the request handling pipeline
            request_handler_timeout=timedelta(seconds=120),
            browser_type='firefox',
            # Limit any permissions to device data
            browser_new_context_options={'permissions': []},
        )

        # Run the crawler to collect data from several user pages
        await crawler.run(
            [
                Request.from_url('https://www.tiktok.com/@apifyoffice', user_data={'limit': max_items}),
                Request.from_url('https://www.tiktok.com/@authorbrandonsanderson', user_data={'limit': max_items}),
            ]
        )

Someone might ask, "What about configurations to avoid fingerprint blocking?!!!" My answer is, "Crawlee for Python has already done that for you."

Depending on your deployment environment, you may need to add a proxy. We'll come back to this in the last section.

4. Extracting TikTok data

After configuration, let's move on to navigation and data extraction.

Also, with deeper investigation, you may encounter a TikTok page that doesn't load user videos, but only shows a button and an error message.

It's very important to handle this case.

Also during testing, I discovered that you need to interact with scrolling, otherwise when using infinite_scroll, new elements don't load. I think this is a TikTok bug.

Let's start with a simple function to extract video links. It will help avoid code duplication.

# routes.py

import asyncio
import json

from playwright.async_api import Page

from crawlee import Request
from crawlee.crawlers import PlaywrightCrawlingContext
from crawlee.router import Router

router = Router[PlaywrightCrawlingContext]()


# Helper function that extracts all loaded video links
async def extract_video_links(page: Page) -> list[Request]:
    """Extract all loaded video links from the page."""
    links = []
    for post in await page.query_selector_all('[data-e2e="user-post-item"] a'):
        post_link = await post.get_attribute('href')
        if post_link and '/video/' in post_link:
            links.append(Request.from_url(post_link, label='video'))
    return links

Now we can move on to the main handler that will process TikTok user pages.

# routes.py

# Main handler used for TikTok user pages
@router.default_handler
async def default_handler(context: PlaywrightCrawlingContext) -> None:
    """Handle request without specific label."""
    context.log.info(f'Processing {context.request.url} ...')

    # Get the limit for video elements from `user_data`
    limit = context.request.user_data.get('limit', 10)
    if not isinstance(limit, int):
        raise TypeError('Limit must be an integer')

    # Wait until the button or at least a video loads, if the connection is slow
    check_locator = context.page.locator('[data-e2e="user-post-item"], main button').first
    await check_locator.wait_for()

    # If the button loaded, click it to initiate video loading
    if button := await context.page.query_selector('main button'):
        await button.click()

    # Perform interaction with scrolling
    await context.page.press('body', 'PageDown')

    # Start `infinite_scroll` as a background task
    scroll_task: asyncio.Task[None] = asyncio.create_task(context.infinite_scroll())

    # Wait until scrolling is completed or until the limit is reached
    while not scroll_task.done():
        requests = await extract_video_links(context.page)
        # If we've already reached the limit, interrupt scrolling and exit the loop
        if len(requests) >= limit:
            scroll_task.cancel()
            break
        # Switch the asynchronous context to allow other tasks to execute
        await asyncio.sleep(0.2)
    else:
        requests = await extract_video_links(context.page)

    # Limit the number of requests to the limit value
    requests = requests[:limit]

    # If the page wasn't properly processed for some reason and didn't find any links,
    # then I want to raise an error for retry
    if not requests:
        raise RuntimeError('No video links found')

    await context.add_requests(requests)

The final stage is handling the video page.

# routes.py

@router.handler(label='video')
async def video_handler(context: PlaywrightCrawlingContext) -> None:
    """Handle request with the label 'video'."""
    context.log.info(f'Processing video {context.request.url} ...')

    # Extract the element containing JSON with data
    json_element = await context.page.query_selector('#__UNIVERSAL_DATA_FOR_REHYDRATION__')
    if json_element:
        # Extract JSON and convert it to a dictionary
        text_data = await json_element.text_content()
        json_data = json.loads(text_data)

        data = json_data['__DEFAULT_SCOPE__']['webapp.video-detail']['itemInfo']['itemStruct']

        # Create result item
        result_item = {
            'author': {
                'nickname': data['author']['nickname'],
                'id': data['author']['id'],
                'handle': data['author']['uniqueId'],
                'signature': data['author']['signature'],
                'followers': data['authorStats']['followerCount'],
                'following': data['authorStats']['followingCount'],
                'hearts': data['authorStats']['heart'],
                'videos': data['authorStats']['videoCount'],
            },
            'description': data['desc'],
            'tags': [item['hashtagName'] for item in data['textExtra'] if item['hashtagName']],
            'hearts': data['stats']['diggCount'],
            'shares': data['stats']['shareCount'],
            'comments': data['stats']['commentCount'],
            'plays': data['stats']['playCount'],
        }

        # Save the result to the dataset
        await context.push_data(result_item)
    else:
        # If the data wasn't received, we raise an error for retry
        raise RuntimeError('No JSON data found')

The crawler is ready for local launch. To run it, execute the command:

uv run python -m tiktok_crawlee

You can view the saved results in the dataset folder, path ./storage/datasets/default/.

Example record:

{
  "author": {
    "nickname": "apifyoffice",
    "id": "7095709566285480965",
    "handle": "apifyoffice",
    "signature": "🤖 web scraping and AI 🤖\n\ncheck out our open positions at ✨apify.it/jobs✨",
    "followers": 118,
    "following": 3,
    "hearts": 1975,
    "videos": 33
  },
  "description": ""Fun" is the top word Apifiers used to describe our culture. Here's what else came to their minds 🎤  #workculture #teambuilding #interview #czech #ilovemyjob ",
  "tags": [
    "workculture",
    "teambuilding",
    "interview",
    "czech",
    "ilovemyjob"
  ],
  "hearts": 7,
  "shares": 1,
  "comments": 1,
  "plays": 448
}

5. Creating TikTok Actor on the Apify platform

To deploy to the Apify platform, we need to adapt our project for the Apify Actor structure.

Create an .actor folder with the necessary files.

mkdir .actor && touch .actor/{actor.json,input_schema.json}

Move the Dockerfile from the root folder to .actor.

mv Dockerfile .actor

Let's fill in the empty files:

The actor.json file contains project metadata for the Apify platform. Follow the documentation for proper configuration:

{
  "actorSpecification": 1,
  "name": "TikTok-Crawlee",
  "title": "TikTok - Crawlee",
  "minMemoryMbytes": 2048,
  "description": "Scrape video elements from TikTok user pages",
  "version": "0.1",
  "meta": {
    "templateId": "tiktok-crawlee"
  },
  "input": "./input_schema.json",
  "dockerfile": "./Dockerfile"
}

Actor input parameters are defined using input_schema.json, which is specified here.

Let's define input parameters for our crawler:

maxItems - this should be an externally configurable parameter.
urls - these are links to TikTok user pages, the starting points for our crawler's scraping
proxySettings - proxy settings, since without a proxy you'll be using the datacenter IP that Apify uses.

{
    "title": "TikTok Crawlee",
    "type": "object",
    "schemaVersion": 1,
    "properties": {
        "urls": {
            "title": "List URLs",
            "type": "array",
            "description": "Direct URLs to pages TikTok profiles.",
            "editor": "stringList",
            "prefill": ["https://www.tiktok.com/@apifyoffice"]
        },
        "maxItems": {
            "type": "integer",
            "editor": "number",
            "title": "Limit search results",
            "description": "Limits the maximum number of results, applies to each search separately.",
            "default": 10
        },
        "proxySettings": {
            "title": "Proxy configuration",
            "type": "object",
            "description": "Select proxies to be used by your scraper.",
            "prefill": { "useApifyProxy": true },
            "editor": "proxy"
        }
    },
    "required": ["urls"]
}

Let's update the code to accept input parameters.

# main.py

from datetime import timedelta

from apify import Actor
from crawlee.crawlers import PlaywrightCrawler
from crawlee import ConcurrencySettings
from crawlee import Request

from .routes import router

async def main() -> None:
    """The crawler entry point."""
    async with Actor:
        # highlight-start
        # Accept input parameters passed when starting the Actor
        actor_input = await Actor.get_input()

        max_items = actor_input.get('maxItems', 0)
        requests = [Request.from_url(url, user_data={'limit': max_items}) for url in actor_input.get('urls', [])]
        proxy = await Actor.create_proxy_configuration(actor_proxy_input=actor_input.get('proxySettings'))
        # highlight-end

        crawler = PlaywrightCrawler(
            concurrency_settings=ConcurrencySettings(max_tasks_per_minute=50),
            proxy_configuration=proxy,
            request_handler=router,
            headless=True,
            request_handler_timeout=timedelta(seconds=120),
            browser_type='firefox',
            browser_new_context_options={'permissions': []}
        )

        await crawler.run(requests)

That's it, the project is ready for deployment.

6. Deploying to Apify

Use the official Apify CLI to upload your code:

Authenticate using your API token from Apify Console:

apify login

Choose "Enter API token manually" and paste your token.

Push the project to the platform:

apify push

Now you can configure runs on the Apify platform.

Let's perform a test run:

Fill in the input parameters:

Check that logging works correctly:

View results in the dataset:

If you want to make your Actor public and provide access to other users, potentially to earn income from it, follow this publishing guide for Apify Store.

Conclusion

We've created a good foundation for crawling TikTok using Crawlee for Python and Playwright. If you want to improve the project, I would recommend adding error handling and handling cases when you get a CAPTCHA to reduce the likelihood of being blocked by TikTok. However, this is a good foundation to start working with TikTok. It allows you to get data right now.

You can find the complete code in the repository

If you enjoyed this blog, feel free to support Crawlee for Python by starring the repository or joining the maintainer team.

Have questions or want to discuss implementation details? Join our Discord - our community of 10,000+ developers is there to help.

How to scrape Bluesky with Python

Max Bohomolov — Fri, 21 Mar 2025 12:58:22 +0000

Bluesky is an emerging social network developed by former members of the Twitter(now X) development team. The platform has been showing significant growth recently, reaching 140.3 million visits according to SimilarWeb. Like X, luesky generates a vast amount of data that can be used for analysis. In this article, we’ll explore how to collect this data using Crawlee for Python.

Note: One of our community members wrote this blog as a contribution to the Crawlee Blog. If you’d like to contribute articles like these, please reach out to us on our discord channel.

Key steps we will cover:

Project setup
Development of the Bluesky crawler in Python
Create Apify Actor for Bluesky crawler
Conclusion and repository access

Prerequisites

Basic understanding of web scraping concepts
Python 3.9 or higher
UV version 0.6.0 or higher
Crawlee for Python v0.6.5 or higher
Bluesky account for API access

Project setup

In this project, we’ll use UV for package management and a specific Python version installed through UV. UV is a fast and modern package manager written in Rust.

If you don’t have UV installed yet, follow the guide or use this command:
```
curl -LsSf https://astral.sh/uv/install.sh | sh
```
Install standalone Python using UV:
```
uv install python 3.13
```

Create a new project and install Crawlee for Python:

uv init bluesky-crawlee --package
cd bluesky-crawlee
uv add crawlee

We’ve created a new isolated Python project with all the necessary dependencies for Crawlee.

Development of the Bluesky crawler in Python

Note: Before going ahead with the project, I'd like to ask you to star Crawlee for Python on GitHub, it helps us to spread the word to fellow scraper developers.

1. Identifying the data source

When accessing the search page, you'll see data displayed, but be aware of a key limitation: the site only allows viewing the first page of results, preventing access to any additional pages.

Fortunately, Bluesky provides a well-documented API that is accessible to any registered user without additional permissions. This is what we’ll use for data collection

2. Creating a session for API interaction

For secure API interaction, you need to create a dedicated app password instead of using your main account password.

Go to Settings -> Privacy and Security -> App Passwords and click Add App Password.
Important: Save the generated password, as it won’t be visible after creation.

Next, create environment variables to store your credentials:

Your application password
Your user identifier (found in your profile and Bluesky URL, for example: mantisus.bsky.social)

export BLUESKY_APP_PASSWORD=your_app_password
export BLUESKY_IDENTIFIER=your_identifier

Using the createSession, deleteSession endpoints and httpx, we can create a session for API interaction.

Let us create a class with the necessary methods:

import asyncio
import json
import os
import traceback

import httpx
from yarl import URL

from crawlee import ConcurrencySettings, Request
from crawlee.configuration import Configuration
from crawlee.crawlers import HttpCrawler, HttpCrawlingContext
from crawlee.http_clients import HttpxHttpClient
from crawlee.storages import Dataset

# Environment variables for authentication
# BLUESKY_APP_PASSWORD: App-specific password generated from Bluesky settings
# BLUESKY_IDENTIFIER: Your Bluesky handle (e.g., username.bsky.social)
BLUESKY_APP_PASSWORD = os.getenv('BLUESKY_APP_PASSWORD')
BLUESKY_IDENTIFIER = os.getenv('BLUESKY_IDENTIFIER')


class BlueskyApiScraper:
    """A scraper class for extracting data from Bluesky social network using their official API.

    This scraper manages authentication, concurrent requests, and data collection for both
    posts and user profiles. It uses separate datasets for storing post and user information.
    """

    def __init__(self) -> None:
        self._crawler: HttpCrawler | None = None

        self._users: Dataset | None = None
        self._posts: Dataset | None = None

        # Variables for storing session data
        self._service_endpoint: str | None = None
        self._user_did: str | None = None
        self._access_token: str | None = None
        self._refresh_token: str | None = None
        self._handle: str | None = None

    def create_session(self) -> None:
        """Create credentials for the session."""
        url = 'https://bsky.social/xrpc/com.atproto.server.createSession'
        headers = {
            'Content-Type': 'application/json',
        }
        data = {'identifier': BLUESKY_IDENTIFIER, 'password': BLUESKY_APP_PASSWORD}

        response = httpx.post(url, headers=headers, json=data)
        response.raise_for_status()

        data = response.json()

        self._service_endpoint = data['didDoc']['service'][0]['serviceEndpoint']
        self._user_did = data['didDoc']['id']
        self._access_token = data['accessJwt']
        self._refresh_token = data['refreshJwt']
        self._handle = data['handle']

    def delete_session(self) -> None:
        """Delete the current session."""
        url = f'{self._service_endpoint}/xrpc/com.atproto.server.deleteSession'
        headers = {'Content-Type': 'application/json', 'authorization': f'Bearer {self._refresh_token}'}

        response = httpx.post(url, headers=headers)
        response.raise_for_status()

The session expires after 2 hours, so if you plan for your crawler to run longer, you should also add a method for refresh.

3. Configuring Crawlee for Python for data collection

Since we’ll be using the official API, we do not need to worry about being blocked by Bluesky. However, we should be careful with the number of requests to avoid overloading Bluesky's servers, so we will configure ConcurrencySettings. We’ll also configure HttpxHttpClient to use custom headers with the current session's Authorization.

We’ll use 2 endpoints for data collection: searchPosts for posts and getProfile. If you plan to scale the crawler, you can use getProfiles for user data, but in this case, you’ll need to implement deduplication logic. When each link is unique, Crawlee for Python handles this for you.

When collecting data, I’d like to separately collect user and post data, so we’ll use different Dataset instances for storage.

async def init_crawler(self) -> None:
    """Initialize the crawler."""
    if not self._user_did:
        raise ValueError('Session not created.')

    # Initialize the datasets purge the data if it is not empty
    self._users = await Dataset.open(name='users', configuration=Configuration(purge_on_start=True))
    self._posts = await Dataset.open(name='posts', configuration=Configuration(purge_on_start=True))

    # Initialize the crawler
    self._crawler = HttpCrawler(
        max_requests_per_crawl=100,
        http_client=HttpxHttpClient(
            # Set headers for API requests
            headers={
                'Content-Type': 'application/json',
                'Authorization': f'Bearer {self._access_token}',
                'Connection': 'Keep-Alive',
                'accept-encoding': 'gzip, deflate, br, zstd',
            }
        ),
        # Configuring concurrency of crawling requests
        concurrency_settings=ConcurrencySettings(
            min_concurrency=10,
            desired_concurrency=10,
            max_concurrency=30,
            max_tasks_per_minute=200,
        ),
    )

    self._crawler.router.default_handler(self._search_handler)  # Handler for search requests
    self._crawler.router.handler(label='user')(self._user_handler)  # Handler for user requests

4. Implementing handlers for data collection

Now we can implement the handler for searching posts. We’ll save the retrieved posts in self._posts and create requests for user data, placing them in the crawler's queue. We also need to handle pagination by forming the link to the next search page.

async def _search_handler(self, context: HttpCrawlingContext) -> None:
    context.log.info(f'Processing search {context.request.url} ...')

    data = json.loads(context.http_response.read())

    if 'posts' not in data:
        context.log.warning(f'No posts found in response: {context.request.url}')
        return

    user_requests = {}
    posts = []

    profile_url = URL(f'{self._service_endpoint}/xrpc/app.bsky.actor.getProfile')

    for post in data['posts']:
        # Add user request if not already added in current context
        if post['author']['did'] not in user_requests:
            user_requests[post['author']['did']] = Request.from_url(
                url=str(profile_url.with_query(actor=post['author']['did'])),
                user_data={'label': 'user'},
            )

        posts.append(
            {
                'uri': post['uri'],
                'cid': post['cid'],
                'author_did': post['author']['did'],
                'created': post['record']['createdAt'],
                'indexed': post['indexedAt'],
                'reply_count': post['replyCount'],
                'repost_count': post['repostCount'],
                'like_count': post['likeCount'],
                'quote_count': post['quoteCount'],
                'text': post['record']['text'],
                'langs': '; '.join(post['record'].get('langs', [])),
                'reply_parent': post['record'].get('reply', {}).get('parent', {}).get('uri'),
                'reply_root': post['record'].get('reply', {}).get('root', {}).get('uri'),
            }
        )

    await self._posts.push_data(posts)  # Push a batch of posts to the dataset
    await context.add_requests(list(user_requests.values()))

    if cursor := data.get('cursor'):
        next_url = URL(context.request.url).update_query({'cursor': cursor})  # Use yarl for update the query string

        await context.add_requests([str(next_url)])

When receiving user data, we'll store it in the corresponding Dataset self._users

async def _user_handler(self, context: HttpCrawlingContext) -> None:
    context.log.info(f'Processing user {context.request.url} ...')

    data = json.loads(context.http_response.read())

    user_item = {
        'did': data['did'],
        'created': data['createdAt'],
        'avatar': data.get('avatar'),
        'description': data.get('description'),
        'display_name': data.get('displayName'),
        'handle': data['handle'],
        'indexed': data.get('indexedAt'),
        'posts_count': data['postsCount'],
        'followers_count': data['followersCount'],
        'follows_count': data['followsCount'],
    }

    await self._users.push_data(user_item)

5. Saving data to files

For saving results, we will use the write_to_json.

async def save_data(self) -> None:
    """Save the data."""
    if not self._users or not self._posts:
        raise ValueError('Datasets not initialized.')

    with open('users.json', 'w') as f:
        await self._users.write_to_json(f, indent=4)

    with open('posts.json', 'w') as f:
        await self._posts.write_to_json(f, indent=4)

6. Running the crawler

We have everything needed to complete the crawler. We just need a method to execute the crawling - let us call it crawl

async def crawl(self, queries: list[str]) -> None:
    """Crawl the given URL."""
    if not self._crawler:
        raise ValueError('Crawler not initialized.')

    search_url = URL(f'{self._service_endpoint}/xrpc/app.bsky.feed.searchPosts')

    await self._crawler.run([str(search_url.with_query(q=query)) for query in queries])

Let's finalize the code:

async def run() -> None:
    """Main execution function that orchestrates the crawling process.

    Creates a scraper instance, manages the session, and handles the complete
    crawling lifecycle including proper cleanup on completion or error.
    """
    scraper = BlueskyApiScraper()
    scraper.create_session()
    try:
        await scraper.init_crawler()
        await scraper.crawl(['python', 'apify', 'crawlee'])
        await scraper.save_data()
    except Exception:
        traceback.print_exc()
    finally:
        scraper.delete_session()


def main() -> None:
    """Entry point for the crawler application."""
    asyncio.run(run())

If you check your pyproject.toml, you will see that UV created an entrypoint for running bluesky-crawlee = "bluesky_crawlee:main", so we can run our crawler simply by executing:

uv run bluesky-crawlee

Let's look at sample results:

Posts

Users

Create Apify Actor for Bluesky crawler

We already have a fully functional implementation for local execution. Let us explore how to adapt it for running on the Apify Platform and transform in Apify Actor.

An Actor is a simple and efficient way to deploy your code in the cloud infrastructure on the Apify Platform. You can flexibly interact with the Actor, schedule regular runs for monitoring data, or integrate with other tools to build data processing flows.

First, create an .actor directory with platform configuration files:

mkdir .actor && touch .actor/{actor.json,Dockerfile,input_schema.json}

Then add Apify SDK for Python as a project dependency:

uv add apify

Configure Dockerfile

We’ll use the official Apify Docker image along with recommended UV practices for Docker:

FROM apify/actor-python:3.13

ENV PATH='/app/.venv/bin:$PATH'

WORKDIR /app

COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/

COPY pyproject.toml uv.lock ./

RUN uv sync --frozen --no-install-project --no-editable -q --no-dev

COPY . .

RUN uv sync --frozen --no-editable -q --no-dev

CMD ["bluesky-crawlee"]

Here, bluesky-crawlee refers to the entrypoint specified in pyproject.toml.

Define project metadata in actor.json

The actor.json file contains project metadata for Apify Platform. Follow the documentation for proper configuration:

{
  "actorSpecification": 1,
  "name": "Bluesky-Crawlee",
  "title": "Bluesky - Crawlee",
  "minMemoryMbytes": 128,
  "maxMemoryMbytes": 2048,
  "description": "Scrape data products from bluesky",
  "version": "0.1",
  "meta": {
    "templateId": "bluesky-crawlee"
  },
  "input": "./input_schema.json",
  "dockerfile": "./Dockerfile"
}

Define Actor input parameters

Our crawler requires several external parameters. Let’s define them:

identifier: User's Bluesky identifier (encrypted for security)
appPassword: Bluesky app password (encrypted)
queries: List of search queries for crawling
maxRequestsPerCrawl: Optional limit for testing
mode: Choose between collecting posts or user data who post on specific topics

Configure the input schema following the specification:

{
  "title": "Bluesky - Crawlee",
  "type": "object",
  "schemaVersion": 1,
  "properties": {
    "identifier": {
      "title": "Bluesky identifier",
      "description": "Bluesky identifier for API login",
      "type": "string",
      "editor": "textfield",
      "isSecret": true
    },
    "appPassword": {
      "title": "Bluesky app password",
      "description": "Bluesky app password for API",
      "type": "string",
      "editor": "textfield",
      "isSecret": true
    },
    "maxRequestsPerCrawl": {
      "title": "Max requests per crawl",
      "description": "Maximum number of requests for crawling",
      "type": "integer"
    },
    "queries": {
      "title": "Queries",
      "type": "array",
      "description": "Search queries",
      "editor": "stringList",
      "prefill": [
        "apify"
      ],
      "example": [
        "apify",
        "crawlee"
      ]
    },
    "mode": {
      "title": "Mode",
      "type": "string",
      "description": "Collect posts or users who post on a topic",
      "enum": [
        "posts",
        "users"
      ],
      "default": "posts"
    }
  },
  "required": [
    "identifier",
    "appPassword",
    "queries",
    "mode"
  ]
}

Update project code

Remove environment variables and parameterize the code according to the Actor input parameters. Replace named datasets with the default dataset.

Add Actor logging:

# __init__.py

import logging

from apify.log import ActorLogFormatter

handler = logging.StreamHandler()
handler.setFormatter(ActorLogFormatter())

apify_client_logger = logging.getLogger('apify_client')
apify_client_logger.setLevel(logging.INFO)
apify_client_logger.addHandler(handler)

apify_logger = logging.getLogger('apify')
apify_logger.setLevel(logging.DEBUG)
apify_logger.addHandler(handler)

Update imports and entry point code:

import asyncio
import json
import traceback
from dataclasses import dataclass

import httpx
from apify import Actor
from yarl import URL

from crawlee import ConcurrencySettings, Request
from crawlee.crawlers import HttpCrawler, HttpCrawlingContext
from crawlee.http_clients import HttpxHttpClient


@dataclass
class ActorInput:
    """Actor input schema."""
    identifier: str
    app_password: str
    queries: list[str]
    mode: str
    max_requests_per_crawl: Optional[int] = None


async def run() -> None:
    """Main execution function that orchestrates the crawling process.

    Creates a scraper instance, manages the session, and handles the complete
    crawling lifecycle including proper cleanup on completion or error.
    """
    async with Actor:
        raw_input = await Actor.get_input()
        actor_input = ActorInput(
            identifier=raw_input.get('indentifier', ''),
            app_password=raw_input.get('appPassword', ''),
            queries=raw_input.get('queries', []),
            mode=raw_input.get('mode', 'posts'),
            max_requests_per_crawl=raw_input.get('maxRequestsPerCrawl')
        )
        scraper = BlueskyApiScraper(actor_input.mode, actor_input.max_requests_per_crawl)
        try:
            scraper.create_session(actor_input.identifier, actor_input.app_password)

            await scraper.init_crawler()
            await scraper.crawl(actor_input.queries)
        except httpx.HTTPError as e:
            Actor.log.error(f'HTTP error occurred: {e}')
            raise
        except Exception as e:
            Actor.log.error(f'Unexpected error: {e}')
            traceback.print_exc()
        finally:
            scraper.delete_session()

def main() -> None:
    """Entry point for the scraper application."""
    asyncio.run(run())

Update methods with Actor input parameters:

class BlueskyApiScraper:
    """A scraper class for extracting data from Bluesky social network using their official API.

    This scraper manages authentication, concurrent requests, and data collection for both
    posts and user profiles. It uses separate datasets for storing post and user information.
    """

    def __init__(self, mode: str, max_request: int | None) -> None:
        self._crawler: HttpCrawler | None = None

        self.mode = mode
        self.max_request = max_request

        # Variables for storing session data
        self._service_endpoint: str | None = None
        self._user_did: str | None = None
        self._access_token: str | None = None
        self._refresh_token: str | None = None
        self._handle: str | None = None

    def create_session(self, identifier: str, password: str) -> None:
        """Create credentials for the session."""
        url = 'https://bsky.social/xrpc/com.atproto.server.createSession'
        headers = {
            'Content-Type': 'application/json',
        }
        data = {'identifier': identifier, 'password': password}

        response = httpx.post(url, headers=headers, json=data)
        response.raise_for_status()

        data = response.json()

        self._service_endpoint = data['didDoc']['service'][0]['serviceEndpoint']
        self._user_did = data['didDoc']['id']
        self._access_token = data['accessJwt']
        self._refresh_token = data['refreshJwt']
        self._handle = data['handle']

Implement mode-aware data collection logic:

async def _search_handler(self, context: HttpCrawlingContext) -> None:
    """Handle search requests based on mode."""
    context.log.info(f'Processing search {context.request.url} ...')

    data = json.loads(context.http_response.read())

    if 'posts' not in data:
        context.log.warning(f'No posts found in response: {context.request.url}')
        return

    user_requests = {}
    posts = []

    profile_url = URL(f'{self._service_endpoint}/xrpc/app.bsky.actor.getProfile')

    for post in data['posts']:
        if self.mode == 'users' and post['author']['did'] not in user_requests:
            user_requests[post['author']['did']] = Request.from_url(
                url=str(profile_url.with_query(actor=post['author']['did'])),
                user_data={'label': 'user'},
            )
        elif self.mode == 'posts':
            posts.append(
                {
                    'uri': post['uri'],
                    'cid': post['cid'],
                    'author_did': post['author']['did'],
                    'created': post['record']['createdAt'],
                    'indexed': post['indexedAt'],
                    'reply_count': post['replyCount'],
                    'repost_count': post['repostCount'],
                    'like_count': post['likeCount'],
                    'quote_count': post['quoteCount'],
                    'text': post['record']['text'],
                    'langs': '; '.join(post['record'].get('langs', [])),
                    'reply_parent': post['record'].get('reply', {}).get('parent', {}).get('uri'),
                    'reply_root': post['record'].get('reply', {}).get('root', {}).get('uri'),
                }
            )

    if self.mode == 'posts':
        await context.push_data(posts)
    else:
        await context.add_requests(list(user_requests.values()))

    if cursor := data.get('cursor'):
        next_url = URL(context.request.url).update_query({'cursor': cursor})
        await context.add_requests([str(next_url)])

Update the user handler for the default dataset:

async def _user_handler(self, context: HttpCrawlingContext) -> None:
    """Handle user profile requests."""
    context.log.info(f'Processing user {context.request.url} ...')

    data = json.loads(context.http_response.read())

    user_item = {
        'did': data['did'],
        'created': data['createdAt'],
        'avatar': data.get('avatar'),
        'description': data.get('description'),
        'display_name': data.get('displayName'),
        'handle': data['handle'],
        'indexed': data.get('indexedAt'),
        'posts_count': data['postsCount'],
        'followers_count': data['followersCount'],
        'follows_count': data['followsCount'],
    }

    await context.push_data(user_item)

Deploy

Use the official Apify CLI to upload your code:

Authenticate using your API token from Apify Console:

apify login

Choose "Enter API token manually" and paste your token.

Push the project to the platform:

apify push

Now you can configure runs on Apify Platform.

Let’s perform a test run:

Fill in the input parameters:

Check that logging works correctly:

View results in the dataset:

If you want to make your Actor public and provide access to other users, potentially to earn income from it, follow this publishing guide for Apify Store.

Conclusion and repository access

We’ve created an efficient crawler for Bluesky using the official API. If you want to learn more this topic for regular data extraction from Bluesky, I recommend explorin custom feed generation - I think it opens up some interesting possibilities.

And if you need to quickly create a crawler that can retrieve data for various queries, you now have everything you need.

You can find the complete code in the repository

If you enjoyed this blog, feel free to support Crawlee for Python by starring the repository or joining the maintainer team.

Have questions or want to discuss implementation details? Join our Discord - our community of 10,000+ developers is there to help.

How to scrape Crunchbase using Python in 2024 (Easy Guide)

Max Bohomolov — Wed, 15 Jan 2025 14:38:03 +0000

Python developers know the drill: you need reliable company data, and Crunchbase has it. This guide shows you how to build an effective Crunchbase scraper in Python that gets you the data you need.

Crunchbase tracks details that matter: locations, business focus, founders, and investment histories. Manual extraction from such a large dataset isn't practical -automation is essential for transforming this information into an analyzable format.

By the end of this blog, we'll explore three different ways to extract data from Crunchbase using Crawlee for Python. We'll fully implement two of them and discuss the specifics and challenges of the third. This will help us better understand how important it is to properly choose the right data source.

Note: This guide comes from a developer in our growing community. Have you built interesting projects with Crawlee? Join us on Discord to share your experiences and blog ideas - we value these contributions from developers like you.

Key steps we'll cover:

Project setup
Choosing the data source
Implementing sitemap-based crawler
Analysis of search-based approach and its limitations
Implementing the official API crawler
Conclusion and repository access

Prerequisites

Python 3.9 or higher
Familiarity with web scraping concepts
Crawlee for Python v0.5.0
poetry v2.0 or higher

Project setup

Before we start scraping, we need to set up our project. In this guide, we won't be using crawler templates (Playwright and Beautifulsoup), so we'll set up the project manually.

Install Poetry
```
pipx install poetry
```

Create and navigate to the project folder.

mkdir crunchbase-crawlee && cd crunchbase-crawlee

Initialize the project using Poetry, leaving all fields empty.
```
poetry init
```
When prompted:
- For "Compatible Python versions", enter: >={your Python version},<4.0 (For example, if you're using Python 3.10, enter: >=3.10,<4.0)
- Leave all other fields empty by pressing Enter
- Confirm the generation by typing "yes"
Add and install Crawlee with necessary dependencies to your project using Poetry.
```
poetry add crawlee[parsel,curl-impersonate]
```

Complete the project setup by creating the standard file structure for Crawlee for Python projects.

mkdir crunchbase-crawlee && touch crunchbase-crawlee/{__init__.py,__main__.py,main.py,routes.py}

After setting up the basic project structure, we can explore different methods of obtaining data from Crunchbase.

Choosing the data source

While we can extract target data directly from the company page, we need to choose the best way to navigate the site.

A careful examination of Crunchbase's structure shows that we have three main options for obtaining data:

Sitemap - for complete site traversal.
Search - for targeted data collection.
Official API - recommended method.

Let's examine each of these approaches in detail.

Scraping Crunchbase using sitemap and Crawlee for Python

Sitemap is a standard way of site navigation used by crawlers like Google, Ahrefs, and other search engines. All crawlers must follow the rules described in robots.txt.

Let's look at the structure of Crunchbase's Sitemap:

As you can see, links to organization pages are located inside second-level Sitemap files, which are compressed using gzip.

The structure of one of these files looks like this:

The lastmod field is particularly important here. It allows tracking which companies have updated their information since the previous data collection. This is especially useful for regular data updates.

1. Configuring the crawler for scraping

To work with the site, we'll use CurlImpersonateHttpClient, which impersonates a Safari browser. While this choice might seem unexpected for working with a sitemap, it's necessitated by Crunchbase's protection features.

The reason is that Crunchbase uses Cloudflare to protect against automated access. This is clearly visible when analyzing traffic on a company page:

An interesting feature is that challenges.cloudflare is executed after loading the document with data. This means we receive the data first, and only then JavaScript checks if we're a bot. If our HTTP client's fingerprint is sufficiently similar to a real browser, we'll successfully receive the data.

Cloudflare also analyzes traffic at the sitemap level. If our crawler doesn't look legitimate, access will be blocked. That's why we impersonate a real browser.

To prevent blocks due to overly aggressive crawling, we'll configure ConcurrencySettings.

When scaling this approach, you'll likely need proxies. Detailed information about proxy setup can be found in the documentation.

We'll save our scraping results in JSON format. Here's how the basic crawler configuration looks:

# main.py

from crawlee import ConcurrencySettings, HttpHeaders
from crawlee.crawlers import ParselCrawler
from crawlee.http_clients import CurlImpersonateHttpClient

from .routes import router


async def main() -> None:
    """The crawler entry point."""
    concurrency_settings = ConcurrencySettings(max_concurrency=1, max_tasks_per_minute=50)

    http_client = CurlImpersonateHttpClient(
        impersonate='safari17_0',
        headers=HttpHeaders(
            {
                'accept-language': 'en',
                'accept-encoding': 'gzip, deflate, br, zstd',
            }
        ),
    )
    crawler = ParselCrawler(
        request_handler=router,
        max_request_retries=1,
        concurrency_settings=concurrency_settings,
        http_client=http_client,
        max_requests_per_crawl=30,
    )

    await crawler.run(['https://www.crunchbase.com/www-sitemaps/sitemap-index.xml'])

    await crawler.export_data_json('crunchbase_data.json')

2. Implementing sitemap navigation

Sitemap navigation happens in two stages. In the first stage, we need to get a list of all files containing organization information:

# routes.py

from crawlee.crawlers import ParselCrawlingContext
from crawlee.router import Router
from crawlee import Request

router = Router[ParselCrawlingContext]()


@router.default_handler
async def default_handler(context: ParselCrawlingContext) -> None:
    """Default request handler."""
    context.log.info(f'default_handler processing {context.request} ...')

    requests = [
        Request.from_url(url, label='sitemap')
        for url in context.selector.xpath('//loc[contains(., "sitemap-organizations")]/text()').getall()
    ]

    # Since this is a tutorial, I don't want to upload more than one sitemap link
    await context.add_requests(requests, limit=1)

In the second stage, we process second-level sitemap files stored in gzip format. This requires a special approach as the data needs to be decompressed first:

# routes.py

from gzip import decompress
from parsel import Selector


@router.handler('sitemap')
async def sitemap_handler(context: ParselCrawlingContext) -> None:
    """Sitemap gzip request handler."""
    context.log.info(f'sitemap_handler processing {context.request.url} ...')

    data = context.http_response.read()
    data = decompress(data)

    selector = Selector(data.decode())

    requests = [Request.from_url(url, label='company') for url in selector.xpath('//loc/text()').getall()]

    await context.add_requests(requests)

3. Extracting and saving data

Each company page contains a large amount of information. For demonstration purposes, we'll focus on the main fields: Company Name, Short Description, Website, and Location.

One of Crunchbase's advantages is that all data is stored in JSON format within the page:

This significantly simplifies data extraction - we only need to use one Xpath selector to get the JSON, and then apply jmespath to extract the needed fields:

# routes.py

@router.handler('company')
async def company_handler(context: ParselCrawlingContext) -> None:
    """Company request handler."""
    context.log.info(f'company_handler processing {context.request.url} ...')

    json_selector = context.selector.xpath('//*[@id="ng-state"]/text()')

    await context.push_data(
        {
            'Company Name': json_selector.jmespath('HttpState.*.data[].properties.identifier.value').get(),
            'Short Description': json_selector.jmespath('HttpState.*.data[].properties.short_description').get(),
            'Website': json_selector.jmespath('HttpState.*.data[].cards.company_about_fields2.website.value').get(),
            'Location': '; '.join(
                json_selector.jmespath(
                    'HttpState.*.data[].cards.company_about_fields2.location_identifiers[].value'
                ).getall()
            ),
        }
    )

The collected data is saved in Crawlee for Python's internal storage using the context.push_data method. When the crawler finishes, we export all collected data to a JSON file:

# main.py

await crawler.export_data_json('crunchbase_data.json')

4. Running the project

With all components in place, we need to create an entry point for our crawler:

# __main__.py
import asyncio

from .main import main

if __name__ == '__main__':
    asyncio.run(main())

Execute the crawler using Poetry:

poetry run python -m crunchbase-crawlee

5. Finally, characteristics of using the sitemap crawler

The sitemap approach has its distinct advantages and limitations. It's ideal in the following cases:

When you need to collect data about all companies on the platform
When there are no specific company selection criteria
If you have sufficient time and computational resources

However, there are significant limitations to consider:

Almost no ability to filter data during collection
Requires constant monitoring of Cloudflare blocks
Scaling the solution requires proxy servers, which increases project costs

Using search for scraping Crunchbase

The limitations of the sitemap approach might point to search as the next solution. However, Crunchbase applies tighter security measures to its search functionality compared to its public pages.

The key difference lies in how Cloudflare protection works. While we receive data before the challenges.cloudflare check when accessing a company page, the search API requires valid cookies that have passed this check.

Let's verify this in practice. Open the following link in Incognito mode:

<https://www.crunchbase.com/v4/data/autocompletes?query=Ap&collection_ids=organizations&limit=25&source=topSearch>

When analyzing the traffic, we'll see the following pattern:

The sequence of events here is:

First, the page is blocked with code 403
Then the challenges.cloudflare check is performed
Only after successfully passing the check do we receive data with code 200

Automating this process would require a headless browser capable of bypassing Cloudflare Turnstile. The current version of Crawlee for Python (v0.5.0) doesn't provide this functionality, although it's planned for future development.

You can extend the capabilities of Crawlee for Python by integrating Camoufox following this example.

Working with the official Crunchbase API

Crunchbase provides a free API with basic functionality. Paid subscription users get expanded data access. Complete documentation for available endpoints can be found in the official API specification.

1. Setting up API access

To start working with the API, follow these steps:

Create a Crunchbase account
Go to the Integrations section
Create a Crunchbase Basic API key

Although the documentation states that key activation may take up to an hour, it usually starts working immediately after creation.

2. Configuring the crawler for API work

An important API feature is the limit - no more than 200 requests per minute, but in the free version, this number is significantly lower. Taking this into account, let's configure ConcurrencySettings. Since we're working with the official API, we don't need to mask our HTTP client. We'll use the standard 'HttpxHttpClient' with preset headers.

First, let's save the API key in an environment variable:

export CRUNCHBASE_TOKEN={YOUR KEY}

Here's how the crawler configuration for working with the API looks:

# main.py

import os

from crawlee.crawlers import HttpCrawler
from crawlee.http_clients import HttpxHttpClient
from crawlee import ConcurrencySettings, HttpHeaders

from .routes import router

CRUNCHBASE_TOKEN = os.getenv('CRUNCHBASE_TOKEN', '')


async def main() -> None:
    """The crawler entry point."""

    concurrency_settings = ConcurrencySettings(max_tasks_per_minute=60)

    http_client = HttpxHttpClient(
        headers=HttpHeaders({'accept-encoding': 'gzip, deflate, br, zstd', 'X-cb-user-key': CRUNCHBASE_TOKEN})
    )
    crawler = HttpCrawler(
        request_handler=router,
        concurrency_settings=concurrency_settings,
        http_client=http_client,
        max_requests_per_crawl=30,
    )

    await crawler.run(
        ['https://api.crunchbase.com/api/v4/autocompletes?query=apify&collection_ids=organizations&limit=25']
    )

    await crawler.export_data_json('crunchbase_data.json')

3. Processing search results

For working with the API, we'll need two main endpoints:

get_autocompletes - for searching
get_entities_organizations__entity_id - for getting data

First, let's implement search results processing:

import json

from crawlee.crawlers import HttpCrawler
from crawlee.router import Router
from crawlee import Request

router = Router[HttpCrawlingContext]()


@router.default_handler
async def default_handler(context: HttpCrawlingContext) -> None:
    """Default request handler."""
    context.log.info(f'default_handler processing {context.request.url} ...')

    data = json.loads(context.http_response.read())

    requests = []

    for entity in data['entities']:
        permalink = entity['identifier']['permalink']
        requests.append(
            Request.from_url(
                url=f'https://api.crunchbase.com/api/v4/entities/organizations/{permalink}?field_ids=short_description%2Clocation_identifiers%2Cwebsite_url',
                label='company',
            )
        )

    await context.add_requests(requests)

4. Extracting company data

After getting the list of companies, we extract detailed information about each one:

@router.handler('company')
async def company_handler(context: HttpCrawlingContext) -> None:
    """Company request handler."""
    context.log.info(f'company_handler processing {context.request.url} ...')

    data = json.loads(context.http_response.read())

    await context.push_data(
        {
            'Company Name': data['properties']['identifier']['value'],
            'Short Description': data['properties']['short_description'],
            'Website': data['properties'].get('website_url'),
            'Location': '; '.join([item['value'] for item in data['properties'].get('location_identifiers', [])]),
        }
    )

5. Advanced location-based search

If you need more flexible search capabilities, the API provides a special search endpoint. Here's an example of searching for all companies in Prague:

payload = {
    'field_ids': ['identifier', 'location_identifiers', 'short_description', 'website_url'],
    'limit': 200,
    'order': [{'field_id': 'rank_org', 'sort': 'asc'}],
    'query': [
        {
            'field_id': 'location_identifiers',
            'operator_id': 'includes',
            'type': 'predicate',
            'values': ['e0b951dc-f710-8754-ddde-5ef04dddd9f8'],
        },
        {'field_id': 'facet_ids', 'operator_id': 'includes', 'type': 'predicate', 'values': ['company']},
    ],
}

serialiazed_payload = json.dumps(payload)
await crawler.run(
    [
        Request.from_url(
            url='https://api.crunchbase.com/api/v4/searches/organizations',
            method='POST',
            payload=serialiazed_payload,
            use_extended_unique_key=True,
            headers=HttpHeaders({'Content-Type': 'application/json'}),
            label='search',
        )
    ]
)

For processing search results and pagination, we use the following handler:

@router.handler('search')
async def search_handler(context: HttpCrawlingContext) -> None:
    """Search results handler with pagination support."""
    context.log.info(f'search_handler processing {context.request.url} ...')

    data = json.loads(context.http_response.read())

    last_entity = None
    results = []

    for entity in data['entities']:
        last_entity = entity['uuid']
        results.append(
            {
                'Company Name': entity['properties']['identifier']['value'],
                'Short Description': entity['properties']['short_description'],
                'Website': entity['properties'].get('website_url'),
                'Location': '; '.join([item['value'] for item in entity['properties'].get('location_identifiers', [])]),
            }
        )

    if results:
        await context.push_data(results)

    if last_entity:
        payload = json.loads(context.request.payload)
        payload['after_id'] = last_entity
        payload = json.dumps(payload)

        await context.add_requests(
            [
                Request.from_url(
                    url='https://api.crunchbase.com/api/v4/searches/organizations',
                    method='POST',
                    payload=payload,
                    use_extended_unique_key=True,
                    headers=HttpHeaders({'Content-Type': 'application/json'}),
                    label='search',
                )
            ]
        )

6. Finally, free API limitations

The free version of the API has significant limitations:

Limited set of available endpoints
Autocompletes function only works for company searches
Not all data fields are accessible
Limited search filtering capabilities

Consider a paid subscription for production-level work. The API provides the most reliable way to access Crunchbase data, even with its rate constraints.

What’s your best path forward?

We've explored three different approaches to obtaining data from Crunchbase:

Sitemap - for large-scale data collection
Search - difficult to automate due to Cloudflare protection
Official API - the most reliable solution for commercial projects

Each method has its advantages, but for most projects, I recommend using the official API despite its limitations in the free version.

The complete source code is available in my repository. Have questions or want to discuss implementation details? Join our Discord - our community of developers is there to help.

How to scrape Google search results with Python

Max Bohomolov — Mon, 02 Dec 2024 05:31:31 +0000

Scraping Google Search delivers essential SERP analysis, SEO optimization, and data collection capabilities. Modern scraping tools make this process faster and more reliable.

One of our community members wrote this blog as a contribution to the Crawlee Blog. If you would like to contribute blogs like these to Crawlee Blog, please reach out to us on our discord channel.

In this guide, we'll create a Google Search scraper using Crawlee for Python that can handle result ranking and pagination.

We'll create a scraper that:

Extracts titles, URLs, and descriptions from search results
Handles multiple search queries
Tracks ranking positions
Processes multiple result pages
Saves data in a structured format

Prerequisites

Python 3.7 or higher
Basic understanding of HTML and CSS selectors
Familiarity with web scraping concepts
Crawlee for Python v0.4.2 or higher

Project setup

Install Crawlee with required dependencies:

pipx install crawlee[beautifulsoup,curl-impersonate]

Create a new project using Crawlee CLI:

pipx run crawlee create crawlee-google-search

When prompted, select Beautifulsoup as your template type.
Navigate to the project directory and complete installation:
```
cd crawlee-google-search
poetry install
```

Development of the Google Search scraper in Python

1. Defining data for extraction

First, let's define our extraction scope. Google's search results now include maps, notable people, company details, videos, common questions, and many other elements. We'll focus on analyzing standard search results with rankings.

Here's what we'll be extracting:

Let's verify whether we can extract the necessary data from the page's HTML code, or if we need deeper analysis or JS rendering. Note that this verification is sensitive to HTML tags:

Based on the data obtained from the page, all necessary information is present in the HTML code. Therefore, we can use beautifulsoup_crawler.

The fields we'll extract:

Search result titles
URLs
Description text
Ranking positions

2. Configure the crawler

First, let's create the crawler configuration.

We'll use CurlImpersonateHttpClient as our http_client with preset headers and impersonate relevant to the Chrome browser.

We'll also configure ConcurrencySettings to control scraping aggressiveness. This is crucial to avoid getting blocked by Google.

If you need to extract data more intensively, consider setting up ProxyConfiguration.

from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
from crawlee.http_clients.curl_impersonate import CurlImpersonateHttpClient
from crawlee import ConcurrencySettings, HttpHeaders

async def main() -> None:
    concurrency_settings = ConcurrencySettings(max_concurrency=5, max_tasks_per_minute=200)

    http_client = CurlImpersonateHttpClient(impersonate="chrome124",
                                            headers=HttpHeaders({"referer": "https://www.google.com/",
                                                     "accept-language": "en",
                                                     "accept-encoding": "gzip, deflate, br, zstd",
                                                     "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36"
                                            }))

    crawler = BeautifulSoupCrawler(
        max_request_retries=1,
        concurrency_settings=concurrency_settings,
        http_client=http_client,
        max_requests_per_crawl=10,
        max_crawl_depth=5
    )

    await crawler.run(['https://www.google.com/search?q=Apify'])

3. Implementing data extraction

First, let's analyze the HTML code of the elements we need to extract:

There's an obvious distinction between readable ID attributes and generated class names and other attributes. When creating selectors for data extraction, you should ignore any generated attributes. Even if you've read that Google has been using a particular generated tag for N years, you shouldn't rely on it - this reflects your experience in writing robust code.

Now that we understand the HTML structure, let's implement the extraction. As our crawler deals with only one type of page, we can use router.default_handler for processing it. Within the handler, we'll use BeautifulSoup to iterate through each search result, extracting data such as title, url, and text_widget while saving the results.

@crawler.router.default_handler
async def default_handler(context: BeautifulSoupCrawlingContext) -> None:
    """Default request handler."""
    context.log.info(f'Processing {context.request} ...')

    for item in context.soup.select("div#search div#rso div[data-hveid][lang]"):
        data = {
            'title': item.select_one("h3").get_text(),
            "url": item.select_one("a").get("href"),
            "text_widget": item.select_one("div[style*='line']").get_text(),
        }
        await context.push_data(data)

4. Handling pagination

Since Google results depend on the IP geolocation of the search request, we can't rely on link text for pagination. We need to create a more sophisticated CSS selector that works regardless of geolocation and language settings.

The max_crawl_depth parameter controls how many pages our crawler should scan. Once we have our robust selector, we simply need to get the next page link and add it to the crawler's queue.

To write more efficient selectors, learn the basics of CSS and XPath syntax.

    await context.enqueue_links(selector="div[role='navigation'] td[role='heading']:last-of-type > a")

5. Exporting data to CSV format

Since we want to save all search result data in a convenient tabular format like CSV, we can simply add the export_data method call right after running the crawler:

await crawler.export_data_csv("google_search.csv")

6. Finalizing the Google Search scraper

While our core crawler logic works, you might have noticed that our results currently lack ranking position information. To complete our scraper, we need to implement proper ranking position tracking by passing data between requests using user_data in Request.

Let's modify the script to handle multiple queries and track ranking positions for search results analysis. We'll also set the crawling depth as a top-level variable. Let's move the router.default_handler to routes.py to match the project structure:

# crawlee-google-search.main

from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from crawlee.http_clients.curl_impersonate import CurlImpersonateHttpClient
from crawlee import Request, ConcurrencySettings, HttpHeaders

from .routes import router

QUERIES = ["Apify", "Crawlee"]

CRAWL_DEPTH = 2


async def main() -> None:
    """The crawler entry point."""

    concurrency_settings = ConcurrencySettings(max_concurrency=5, max_tasks_per_minute=200)

    http_client = CurlImpersonateHttpClient(impersonate="chrome124",
                                            headers=HttpHeaders({"referer": "https://www.google.com/",
                                                     "accept-language": "en",
                                                     "accept-encoding": "gzip, deflate, br, zstd",
                                                     "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36"
                                            }))
    crawler = BeautifulSoupCrawler(
        request_handler=router,
        max_request_retries=1,
        concurrency_settings=concurrency_settings,
        http_client=http_client,
        max_requests_per_crawl=100,
        max_crawl_depth=CRAWL_DEPTH
    )

    requests_lists = [Request.from_url(f"https://www.google.com/search?q={query}", user_data = {"query": query}) for query in QUERIES]

    await crawler.run(requests_lists)

    await crawler.export_data_csv("google_ranked.csv")

Let's also modify the handler to add query and order_no fields and basic error handling:

# crawlee-google-search.routes

from crawlee.beautifulsoup_crawler import BeautifulSoupCrawlingContext
from crawlee.router import Router

router = Router[BeautifulSoupCrawlingContext]()


@router.default_handler
async def default_handler(context: BeautifulSoupCrawlingContext) -> None:
    """Default request handler."""
    context.log.info(f'Processing {context.request.url} ...')

    order = context.request.user_data.get("last_order", 1)
    query = context.request.user_data.get("query")
    for item in context.soup.select("div#search div#rso div[data-hveid][lang]"):
        try:
            data = {
                "query": query,
                "order_no": order,
                'title': item.select_one("h3").get_text(),
                "url": item.select_one("a").get("href"),
                "text_widget": item.select_one("div[style*='line']").get_text(),
            }
            await context.push_data(data)
            order += 1
        except AttributeError as e:
            context.log.warning(f'Attribute error for query "{query}": {str(e)}')
        except Exception as e:
            context.log.error(f'Unexpected error for query "{query}": {str(e)}')

    await context.enqueue_links(selector="div[role='navigation'] td[role='heading']:last-of-type > a",
                                user_data={"last_order": order, "query": query})

And we're done!

Our Google Search crawler is ready. Let's look at the results in the google_ranked.csv file:

The code repository is available on GitHub

Scrape Google Search results with Apify

If you're working on a large-scale project requiring millions of data points, like the project featured in this article about Google ranking analysis - you might need a ready-made solution.

Consider using Google Search Results Scraper by the Apify team.

It offers important features such as:

Proxy support
Scalability for large-scale data extraction
Geolocation control
Integration with external services like Zapier, Make, Airbyte, LangChain and others

You can learn more in the Apify blog

What will you scrape?

In this blog, we've explored step-by-step how to create a Google Search crawler that collects ranking data. How you analyze this dataset is up to you!

As a reminder, you can find the full project code on GitHub.

I'd like to think that in 5 years I'll need to write an article on "How to extract data from the best search engine for LLMs", but I suspect that in 5 years this article will still be relevant.

12 tips on how to think like a web scraping expert

Max Bohomolov — Mon, 04 Nov 2024 10:15:24 +0000

Typically, tutorials focus on the technical aspects, on what you can replicate: "Start here, follow this path, and you'll end up here." This is great for learning a particular technology, but it's sometimes difficult to understand why the author decided to do things a certain way or what guides their development process.

One of our community members wrote this blog as a contribution to Crawlee Blog. If you want to contribute blogs like these to Crawlee Blog, please reach out to us on our discord channel.

Crawlee & Apify

This is the official developer community of Apify and Crawlee. | 8987 members

discord.com

In this blog, I'll discuss the general rules and principles that guide me when I work on web scraping projects and allow me to achieve great results.

So, let's explore the mindset of a web scraping developer.

1. Choosing a data source for the project

When you start working on a project, you likely have a target site from which you need to extract specific data. Check what possibilities this site or application provides for data extraction. Here are some possible options:

Official API - the site may provide a free official API through which you can get all the necessary data. This is the best option for you. For example, you can consider this approach if you need to extract data from Yelp
Website - in this case, we study the website, its structure, as well as the ways the frontend and backend interact
Mobile Application - in some cases, there's no website or API at all, or the mobile application provides more data, in which case, don't forget about the man-in-the-middle approach

If one data source fails, try accessing another available source.

For example, for Yelp, all three options are available, and if the Official API doesn't suit you for some reason, you can try the other two.

2. Check `robots.txt` and `sitemap`

I think everyone knows about robots.txt and sitemap one way or another, but I regularly see people simply forgetting about them. If you're hearing about these for the first time, here's a quick explanation:

robots is the established name for crawlers in SEO. Usually, this refers to crawlers of major search engines like Google and Bing, or services like Ahrefs and ChatGPT.
robots.txt is a file describing the allowed behavior for robots. It includes permitted crawler user-agents, wait time between page scans, patterns of pages forbidden for scanning, and more. These rules are typically based on which pages should be indexed by search engines and which should not.
sitemap describes the site structure to make it easier for robots to navigate. It also helps in scanning only the content that needs updating, without creating unnecessary load on the site

Since you're not Google or any other popular search engine, the robot rules in robots.txt will likely be against you. But combined with the sitemap, this is a good place to study the site structure, expected interaction with robots, and non-browser user-agents. In some situations, it simplifies data extraction from the site.

For example, using the sitemap for Crawlee website, you can easily get direct links to posts both for the entire lifespan of the blog and for a specific period. One simple check, and you don't need to implement pagination logic.

3. Don't neglect site analysis

Thorough site analysis is an important prerequisite for creating an effective web scraper, especially if you're not planning to use browser automation. However, such analysis takes time, sometimes a lot of it.

It's also worth noting that the time spent on analysis and searching for a more optimal crawling solution doesn't always pay off - you might spend hours only to discover that the most obvious approach was the best all along.

Therefore, it's wise to set limits on your initial site analysis. If you don't see a better path within the allocated time, revert to simpler approaches. As you gain more experience, you'll more often be able to tell early on, based on the technologies used on the site, whether it's worth dedicating more time to analysis or not.

Also, in projects where you need to extract data from a site just once, thorough site analysis can sometimes eliminate the need to write scraper code altogether. Here's an example of such a site - https://ricebyrice.com/nl/pages/find-store.

By analyzing it, you'll easily discover that all the data can be obtained with a single request. You simply need to copy this data from your browser into a JSON file, and your task is complete.

4. Maximum interactivity

When analyzing a site, switch sorts, pages, interact with various elements of the site, while watching the Network tab in your browser's Dev Tools. This will allow you to better understand how the site interacts with the backend, what framework it's built on, and what behavior can be expected from it.

5. Data doesn't appear out of thin air

This is obvious, but it's important to keep in mind while working on a project. If you see some data or request parameters, it means they were obtained somewhere earlier, possibly in another request, possibly they may have already been on the website page, possibly they were formed using JS from other parameters. But they are always somewhere.

If you don't understand where the data on the page comes from, or the data used in a request, follow these steps:

Sequentially, check all requests the site made before this point.
Examine their responses, headers, and cookies.
Use your intuition: Could this parameter be a timestamp? Could it be another parameter in a modified form?
Does it resemble any standard hashes or encodings?

Practice makes perfect here. As you become familiar with different technologies, various frameworks, and their expected behaviors, and as you encounter a wide range of technologies, you'll find it easier to understand how things work and how data is transferred. This accumulated knowledge will significantly improve your ability to trace and understand data flow in web applications.

6. Data is cached

You may notice that when opening the same page several times, the requests transmitted to the server differ: possibly something was cached and is already stored on your computer. Therefore, it's recommended to analyze the site in incognito mode, as well as switch browsers.

This situation is especially relevant for mobile applications, which may store some data in storage on the device. Therefore, when analyzing mobile applications, you may need to clear the cache and storage.

7. Learn more about the framework

If during the analysis you discover that the site uses a framework you haven't encountered before, take some time to learn about it and its features. For example, if you notice a site is built with Next.js, understanding how it handles routing and data fetching could be crucial for your scraping strategy.

You can learn about these frameworks through official documentation or by using LLMs like ChatGPT or Claude. These AI assistants are excellent at explaining framework-specific concepts. Here's an example of how you might query an LLM about Next.js:

I am in the process of optimizing my website using Next.js. Are there any files passed to the browser that describe all internal routing and how links are formed?

Restrictions:
- Accompany your answers with code samples
- Use this message as the main message for all subsequent responses
- Reference only those elements that are available on the client side, without access to the project code base

You can create similar queries for backend frameworks as well. For instance, with GraphQL, you might ask about available fields and query structures. These insights can help you understand how to better interact with the site's API and what data is potentially available.

For effective work with LLM, I recommend at least basically studying the basics of prompt engineering.

8. Reverse engineering

Web scraping goes hand in hand with reverse engineering. You study the interactions of the frontend and backend, you may need to study the code to better understand how certain parameters are formed.

But in some cases, reverse engineering may require more knowledge, effort, time, or have a high degree of complexity. At this point, you need to decide whether you need to delve into it or it's better to change the data source, or, for example, technologies. Most likely, this will be the moment when you decide to abandon HTTP web scraping and switch to a headless browser.

The main principle of most web scraping protections is not to make web scraping impossible, but to make it expensive.

Let's just look at what the response to a search on zoopla looks like

9. Testing requests to endpoints

After identifying the endpoints you need to extract the target data, make sure you get a correct response when making a request. If you get a response from the server other than 200, or data different from expected, then you need to figure out why. Here are some possible reasons:

You need to pass some parameters, for example cookies, or specific technical headers
The site requires that when accessing this endpoint, there is a corresponding Referrer header
The site expects that the headers will follow a certain order. I've encountered this only a couple of times, but I have encountered it
The site uses protection against web scraping, for example with TLS fingerprint

And many other possible reasons, each of which requires separate analysis.

10. Experiment with request parameters

Explore what results you get when changing request parameters, if any. Some parameters may be missing but supported on the server side. For example, order, sort, per_page, limit, and others. Try adding them and see if the behavior changes.

This is especially relevant for sites using graphql

Let's consider this example

If you analyze the site, you'll see a request that can be reproduced with the following code, I've formatted it a bit to improve readability:

import requests

url = "https://restoran.ua/graphql"

data = {
    "operationName": "Posts_PostsForView",
    "variables": {"sort": {"sortBy": ["startAt_DESC"]}},
    "query": """query Posts_PostsForView(
    $where: PostForViewWhereInput,
    $sort: PostForViewSortInput,
    $pagination: PaginationInput,
    $search: String,
    $token: String,
    $coordinates_slice: SliceInput)
    {
        PostsForView(
                where: $where
                sort: $sort
                pagination: $pagination
                search: $search
                token: $token
                ) {
                        id
                        title: ukTitle
                        summary: ukSummary
                        slug
                        startAt
                        endAt
                        newsFeed
                        events
                        journal
                        toProfessionals
                        photoHeader {
                            address: mobile
                            __typename
                            }
                        coordinates(slice: $coordinates_slice) {
                            lng
                            lat
                            __typename
                            }
                        __typename
                    }
    }"""
}

response = requests.post(url, json=data)

print(response.json())

Now I'll update it to get results in 2 languages at once, and most importantly, along with the internal text of the publications:

import requests

url = "https://restoran.ua/graphql"

data = {
    "operationName": "Posts_PostsForView",
    "variables": {"sort": {"sortBy": ["startAt_DESC"]}},
    "query": """query Posts_PostsForView(
    $where: PostForViewWhereInput,
    $sort: PostForViewSortInput,
    $pagination: PaginationInput,
    $search: String,
    $token: String,
    $coordinates_slice: SliceInput)
    {
        PostsForView(
                where: $where
                sort: $sort
                pagination: $pagination
                search: $search
                token: $token
                ) {  
                        id
                        # highlight-start
                        uk_title: ukTitle
                        en_title: enTitle
                        # highlight-end
                        summary: ukSummary
                        slug
                        startAt
                        endAt
                        newsFeed
                        events
                        journal
                        toProfessionals
                        photoHeader {
                            address: mobile
                            __typename
                            }
                        # highlight-start
                        mixedBlocks {
                            index
                            en_text: enText
                            uk_text: ukText
                            __typename
                            }
                        # highlight-end
                        coordinates(slice: $coordinates_slice) {
                            lng
                            lat
                            __typename
                            }
                        __typename
                    }
    }"""
}

response = requests.post(url, json=data)
print(response.json())

As you can see, a small update of the request parameters allows me not to worry about visiting the internal page of each publication. You have no idea how many times this trick has saved me.

If you see graphql in front of you and don't know where to start, then my advice about documentation and LLM works here too.

11. Don't be afraid of new technologies

I know how easy it is to master a few tools and just use them because it works. I've fallen into this trap more than once myself.

But modern sites use modern technologies that have a significant impact on web scraping, and in response, new tools for web scraping are emerging. Learning these may greatly simplify your next project, and may even solve some problems that were insurmountable for you. I wrote about some tools earlier.

I especially recommend paying attention to curl_cffi and frameworks
botasaurus and Crawlee for Python.

12. Help open-source libraries

Personally, I only recently came to realize the importance of this. All the tools I use for my work are either open-source developments or based on open-source. Web scraping literally lives thanks to open-source, and this is especially noticeable if you're a Python developer and have realized that on pure Python everything is quite sad when you need to deal with TLS fingerprint, and again, open-source saved us here.

And it seems to me that the least we could do is invest a little of our knowledge and skills in supporting open-source.

I chose to support Crawlee for Python, and no, not because they allowed me to write in their blog, but because it shows excellent development dynamics and is aimed at making life easier for web crawler developers. It allows for faster crawler development by taking care of and hiding under the hood such critical aspects as session management, session rotation when blocked, managing concurrency of asynchronous tasks (if you write asynchronous code, you know what a pain this can be), and much more.

:::tip
If you like the blog so far, please consider giving Crawlee a star on GitHub, it helps us to reach and help more developers.
:::

And what choice will you make?

Conclusion

I think some things in the article were obvious to you, some things you follow yourself, but I hope you learned something new too. If most of them were new, then try using these rules as a checklist in your next project.

I would be happy to discuss the article. Feel free to comment here, in the article, or contact me in the Crawlee developer community on Discord.

You can also find me on the following platforms: Github, Linkedin, Apify, Upwork, Contra.

Thank you for your attention :)

Web scraping of a dynamic website using Python with HTTP Client

Max Bohomolov — Sat, 28 Sep 2024 11:30:02 +0000

Dynamic websites that use JavaScript for content rendering and backend interaction often create challenges for web scraping. The traditional approach to solving this problem is browser emulation, but it's not very efficient in terms of resource consumption.

In this article, we'll explore an alternative method based on in-depth site analysis and the use of an HTTP client. We'll go through the entire process from analyzing a dynamic website to implementing an efficient web crawler using the Crawlee for Python framework.

What you'll learn in this tutorial

Our subject of study is the Accommodation for Students website. Using this example, we'll examine the specifics of analyzing sites built with the Next.js framework and implement a crawler capable of efficiently extracting data without using browser emulation.

By the end of this article, you will have:

A clear understanding of how to analyze sites with dynamic content rendered using JavaScript.
How to implement a crawler based on Crawlee for Python.
Insight into some of the details of working with sites that use Next.js.
A link to a GitHub repository with the full crawler implementation code.

Website analysis

To track all requests, open your Dev Tools and the network tab before entering the site. Some data may be transmitted only once the site is first opened.

As the site is intended for students in the UK, let's go to London. We'll start the analysis from the search page

Interacting with elements on the site page, you'll quickly notice a request of this type:

https://www.accommodationforstudents.com/search?limit=22&skip=0&random=false&mode=text&numberOfBedrooms=0&occupancy=min&countryCode=gb&location=London&sortBy=price&order=asc

If we look at the format of the received response, we'll immediately notice that it comes in JSON format.

Great, we're getting data in a structured format that's very convenient to work with. We see the total number of results links to listings are in the url attribute for each properties element

Let's also take a look at the server response headers.

content-type: application/json; charset=utf-8 - It tells us that the server response comes in JSON format, which we've already confirmed visually.
content-encoding: gzip - It tells us that the response was compressed using gzip, and therefore we should use appropriate decompression in our crawler.
server: cloudflare - The site is hosted on Сloudflare servers and uses their protection. We should consider this when creating our crawler.

Great, let's also look at the parameters used in the search API request and make hypotheses about what they're responsible for:

limit: 22 - The number of elements we get per request.
skip: 0 - The element from which we'll start getting important data for pagination.
random: false - We don't change the random sorting as we benefit from strict sorting.
mode: text - An unusual parameter. If you decide to conduct several experiments, you'll find that it can take the following values: text, fallback, geo. - Interestingly, the geo parameter completely changes the output, returning about 5400 options. I assume it's necessary to search by coordinates, and if we don't pass any coordinates, we get all the available results.
numberOfBedrooms: 0- filter by bedrooms.
occupancy: min - filter by occupancy.
countryCode: gb - country code, in our case it's Great Britain
location: London - search location
sortBy: price - the field by which sorting is performed
order: asc - type of sorting

But there's another important point to pay attention to. Let's look at our link in the browser bar, which looks like this:

https://www.accommodationforstudents.com/search-results?location=London&beds=0&occupancy=min&minPrice=0&maxPrice=500&latitude=51.509865&longitude=-0.118092&geo=false&page=1

In it, we see the coordinate parameters latitude and longitude, which don't participate in any way when interacting with the backend, and the geo parameter with a false value. This also confirms our hypothesis regarding the mode parameter. This is quite useful if you want to extract all data from the site.

Great. We can get the site's search data in a convenient JSON format. We also have flexible parameters to guarantee data extraction, whether all are available on the site or for a specific city.

Let's move on to analyzing the property page.

Since after clicking on the listing it opens in a new window, make sure you have Auto-open DevTools for popups option set in Dev Tools

Unfortunately, we don't see any interesting interaction with the backend after analyzing all requests. All listing data is obtained in one request containing HTML code and JSON elements.

After carefully studying the page's source code, we can say that all the data we're interested in is in the JSON located in the script tag, which has an id attribute with the value __NEXT_DATA__. We can easily extract this JSON using a regular expression or HTML parser.

We already have everything necessary to build the crawler at this analysis stage. We know how to get data from the search, how pagination works, how to go from the search to the listing page, and where to extract the data we're interested in on the listing page.

But there's one obvious inconvenience: we get search data in JSON, and listing data we get in HTML inside, which is JSON. This isn't a problem but rather an inconvenience and higher traffic consumption, as such an HTML page will weigh much more than just JSON.

Let's continue our analysis.

The data in __NEXT_DATA__ signals that the site uses the Next.js framework. Each framework has its own established internal patterns, parameters, and features.

Let's analyze the listing page again by refreshing it and analyzing the .js files we receive.

We're interested in the file containing _buildManifest.js in its name, the link to it will regularly change, so I'll provide an example:

https://www.accommodationforstudents.com/_next/static/B5yLvSqNOvFysuIu10hQ5/_buildManifest.js

This file contains all possible routes available on the site. After careful study, we can see a link format like /property/[id], which is clearly related to the property page. After reading more about Next.js, we can get the final link—https://www.accommodationforstudents.com/_next/data/[build_id]/property/[id].json.

This link has two variables:

build_id - the current build of the Next.js application, it can be obtained from __NEXT_DATA__ on any application page. In the example link for _buildManifest.js, its value is B5yLvSqNOvFysuIu10hQ5
id - the identifier for the property object whose data we're interested in.

Let's form a link and study the result in the browser.

As you can see, now we get the listing results in JSON format. But after all, Next.js works for search, so let's get a link for it, so that our future crawler interacts with only one API. It transforms from the link you see in the browser bar and will look like this:

https://www.accommodationforstudents.com/_next/data/[build_id]/search-results.json?location=[location]&page=[page]

I think you immediately noticed that I excluded part of the search parameters, I did this because we simply don't need them. Coordinates aren't used in basic interaction with the backend. I plan that the crawler will search by location, so I keep the location and pagination parameters.

Let's summarize our analysis.

For search pages, we'll use links of the format - https://www.accommodationforstudents.com/_next/data/[build_id]/search-results.json?location=[location]&page=[page]
For listing pages, we'll use links of the format - https://www.accommodationforstudents.com/_next/data/[build_id]/property/[id].json
We need to get the build_id, let's use the main page of the site and a simple regular expression for this.
We need an HTTP client that allows bypassing Cloudflare, and we don't need any HTML parsers, as we'll get all target data from JSON.

Crawler implementation

I'm using Crawlee for Python version 0.3.5, this is important, as the library is developing actively and will have more capabilities in higher versions. But this is an ideal moment to show how we can work with it for complex projects.

The library already has support for an HTTP client that allows bypassing Cloudflare - CurlImpersonateHttpClient. Since we have to work with JSON responses we could use parsel_crawler added in version 0.3.0, but I think this is excessive for such tasks, besides I like the high speed of orjson.. Therefore, we'll need to implement our crawler rather than using one of the ready-made ones.

As a sample crawler, we'll use beautifulsoup_crawler

Let's install the necessary dependencies.

pip install crawlee[curl-impersonate]==0.3.5
pip install orjson>=3.10.7,<4.0.0"

I'm using orjson instead of the standard json module due to its high performance, which is especially noticeable in asynchronous applications.

Well, let's implement our custom_crawler.
Let's define the CustomContext class with the necessary attributes.

# custom_context.py

from __future__ import annotations

from dataclasses import dataclass
from typing import TYPE_CHECKING

from crawlee.basic_crawler import BasicCrawlingContext
from crawlee.http_crawler import HttpCrawlingResult

if TYPE_CHECKING:

    from collections.abc import Callable


@dataclass(frozen=True)
class CustomContext(HttpCrawlingResult, BasicCrawlingContext):
    """Crawling context used by CustomCrawler."""

    page_data: dict | None
    # not `EnqueueLinksFunction`` because we are breaking protocol since we are not working with HTML
    # and we are not using selectors
    enqueue_links: Callable

Note that in my context, enqueue_links is just Callable, not EnqueueLinksFunction. This is because we won't be using selectors and extracting links from HTML, which violate the agreed protocol. Still, I want the syntax in my crawler to be as close to standardized as possible.

Let's move on to the crawler functionality in the CustomCrawler class.

# custom_crawler.py

from __future__ import annotations

import logging
from re import search
from typing import TYPE_CHECKING, Any, Unpack

from crawlee import Request
from crawlee.basic_crawler import (
    BasicCrawler,
    BasicCrawlerOptions,
    BasicCrawlingContext,
    ContextPipeline,
)
from crawlee.errors import SessionError
from crawlee.http_clients.curl_impersonate import CurlImpersonateHttpClient
from crawlee.http_crawler import HttpCrawlingContext
from orjson import loads

from afs_crawlee.constants import BASE_TEMPLATE, HEADERS

from .custom_context import CustomContext

if TYPE_CHECKING:
    from collections.abc import AsyncGenerator, Iterable


class CustomCrawler(BasicCrawler[CustomContext]):
    """A crawler that fetches the request URL using `curl_impersonate` and parses the result with `orjson` and `re`."""

    def __init__(
        self,
        *,
        impersonate: str = 'chrome124',
        additional_http_error_status_codes: Iterable[int] = (),
        ignore_http_error_status_codes: Iterable[int] = (),
        **kwargs: Unpack[BasicCrawlerOptions[CustomContext]],
    ) -> None:

        self._build_id = None
        self._base_url = BASE_TEMPLATE

        kwargs['_context_pipeline'] = (
            ContextPipeline()
            .compose(self._make_http_request)
            .compose(self._handle_blocked_request)
            .compose(self._parse_http_response)
        )

        # Initialize curl_impersonate http client using TLS preset and necessary headers
        kwargs.setdefault(
            'http_client',
            CurlImpersonateHttpClient(
                additional_http_error_status_codes=additional_http_error_status_codes,
                ignore_http_error_status_codes=ignore_http_error_status_codes,
                impersonate=impersonate,
                headers=HEADERS,
            ),
        )

        kwargs.setdefault('_logger', logging.getLogger(__name__))

        super().__init__(**kwargs)

In __init__, we define that we'll use CurlImpersonateHttpClient as the http_client. Another important element is _context_pipeline, which defines the sequence of methods through which our context passes.

_make_http_request - is completely identical to BeautifulSoupCrawler
_handle_blocked_request - since we get all data through the API, only the server response status will signal about blocking.

    async def _handle_blocked_request(self, crawling_context: CustomContext) -> AsyncGenerator[CustomContext, None]:
        if self._retry_on_blocked:
            status_code = crawling_context.http_response.status_code

            if crawling_context.session and crawling_context.session.is_blocked_status_code(status_code=status_code):
                raise SessionError(f'Assuming the session is blocked based on HTTP status code {status_code}')

        yield crawling_context

_parse_http_response - a function that encapsulates the main logic of parsing responses

    async def _parse_http_response(self, context: HttpCrawlingContext) -> AsyncGenerator[CustomContext, None]:

        page_data = None

        if context.http_response.headers['content-type'] == 'text/html; charset=utf-8':
            # Get Build ID for Next js from the start page of the site, form a link to next.js endpoints
            build_id = search(rb'"buildId":"(.{21})"', context.http_response.read()).group(1)
            self._build_id = build_id.decode('UTF-8')
            self._base_url = self._base_url.format(build_id=self._build_id)
        else:
            # Convert json to python dictionary
            page_data = context.http_response.read()
            page_data = page_data.decode('ISO-8859-1').encode('utf-8')
            page_data = loads(page_data)

        async def enqueue_links(
            *, path_template: str, items: list[str], user_data: dict[str, Any] | None = None, label: str | None = None
        ) -> None:

            requests = list[Request]()
            user_data = user_data if user_data else {}

            for item in items:
                link_user_data = user_data.copy()

                if label is not None:
                    link_user_data.setdefault('label', label)

                if link_user_data.get('label') == 'SEARCH':
                    link_user_data['location'] = item

                url = self._base_url + path_template.format(item=item, **user_data)
                requests.append(Request.from_url(url, user_data=link_user_data))

            await context.add_requests(requests)

        yield CustomContext(
            request=context.request,
            session=context.session,
            proxy_info=context.proxy_info,
            enqueue_links=enqueue_links,
            add_requests=context.add_requests,
            send_request=context.send_request,
            push_data=context.push_data,
            log=context.log,
            http_response=context.http_response,
            page_data=page_data,
        )

As you can see, if the server response comes in HTML, we get the build_id using a simple regular expression. This condition should be executed once for the first link and is necessary to interact further with the Next.js API. In all other cases, we simply convert JSON to a Python dict and save it in the context.

In enqueue_links, I create logic for generating links based on string templates and input parameters.

That's it: our custom Crawler Class for Crawlee for Python is ready, it's based on the CurlImpersonateHttpClient client, works with JSON responses instead of HTML, and implements the link generation logic we need.

Let's finalize it by defining public classes for import.

# init.py

from .custom_crawler import CustomCrawler
from .types import CustomContext

__all__ = ['CustomCrawler', 'CustomContext']

Now that we have the crawler functionality, let's implement routing and data extraction from the site. We'll use the official documentation as a template.

# router.py

from crawlee.router import Router

from .constants import LISTING_PATH, SEARCH_PATH, TARGET_LOCATIONS
from .custom_crawler import CustomContext

router = Router[CustomContext]()


@router.default_handler
async def default_handler(context: CustomContext) -> None:
    """Handle the start URL to get the Build ID and create search links."""
    context.log.info(f'default_handler is processing {context.request.url}')

    await context.enqueue_links(
        path_template=SEARCH_PATH, items=TARGET_LOCATIONS, label='SEARCH', user_data={'page': 1}
    )


@router.handler('SEARCH')
async def search_handler(context: CustomContext) -> None:
    """Handle the SEARCH URL generates links to listings and to the next search page."""
    context.log.info(f'search_handler is processing {context.request.url}')

    max_pages = context.page_data['pageProps']['initialPageCount']
    current_page = context.request.user_data['page']
    if current_page < max_pages:

        await context.enqueue_links(
            path_template=SEARCH_PATH,
            items=[context.request.user_data['location']],
            label='SEARCH',
            user_data={'page': current_page + 1},
        )
    else:
        context.log.info(f'Last page for {context.request.user_data["location"]} location')

    listing_ids = [
        listing['property']['id']
        for group in context.page_data['pageProps']['initialListings']['groups']
        for listing in group['results']
        if listing.get('property')
    ]

    await context.enqueue_links(path_template=LISTING_PATH, items=listing_ids, label='LISTING')


@router.handler('LISTING')
async def listing_handler(context: CustomContext) -> None:
    """Handle the LISTING URL extracts data from the listings and saving it to a dataset."""
    context.log.info(f'listing_handler is processing {context.request.url}')

    listing_data = context.page_data['pageProps']['viewModel']['propertyDetails']
    if not listing_data['exists']:
        context.log.info(f'listing_handler, data is not available for url {context.request.url}')
        return
    property_data = {
        'property_id': listing_data['id'],
        'property_type': listing_data['propertyType'],
        'location_latitude': listing_data['coordinates']['lat'],
        'location_longitude': listing_data['coordinates']['lng'],
        'address1': listing_data['address']['address1'],
        'address2': listing_data['address']['address2'],
        'city': listing_data['address']['city'],
        'postcode': listing_data['address']['postcode'],
        'bills_included': listing_data.get('terms', {}).get('billsIncluded'),
        'description': listing_data.get('description'),
        'bathrooms': listing_data.get('numberOfBathrooms'),
        'number_rooms': len(listing_data['rooms']) if listing_data.get('rooms') else None,
        'rent_ppw': listing_data.get('terms', {}).get('rentPpw', {}).get('value', None),
    }

    await context.push_data(property_data)

Let's define our main function, which will launch the crawler.

# main.py

from .custom_crawler import CustomCrawler
from .router import router


async def main() -> None:
    """The main function that starts crawling."""
    crawler = CustomCrawler(max_requests_per_crawl=50, request_handler=router)

    # Run the crawler with the initial list of URLs.
    await crawler.run(['https://www.accommodationforstudents.com/'])

    await crawler.export_data('results.json')

Let's look at the results.

As I prefer to manage my projects as packages and use pyproject.toml according to PEP 518, the final structure of our project will look like this.

Conclusion

In this project, we went through the entire cycle of crawler development, from analyzing a rather interesting dynamic site to full implementation of a crawler using Crawlee for Python. You can view the full project code on GitHub

I would also like to hear your comments and thoughts on the web scraping topic you'd like to see in the next article. Feel free to comment here in the article or contact me in the Crawlee developer community on Discord.

If you are looking out to how to start scraping using Crawlee for Python, check out our latest tutorial here.

You can find me on the following platforms: Github, Linkedin, Apify, Upwork, Contra.

Thank you for your attention. I hope you found this information useful.

Current problems and mistakes of web scraping in Python and tricks to solve them!

Max Bohomolov — Thu, 22 Aug 2024 11:33:31 +0000

Introduction

Greetings! I'm Max, a Python developer from Ukraine, a developer with expertise in web scraping, data analysis, and processing.

My journey in web scraping started in 2016 when I was solving lead generation challenges for a small company. Initially, I used off-the-shelf solutions such as Import.io and Kimono Labs. However, I quickly encountered limitations such as blocking, inaccurate data extraction, and performance issues. This led me to learn Python. Those were the glory days when requests and lxml/beautifulsoup were enough to extract data from most websites. And if you knew how to work with threads, you were already a respected expert :)

One of our community members wrote this blog as a contribution to Crawlee Blog. If you want to contribute blogs like these to Crawlee Blog, please reach out to us on our discord channel.

Crawlee & Apify

This is the official developer community of Apify and Crawlee. | 8756 members

discord.com

As a freelancer, I've built small solutions and large, complex data mining systems for products over the years.

Today, I want to discuss the realities of web scraping with Python in 2024. We'll look at the mistakes I sometimes see and the problems you'll encounter and offer solutions to some of them.

Let's get started.

Just take requests and beautifulsoup and start making a lot of money...

No, this is not that kind of article.

1. "I got a 200 response from the server, but it's an unreadable character set."

Yes, it can be surprising. But I've seen this message from customers and developers six years ago, four years ago, and in 2024. I read a post on Reddit just a few months ago about this issue.

Let's look at a simple code example. This will work for requests, httpx, and aiohttp with a clean installation and no extensions.

import httpx

url = 'https://www.wayfair.com/'

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/png,image/svg+xml,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br, zstd",
    "Connection": "keep-alive",
}

response = httpx.get(url, headers=headers)

print(response.content[:10])

The print result will be similar to:

b'\x83\x0c\x00\x00\xc4\r\x8e4\x82\x8a'

It's not an error - it's a perfectly valid server response. It's encoded somehow.

The answer lies in the Accept-Encoding header. In the example above, I just copied it from my browser, so it lists all the compression methods my browser supports: "gzip, deflate, br, zstd". The Wayfair backend supports compression with "br", which is Brotli, and uses it as the most efficient method.

This can happen if none of the libraries listed above have a Brotli dependency among their standard dependencies. However, they all support decompression from this format if you already have Brotli installed.

Therefore, it's sufficient to install the appropriate library:

pip install Brotli

This will allow you to get the result of the print:

b'<!DOCTYPE '

You can obtain the same result for aiohttp and httpx by doing the installation with extensions:

pip install aiohttp[speedups]
pip install httpx[brotli]

By the way, adding the brotli dependency was my first contribution to crawlee-python. They use httpx as the base HTTP client.

You may have also noticed that a new supported data compression format zstd appeared some time ago. I haven't seen any backends that use it yet, but httpx will support decompression in versions above 0.28.0. I already use it to compress server response dumps in my projects; it shows incredible efficiency in asynchronous solutions with aiofiles.

The most common solution to this situation that I've seen is for developers to simply stop using the Accept-Encoding header, thus getting an uncompressed response from the server. Why is that bad? The main page of Wayfair takes about 1 megabyte uncompressed and about 0.165 megabytes compressed.

Therefore, in the absence of this header:

You increase the load on your internet bandwidth.
If you use a proxy with traffic, you increase the cost of each of your requests.
You increase the load on the server's internet bandwidth.
You're revealing yourself as a scraper, since any browser uses compression.

But I think the problem is a bit deeper than that. Many web scraping developers simply don't understand what the headers they use do. So if this applies to you, when you're working on your next project, read up on these things; they may surprise you.

2. "I use headers as in an incognito browser, but I get a 403 response". Here's Johnn-... I mean, Cloudflare

Yes, that's right. 2023 brought us not only Large Language Models like ChatGPT but also improved Cloudflare protection.

Those who have been scraping the web for a long time might say, "Well, we've already dealt with DataDome, PerimeterX, InCapsula, and the like."

But Cloudflare has changed the rules of the game. It is one of the largest CDN providers in the world, serving a huge number of sites. Therefore, its services are available to many sites with a fairly low entry barrier. This makes it radically different from the technologies mentioned earlier, which were implemented purposefully when they wanted to protect the site from scraping.

Cloudflare is the reason why, when you start reading another course on "How to do web scraping using requests and beautifulsoup", you can close it immediately. Because there's a big chance that what you learn will simply not work on any "decent" website.

Let's look at another simple code example:

from httpx import Client

client = Client(http2=True)

url = 'https://www.g2.com/'

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/png,image/svg+xml,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br, zstd",
    "Connection": "keep-alive",
}

response = client.get(url, headers=headers)

print(response)

Of course, the response would be 403.

What if we use curl?

curl -XGET -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0"' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/png,image/svg+xml,*/*;q=0.8' -H 'Accept-Language: en-US,en;q=0.5' -H 'Connection: keep-alive' 'https://www.g2.com/' -s -o /dev/null -w "%{http_code}\n"

Also 403.

Why is this happening?

Because Cloudflare uses TLS fingerprints of many HTTP clients popular among developers, site administrators can also customize how aggressively Cloudflare blocks clients based on these fingerprints.

For curl, we can solve it like this:

curl -XGET -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0"' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/png,image/svg+xml,*/*;q=0.8' -H 'Accept-Language: en-US,en;q=0.5' -H 'Connection: keep-alive' 'https://www.g2.com/' --tlsv1.3 -s -o /dev/null -w "%{http_code}\n"

You might expect me to write here an equally elegant solution for httpx, but no. About six months ago, you could do the "dirty trick" and change the basic httpcore parameters that it passes to h2, which are responsible for the HTTP2 handshake. But now, as I'm writing this article, that doesn't work anymore.

There are different approaches to getting around this. But let's solve it by manipulating TLS.

The bad news is that all the Python clients I know of use the ssl library to handle TLS. And it doesn't give you the ability to manipulate TLS subtly.

The good news is that the Python community is great and implements solutions that exist in other programming languages.

The first way to solve this problem is to use tls-client

This Python wrapper around the Golang library provides an API similar to requests.

pip install tls-client

from tls_client import Session

client = Session(client_identifier="firefox_120")

url = 'https://www.g2.com/'

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/png,image/svg+xml,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br, zstd",
    "Connection": "keep-alive",
}

response = client.get(url, headers=headers)

print(response)

The tls_client supports TLS presets for popular browsers, the relevance of which is maintained by developers. To use this, you must pass the necessary client_identifier. However, the library also allows for subtle manual manipulation of TLS.

The second way to solve this problem is to use curl_cffi

This wrapper around the C library patches curl and provides an API similar to requests.

pip install curl_cffi

from curl_cffi import requests

url = 'https://www.g2.com/'

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/png,image/svg+xml,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br, zstd",
    "Connection": "keep-alive",
}

response = requests.get(url, headers=headers, impersonate="chrome124")

print(response)

curl_cffi also provides TLS presets for some browsers, which are specified via the impersonate parameter. It also provides options for subtle manual manipulation of TLS.

I think someone just said, "They're literally doing the same thing." That's right, and they're both still very raw.

Let's do some simple comparisons:

Feature	tls_client	curl_cffi
TLS preset	+	+
TLS manual	+	+
async support	-	+
big company support	-	+
number of contributors	-	+

Obviously, curl_cffi wins in this comparison. But as an active user, I have to say that sometimes there are some pretty strange errors that I'm just unsure how to deal with. And let's be honest, so far, they are both pretty raw.

I think we will soon see other libraries that solve this problem.

One might ask, what about Scrapy? I'll be honest: I don't really keep up with their updates. But I haven't heard about Zyte doing anything to bypass TLS fingerprinting. So out of the box Scrapy will also be blocked, but nothing is stopping you from using curl_cffi in your Scrapy Spider.

3. What about headless browsers and Cloudflare Turnstile?

Yes, sometimes we need to use headless browsers. Although I'll be honest, from my point of view, they are used too often even when clearly not necessary.

Even in a headless situation, the folks at Cloudflare have managed to make life difficult for the average web scraper by creating a monster called Cloudflare Turnstile.

To test different tools, you can use this demo page.

To quickly test whether a library works with the browser, you should start by checking the usual non-headless mode. You don't even need to use automation; just open the site using the desired library and act manually.

What libraries are worth checking out for this?

Candidate #1 Playwright + playwright-stealth

It'll be blocked and won't let you solve the captcha.

Playwright is a great library for browser automation. However the developers explicitly state that they don't plan to develop it as a web scraping tool.

And I haven't heard of any Python projects that effectively solve this problem.

Candidate #2 undetected_chromedriver

It'll be blocked and won't let you solve the captcha.

This is a fairly common library for working with headless browsers in Python, and in some cases, it allows bypassing Cloudflare Turnstile. But on the target website, it is blocked. Also, in my projects, I've encountered at least two other cases where Cloudflare blocked undetected_chromedriver.

In general, undetected_chromedriver is a good library for your projects, especially since it uses good old Selenium under the hood.

Candidate #3 botasaurus-driver

It allows you to go past the captcha after clicking.

I don't know how its developers pulled this off, but it works. Its main feature is that it was developed specifically for web scraping. It also has a higher-level library to work with - botasaurus.

On the downside, so far, it's pretty raw, and botasaurus-driver has no documentation and has a rather challenging API to work with.

To summarize, most likely, your main library for headless browsing will be undetected_chromedriver. But in some particularly challenging cases, you might need to use botasaurus.

4. What about frameworks?

High-level frameworks are designed to speed up and ease development by allowing us to focus on business logic, although we often pay the price in flexibility and control.

So, what are the frameworks for web scraping in 2024?

Scrapy

It's impossible to talk about Python web scraping frameworks without mentioning Scrapy. Scrapinghub (now Zyte) first released it in 2008. For 16 years, it has been developed as an open-source library upon which development companies built their business solutions.

Talking about the advantages of Scrapy, you could write a separate article. But I will emphasize the two of them:

The huge amount of tutorials that have been released over the years
Middleware libraries are written by the community and are extending their functionality. For example, scrapy-playwright.

But what are the downsides?

In recent years, Zyte has been focusing more on developing its own platform. Scrapy mostly gets fixes only.

Lack of development towards bypassing anti-scraping systems. You have to implement them yourself, but then, why do you need a framework?
Scrapy was originally developed with the asynchronous framework Twisted. Partial support for asyncio was added only in version 2.0. Looking through the source code, you may notice some workarounds that were added for this purpose.

Thus, Scrapy is a good and proven solution for sites that are not protected against web scraping. You will need to develop and add the necessary solutions to the framework in order to bypass anti-scraping measures.

Botasaurus

A new framework for web scraping using browser automation, built on botasaurus-driver. The initial commit was made on May 9, 2023.

Let's start with its advantages:

Allows you to bypass any Claudflare protection as well as many others using botasaurus-driver.
Good documentation for a quick start

Downsides include:

Browser automation only, not intended for HTTP clients.
Tight coupling with botasaurus-driver; you can't easily replace it with something better if it comes out in the future.
No asynchrony, only multithreading.
At the moment, it's quite raw and still requires fixes for stable operation.
There are very few training materials available at the moment.

This is a good framework for quickly building a web scraper based on browser automation. It lacks flexibility and support for HTTP clients, which is crutias for users like me.

Crawlee for Python

A new framework for web scraping in the Python ecosystem. The initial commit was made on Jan 10, 2024, with a release in the media space on July 5, 2024.

apify / crawlee-python

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

A web scraping and browser automation library

Crawlee covers your crawling and scraping end-to-end and helps you build reliable scrapers. Fast.

🚀 Crawlee for Python is open to early adopters!

Your crawlers will appear almost human-like and fly under the radar of modern bot protections even with the default configuration. Crawlee gives you the tools to crawl the web for links, scrape data and persistently store it in machine-readable formats, without having to worry about the technical details. And thanks to rich configuration options, you can tweak almost any aspect of Crawlee to suit your project's needs if the default settings don't cut it.

👉 View full documentation, guides and examples on the Crawlee project website 👈

We also have a TypeScript implementation of the Crawlee, which you can explore and utilize for your projects. Visit our GitHub repository for more information Crawlee for JS/TS on GitHub.

Installation

We…

View on GitHub

Developed by Apify, it is a Python adaptation of their famous JS framework crawlee, first released on Jul 9, 2019.

As this is a completely new solution on the market, it is now in an active design and development stage. The community is also actively involved in its development. So,we can see that the use of curl_cffi is already being discussed. The possibility of creating their own Rust-based client was previously discussed. I hope the company doesn't abandon the idea.

From Crawlee team:
"Yeah, for sure we will keep improving Crawlee for Python for years to come."

As I personally would like to see an HTTP client for Python developed and maintained by a major company. And Rust shows itself very well as a library language for Python. Let's remember at least Ruff and Pydantic v2.

Advantages:

The framework was developed by an established company in the web scraping market, which has well-developed expertise in this sphere.

Support for both browser automation and HTTP clients.
Fully asynchronous, based on asyncio.
Active development phase and media activity. As developers listen to the community, it is quite important in this phase.

On a separate note, it has a pretty good modular architecture. If developers introduce the ability to switch between several HTTP clients, we will get a rather flexible framework that allows us to easily change the technologies used, with a simple implementation from the development team.

Deficiencies:

The framework is new. There are very few training materials available at the moment.
At the moment, it's quite raw and still requires fixes for stable operation, as well as convenient interfaces for configuration. -There is no implementation of any means of bypassing anti-scraping systems for now other than changing sessions and proxies. But they are being discussed.

I believe that how successful crawlee-python turns out to depends primarily on the community. Due to the small number of tutorials, it is not suitable for beginners. However, experienced developers may decide to try it instead of Scrapy.

In the long run, it may turn out to be a better solution than Scrapy and Botasaurus. It already provides flexible tools for working with HTTP clients, automating browsers out of the box, and quickly switching between them. However, it lacks tools to bypass scraping protections, and their implementation in the future may be the deciding factor in choosing a framework for you.

Conclusion

If you have read all the way to here, I assume you found it interesting and maybe even helpful :)

The industry is changing and offering new challenges, and if you are professionally involved in web scraping, you will have to keep a close eye on the situation. In some other field, you would remain a developer who makes products using outdated technologies. But in modern web scraping, you become a developer who makes web scrapers that simply don't work.

Also, don't forget that you are part of the larger Python community, and your knowledge can be useful in developing tools that make things happen for all of us. As you can see, many of the tools you need are being built literally right now.

I'll be glad to read your comments. Also, if you need a web scraping expert or do you just want to discuss the article, you can find me on the following platforms: Github, Linkedin, Apify, Upwork, Contra.

Thank you for your attention :)

Forem: Max Bohomolov

How to scrape YouTube using Python [2025 guide]

What you’ll need to get started

1. Project setup

2. Analyzing YouTube and determining a scraping strategy

3. Configuring Crawlee

4. Extracting YouTube data

5. Enhancing the scraper capabilities

6. Creating YouTube Actor on the Apify platform

7. Deploying to Apify

Conclusion

How to scrape TikTok using Python

Prerequisites

1. Project setup

2. Analyzing TikTok and determining a scraping strategy

3. Configuring Crawlee

4. Extracting TikTok data

5. Creating TikTok Actor on the Apify platform

6. Deploying to Apify

Conclusion

How to scrape Bluesky with Python

Prerequisites

Project setup

Development of the Bluesky crawler in Python

1. Identifying the data source

2. Creating a session for API interaction

3. Configuring Crawlee for Python for data collection

4. Implementing handlers for data collection

5. Saving data to files

6. Running the crawler

Create Apify Actor for Bluesky crawler

Configure Dockerfile

Define project metadata in actor.json

Define Actor input parameters

Update project code

Deploy

Conclusion and repository access

How to scrape Crunchbase using Python in 2024 (Easy Guide)

Prerequisites

Project setup

Choosing the data source

Scraping Crunchbase using sitemap and Crawlee for Python

1. Configuring the crawler for scraping

2. Implementing sitemap navigation

3. Extracting and saving data

4. Running the project

5. Finally, characteristics of using the sitemap crawler

Using search for scraping Crunchbase

Working with the official Crunchbase API

1. Setting up API access

2. Configuring the crawler for API work

3. Processing search results

4. Extracting company data

5. Advanced location-based search

6. Finally, free API limitations

What’s your best path forward?

How to scrape Google search results with Python

Prerequisites

Project setup

Development of the Google Search scraper in Python

1. Defining data for extraction

2. Configure the crawler

3. Implementing data extraction

4. Handling pagination

5. Exporting data to CSV format

6. Finalizing the Google Search scraper

Scrape Google Search results with Apify

What will you scrape?

12 tips on how to think like a web scraping expert

Crawlee & Apify

1. Choosing a data source for the project

2. Check robots.txt and sitemap

3. Don't neglect site analysis

4. Maximum interactivity

5. Data doesn't appear out of thin air

6. Data is cached

7. Learn more about the framework

8. Reverse engineering

9. Testing requests to endpoints

10. Experiment with request parameters

2. Check `robots.txt` and `sitemap`