Forem: Crawlee

How we moved Crawlee for Python out of Beta

Ella (she/her/elle) — Tue, 30 Sep 2025 08:59:42 +0000

Today is the day we move Crawlee for Python out of beta 🥳 In this post I will summarize the decisions and development that went into this process. I hope it will inform — and even validate — your product decisions.

What is Crawlee for Python?

To begin with, a quick intro: Crawlee for Python is an open-source web-scraping library that provides you with a full toolkit to build, run, and scale your own scrapers. You can deploy Crawlee for Python to extract financial data to identify cases of corporate wage theft, or to track major global polluters, or compare the prices of competing e-commerce retailers. The only real limit is your creativity.

Why was it in beta?

Crawlee for Python was released in beta in July 2024. A couple of years earlier, we had released Crawlee for JS, which had better capabilities from the outset as a result of the tooling that was available for us to build with.

We knew we ultimately wanted both versions of Crawlee to have equivalent capabilities, but the Python tooling available to us at the time was too young and untested in real scenarios. We believed we could reach the point where Crawlee for Python was equivalent to (or better than!) its JS counterpart but weren’t ready to include untested tooling in the build. After working hard to launch, we made the decision to release Crawlee for Python in beta.

This decision gave us breathing room to collect real-world feedback to solve the issues related to any new code and gather community feedback. Getting the product out in beta allowed devs to engage with it, and gave us the chance to fix reported bugs, implement requested features, and improve some performance issues. This helped identify new issues with the library that we could incorporate into a more complete build

What did we do to prepare for v1?

We used the time Crawlee for Python was in beta to improve more than just the library. We created several benchmarks to track the performance of a wide range of typical crawlers, allowing us to compare Crawlee for Python with other options available to developers, including Crawlee for JS.

We re-implemented most of the capabilities of Crawlee for JS in our v1 of Crawlee for Python. Key features we added are:

Unified storage client system, resulting in less duplication, better extensibility, and a cleaner developer experience. It also opens the door for the community to build and share their own storage client implementations tailored to specific databases or cloud storage providers.
Adaptive Playwright crawler, which makes your crawls faster and cheaper, while still allowing you to reliably handle complex, dynamic websites. In practice, you get the best of both worlds: speed on simple pages and robustness on modern, JavaScript-heavy sites.
New default HTTP client (ImpitHttpClient, powered by the Impit library). This delivers fewer false positives, more resilient crawls, and less need for complicated workarounds. Impit is also developed as an open-source project by Apify, so you can dive into the internals or contribute improvements yourself: you can also create your own instance, configure it to your needs (e.g. enable HTTP/3 or choose a specific browser profile), and pass it into your crawler.
Sitemap request loader, making it easier to start large-scale crawls where sitemaps already provide full coverage of the site
Robots exclusion standard which not only helps you build ethical and responsible crawlers, but can also save time and bandwidth by skipping disallowed or irrelevant pages
Fingerprinting so that each crawler run looks like a real browser on a real device, making it less likely that you’ll get blocked. Using fingerprinting in Crawlee is straightforward: create a fingerprint generator with your desired options and pass it to the crawler.
Open telemetry, allowing you to monitor real-time dashboards or analyze traces to understand crawler performance. easier to integrate Crawlee into existing monitoring pipelines

How did we know it was ready to move out of beta?

Once we implemented the features above, and tested them for as many possible real-world cases we believed are possible, we eventually had a version of Crawlee for Python that lived up to our initial vision for it. We feel now is the time: Crawlee for Python is ready to sit alongside its JS counterpart as a full web-scraping and automation library.

Are you ready to check it out? Visit our repo and give us a star ⭐

What happens next?

We will continue to build upon Crawlee for Python v1 as a rolling release. Although we are happy to be launching the full version, we know we can make further improvements. This is where you come in!

We’d love your feedback. Try it out, and don’t hesitate to open an issue to:

Report any bugs or weird behaviour you encounter;
Ask questions if something isn’t clear - so we can improve the docs;
Or request new features you’d like to see

Thanks for reading!

We hope this summary gave you some insights into the process and decisions that take place to move a product out of beta. We would love to hear from you if you have questions about Crawlee for Python (or JS!), the tooling we have implemented, or anything else we have discussed in this article.

Give it a run and share your issues with us so we can continue to refine this tool and adapt it to suit all your web-crawling and automation needs.

How to scrape YouTube using Python [2025 guide]

Max Bohomolov — Wed, 16 Jul 2025 12:10:32 +0000

In this guide, we'll explore how to efficiently collect data from YouTube using Crawlee for Python. The scraper will extract video metadata, video statistics, and transcripts - giving you structured YouTube data perfect for content analysis, ML training, or trend monitoring.

Note: One of our community members wrote this guide as a contribution to the Crawlee Blog. If you'd like to contribute articles like these, please reach out to us on Apify’s Discord channel.

Key steps we'll cover:

Project setup
Analyzing YouTube and determining a scraping strategy
Configuring YouTube
Extracting YouTube data
Enhancing the scraper capabilities
Creating a YouTube Actor on the Apify platform
Deploying to Apify

What you’ll need to get started

Python 3.10 or higher
Familiarity with web scraping concepts
Crawlee for Python v0.6.0 or higher
uv v0.7 or higher

1. Project setup

Note: Before starting the project, I'd like to ask you to star Crawlee for Python on GitHub. This will help us spread the word to fellow scraper developers.

In this project, we'll use uv for package management and a specific Python version will be installed through uv. If you don't have uv installed yet, just follow the guide or use this command:

curl -LsSf https://astral.sh/uv/install.sh | sh

To create the project, run:

uvx crawlee['cli'] create youtube-crawlee

In the cli menu that opens, select:

Playwright
Httpx
uv
Leave the default value - https://crawlee.dev
y

Or, just run the command:

uvx crawlee['cli'] create youtube-crawlee --crawler-type playwright --http-client httpx --package-manager uv --apify --start-url 'https://crawlee.dev'

Or, if you prefer to use pipx.

pipx run crawlee['cli'] create youtube-crawlee --crawler-type playwright --http-client httpx --package-manager uv --apify --start-url 'https://crawlee.dev'

Creating the project may take a few minutes. After installation is complete, navigate to the project folder:

cd youtube-crawlee

2. Analyzing YouTube and determining a scraping strategy

If you're working on a small project to extract data from YouTube, you should use the YouTube API to get your data. However, the API has very strict quotas, with no more than 10,000 units per day. This allows you to get just 100 search pages, and you can't increase this limit.

If your project requires more data than the API allows, you'll need to use crawling. Let's examine the site to develop an optimal crawling strategy.

Let's study YouTube navigation using Apify's YouTube channel as an example to better understand the features and data extraction points.

YouTube uses infinite scrolling to load new elements on the page, similar to what we discussed in the corresponding article from the Apify team. Let's look at how this works using DevTools and the Network tab.

If we look at the response structure, we can see that YouTube uses JSON to transmit data, but its structure is quite complex to navigate.

Therefore, we'll use Playwright for crawling, which will help us avoid parsing complex JSON responses. But if you want to practice crawling complex websites, try implementing a crawler based on an HTTP client, like in this article.

Let's analyze the selectors for getting video links using the Elements tab:

It looks like we're interested in a tags with the attribute id="video-title-link"!

Let's look at the video page to understand better how YouTube transmits data. As expected, we see data in JSON format.

Now let's get the transcript link. Click on the subtitles button in the player to trigger the transcript request.

Let's verify that we can access the transcript via this link. Remove the fmt=json3 parameter from the URL and open it in your browser. Removing the fmt parameter is necessary to get the data in a convenient XML format instead of the complex JSON3 format.

If you live in a country where GDPR applies, you'll need to handle the following pop-up before you can access the data:

After our analysis, we now understand:

Navigation strategy: How to navigate the channel page to retrieve all videos using infinite scroll.
Video metadata extraction: How to extract video statistics, title, description, publish date, and other metadata from video pages.
Transcript access: How to obtain the correct transcript link.
Data formats: Transcript data is available in XML format, which is easier to parse than JSON3
Regional considerations: Special handling required for GDPR consent in European countries

With this knowledge, we're ready to implement the YouTube scraper using Crawlee for Python.

3. Configuring Crawlee

Configuring Crawlee for YouTube is very similar to configuring it for TikTok, but with some key differences.

Since pages have infinite scrolling, we need to limit the number of elements we want to get. For this, we'll add a max_items parameter that will limit the maximum number of elements for each search, and pass it in user_data when forming a Request.

We'll limit the intensity of scraping by setting max_tasks_per_minute in ConcurrencySettings. This will help us reduce the likelihood of being blocked by YouTube.

Scrolling pages can take a long time, so we’ll increase the time limit for processing a single request using request_handler_timeout.

Since we won't be saving images, videos, and similar media content during crawling, we can block requests to them using block_requests and pre_navigation_hook.

Also, to handle the GDPR page only once, we'll use use_state to pass the appropriate cookies between sessions, ensuring all requests have the necessary cookies.

# main.py

from datetime import timedelta

from apify import Actor

from crawlee import ConcurrencySettings, Request
from crawlee.crawlers import PlaywrightCrawler

from .hooks import pre_hook
from .routes import router


async def main() -> None:
    """The crawler entry point."""
    async with Actor:
        # Create a crawler instance with the router
        crawler = PlaywrightCrawler(
            # Limit scraping intensity by setting a limit on requests per minute
            concurrency_settings=ConcurrencySettings(max_tasks_per_minute=50),
            # We'll configure the `router` in the next step
            request_handler=router,
            # Increase the timeout for the request handling pipeline
            request_handler_timeout=timedelta(seconds=120),
            # Runs browser without visual interface
            headless=True,
            # Limit requests per crawl for testing purposes
            max_requests_per_crawl=100,
        )

        # Set the maximum number of items to scrape per youtube channel
        max_items = 1
        # Set the list of channels to scrape
        channels = ['Apify']

        # Set hook for prepare context before navigation on each request
        crawler.pre_navigation_hook(pre_hook)

        await crawler.run(
            [
                Request.from_url(f'https://www.youtube.com/@{channel}/videos', user_data={'limit': max_items})
                for channel in channels
            ]
        )

Let's prepare the pre_hook function to block requests and set cookies (the cookie collection process will be explained in the extraction section):

# hooks.py

from crawlee.crawlers import PlaywrightPreNavCrawlingContext


async def pre_hook(context: PlaywrightPreNavCrawlingContext) -> None:
    """Prepare context before navigation."""
    crawler_state = await context.use_state()
    # Check if there are previously collected cookies in the crawler state and set them for the session
    if 'cookies' in crawler_state and context.session:
        cookies = crawler_state['cookies']
        # Set cookies for the session
        context.session.cookies.set_cookies_from_playwright_format(cookies)
    # Block requests to resources that aren't needed for parsing
    # This is similar to the default value, but we don't block `css` as it is needed for Player loading
    await context.block_requests(
        url_patterns=['.webp', '.jpg', '.jpeg', '.png', '.svg', '.gif', '.woff', '.pdf', '.zip']
    )

4. Extracting YouTube data

After configuration, let's move on to navigation and data extraction.

For infinite scrolling, we'll use the built-in helper function 'infinite_scroll'. But instead of waiting for scrolling to complete, which in some cases can take a really long time, we'll use Python's asyncio capabilities to make it a background task.

The GDPR page requiring consent for cookie usage is on the domain consent.youtube.com, which might cause an error when forming a Request for a video page. Therefore, we need to use a helper function for the transform_request_function parameter in extract_links.

This function will check each extracted URL. If it contains 'consent.youtube', we'll replace it with 'www.youtube'. This will allow us to get the correct URL for the video page.

# routes.py

from __future__ import annotations

import asyncio
import xml.etree.ElementTree as ET
from typing import TYPE_CHECKING

from yarl import URL

from crawlee import Request, RequestOptions, RequestTransformAction
from crawlee.crawlers import PlaywrightCrawlingContext
from crawlee.router import Router

if TYPE_CHECKING:
    from playwright.async_api import Request as PlaywrightRequest
    from playwright.async_api import Route as PlaywrightRoute

router = Router[PlaywrightCrawlingContext]()


def request_domain_transform(request_param: RequestOptions) -> RequestOptions | RequestTransformAction:
    """Transform request before adding it to the queue."""
    if 'consent.youtube' in request_param['url']:
        request_param['url'] = request_param['url'].replace('consent.youtube', 'www.youtube')
        return request_param
    return 'unchanged'

Let's implement a function that will intercept transcript requests for later modification and processing in the crawler:

# routes.py

async def extract_transcript_url(context: PlaywrightCrawlingContext) -> str | None:
    """Extract the transcript URL from request intercepted by Playwright."""
    # Create a Future to store the transcript URL
    transcript_future: asyncio.Future[str] = asyncio.Future()

    # Define a handler for the transcript request
    # This will be called when the page requests the transcript
    async def handle_transcript_request(route: PlaywrightRoute, request: PlaywrightRequest) -> None:
        # Set the result of the future with the transcript URL
        if not transcript_future.done():
            transcript_future.set_result(request.url)

        await route.fulfill(status=200)

    # Set up a route to intercept requests to the transcript API
    await context.page.route('**/api/timedtext**', handle_transcript_request)

    # Click the subtitles button to trigger the transcript request
    await context.page.click('.ytp-subtitles-button')

    # Wait for the transcript URL to be captured
    # The future will resolve when handle_transcript_request is called
    return await transcript_future

Now, let's create the main handler that will navigate to the channel page, perform infinite scrolling, and extract links to videos.

# routes.py

@router.default_handler
async def default_handler(context: PlaywrightCrawlingContext) -> None:
    """Handle requests that do not match any specific handler."""
    context.log.info(f'Processing {context.request.url} ...')

    # Get the limit from user_data, default to 10 if not set
    limit = context.request.user_data.get('limit', 10)
    if not isinstance(limit, int):
        raise TypeError('Limit must be an integer')

    # Wait for the page to load
    await context.page.locator('h1').first.wait_for(state='attached')

    # Check if there's a GDPR popup on the page requiring consent for cookie usage
    cookies_button = context.page.locator('button[aria-label*="Accept"]').first
    if await cookies_button.is_visible():
        await cookies_button.click()

        # Save cookies for later use with other sessions
        # You can learn more about `SOCS` cookies from - https://policies.google.com/technologies/cookies?hl=en-US
        cookies_state = [cookie for cookie in await context.page.context.cookies() if cookie['name'] == 'SOCS']
        crawler_state = await context.use_state()
        crawler_state['cookies'] = cookies_state

    # Wait until at least one video loads
    await context.page.locator('a[href*="watch"]').first.wait_for()

    # Create a background task for infinite scrolling
    scroll_task: asyncio.Task[None] = asyncio.create_task(context.infinite_scroll())
    # Scroll the page to the end until we reach the limit or finish scrolling
    while not scroll_task.done():
        # Extract links to videos
        requests = await context.extract_links(
            selector='a[href*="watch"]',
            label='video',
            transform_request_function=request_domain_transform,
            strategy='same-domain',
        )
        # Create a dictionary to avoid duplicates
        requests_map = {request.id: request for request in requests}
        # If the limit is reached, cancel the scrolling task
        if len(requests_map) >= limit:
            scroll_task.cancel()
            break
        # Switch the asynchronous context to allow other tasks to execute
        await asyncio.sleep(0.2)
    else:
        # If the scroll task is done, we can safely assume that we have reached the end of the page
        requests = await context.extract_links(
            selector='a[href*="watch"]',
            label='video',
            transform_request_function=request_domain_transform,
            strategy='same-domain',
        )
        requests_map = {request.id: request for request in requests}

    requests = list(requests_map.values())
    requests = requests[:limit]

    # Add the requests to the queue
    await context.enqueue_links(requests=requests)

Let's take a closer look at the parameters used in extract_links:

selector - selector for extracting links to videos. We expected that we could use id="video-title-link", but YouTube uses different page formats with different selectors, so the selector a[href*="watch"] will be more universal.
label - pointer for the router that will be used to handle the video page.
transform_request_function - function to transform the request before adding it to the queue. We use it to replace the domain consent.youtube with www.youtube, which helps avoid errors when processing the video page.
strategy - strategy for extracting links. We use same-domain to extract links to any subdomain of youtube.com.

Let's move on to the handler for video pages. In it, we'll extract video data and also look at how to get and process the video transcript link.

# routes.py

@router.handler('video')
async def video_handler(context: PlaywrightCrawlingContext) -> None:
    """Handle video requests."""
    context.log.info(f'Processing video {context.request.url} ...')
    # extract video data from the page
    video_data = await context.page.evaluate('window.ytInitialPlayerResponse')
    main_data = {
        'url': context.request.url,
        'title': video_data['videoDetails']['title'],
        'description': video_data['videoDetails']['shortDescription'],
        'channel': video_data['videoDetails']['author'],
        'channel_id': video_data['videoDetails']['channelId'],
        'video_id': video_data['videoDetails']['videoId'],
        'duration': video_data['videoDetails']['lengthSeconds'],
        'keywords': video_data['videoDetails']['keywords'],
        'view_count': video_data['videoDetails']['viewCount'],
        'like_count': video_data['microformat']['playerMicroformatRenderer']['likeCount'],
        'is_shorts': video_data['microformat']['playerMicroformatRenderer']['isShortsEligible'],
        'publish_date': video_data['microformat']['playerMicroformatRenderer']['publishDate'],
    }

    # Try to extract the transcript URL
    try:
        transcript_url = await asyncio.wait_for(extract_transcript_url(context), timeout=20)
    except asyncio.TimeoutError:
        transcript_url = None

    if transcript_url:
        transcript_url = str(URL(transcript_url).without_query_params('fmt'))
        context.log.info(f'Found transcript URL: {transcript_url}')
        await context.add_requests(
            [Request.from_url(transcript_url, label='transcript', user_data={'video_data': main_data})]
        )
    else:
        await context.push_data(main_data)

Note that if we want to extract the video transcript, we need to get the link to the transcript file and pass the video data to the next handler before it's saved to the Dataset.

The final stage is processing the transcript. YouTube uses XML to transmit transcript data, so we need to use a library to parse XML, such as xml.etree.ElementTree.

# routes.py

@router.handler('transcript')
async def transcript_handler(context: PlaywrightCrawlingContext) -> None:
    """Handle transcript requests."""
    context.log.info(f'Processing transcript {context.request.url} ...')

    # Get the main video data extracted in `video_handler`
    video_data = context.request.user_data.get('video_data', {})

    try:
        # Get XML data from the response
        root = ET.fromstring(await context.response.text())

        # Extract text elements from XML
        transcript_data = [text_element.text.strip() for text_element in root.findall('.//text') if text_element.text]

        # Enrich video data by adding the transcript
        video_data['transcript'] = '\n'.join(transcript_data)

        # Save the data to Dataset
        await context.push_data(video_data)
    except ET.ParseError:
        context.log.warning('Incorect XML Response')
        # Save the video data without the transcript
        await context.push_data(video_data)

After collecting the data, we need to save the results to a file. Just add the following code to the end of the main function in main.py:

# main.py

# Export the data from Dataset to JSON format
await crawler.export_data_json('youtube.json')

To run the crawler, use the command:

uv run python -m youtube_crawlee

Example result record:

{
  "url": "https://www.youtube.com/watch?v=r-1J94tk5Fo",
  "title": "Facebook Marketplace API - Scrape Data Based on LOCATION, CATEGORY and SEARCH",
  "description": "See how you can export Facebook Marketplace listings to Excel, CSV or JSON with the Facebook Marketplace API 🛍️ Input one or more URLs to scrape price, description, images, delivery info, seller data, location, listing status, and much more 📊\n\nWith the Facebook Marketplace Downloader, you can:\n🛒 **Extract listings and seller details** from any public Marketplace category or search query.\n📷 **Scrape product details**, including images, prices, descriptions, locations, and timestamps.\n💰 **Get thousands of marketplace listings** quickly and efficiently.\n📦 **Export results** via API or in JSON, CSV, or Excel with all listing details.\n\n🛍️ Facebook Marketplace Search API 👉 https://apify.it/3E5NLz4\n📱 Explore other Facebook Scrapers 👉 https://apify.it/43Bae1f\n\n*Why scrape Facebook Marketplace data?* 🤔\n💰 Price & Demand Analysis – Track product pricing trends and demand fluctuations.\n📊 Competitor Insights – Monitor listings from competitors to adjust pricing and strategy.\n📍 Location-Based Market Trends – Identify popular products in specific regions.\n🔎 Product Availability Monitoring – Detect shortages or oversupply in certain categories.\n📈 Reselling Opportunities – Find underpriced items for profitable flips.\n🛍 Consumer Behavior Insights – Understand what products and features attract buyers.\n💡 Trend Spotting – Discover emerging products before they go mainstream.\n📝 Market Research – Gather data for academic, business, or personal research.\n\n*How to* scrape *facebook marketplace? 🧑‍🏫* \nStep 1. Find the Facebook Marketplace dataset tool on Apify Store\nStep 2: Click ‘Try for free’\nStep 3: Input a URL\nStep 4: Fine tune the input\nStep 5: Start the Actor and get your data!\n\n*Useful links 🧑‍💻*\n📚 Read more about Scraping Facebook data: https://apify.it/43wyth9\n🧑‍💻 Sign up for Apify: https://apify.it/42e8nNu\n🧩 Integrate the Actor with other tools: https://apify.it/43Ustiz\n📱 Browse other Social Media Scrapers on Apify Store: https://apify.it/4jhq7i8\n\n*Follow us 🤳*\nhttps://www.linkedin.com/company/apifytech\nhttps://twitter.com/apify\nhttps://www.tiktok.com/@apifytech\nhttps://discord.com/invite/jyEM2PRvMU\n\n*Timestamps ⌛️*\n00:00 Introduction\n01:27 Input\n02:17 Run\n02:26 Export\n02:41 Scheduling\n02:54 Integrations\n03:00 API\n03:13 Other Meta Scrapers\n03:26 Like and subscribe!\n\n#webscraping #instagram",
  "channel": "Apify",
  "channel_id": "UCTgwcoeGGKmZ3zzCXN2qo_A",
  "video_id": "r-1J94tk5Fo",
  "duration": "226",
  "keywords": [
    "web scraping platform",
    "web automation",
    "scrapers",
    "Apify",
    "web crawling",
    "web scraping",
    "data extraction",
    "best web scraping tool",
    "API",
    "how to extract data from any website",
    "web scraping tutorial",
    "web scrape",
    "data collection tool",
    "RPA",
    "web integration",
    "how to turn website into API",
    "JSON",
    "python web scraping",
    "web scraping python",
    "web api integration",
    "how to turn website into api",
    "scraping",
    "apify",
    "data extraction tools",
    "how to web scrape",
    "web scraping javascript",
    "web scraping tool"
  ],
  "view_count": "765",
  "like_count": "8",
  "is_shorts": false,
  "publish_date": "2025-04-03T05:33:18-07:00",
  "transcript": "Hi, Theo here. In this video, I’ll \nshow you how to scrape structured\ndata from Facebook Marketplace by location, \ncategory, or specific search query. You’ll\nbe able to extract listing details like price, \ndescription, images, delivery info, seller data,\nlocation, and listing status — using a \ntool called Facebook Marketplace Scraper.\nHere’s what you can do with it. \nIf you&#39;re reselling, flipping,\nor deal hunting, scraping helps you track \nprices, spot trends, and catch underpriced\nor free items early. Looking for a rental \nor house? Compare listings across cities,\ncheck historical prices, and avoid wasting \ntime on overpriced options. Selling on\nMarketplace? Analyze top-performing listings, \noptimize keywords, and price competitively.\nFor businesses, scraping \nenables competitor tracking,\ndynamic pricing, real estate \nresearch, fraud detection,\nand brand protection — like spotting counterfeit \nor unauthorized listings before they do damage.\nThe best part is you don’t need to \njump through hoops to get this data:\nFacebook Marketplace Scraper makes things simple: \nno login, no cookies, no browser extension.\nIt runs in the cloud, and you can export \nresults in JSON, CSV, Excel — or use the API.\nLet’s see how it works.\nFirst, head to the link in the description, \nwhich’ll take you to Facebook Marketplace\nScraper’s README. Click on `try \nfor free`, which will send you to\nthe `Login page` and you can get started \nwith a free Apify account - don’t worry,\nthere’s no limit on the free plan and \nno credit card will ever be required.\nAfter logging in, you’ll land on the Actor’s \ninput page. While you can configure this through\neither the intuitive UI or JSON, we’ll \nstick with the UI option to keep it easy.\nFor scraping Facebook Marketplace, you’re gonna \nneed the URL from Facebook. You can use a URL of\na search term, location or an item category. For \nthis tutorial, we’re gonna go with an iPhone. So\nlet’s open up Facebook Marketplace, input a search \nterm and then copy the URL from the toolbar and\npaste it in the input. You can add more via the \nadd button, edit them in bulk or import the URLs\nas a text file. Next, you can limit how many \nposts you want to scrape. And that’s it.\nBefore running your Actor, it’s a great idea \nto save your configuration and create a task.\nThis will come in handy for scheduling or \nintegrating your Actor with other tools,\nor if you plan to work with \nmultiple configurations.\nNow that we have the `input`, let’s run \nthe Actor by hitting START. You can watch\nyour results appear in Overview or switch to \nthe Log tab to see more details about run.\nNow that your run is finished, we can get the \ndata via the Export button. You can choose your\npreffered format, and select which fields you want \nto include or exclude in your dataset. Then just\nhit Download and you have your dataset file. Let \nme show you what this looks like in JSON format.\nIf you want to automate your workflow \neven more, you can schedule your Facebook\nMarketplace Scraper to run at regular intervals. \nChoose your task and hit schedule. You can set\nthe frequency of how often you want to run \nthe Actor. You can even connect your Actor\nto other cloud services, such as Google \nDrive, Make, or any other Apify Actor.\nYou can also run this scraper locally via \nAPI. You can find the code in Node.js,\nPython, or curl in the API \ndrop down menu in the top-right\ncorner. To learn more about retrieving data \nprogramatically, check out our video on it.\nNeed more Facebook or Instagram data? \nCheck out our other scrapers in Apify\nStore. We have got dozens of meta \nscrapers, links are in the description.\nIf you prefer video tutorials, we have a playlist \ncovering different Instagram scraping use cases.\nAnd that’s all for today! Let us know what you \nthink about the Facebok Marketplace Scraper.\nRemember, if you come across any issues, make \nsure to report them to our team in Apify Console.\nIf you found this helpful, give us a thumbs \nup and subscribe. Don&#39;t forget to hit the\nbell to stay updated on new tutorials. Thanks for \nwatching! So long, and thanks for all the likes"
}

5. Enhancing the scraper capabilities

As with any project working with a large site like YouTube, you may encounter various issues that need to be resolved.
Currently, the Crawlee for Python documentation contains many guides and examples to help you with this.

Use Camoufox, a project compatible with Playwright, which allows you to get a browser configuration that's more resistant to blocking, and you can easily integrate it with Crawlee for Python.
Improve error handling and logging for unusual cases so you can easily debug and maintain the project; the guide on error handling is a good place to start.
Add proxy support to avoid blocks from YouTube. You can use Apify Proxy and ProxyConfiguration; you can learn more in this guide in the documentation.
Make your crawler a web service that crawls pages by user request, using FastAPI and following this guide.

6. Creating YouTube Actor on the Apify platform

For deployment, we'll use the Apify platform. It's a simple and effective environment for cloud deployment, allowing efficient interaction with your crawler. Call it via API, schedule tasks, integrate with various services, and much more.

To deploy to the Apify platform, we need to adapt our project for the Apify Actor structure.

Create an .actor folder with the necessary files.

mkdir .actor && touch .actor/{actor.json,input_schema.json}

Move the Dockerfile from the root folder to .actor.

mv Dockerfile .actor

Let's fill in the empty files:

The actor.json file contains project metadata for the Apify platform. Follow the documentation for proper configuration:

{
  "actorSpecification": 1,
  "name": "YouTube-Crawlee",
  "title": "YouTube - Crawlee",
  "minMemoryMbytes": 2048,
  "description": "Scrape video stats, metadata and transcripts from videos in YouTube channels",
  "version": "0.1",
  "meta": {
    "templateId": "youtube-crawlee"
  },
  "input": "./input_schema.json",
  "dockerfile": "./Dockerfile"
}

Actor input parameters are defined using input_schema.json, which is specified here.

Let's define input parameters for our crawler:

maxItems - maximum number of videos per channel for scraping.
channelNames - these are the YouTube channel names to scrape.
proxySettings - proxy settings, since without a proxy, you'll be using the datacenter IP that Apify uses.

{
    "title": "YouTube Crawlee",
    "type": "object",
    "schemaVersion": 1,
    "properties": {
        "channelNames": {
            "title": "List Channel Names",
            "type": "array",
            "description": "Channel names for extraction video stats, metadata and transcripts.",
            "editor": "stringList",
            "prefill": ["Apify"]
        },
        "maxItems": {
            "type": "integer",
            "editor": "number",
            "title": "Limit search results",
            "description": "Limits the maximum number of results, applies to each search separately.",
            "default": 10
        },
        "proxySettings": {
            "title": "Proxy configuration",
            "type": "object",
            "description": "Select proxies to be used by your scraper.",
            "prefill": { "useApifyProxy": true },
            "editor": "proxy"
        }
    },
    "required": ["channelNames"]
}

Let's update the code to accept input parameters.

# main.py

async def main() -> None:
    """The crawler entry point."""
    async with Actor:
        # highlight-start
        # Get the input parameters from the Actor
        actor_input = await Actor.get_input()

        max_items = actor_input.get('maxItems', 0)
        channels = actor_input.get('channelNames', [])
        proxy = await Actor.create_proxy_configuration(actor_proxy_input=actor_input.get('proxySettings'))
        # highlight-end

        crawler = PlaywrightCrawler(
            concurrency_settings=ConcurrencySettings(max_tasks_per_minute=50),
            request_handler=router,
            request_handler_timeout=timedelta(seconds=120),
            headless=True,
            max_requests_per_crawl=100,
            proxy_configuration=proxy
        )

And delete export to JSON from the main function, as the Apify platform will handle data storage in the Dataset.

That's it, the project is ready for deployment.

7. Deploying to Apify

Use the official Apify CLI to upload your code:

Authenticate using your API token from Apify Console:

apify login

Choose "Enter API token manually" and paste your token.

Push the project to the platform:

apify push

Now you can configure runs on the Apify platform.

Let's perform a test run:

Fill in the input parameters:

View results in the dataset:

If you want to make your Actor public and provide access to other users, potentially to earn income from it, follow this publishing guide for Apify Store.

Conclusion

We've created a good foundation for crawling YouTube using Crawlee for Python and Playwright. If you're just starting your journey in crawling, this will be an excellent project for learning and practice. You can use it as a basis for creating more complex crawlers that will collect data from YouTube. If this is your first project using Crawlee for Python, check out all the documentation links provided in this article; it will help you better understand how Crawlee for Python works and how you can use it for your projects.

You can find the complete code in the repository

If you enjoyed this blog, feel free to support Crawlee for Python by starring the repository or joining the maintainer team.

Do you have questions or want to discuss the details of the implementation? Join our Discord—our community of 11,000+ developers is there to help.

How to scrape TikTok using Python

Max Bohomolov — Wed, 30 Apr 2025 15:25:08 +0000

TikTok users generate tons of data that are valuable for analysis.

Which hashtags are trending now? What is an influencer's engagement rate? What topics are important for a content creator? You can find answers to these and many other questions by analyzing TikTok data. However, for analysis, you need to extract the data in a convenient format. In this blog, we'll explore how to scrape TikTok using Crawlee for Python.

Note: One of our community members wrote this blog as a contribution to the Crawlee Blog. If you'd like to contribute articles like these, please reach out to us on our Discord channel.

Key steps we'll cover:

Project setup
Analyzing TikTok and determining a scraping strategy
Configuring Crawlee
Extracting TikTok data
Creating TikTok Actor on the Apify platform
Deploying to Apify

Prerequisites

Python 3.9 or higher
Familiarity with web scraping concepts
Crawlee for Python v0.6.0 or higher
uv v0.6 or higher
An Apify account

1. Project setup

Note: Before going ahead with the project, I'd like to ask you to star Crawlee for Python on GitHub, it helps us to spread the word to fellow scraper developers.

In this project, we'll use uv for package management and a specific Python version will be installed through uv. Uv is a fast and modern package manager written in Rust.

If you don't have uv installed yet, just follow the guide or use this command:

curl -LsSf https://astral.sh/uv/install.sh | sh

To create the project, run:

uvx crawlee['cli'] create tiktok-crawlee

In the cli menu that opens, select:

Playwright
Httpx
uv
Leave the default value - https://crawlee.dev
y

Or, just run the command:

uvx crawlee['cli'] create tiktok-crawlee --crawler-type playwright --http-client httpx --package-manager uv --apify --start-url 'https://crawlee.dev'

Creating the project may take a few minutes. After installation is complete, navigate to the project folder:

cd tiktok-crawlee

2. Analyzing TikTok and determining a scraping strategy

TikTok uses quite a lot of JavaScript on its site, both for displaying content and for analyzing user behavior, including detecting and blocking crawlers. Therefore, for crawling TikTok, we'll use a headless browser with Playwright.

To load new elements on a user's page, TikTok uses infinite scrolling. You may already be familiar with this method from this article.

Let's look at what happens under the hood when we scroll a TikTok page. I recommend studying network activity in DevTools to understand what requests are going to the server.

Let's examine the HTML structure to understand if navigating to elements will be difficult.

Well, this looks quite simple. If using CSS selectors, [data-e2e="user-post-item"] a is sufficient.

Let's look at what a video page response looks like to see what data we can extract.

It seems that the HTML code contains JSON with all the data we're interested in. Great!

3. Configuring Crawlee

Now that we understand our scraping strategy, let's set up Crawlee for scraping TikTok.

Since pages have infinite scrolling, we need to limit the number of elements we want to get. For this, we'll add a max_items parameter that will limit the maximum number of elements for each search and pass it in user_data when forming a Request.

We'll limit the intensity of scraping by setting max_tasks_per_minute in ConcurrencySettings. This will help us reduce the likelihood of being blocked by TikTok.

We'll set browser_type to firefox, as it performed better for TikTok in my tests.
TikTok may request permissions to access device data, so we'll explicitly limit all permissions by passing the appropriate parameter to browser_new_context_options.

Scrolling pages can take a long time, so we should increase the time limit for processing a single request using request_handler_timeout.

# main.py

from datetime import timedelta

from apify import Actor

from crawlee import ConcurrencySettings, Request
from crawlee.crawlers import PlaywrightCrawler

from .routes import router

async def main() -> None:
    """The crawler entry point."""
    # When creating the template, we confirmed Apify integration.
    # However, this isn't important for us at this stage.
    async with Actor:

        max_items = 20

        # Create a crawler with the necessary settings
        crawler = PlaywrightCrawler(
            # Limit scraping intensity by setting a limit on requests per minute
            concurrency_settings=ConcurrencySettings(max_tasks_per_minute=50),
            # We'll configure the `router` in the next step
            request_handler=router,
            # You can use `False` during development. But for production, it's always `True`
            headless=True,
            max_requests_per_crawl=100,
            # Increase the timeout for the request handling pipeline
            request_handler_timeout=timedelta(seconds=120),
            browser_type='firefox',
            # Limit any permissions to device data
            browser_new_context_options={'permissions': []},
        )

        # Run the crawler to collect data from several user pages
        await crawler.run(
            [
                Request.from_url('https://www.tiktok.com/@apifyoffice', user_data={'limit': max_items}),
                Request.from_url('https://www.tiktok.com/@authorbrandonsanderson', user_data={'limit': max_items}),
            ]
        )

Someone might ask, "What about configurations to avoid fingerprint blocking?!!!" My answer is, "Crawlee for Python has already done that for you."

Depending on your deployment environment, you may need to add a proxy. We'll come back to this in the last section.

4. Extracting TikTok data

After configuration, let's move on to navigation and data extraction.

Also, with deeper investigation, you may encounter a TikTok page that doesn't load user videos, but only shows a button and an error message.

It's very important to handle this case.

Also during testing, I discovered that you need to interact with scrolling, otherwise when using infinite_scroll, new elements don't load. I think this is a TikTok bug.

Let's start with a simple function to extract video links. It will help avoid code duplication.

# routes.py

import asyncio
import json

from playwright.async_api import Page

from crawlee import Request
from crawlee.crawlers import PlaywrightCrawlingContext
from crawlee.router import Router

router = Router[PlaywrightCrawlingContext]()


# Helper function that extracts all loaded video links
async def extract_video_links(page: Page) -> list[Request]:
    """Extract all loaded video links from the page."""
    links = []
    for post in await page.query_selector_all('[data-e2e="user-post-item"] a'):
        post_link = await post.get_attribute('href')
        if post_link and '/video/' in post_link:
            links.append(Request.from_url(post_link, label='video'))
    return links

Now we can move on to the main handler that will process TikTok user pages.

# routes.py

# Main handler used for TikTok user pages
@router.default_handler
async def default_handler(context: PlaywrightCrawlingContext) -> None:
    """Handle request without specific label."""
    context.log.info(f'Processing {context.request.url} ...')

    # Get the limit for video elements from `user_data`
    limit = context.request.user_data.get('limit', 10)
    if not isinstance(limit, int):
        raise TypeError('Limit must be an integer')

    # Wait until the button or at least a video loads, if the connection is slow
    check_locator = context.page.locator('[data-e2e="user-post-item"], main button').first
    await check_locator.wait_for()

    # If the button loaded, click it to initiate video loading
    if button := await context.page.query_selector('main button'):
        await button.click()

    # Perform interaction with scrolling
    await context.page.press('body', 'PageDown')

    # Start `infinite_scroll` as a background task
    scroll_task: asyncio.Task[None] = asyncio.create_task(context.infinite_scroll())

    # Wait until scrolling is completed or until the limit is reached
    while not scroll_task.done():
        requests = await extract_video_links(context.page)
        # If we've already reached the limit, interrupt scrolling and exit the loop
        if len(requests) >= limit:
            scroll_task.cancel()
            break
        # Switch the asynchronous context to allow other tasks to execute
        await asyncio.sleep(0.2)
    else:
        requests = await extract_video_links(context.page)

    # Limit the number of requests to the limit value
    requests = requests[:limit]

    # If the page wasn't properly processed for some reason and didn't find any links,
    # then I want to raise an error for retry
    if not requests:
        raise RuntimeError('No video links found')

    await context.add_requests(requests)

The final stage is handling the video page.

# routes.py

@router.handler(label='video')
async def video_handler(context: PlaywrightCrawlingContext) -> None:
    """Handle request with the label 'video'."""
    context.log.info(f'Processing video {context.request.url} ...')

    # Extract the element containing JSON with data
    json_element = await context.page.query_selector('#__UNIVERSAL_DATA_FOR_REHYDRATION__')
    if json_element:
        # Extract JSON and convert it to a dictionary
        text_data = await json_element.text_content()
        json_data = json.loads(text_data)

        data = json_data['__DEFAULT_SCOPE__']['webapp.video-detail']['itemInfo']['itemStruct']

        # Create result item
        result_item = {
            'author': {
                'nickname': data['author']['nickname'],
                'id': data['author']['id'],
                'handle': data['author']['uniqueId'],
                'signature': data['author']['signature'],
                'followers': data['authorStats']['followerCount'],
                'following': data['authorStats']['followingCount'],
                'hearts': data['authorStats']['heart'],
                'videos': data['authorStats']['videoCount'],
            },
            'description': data['desc'],
            'tags': [item['hashtagName'] for item in data['textExtra'] if item['hashtagName']],
            'hearts': data['stats']['diggCount'],
            'shares': data['stats']['shareCount'],
            'comments': data['stats']['commentCount'],
            'plays': data['stats']['playCount'],
        }

        # Save the result to the dataset
        await context.push_data(result_item)
    else:
        # If the data wasn't received, we raise an error for retry
        raise RuntimeError('No JSON data found')

The crawler is ready for local launch. To run it, execute the command:

uv run python -m tiktok_crawlee

You can view the saved results in the dataset folder, path ./storage/datasets/default/.

Example record:

{
  "author": {
    "nickname": "apifyoffice",
    "id": "7095709566285480965",
    "handle": "apifyoffice",
    "signature": "🤖 web scraping and AI 🤖\n\ncheck out our open positions at ✨apify.it/jobs✨",
    "followers": 118,
    "following": 3,
    "hearts": 1975,
    "videos": 33
  },
  "description": ""Fun" is the top word Apifiers used to describe our culture. Here's what else came to their minds 🎤  #workculture #teambuilding #interview #czech #ilovemyjob ",
  "tags": [
    "workculture",
    "teambuilding",
    "interview",
    "czech",
    "ilovemyjob"
  ],
  "hearts": 7,
  "shares": 1,
  "comments": 1,
  "plays": 448
}

5. Creating TikTok Actor on the Apify platform

To deploy to the Apify platform, we need to adapt our project for the Apify Actor structure.

Create an .actor folder with the necessary files.

mkdir .actor && touch .actor/{actor.json,input_schema.json}

Move the Dockerfile from the root folder to .actor.

mv Dockerfile .actor

Let's fill in the empty files:

The actor.json file contains project metadata for the Apify platform. Follow the documentation for proper configuration:

{
  "actorSpecification": 1,
  "name": "TikTok-Crawlee",
  "title": "TikTok - Crawlee",
  "minMemoryMbytes": 2048,
  "description": "Scrape video elements from TikTok user pages",
  "version": "0.1",
  "meta": {
    "templateId": "tiktok-crawlee"
  },
  "input": "./input_schema.json",
  "dockerfile": "./Dockerfile"
}

Actor input parameters are defined using input_schema.json, which is specified here.

Let's define input parameters for our crawler:

maxItems - this should be an externally configurable parameter.
urls - these are links to TikTok user pages, the starting points for our crawler's scraping
proxySettings - proxy settings, since without a proxy you'll be using the datacenter IP that Apify uses.

{
    "title": "TikTok Crawlee",
    "type": "object",
    "schemaVersion": 1,
    "properties": {
        "urls": {
            "title": "List URLs",
            "type": "array",
            "description": "Direct URLs to pages TikTok profiles.",
            "editor": "stringList",
            "prefill": ["https://www.tiktok.com/@apifyoffice"]
        },
        "maxItems": {
            "type": "integer",
            "editor": "number",
            "title": "Limit search results",
            "description": "Limits the maximum number of results, applies to each search separately.",
            "default": 10
        },
        "proxySettings": {
            "title": "Proxy configuration",
            "type": "object",
            "description": "Select proxies to be used by your scraper.",
            "prefill": { "useApifyProxy": true },
            "editor": "proxy"
        }
    },
    "required": ["urls"]
}

Let's update the code to accept input parameters.

# main.py

from datetime import timedelta

from apify import Actor
from crawlee.crawlers import PlaywrightCrawler
from crawlee import ConcurrencySettings
from crawlee import Request

from .routes import router

async def main() -> None:
    """The crawler entry point."""
    async with Actor:
        # highlight-start
        # Accept input parameters passed when starting the Actor
        actor_input = await Actor.get_input()

        max_items = actor_input.get('maxItems', 0)
        requests = [Request.from_url(url, user_data={'limit': max_items}) for url in actor_input.get('urls', [])]
        proxy = await Actor.create_proxy_configuration(actor_proxy_input=actor_input.get('proxySettings'))
        # highlight-end

        crawler = PlaywrightCrawler(
            concurrency_settings=ConcurrencySettings(max_tasks_per_minute=50),
            proxy_configuration=proxy,
            request_handler=router,
            headless=True,
            request_handler_timeout=timedelta(seconds=120),
            browser_type='firefox',
            browser_new_context_options={'permissions': []}
        )

        await crawler.run(requests)

That's it, the project is ready for deployment.

6. Deploying to Apify

Use the official Apify CLI to upload your code:

Authenticate using your API token from Apify Console:

apify login

Choose "Enter API token manually" and paste your token.

Push the project to the platform:

apify push

Now you can configure runs on the Apify platform.

Let's perform a test run:

Fill in the input parameters:

Check that logging works correctly:

View results in the dataset:

If you want to make your Actor public and provide access to other users, potentially to earn income from it, follow this publishing guide for Apify Store.

Conclusion

We've created a good foundation for crawling TikTok using Crawlee for Python and Playwright. If you want to improve the project, I would recommend adding error handling and handling cases when you get a CAPTCHA to reduce the likelihood of being blocked by TikTok. However, this is a good foundation to start working with TikTok. It allows you to get data right now.

You can find the complete code in the repository

If you enjoyed this blog, feel free to support Crawlee for Python by starring the repository or joining the maintainer team.

Have questions or want to discuss implementation details? Join our Discord - our community of 10,000+ developers is there to help.

How to scrape Bluesky with Python

Max Bohomolov — Fri, 21 Mar 2025 12:58:22 +0000

Bluesky is an emerging social network developed by former members of the Twitter(now X) development team. The platform has been showing significant growth recently, reaching 140.3 million visits according to SimilarWeb. Like X, luesky generates a vast amount of data that can be used for analysis. In this article, we’ll explore how to collect this data using Crawlee for Python.

Note: One of our community members wrote this blog as a contribution to the Crawlee Blog. If you’d like to contribute articles like these, please reach out to us on our discord channel.

Key steps we will cover:

Project setup
Development of the Bluesky crawler in Python
Create Apify Actor for Bluesky crawler
Conclusion and repository access

Prerequisites

Basic understanding of web scraping concepts
Python 3.9 or higher
UV version 0.6.0 or higher
Crawlee for Python v0.6.5 or higher
Bluesky account for API access

Project setup

In this project, we’ll use UV for package management and a specific Python version installed through UV. UV is a fast and modern package manager written in Rust.

If you don’t have UV installed yet, follow the guide or use this command:
```
curl -LsSf https://astral.sh/uv/install.sh | sh
```
Install standalone Python using UV:
```
uv install python 3.13
```

Create a new project and install Crawlee for Python:

uv init bluesky-crawlee --package
cd bluesky-crawlee
uv add crawlee

We’ve created a new isolated Python project with all the necessary dependencies for Crawlee.

Development of the Bluesky crawler in Python

Note: Before going ahead with the project, I'd like to ask you to star Crawlee for Python on GitHub, it helps us to spread the word to fellow scraper developers.

1. Identifying the data source

When accessing the search page, you'll see data displayed, but be aware of a key limitation: the site only allows viewing the first page of results, preventing access to any additional pages.

Fortunately, Bluesky provides a well-documented API that is accessible to any registered user without additional permissions. This is what we’ll use for data collection

2. Creating a session for API interaction

For secure API interaction, you need to create a dedicated app password instead of using your main account password.

Go to Settings -> Privacy and Security -> App Passwords and click Add App Password.
Important: Save the generated password, as it won’t be visible after creation.

Next, create environment variables to store your credentials:

Your application password
Your user identifier (found in your profile and Bluesky URL, for example: mantisus.bsky.social)

export BLUESKY_APP_PASSWORD=your_app_password
export BLUESKY_IDENTIFIER=your_identifier

Using the createSession, deleteSession endpoints and httpx, we can create a session for API interaction.

Let us create a class with the necessary methods:

import asyncio
import json
import os
import traceback

import httpx
from yarl import URL

from crawlee import ConcurrencySettings, Request
from crawlee.configuration import Configuration
from crawlee.crawlers import HttpCrawler, HttpCrawlingContext
from crawlee.http_clients import HttpxHttpClient
from crawlee.storages import Dataset

# Environment variables for authentication
# BLUESKY_APP_PASSWORD: App-specific password generated from Bluesky settings
# BLUESKY_IDENTIFIER: Your Bluesky handle (e.g., username.bsky.social)
BLUESKY_APP_PASSWORD = os.getenv('BLUESKY_APP_PASSWORD')
BLUESKY_IDENTIFIER = os.getenv('BLUESKY_IDENTIFIER')


class BlueskyApiScraper:
    """A scraper class for extracting data from Bluesky social network using their official API.

    This scraper manages authentication, concurrent requests, and data collection for both
    posts and user profiles. It uses separate datasets for storing post and user information.
    """

    def __init__(self) -> None:
        self._crawler: HttpCrawler | None = None

        self._users: Dataset | None = None
        self._posts: Dataset | None = None

        # Variables for storing session data
        self._service_endpoint: str | None = None
        self._user_did: str | None = None
        self._access_token: str | None = None
        self._refresh_token: str | None = None
        self._handle: str | None = None

    def create_session(self) -> None:
        """Create credentials for the session."""
        url = 'https://bsky.social/xrpc/com.atproto.server.createSession'
        headers = {
            'Content-Type': 'application/json',
        }
        data = {'identifier': BLUESKY_IDENTIFIER, 'password': BLUESKY_APP_PASSWORD}

        response = httpx.post(url, headers=headers, json=data)
        response.raise_for_status()

        data = response.json()

        self._service_endpoint = data['didDoc']['service'][0]['serviceEndpoint']
        self._user_did = data['didDoc']['id']
        self._access_token = data['accessJwt']
        self._refresh_token = data['refreshJwt']
        self._handle = data['handle']

    def delete_session(self) -> None:
        """Delete the current session."""
        url = f'{self._service_endpoint}/xrpc/com.atproto.server.deleteSession'
        headers = {'Content-Type': 'application/json', 'authorization': f'Bearer {self._refresh_token}'}

        response = httpx.post(url, headers=headers)
        response.raise_for_status()

The session expires after 2 hours, so if you plan for your crawler to run longer, you should also add a method for refresh.

3. Configuring Crawlee for Python for data collection

Since we’ll be using the official API, we do not need to worry about being blocked by Bluesky. However, we should be careful with the number of requests to avoid overloading Bluesky's servers, so we will configure ConcurrencySettings. We’ll also configure HttpxHttpClient to use custom headers with the current session's Authorization.

We’ll use 2 endpoints for data collection: searchPosts for posts and getProfile. If you plan to scale the crawler, you can use getProfiles for user data, but in this case, you’ll need to implement deduplication logic. When each link is unique, Crawlee for Python handles this for you.

When collecting data, I’d like to separately collect user and post data, so we’ll use different Dataset instances for storage.

async def init_crawler(self) -> None:
    """Initialize the crawler."""
    if not self._user_did:
        raise ValueError('Session not created.')

    # Initialize the datasets purge the data if it is not empty
    self._users = await Dataset.open(name='users', configuration=Configuration(purge_on_start=True))
    self._posts = await Dataset.open(name='posts', configuration=Configuration(purge_on_start=True))

    # Initialize the crawler
    self._crawler = HttpCrawler(
        max_requests_per_crawl=100,
        http_client=HttpxHttpClient(
            # Set headers for API requests
            headers={
                'Content-Type': 'application/json',
                'Authorization': f'Bearer {self._access_token}',
                'Connection': 'Keep-Alive',
                'accept-encoding': 'gzip, deflate, br, zstd',
            }
        ),
        # Configuring concurrency of crawling requests
        concurrency_settings=ConcurrencySettings(
            min_concurrency=10,
            desired_concurrency=10,
            max_concurrency=30,
            max_tasks_per_minute=200,
        ),
    )

    self._crawler.router.default_handler(self._search_handler)  # Handler for search requests
    self._crawler.router.handler(label='user')(self._user_handler)  # Handler for user requests

4. Implementing handlers for data collection

Now we can implement the handler for searching posts. We’ll save the retrieved posts in self._posts and create requests for user data, placing them in the crawler's queue. We also need to handle pagination by forming the link to the next search page.

async def _search_handler(self, context: HttpCrawlingContext) -> None:
    context.log.info(f'Processing search {context.request.url} ...')

    data = json.loads(context.http_response.read())

    if 'posts' not in data:
        context.log.warning(f'No posts found in response: {context.request.url}')
        return

    user_requests = {}
    posts = []

    profile_url = URL(f'{self._service_endpoint}/xrpc/app.bsky.actor.getProfile')

    for post in data['posts']:
        # Add user request if not already added in current context
        if post['author']['did'] not in user_requests:
            user_requests[post['author']['did']] = Request.from_url(
                url=str(profile_url.with_query(actor=post['author']['did'])),
                user_data={'label': 'user'},
            )

        posts.append(
            {
                'uri': post['uri'],
                'cid': post['cid'],
                'author_did': post['author']['did'],
                'created': post['record']['createdAt'],
                'indexed': post['indexedAt'],
                'reply_count': post['replyCount'],
                'repost_count': post['repostCount'],
                'like_count': post['likeCount'],
                'quote_count': post['quoteCount'],
                'text': post['record']['text'],
                'langs': '; '.join(post['record'].get('langs', [])),
                'reply_parent': post['record'].get('reply', {}).get('parent', {}).get('uri'),
                'reply_root': post['record'].get('reply', {}).get('root', {}).get('uri'),
            }
        )

    await self._posts.push_data(posts)  # Push a batch of posts to the dataset
    await context.add_requests(list(user_requests.values()))

    if cursor := data.get('cursor'):
        next_url = URL(context.request.url).update_query({'cursor': cursor})  # Use yarl for update the query string

        await context.add_requests([str(next_url)])

When receiving user data, we'll store it in the corresponding Dataset self._users

async def _user_handler(self, context: HttpCrawlingContext) -> None:
    context.log.info(f'Processing user {context.request.url} ...')

    data = json.loads(context.http_response.read())

    user_item = {
        'did': data['did'],
        'created': data['createdAt'],
        'avatar': data.get('avatar'),
        'description': data.get('description'),
        'display_name': data.get('displayName'),
        'handle': data['handle'],
        'indexed': data.get('indexedAt'),
        'posts_count': data['postsCount'],
        'followers_count': data['followersCount'],
        'follows_count': data['followsCount'],
    }

    await self._users.push_data(user_item)

5. Saving data to files

For saving results, we will use the write_to_json.

async def save_data(self) -> None:
    """Save the data."""
    if not self._users or not self._posts:
        raise ValueError('Datasets not initialized.')

    with open('users.json', 'w') as f:
        await self._users.write_to_json(f, indent=4)

    with open('posts.json', 'w') as f:
        await self._posts.write_to_json(f, indent=4)

6. Running the crawler

We have everything needed to complete the crawler. We just need a method to execute the crawling - let us call it crawl

async def crawl(self, queries: list[str]) -> None:
    """Crawl the given URL."""
    if not self._crawler:
        raise ValueError('Crawler not initialized.')

    search_url = URL(f'{self._service_endpoint}/xrpc/app.bsky.feed.searchPosts')

    await self._crawler.run([str(search_url.with_query(q=query)) for query in queries])

Let's finalize the code:

async def run() -> None:
    """Main execution function that orchestrates the crawling process.

    Creates a scraper instance, manages the session, and handles the complete
    crawling lifecycle including proper cleanup on completion or error.
    """
    scraper = BlueskyApiScraper()
    scraper.create_session()
    try:
        await scraper.init_crawler()
        await scraper.crawl(['python', 'apify', 'crawlee'])
        await scraper.save_data()
    except Exception:
        traceback.print_exc()
    finally:
        scraper.delete_session()


def main() -> None:
    """Entry point for the crawler application."""
    asyncio.run(run())

If you check your pyproject.toml, you will see that UV created an entrypoint for running bluesky-crawlee = "bluesky_crawlee:main", so we can run our crawler simply by executing:

uv run bluesky-crawlee

Let's look at sample results:

Posts

Users

Create Apify Actor for Bluesky crawler

We already have a fully functional implementation for local execution. Let us explore how to adapt it for running on the Apify Platform and transform in Apify Actor.

An Actor is a simple and efficient way to deploy your code in the cloud infrastructure on the Apify Platform. You can flexibly interact with the Actor, schedule regular runs for monitoring data, or integrate with other tools to build data processing flows.

First, create an .actor directory with platform configuration files:

mkdir .actor && touch .actor/{actor.json,Dockerfile,input_schema.json}

Then add Apify SDK for Python as a project dependency:

uv add apify

Configure Dockerfile

We’ll use the official Apify Docker image along with recommended UV practices for Docker:

FROM apify/actor-python:3.13

ENV PATH='/app/.venv/bin:$PATH'

WORKDIR /app

COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/

COPY pyproject.toml uv.lock ./

RUN uv sync --frozen --no-install-project --no-editable -q --no-dev

COPY . .

RUN uv sync --frozen --no-editable -q --no-dev

CMD ["bluesky-crawlee"]

Here, bluesky-crawlee refers to the entrypoint specified in pyproject.toml.

Define project metadata in actor.json

The actor.json file contains project metadata for Apify Platform. Follow the documentation for proper configuration:

{
  "actorSpecification": 1,
  "name": "Bluesky-Crawlee",
  "title": "Bluesky - Crawlee",
  "minMemoryMbytes": 128,
  "maxMemoryMbytes": 2048,
  "description": "Scrape data products from bluesky",
  "version": "0.1",
  "meta": {
    "templateId": "bluesky-crawlee"
  },
  "input": "./input_schema.json",
  "dockerfile": "./Dockerfile"
}

Define Actor input parameters

Our crawler requires several external parameters. Let’s define them:

identifier: User's Bluesky identifier (encrypted for security)
appPassword: Bluesky app password (encrypted)
queries: List of search queries for crawling
maxRequestsPerCrawl: Optional limit for testing
mode: Choose between collecting posts or user data who post on specific topics

Configure the input schema following the specification:

{
  "title": "Bluesky - Crawlee",
  "type": "object",
  "schemaVersion": 1,
  "properties": {
    "identifier": {
      "title": "Bluesky identifier",
      "description": "Bluesky identifier for API login",
      "type": "string",
      "editor": "textfield",
      "isSecret": true
    },
    "appPassword": {
      "title": "Bluesky app password",
      "description": "Bluesky app password for API",
      "type": "string",
      "editor": "textfield",
      "isSecret": true
    },
    "maxRequestsPerCrawl": {
      "title": "Max requests per crawl",
      "description": "Maximum number of requests for crawling",
      "type": "integer"
    },
    "queries": {
      "title": "Queries",
      "type": "array",
      "description": "Search queries",
      "editor": "stringList",
      "prefill": [
        "apify"
      ],
      "example": [
        "apify",
        "crawlee"
      ]
    },
    "mode": {
      "title": "Mode",
      "type": "string",
      "description": "Collect posts or users who post on a topic",
      "enum": [
        "posts",
        "users"
      ],
      "default": "posts"
    }
  },
  "required": [
    "identifier",
    "appPassword",
    "queries",
    "mode"
  ]
}

Update project code

Remove environment variables and parameterize the code according to the Actor input parameters. Replace named datasets with the default dataset.

Add Actor logging:

# __init__.py

import logging

from apify.log import ActorLogFormatter

handler = logging.StreamHandler()
handler.setFormatter(ActorLogFormatter())

apify_client_logger = logging.getLogger('apify_client')
apify_client_logger.setLevel(logging.INFO)
apify_client_logger.addHandler(handler)

apify_logger = logging.getLogger('apify')
apify_logger.setLevel(logging.DEBUG)
apify_logger.addHandler(handler)

Update imports and entry point code:

import asyncio
import json
import traceback
from dataclasses import dataclass

import httpx
from apify import Actor
from yarl import URL

from crawlee import ConcurrencySettings, Request
from crawlee.crawlers import HttpCrawler, HttpCrawlingContext
from crawlee.http_clients import HttpxHttpClient


@dataclass
class ActorInput:
    """Actor input schema."""
    identifier: str
    app_password: str
    queries: list[str]
    mode: str
    max_requests_per_crawl: Optional[int] = None


async def run() -> None:
    """Main execution function that orchestrates the crawling process.

    Creates a scraper instance, manages the session, and handles the complete
    crawling lifecycle including proper cleanup on completion or error.
    """
    async with Actor:
        raw_input = await Actor.get_input()
        actor_input = ActorInput(
            identifier=raw_input.get('indentifier', ''),
            app_password=raw_input.get('appPassword', ''),
            queries=raw_input.get('queries', []),
            mode=raw_input.get('mode', 'posts'),
            max_requests_per_crawl=raw_input.get('maxRequestsPerCrawl')
        )
        scraper = BlueskyApiScraper(actor_input.mode, actor_input.max_requests_per_crawl)
        try:
            scraper.create_session(actor_input.identifier, actor_input.app_password)

            await scraper.init_crawler()
            await scraper.crawl(actor_input.queries)
        except httpx.HTTPError as e:
            Actor.log.error(f'HTTP error occurred: {e}')
            raise
        except Exception as e:
            Actor.log.error(f'Unexpected error: {e}')
            traceback.print_exc()
        finally:
            scraper.delete_session()

def main() -> None:
    """Entry point for the scraper application."""
    asyncio.run(run())

Update methods with Actor input parameters:

class BlueskyApiScraper:
    """A scraper class for extracting data from Bluesky social network using their official API.

    This scraper manages authentication, concurrent requests, and data collection for both
    posts and user profiles. It uses separate datasets for storing post and user information.
    """

    def __init__(self, mode: str, max_request: int | None) -> None:
        self._crawler: HttpCrawler | None = None

        self.mode = mode
        self.max_request = max_request

        # Variables for storing session data
        self._service_endpoint: str | None = None
        self._user_did: str | None = None
        self._access_token: str | None = None
        self._refresh_token: str | None = None
        self._handle: str | None = None

    def create_session(self, identifier: str, password: str) -> None:
        """Create credentials for the session."""
        url = 'https://bsky.social/xrpc/com.atproto.server.createSession'
        headers = {
            'Content-Type': 'application/json',
        }
        data = {'identifier': identifier, 'password': password}

        response = httpx.post(url, headers=headers, json=data)
        response.raise_for_status()

        data = response.json()

        self._service_endpoint = data['didDoc']['service'][0]['serviceEndpoint']
        self._user_did = data['didDoc']['id']
        self._access_token = data['accessJwt']
        self._refresh_token = data['refreshJwt']
        self._handle = data['handle']

Implement mode-aware data collection logic:

async def _search_handler(self, context: HttpCrawlingContext) -> None:
    """Handle search requests based on mode."""
    context.log.info(f'Processing search {context.request.url} ...')

    data = json.loads(context.http_response.read())

    if 'posts' not in data:
        context.log.warning(f'No posts found in response: {context.request.url}')
        return

    user_requests = {}
    posts = []

    profile_url = URL(f'{self._service_endpoint}/xrpc/app.bsky.actor.getProfile')

    for post in data['posts']:
        if self.mode == 'users' and post['author']['did'] not in user_requests:
            user_requests[post['author']['did']] = Request.from_url(
                url=str(profile_url.with_query(actor=post['author']['did'])),
                user_data={'label': 'user'},
            )
        elif self.mode == 'posts':
            posts.append(
                {
                    'uri': post['uri'],
                    'cid': post['cid'],
                    'author_did': post['author']['did'],
                    'created': post['record']['createdAt'],
                    'indexed': post['indexedAt'],
                    'reply_count': post['replyCount'],
                    'repost_count': post['repostCount'],
                    'like_count': post['likeCount'],
                    'quote_count': post['quoteCount'],
                    'text': post['record']['text'],
                    'langs': '; '.join(post['record'].get('langs', [])),
                    'reply_parent': post['record'].get('reply', {}).get('parent', {}).get('uri'),
                    'reply_root': post['record'].get('reply', {}).get('root', {}).get('uri'),
                }
            )

    if self.mode == 'posts':
        await context.push_data(posts)
    else:
        await context.add_requests(list(user_requests.values()))

    if cursor := data.get('cursor'):
        next_url = URL(context.request.url).update_query({'cursor': cursor})
        await context.add_requests([str(next_url)])

Update the user handler for the default dataset:

async def _user_handler(self, context: HttpCrawlingContext) -> None:
    """Handle user profile requests."""
    context.log.info(f'Processing user {context.request.url} ...')

    data = json.loads(context.http_response.read())

    user_item = {
        'did': data['did'],
        'created': data['createdAt'],
        'avatar': data.get('avatar'),
        'description': data.get('description'),
        'display_name': data.get('displayName'),
        'handle': data['handle'],
        'indexed': data.get('indexedAt'),
        'posts_count': data['postsCount'],
        'followers_count': data['followersCount'],
        'follows_count': data['followsCount'],
    }

    await context.push_data(user_item)

Deploy

Use the official Apify CLI to upload your code:

Authenticate using your API token from Apify Console:

apify login

Choose "Enter API token manually" and paste your token.

Push the project to the platform:

apify push

Now you can configure runs on Apify Platform.

Let’s perform a test run:

Fill in the input parameters:

Check that logging works correctly:

View results in the dataset:

If you want to make your Actor public and provide access to other users, potentially to earn income from it, follow this publishing guide for Apify Store.

Conclusion and repository access

We’ve created an efficient crawler for Bluesky using the official API. If you want to learn more this topic for regular data extraction from Bluesky, I recommend explorin custom feed generation - I think it opens up some interesting possibilities.

And if you need to quickly create a crawler that can retrieve data for various queries, you now have everything you need.

You can find the complete code in the repository

If you enjoyed this blog, feel free to support Crawlee for Python by starring the repository or joining the maintainer team.

Have questions or want to discuss implementation details? Join our Discord - our community of 10,000+ developers is there to help.

Inside implementing SuperScraper with Crawlee.

Saurav Jain — Thu, 06 Mar 2025 07:23:07 +0000

SuperScraper is an open-source Actor that combines features from various web scraping services, including ScrapingBee, ScrapingAnt, and ScraperAPI.

A key capability is its standby mode, which runs the Actor as a persistent API server. This removes the usual start-up times - a common pain point in many systems - and lets users make direct API calls to interact with the system immediately.

This blog explains how SuperScraper works, highlights its implementation details, and provides code snippets to demonstrate its core functionality.

apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

A web scraping and browser automation library

Crawlee covers your crawling and scraping end-to-end and helps you build reliable scrapers. Fast.

Your crawlers will appear human-like and fly under the radar of modern bot protections even with the default configuration. Crawlee gives you the tools to crawl the web for links, scrape data, and store it to disk or cloud while staying configurable to suit your project's needs.

Crawlee is available as the crawlee NPM package.

👉 View full documentation, guides and examples on the Crawlee project website 👈

Crawlee for Python is open for early adopters. 🐍 👉 Checkout the source code 👈.

Installation

We recommend visiting the Introduction tutorial in Crawlee documentation for more information.

Crawlee requires Node.js 16 or higher.

With Crawlee CLI

The fastest way to try Crawlee out is to use the Crawlee CLI and choose the Getting started example. The CLI…

View on GitHub

What is SuperScraper?

SuperScraper transforms a traditional scraper into an API server. Instead of running with static inputs and waiting for completion, it starts only once, stays active, and listens for incoming requests.

How to enable standby mode

To activate standby mode, you must configure the settings so it listens for incoming requests.

Server setup

The project uses Node.js http module to create a server that listens on the desired port. After the server starts, a check ensures users are interacting with it correctly by sending requests instead of running it traditionally. This keeps SuperScraper operating as a persistent server.

Handling multiple crawlers

SuperScraper processes user requests using multiple instances of Crawlee’s PlaywrightCrawler. Since each PlaywrightCrawler instance can only handle one proxy configuration, a separate crawler is created for each unique proxy setting.

For example, if the user sends one request for “normal” proxies and one request with residential US proxies, a separate crawler needs to be created for each proxy configuration. Hence, to solve this, we store the crawlers in a key-value map, where the key is a stringified proxy configuration.

const crawlers = new Map<string, PlaywrightCrawler>();

Here’s a part of the code that gets executed when a new request from the user arrives; if the crawler for this proxy configuration exists in the map, it will be used. Otherwise, a new crawler gets created. Then, we add the request to the crawler’s queue so it can be processed.

const key = JSON.stringify(crawlerOptions); 
const crawler = crawlers.has(key) ? crawlers.get(key)! : await createAndStartCrawler(crawlerOptions);

await crawler.addRequests([request]);

The function below initializes new crawlers with predefined settings and behaviors. Each crawler utilizes its own in-memory queue created with the MemoryStorage client. This approach is used for two key reasons:

Performance: In-memory queues are faster, and there's no need to persist them when SuperScraper migrates.
Isolation: Using a separate queue prevents interference with the shared default queue of the SuperScraper Actor, avoiding potential bugs when multiple crawlers use it simultaneously.

export const createAndStartCrawler = async (crawlerOptions: CrawlerOptions = DEFAULT_CRAWLER_OPTIONS) => {
    const client = new MemoryStorage({ persistStorage: false });
    const queue = await RequestQueue.open(undefined, { storageClient: client });

    const proxyConfig = await Actor.createProxyConfiguration(crawlerOptions.proxyConfigurationOptions);

    const crawler = new PlaywrightCrawler({
        keepAlive: true,
        proxyConfiguration: proxyConfig,
        maxRequestRetries: 4,
        requestQueue: queue,
    });
};

At the end of the function, we start the crawler and log a message if it terminates for any reason. Next, we add the newly created crawler to the key-value map containing all crawlers, and finally, we return the crawler.

crawler.run().then(
    () => log.warning(`Crawler ended`, crawlerOptions),
    () => { }
);

crawlers.set(JSON.stringify(crawlerOptions), crawler);

log.info('Crawler ready 🚀', crawlerOptions);

return crawler;

Mapping standby HTTP requests to Crawlee requests

When creating the server, it accepts a request listener function that takes two arguments: the user’s request and a response object. The response object is used to send scraped data back to the user. These response objects are stored in a key-value map to so they can be accessed later in the code. The key is a randomly generated string shared between the request and its corresponding response object, it is used as request.uniqueKey.

const responses = new Map<string, ServerResponse>();

Saving response objects

The following function stores a response object in the key-value map:

export function addResponse(responseId: string, response: ServerResponse) {
    responses.set(responseId, response);
}

Updating crawler logic to store responses

Here’s the updated logic for fetching/creating the corresponding crawler for a given proxy configuration, with a call to store the response object:

const key = JSON.stringify(crawlerOptions); 
const crawler = crawlers.has(key) ? crawlers.get(key)! : await createAndStartCrawler(crawlerOptions);

addResponse(request.uniqueKey!, res);

await crawler.requestQueue!.addRequest(request);

Sending scraped data back

Once a crawler finishes processing a request, it retrieves the corresponding response object using the key and sends the scraped data back to the user:

export const sendSuccResponseById = (responseId: string, result: unknown, contentType: string) => {
    const res = responses.get(responseId);
    if (!res) {
        log.info(`Response for request ${responseId} not found`);
        return;
    }

    res.writeHead(200, { 'Content-Type': contentType });
    res.end(result);
    responses.delete(responseId);
};

Error handling

There is similar logic to send a response back if an error occurs during scraping:

export const sendErrorResponseById = (responseId: string, result: string, statusCode: number = 500) => {
    const res = responses.get(responseId);
    if (!res) {
        log.info(`Response for request ${responseId} not found`);
        return;
    }

    res.writeHead(statusCode, { 'Content-Type': 'application/json' });
    res.end(result);
    responses.delete(responseId);
};

Adding timeouts during migrations

During migration, SuperScraper adds timeouts to pending responses to handle termination cleanly.

export const addTimeoutToAllResponses = (timeoutInSeconds: number = 60) => {
    const migrationErrorMessage = {
        errorMessage: 'Actor had to migrate to another server. Please, retry your request.',
    };

    const responseKeys = Object.keys(responses);

    for (const key of responseKeys) {
        setTimeout(() => {
            sendErrorResponseById(key, JSON.stringify(migrationErrorMessage));
        }, timeoutInSeconds * 1000);
    }
};

Managing migrations

SuperScraper handles migrations by timing out active responses to prevent lingering requests during server transitions.

Actor.on('migrating', ()=>{
    addTimeoutToAllResponses(60);
});

Users receive clear feedback during server migrations, maintaining stable operation.

Build your own

This guide showed how to build and manage a standby web scraper using Apify’s platform and Crawlee. The implementation handles multiple proxy configurations through PlaywrightCrawler instances while managing request-response cycles efficiently to support diverse scraping needs.

Standby mode transforms SuperScraper into a persistent API server, eliminating start-up delays. The migration handling system keeps operations stable during server transitions. You can build on this foundation to create web scraping tools tailored to your requirements.

To get started, explore the project on GitHub or learn more about Crawlee to build your own scalable web scraping tools.

How to scrape Crunchbase using Python in 2024 (Easy Guide)

Max Bohomolov — Wed, 15 Jan 2025 14:38:03 +0000

Python developers know the drill: you need reliable company data, and Crunchbase has it. This guide shows you how to build an effective Crunchbase scraper in Python that gets you the data you need.

Crunchbase tracks details that matter: locations, business focus, founders, and investment histories. Manual extraction from such a large dataset isn't practical -automation is essential for transforming this information into an analyzable format.

By the end of this blog, we'll explore three different ways to extract data from Crunchbase using Crawlee for Python. We'll fully implement two of them and discuss the specifics and challenges of the third. This will help us better understand how important it is to properly choose the right data source.

Note: This guide comes from a developer in our growing community. Have you built interesting projects with Crawlee? Join us on Discord to share your experiences and blog ideas - we value these contributions from developers like you.

Key steps we'll cover:

Project setup
Choosing the data source
Implementing sitemap-based crawler
Analysis of search-based approach and its limitations
Implementing the official API crawler
Conclusion and repository access

Prerequisites

Python 3.9 or higher
Familiarity with web scraping concepts
Crawlee for Python v0.5.0
poetry v2.0 or higher

Project setup

Before we start scraping, we need to set up our project. In this guide, we won't be using crawler templates (Playwright and Beautifulsoup), so we'll set up the project manually.

Install Poetry
```
pipx install poetry
```

Create and navigate to the project folder.

mkdir crunchbase-crawlee && cd crunchbase-crawlee

Initialize the project using Poetry, leaving all fields empty.
```
poetry init
```
When prompted:
- For "Compatible Python versions", enter: >={your Python version},<4.0 (For example, if you're using Python 3.10, enter: >=3.10,<4.0)
- Leave all other fields empty by pressing Enter
- Confirm the generation by typing "yes"
Add and install Crawlee with necessary dependencies to your project using Poetry.
```
poetry add crawlee[parsel,curl-impersonate]
```

Complete the project setup by creating the standard file structure for Crawlee for Python projects.

mkdir crunchbase-crawlee && touch crunchbase-crawlee/{__init__.py,__main__.py,main.py,routes.py}

After setting up the basic project structure, we can explore different methods of obtaining data from Crunchbase.

Choosing the data source

While we can extract target data directly from the company page, we need to choose the best way to navigate the site.

A careful examination of Crunchbase's structure shows that we have three main options for obtaining data:

Sitemap - for complete site traversal.
Search - for targeted data collection.
Official API - recommended method.

Let's examine each of these approaches in detail.

Scraping Crunchbase using sitemap and Crawlee for Python

Sitemap is a standard way of site navigation used by crawlers like Google, Ahrefs, and other search engines. All crawlers must follow the rules described in robots.txt.

Let's look at the structure of Crunchbase's Sitemap:

As you can see, links to organization pages are located inside second-level Sitemap files, which are compressed using gzip.

The structure of one of these files looks like this:

The lastmod field is particularly important here. It allows tracking which companies have updated their information since the previous data collection. This is especially useful for regular data updates.

1. Configuring the crawler for scraping

To work with the site, we'll use CurlImpersonateHttpClient, which impersonates a Safari browser. While this choice might seem unexpected for working with a sitemap, it's necessitated by Crunchbase's protection features.

The reason is that Crunchbase uses Cloudflare to protect against automated access. This is clearly visible when analyzing traffic on a company page:

An interesting feature is that challenges.cloudflare is executed after loading the document with data. This means we receive the data first, and only then JavaScript checks if we're a bot. If our HTTP client's fingerprint is sufficiently similar to a real browser, we'll successfully receive the data.

Cloudflare also analyzes traffic at the sitemap level. If our crawler doesn't look legitimate, access will be blocked. That's why we impersonate a real browser.

To prevent blocks due to overly aggressive crawling, we'll configure ConcurrencySettings.

When scaling this approach, you'll likely need proxies. Detailed information about proxy setup can be found in the documentation.

We'll save our scraping results in JSON format. Here's how the basic crawler configuration looks:

# main.py

from crawlee import ConcurrencySettings, HttpHeaders
from crawlee.crawlers import ParselCrawler
from crawlee.http_clients import CurlImpersonateHttpClient

from .routes import router


async def main() -> None:
    """The crawler entry point."""
    concurrency_settings = ConcurrencySettings(max_concurrency=1, max_tasks_per_minute=50)

    http_client = CurlImpersonateHttpClient(
        impersonate='safari17_0',
        headers=HttpHeaders(
            {
                'accept-language': 'en',
                'accept-encoding': 'gzip, deflate, br, zstd',
            }
        ),
    )
    crawler = ParselCrawler(
        request_handler=router,
        max_request_retries=1,
        concurrency_settings=concurrency_settings,
        http_client=http_client,
        max_requests_per_crawl=30,
    )

    await crawler.run(['https://www.crunchbase.com/www-sitemaps/sitemap-index.xml'])

    await crawler.export_data_json('crunchbase_data.json')

2. Implementing sitemap navigation

Sitemap navigation happens in two stages. In the first stage, we need to get a list of all files containing organization information:

# routes.py

from crawlee.crawlers import ParselCrawlingContext
from crawlee.router import Router
from crawlee import Request

router = Router[ParselCrawlingContext]()


@router.default_handler
async def default_handler(context: ParselCrawlingContext) -> None:
    """Default request handler."""
    context.log.info(f'default_handler processing {context.request} ...')

    requests = [
        Request.from_url(url, label='sitemap')
        for url in context.selector.xpath('//loc[contains(., "sitemap-organizations")]/text()').getall()
    ]

    # Since this is a tutorial, I don't want to upload more than one sitemap link
    await context.add_requests(requests, limit=1)

In the second stage, we process second-level sitemap files stored in gzip format. This requires a special approach as the data needs to be decompressed first:

# routes.py

from gzip import decompress
from parsel import Selector


@router.handler('sitemap')
async def sitemap_handler(context: ParselCrawlingContext) -> None:
    """Sitemap gzip request handler."""
    context.log.info(f'sitemap_handler processing {context.request.url} ...')

    data = context.http_response.read()
    data = decompress(data)

    selector = Selector(data.decode())

    requests = [Request.from_url(url, label='company') for url in selector.xpath('//loc/text()').getall()]

    await context.add_requests(requests)

3. Extracting and saving data

Each company page contains a large amount of information. For demonstration purposes, we'll focus on the main fields: Company Name, Short Description, Website, and Location.

One of Crunchbase's advantages is that all data is stored in JSON format within the page:

This significantly simplifies data extraction - we only need to use one Xpath selector to get the JSON, and then apply jmespath to extract the needed fields:

# routes.py

@router.handler('company')
async def company_handler(context: ParselCrawlingContext) -> None:
    """Company request handler."""
    context.log.info(f'company_handler processing {context.request.url} ...')

    json_selector = context.selector.xpath('//*[@id="ng-state"]/text()')

    await context.push_data(
        {
            'Company Name': json_selector.jmespath('HttpState.*.data[].properties.identifier.value').get(),
            'Short Description': json_selector.jmespath('HttpState.*.data[].properties.short_description').get(),
            'Website': json_selector.jmespath('HttpState.*.data[].cards.company_about_fields2.website.value').get(),
            'Location': '; '.join(
                json_selector.jmespath(
                    'HttpState.*.data[].cards.company_about_fields2.location_identifiers[].value'
                ).getall()
            ),
        }
    )

The collected data is saved in Crawlee for Python's internal storage using the context.push_data method. When the crawler finishes, we export all collected data to a JSON file:

# main.py

await crawler.export_data_json('crunchbase_data.json')

4. Running the project

With all components in place, we need to create an entry point for our crawler:

# __main__.py
import asyncio

from .main import main

if __name__ == '__main__':
    asyncio.run(main())

Execute the crawler using Poetry:

poetry run python -m crunchbase-crawlee

5. Finally, characteristics of using the sitemap crawler

The sitemap approach has its distinct advantages and limitations. It's ideal in the following cases:

When you need to collect data about all companies on the platform
When there are no specific company selection criteria
If you have sufficient time and computational resources

However, there are significant limitations to consider:

Almost no ability to filter data during collection
Requires constant monitoring of Cloudflare blocks
Scaling the solution requires proxy servers, which increases project costs

Using search for scraping Crunchbase

The limitations of the sitemap approach might point to search as the next solution. However, Crunchbase applies tighter security measures to its search functionality compared to its public pages.

The key difference lies in how Cloudflare protection works. While we receive data before the challenges.cloudflare check when accessing a company page, the search API requires valid cookies that have passed this check.

Let's verify this in practice. Open the following link in Incognito mode:

<https://www.crunchbase.com/v4/data/autocompletes?query=Ap&collection_ids=organizations&limit=25&source=topSearch>

When analyzing the traffic, we'll see the following pattern:

The sequence of events here is:

First, the page is blocked with code 403
Then the challenges.cloudflare check is performed
Only after successfully passing the check do we receive data with code 200

Automating this process would require a headless browser capable of bypassing Cloudflare Turnstile. The current version of Crawlee for Python (v0.5.0) doesn't provide this functionality, although it's planned for future development.

You can extend the capabilities of Crawlee for Python by integrating Camoufox following this example.

Working with the official Crunchbase API

Crunchbase provides a free API with basic functionality. Paid subscription users get expanded data access. Complete documentation for available endpoints can be found in the official API specification.

1. Setting up API access

To start working with the API, follow these steps:

Create a Crunchbase account
Go to the Integrations section
Create a Crunchbase Basic API key

Although the documentation states that key activation may take up to an hour, it usually starts working immediately after creation.

2. Configuring the crawler for API work

An important API feature is the limit - no more than 200 requests per minute, but in the free version, this number is significantly lower. Taking this into account, let's configure ConcurrencySettings. Since we're working with the official API, we don't need to mask our HTTP client. We'll use the standard 'HttpxHttpClient' with preset headers.

First, let's save the API key in an environment variable:

export CRUNCHBASE_TOKEN={YOUR KEY}

Here's how the crawler configuration for working with the API looks:

# main.py

import os

from crawlee.crawlers import HttpCrawler
from crawlee.http_clients import HttpxHttpClient
from crawlee import ConcurrencySettings, HttpHeaders

from .routes import router

CRUNCHBASE_TOKEN = os.getenv('CRUNCHBASE_TOKEN', '')


async def main() -> None:
    """The crawler entry point."""

    concurrency_settings = ConcurrencySettings(max_tasks_per_minute=60)

    http_client = HttpxHttpClient(
        headers=HttpHeaders({'accept-encoding': 'gzip, deflate, br, zstd', 'X-cb-user-key': CRUNCHBASE_TOKEN})
    )
    crawler = HttpCrawler(
        request_handler=router,
        concurrency_settings=concurrency_settings,
        http_client=http_client,
        max_requests_per_crawl=30,
    )

    await crawler.run(
        ['https://api.crunchbase.com/api/v4/autocompletes?query=apify&collection_ids=organizations&limit=25']
    )

    await crawler.export_data_json('crunchbase_data.json')

3. Processing search results

For working with the API, we'll need two main endpoints:

get_autocompletes - for searching
get_entities_organizations__entity_id - for getting data

First, let's implement search results processing:

import json

from crawlee.crawlers import HttpCrawler
from crawlee.router import Router
from crawlee import Request

router = Router[HttpCrawlingContext]()


@router.default_handler
async def default_handler(context: HttpCrawlingContext) -> None:
    """Default request handler."""
    context.log.info(f'default_handler processing {context.request.url} ...')

    data = json.loads(context.http_response.read())

    requests = []

    for entity in data['entities']:
        permalink = entity['identifier']['permalink']
        requests.append(
            Request.from_url(
                url=f'https://api.crunchbase.com/api/v4/entities/organizations/{permalink}?field_ids=short_description%2Clocation_identifiers%2Cwebsite_url',
                label='company',
            )
        )

    await context.add_requests(requests)

4. Extracting company data

After getting the list of companies, we extract detailed information about each one:

@router.handler('company')
async def company_handler(context: HttpCrawlingContext) -> None:
    """Company request handler."""
    context.log.info(f'company_handler processing {context.request.url} ...')

    data = json.loads(context.http_response.read())

    await context.push_data(
        {
            'Company Name': data['properties']['identifier']['value'],
            'Short Description': data['properties']['short_description'],
            'Website': data['properties'].get('website_url'),
            'Location': '; '.join([item['value'] for item in data['properties'].get('location_identifiers', [])]),
        }
    )

5. Advanced location-based search

If you need more flexible search capabilities, the API provides a special search endpoint. Here's an example of searching for all companies in Prague:

payload = {
    'field_ids': ['identifier', 'location_identifiers', 'short_description', 'website_url'],
    'limit': 200,
    'order': [{'field_id': 'rank_org', 'sort': 'asc'}],
    'query': [
        {
            'field_id': 'location_identifiers',
            'operator_id': 'includes',
            'type': 'predicate',
            'values': ['e0b951dc-f710-8754-ddde-5ef04dddd9f8'],
        },
        {'field_id': 'facet_ids', 'operator_id': 'includes', 'type': 'predicate', 'values': ['company']},
    ],
}

serialiazed_payload = json.dumps(payload)
await crawler.run(
    [
        Request.from_url(
            url='https://api.crunchbase.com/api/v4/searches/organizations',
            method='POST',
            payload=serialiazed_payload,
            use_extended_unique_key=True,
            headers=HttpHeaders({'Content-Type': 'application/json'}),
            label='search',
        )
    ]
)

For processing search results and pagination, we use the following handler:

@router.handler('search')
async def search_handler(context: HttpCrawlingContext) -> None:
    """Search results handler with pagination support."""
    context.log.info(f'search_handler processing {context.request.url} ...')

    data = json.loads(context.http_response.read())

    last_entity = None
    results = []

    for entity in data['entities']:
        last_entity = entity['uuid']
        results.append(
            {
                'Company Name': entity['properties']['identifier']['value'],
                'Short Description': entity['properties']['short_description'],
                'Website': entity['properties'].get('website_url'),
                'Location': '; '.join([item['value'] for item in entity['properties'].get('location_identifiers', [])]),
            }
        )

    if results:
        await context.push_data(results)

    if last_entity:
        payload = json.loads(context.request.payload)
        payload['after_id'] = last_entity
        payload = json.dumps(payload)

        await context.add_requests(
            [
                Request.from_url(
                    url='https://api.crunchbase.com/api/v4/searches/organizations',
                    method='POST',
                    payload=payload,
                    use_extended_unique_key=True,
                    headers=HttpHeaders({'Content-Type': 'application/json'}),
                    label='search',
                )
            ]
        )

6. Finally, free API limitations

The free version of the API has significant limitations:

Limited set of available endpoints
Autocompletes function only works for company searches
Not all data fields are accessible
Limited search filtering capabilities

Consider a paid subscription for production-level work. The API provides the most reliable way to access Crunchbase data, even with its rate constraints.

What’s your best path forward?

We've explored three different approaches to obtaining data from Crunchbase:

Sitemap - for large-scale data collection
Search - difficult to automate due to Cloudflare protection
Official API - the most reliable solution for commercial projects

Each method has its advantages, but for most projects, I recommend using the official API despite its limitations in the free version.

The complete source code is available in my repository. Have questions or want to discuss implementation details? Join our Discord - our community of developers is there to help.

How to scrape Google Maps data using Python and Crawlee

Satyam Tripathi — Mon, 30 Dec 2024 08:33:40 +0000

Millions of people use Google Maps daily, leaving behind a goldmine of data just waiting to be analyzed. In this guide, I'll show you how to build a reliable scraper using Crawlee and Python to extract locations, ratings, and reviews from Google Maps, all while handling its dynamic content challenges.

Note: One of our community members wrote this blog as a contribution to the Crawlee Blog. If you would like to contribute blogs like these to Crawlee Blog, please reach out to us on our discord channel.

What data will we extract from Google Maps?

We’ll collect information about hotels in a specific city. You can also customize your search to meet your requirements. For example, you might search for "hotels near me", "5-star hotels in Bombay", or other similar queries.

We’ll extract important data, including the hotel name, rating, review count, price, a link to the hotel page on Google Maps, and all available amenities. Here’s an example of what the extracted data will look like:

{
    "name": "Vividus Hotels, Bangalore",
    "rating": "4.3",
    "reviews": "633",
    "price": "₹3,667",
    "amenities": [
        "Pool available",
        "Free breakfast available",
        "Free Wi-Fi available",
        "Free parking available"
    ],
    "link": "https://www.google.com/maps/place/Vividus+Hotels+,+Bangalore/..."
}

Building a Google Maps scraper

Let's build a Google Maps scraper step-by-step.

Note: Crawlee requires Python 3.9 or later.

1. Setting up your environment

First, let's set up everything you’ll need to run the scraper. Open your terminal and run these commands:

# Create and activate a virtual environment
python -m venv google-maps-scraper

# Windows:
.\google-maps-scraper\Scripts\activate

# Mac/Linux:
source google-maps-scraper/bin/activate

# We plan to use Playwright with Crawlee, so we need to install both:
pip install crawlee "crawlee[playwright]"
playwright install

If you're new to Crawlee, check out its easy-to-follow documentation. It’s available for both Node.js and Python.

Note: Before going ahead with the project, I request to star Crawlee for Python on GitHub, it helps us to spread the world to fellow scraping developers.

2. Connecting to Google Maps

Let's see the steps to connect to Google Maps.

Step 1: Setting up the crawler

The first step is to configure the crawler. We're using PlaywrightCrawler from Crawlee, which gives us powerful tools for automated browsing. We set headless=False to make the browser visible during scraping and allow 5 minutes for the pages to load.

from crawlee.playwright_crawler import PlaywrightCrawler
from datetime import timedelta

# Initialize crawler with browser visibility and timeout settings
crawler = PlaywrightCrawler(
    headless=False,  # Shows the browser window while scraping
    request_handler_timeout=timedelta(
        minutes=5
    ),  # Allows plenty of time for page loading
)

Step 2: Handling each page

This function defines how each page is handled when the crawler visits it. It uses context.page to navigate to the target URL.

async def scrape_google_maps(context):
    """
    Establishes connection to Google Maps and handles the initial page load
    """
    page = context.page
    await page.goto(context.request.url)
    context.log.info(f"Processing: {context.request.url}")

Step 3: Launching the crawler

Finally, the main function brings everything together. It creates a search URL, sets up the crawler, and starts the scraping process.

import asyncio

async def main():
    # Prepare the search URL
    search_query = "hotels in bengaluru"
    start_url = f"https://www.google.com/maps/search/{search_query.replace(' ', '+')}"

    # Tell the crawler how to handle each page it visits
    crawler.router.default_handler(scrape_google_maps)

    # Start the scraping process
    await crawler.run([start_url])

if __name__ == "__main__":
    asyncio.run(main())

Let’s combine the above code snippets and save them in a file named gmap_scraper.py:

from crawlee.playwright_crawler import PlaywrightCrawler
from datetime import timedelta
import asyncio

async def scrape_google_maps(context):
    """
    Establishes connection to Google Maps and handles the initial page load
    """
    page = context.page
    await page.goto(context.request.url)
    context.log.info(f"Processing: {context.request.url}")

async def main():
    """
    Configures and launches the crawler with custom settings
    """
    # Initialize crawler with browser visibility and timeout settings
    crawler = PlaywrightCrawler(
        headless=False,  # Shows the browser window while scraping
        request_handler_timeout=timedelta(
            minutes=5
        ),  # Allows plenty of time for page loading
    )

    # Tell the crawler how to handle each page it visits
    crawler.router.default_handler(scrape_google_maps)

    # Prepare the search URL
    search_query = "hotels in bengaluru"
    start_url = f"https://www.google.com/maps/search/{search_query.replace(' ', '+')}"

    # Start the scraping process
    await crawler.run([start_url])

if __name__ == "__main__":
    asyncio.run(main())

Run the code using:

$ python3 gmap_scraper.py

When everything works correctly, you'll see the output like this:

3. Import dependencies and defining Scraper Class

Let's start with the basic structure and necessary imports:

import asyncio
from datetime import timedelta
from typing import Dict, Optional, Set
from crawlee.playwright_crawler import PlaywrightCrawler
from playwright.async_api import Page, ElementHandle

The GoogleMapsScraper class serves as the main scraper engine:

class GoogleMapsScraper:
    def __init__(self, headless: bool = True, timeout_minutes: int = 5):
        self.crawler = PlaywrightCrawler(
            headless=headless,
            request_handler_timeout=timedelta(minutes=timeout_minutes),
        )
        self.processed_names: Set[str] = set()

    async def setup_crawler(self) -> None:
        self.crawler.router.default_handler(self._scrape_listings)

This initialization code sets up two crucial components:

A PlaywrightCrawler instance configured to run either headlessly (without a visible browser window) or with a visible browser
A set to track processed business names, preventing duplicate entries

The setup_crawler method configures the crawler to use our main scraping function as the default handler for all requests.

4. Understanding Google Maps internal code structure

Before we dive into scraping, let's understand exactly what elements we need to target. When you search for hotels in Bengaluru, Google Maps organizes hotel information in a specific structure. Here's a detailed breakdown of how to locate each piece of information.

Hotel name:

Hotel rating:

Hotel review count:

Hotel URL:

Hotel Price:

Hotel amenities:

This returns multiple elements as each hotel has several amenities. We'll need to iterate through these.

Quick tips:

Always verify these selectors before scraping, as Google might update them.
Use Chrome DevTools (F12) to inspect elements and confirm selectors.
Some elements might not be present for all hotels (like prices during the off-season).

5. Scraping Google Maps data using identified selectors

Let's build a scraper to extract detailed hotel information from Google Maps. First, create the core scraping function to handle data extraction.

gmap_scraper.py:

async def _extract_listing_data(self, listing: ElementHandle) -> Optional[Dict]:
    """Extract structured data from a single listing element."""
    try:
        name_el = await listing.query_selector(".qBF1Pd")
        if not name_el:
            return None
        name = await name_el.inner_text()
        if name in self.processed_names:
            return None

        elements = {
            "rating": await listing.query_selector(".MW4etd"),
            "reviews": await listing.query_selector(".UY7F9"),
            "price": await listing.query_selector(".wcldff"),
            "link": await listing.query_selector("a.hfpxzc"),
            "address": await listing.query_selector(".W4Efsd:nth-child(2)"),
            "category": await listing.query_selector(".W4Efsd:nth-child(1)"),
        }

        amenities = []
        amenities_els = await listing.query_selector_all(".dc6iWb")
        for amenity in amenities_els:
            amenity_text = await amenity.get_attribute("aria-label")
            if amenity_text:
                amenities.append(amenity_text)

        place_data = {
            "name": name,
            "rating": await elements["rating"].inner_text() if elements["rating"] else None,
            "reviews": (await elements["reviews"].inner_text()).strip("()") if elements["reviews"] else None,
            "price": await elements["price"].inner_text() if elements["price"] else None,
            "address": await elements["address"].inner_text() if elements["address"] else None,
            "category": await elements["category"].inner_text() if elements["category"] else None,
            "amenities": amenities if amenities else None,
            "link": await elements["link"].get_attribute("href") if elements["link"] else None,
        }

        self.processed_names.add(name)
        return place_data
    except Exception as e:
        context.log.exception("Error extracting listing data")
        return None

In the code:

query_selector: Returns first DOM element matching CSS selector, useful for single items like a name or rating
query_selector_all: Returns all matching elements, ideal for multiple items like amenities
inner_text(): Extracts text content
Some hotels might not have all the information available - we handle this with 'N/A’

When you run this script, you'll see output similar to this:

{
    "name": "GRAND KALINGA HOTEL",
    "rating": "4.2",
    "reviews": "1,171",
    "price": "\u20b91,760",
    "link": "https://www.google.com/maps/place/GRAND+KALINGA+HOTEL/data=!4m10!3m9!1s0x3bae160e0ce07789:0xb15bf736f4238e6a!5m2!4m1!1i2!8m2!3d12.9762259!4d77.5786043!16s%2Fg%2F11sp32pz28!19sChIJiXfgDA4WrjsRao4j9Db3W7E?authuser=0&hl=en&rclk=1",
    "amenities": [
        "Pool available",
        "Free breakfast available",
        "Free Wi-Fi available",
        "Free parking available"
    ]
}

6. Managing Infinite Scrolling

Google Maps uses infinite scrolling to load more results as users scroll down. We handle this with a dedicated method:

First, we need a function that can handle the scrolling and detect when we've hit the bottom. Copy-paste this new function in the gmap_scraper.py file:

async def _load_more_items(self, page: Page) -> bool:
        """Scroll down to load more items."""
        try:
            feed = await page.query_selector('div[role="feed"]')
            if not feed:
                return False
            prev_scroll = await feed.evaluate("(element) => element.scrollTop")
            await feed.evaluate("(element) => element.scrollTop += 800")
            await page.wait_for_timeout(2000)

            new_scroll = await feed.evaluate("(element) => element.scrollTop")
            if new_scroll <= prev_scroll:
                return False
            await page.wait_for_timeout(1000)
            return True
        except Exception as e:
            context.log.exception("Error during scroll")
            return False

Run this code using:

$ python3 gmap_scraper.py

You should see an output like this:

7. Scrape Listings

The main scraping function ties everything together. It scrapes listings from the page by repeatedly extracting data and scrolling.

async def _scrape_listings(self, context) -> None:
    """Main scraping function to process all listings"""
    try:
        page = context.page
        print(f"\nProcessing URL: {context.request.url}\n")

        await page.wait_for_selector(".Nv2PK", timeout=30000)
        await page.wait_for_timeout(2000)

        while True:
            listings = await page.query_selector_all(".Nv2PK")
            new_items = 0

            for listing in listings:
                place_data = await self._extract_listing_data(listing)
                if place_data:
                    await context.push_data(place_data)
                    new_items += 1
                    print(f"Processed: {place_data['name']}")

            if new_items == 0 and not await self._load_more_items(page):
                break
            if new_items > 0:
                await self._load_more_items(page)

        print(f"\nFinished processing! Total items: {len(self.processed_names)}")
    except Exception as e:
        print(f"Error in scraping: {str(e)}")

The scraper uses Crawlee's built-in storage system to manage scraped data. When you run the scraper, it creates a storage directory in your project with several key components:

datasets/: Contains the scraped results in JSON format
key_value_stores/: Stores crawler state and metadata
request_queues/: Manages URLs to be processed

The push_data() method we use in our scraper sends the data to Crawlee's dataset storage as you can see below:

8. Running the Scraper

Finally, we need functions to execute our scraper:

async def run(self, search_query: str) -> None:
    """Execute the scraper with a search query"""
    try:
        await self.setup_crawler()
        start_url = f"https://www.google.com/maps/search/{search_query.replace(' ', '+')}"
        await self.crawler.run([start_url])
        await self.crawler.export_data_json('gmap_data.json')
    except Exception as e:
        print(f"Error running scraper: {str(e)}")

async def main():
    """Entry point of the script"""
    scraper = GoogleMapsScraper(headless=True)
    search_query = "hotels in bengaluru"
    await scraper.run(search_query)

if __name__ == "__main__":
    asyncio.run(main())

This data is automatically stored and can later be exported to a JSON file using:

await self.crawler.export_data_json('gmap_data.json')

Here's what your exported JSON file will look like:

[
  {
    "name": "Vividus Hotels, Bangalore",
    "rating": "4.3",
    "reviews": "633",
    "price": "₹3,667",
    "amenities": [
      "Pool available",
      "Free breakfast available",
      "Free Wi-Fi available",
      "Free parking available"
    ],
    "link": "https://www.google.com/maps/place/Vividus+Hotels+,+Bangalore/..."
  }
]

9. Using proxies for Google Maps scraping

When scraping Google Maps at scale, using proxies is very helpful. Here are a few key reasons why:

Avoid IP blocks: Google Maps can detect and block IP addresses that make an excessive number of requests in a short time. Using proxies helps you stay under the radar.
Bypass rate limits: Google implements strict limits on the number of requests per IP address. By rotating through multiple IPs, you can maintain a consistent scraping pace without hitting these limits.
Access location-specific data: Different regions may display different data on Google Maps. Proxies allow you to view listings as if you are browsing from any specific location.

Here's a simple implementation using Crawlee's built-in proxy management. Update your previous code with this to use proxy settings.

from crawlee.playwright_crawler import PlaywrightCrawler
from crawlee.proxy_configuration import ProxyConfiguration

# Configure your proxy settings
proxy_configuration = ProxyConfiguration(
    proxy_urls=[
        "http://username:password@proxy.provider.com:12345",
        # Add more proxy URLs as needed
    ]
)

# Initialize crawler with proxy support
crawler = PlaywrightCrawler(
    headless=True,
    request_handler_timeout=timedelta(minutes=5),
    proxy_configuration=proxy_configuration,
)

Here, I use a proxy to scrape hotel data in New York City.

Here's an example of data scraped from New York City hotels using proxies:

{
  "name": "The Manhattan at Times Square Hotel",
  "rating": "3.1",
  "reviews": "8,591",
  "price": "$120",
  "amenities": [
    "Free parking available",
    "Free Wi-Fi available",
    "Air-conditioned available",
    "Breakfast available"
  ],
  "link": "https://www.google.com/maps/place/..."
}

10. Project: Interactive hotel analysis dashboard

After scraping hotel data from Google Maps, you can build an interactive dashboard that helps analyze hotel trends. Here’s a preview of how the dashboard works:

Find the complete info for this dashboard on GitHub: Hotel Analysis Dashboard.

11. Now you’re ready to put everything into action!

Take a look at the complete scripts in my GitHub Gist:

To make it all work:

Run the basic scraper or proxy-integrated scraper: This will collect the hotel data and store it in a JSON file.
Run the dashboard script: Load your JSON data and view it interactively in the dashboard.

Wrapping up and next steps

You've successfully built a comprehensive Google Maps scraper that collects and processes hotel data, presenting it through an interactive dashboard. Now you’ve learned about:

Using Crawlee with Playwright to navigate and extract data from Google Maps
Using proxies to scale up scraping without getting blocked
Storing the extracted data in JSON format
Creating an interactive dashboard to analyze hotel data

We’ve handpicked some great resources to help you further explore web scraping:

How to scrape Google search results with Python

Max Bohomolov — Mon, 02 Dec 2024 05:31:31 +0000

Scraping Google Search delivers essential SERP analysis, SEO optimization, and data collection capabilities. Modern scraping tools make this process faster and more reliable.

One of our community members wrote this blog as a contribution to the Crawlee Blog. If you would like to contribute blogs like these to Crawlee Blog, please reach out to us on our discord channel.

In this guide, we'll create a Google Search scraper using Crawlee for Python that can handle result ranking and pagination.

We'll create a scraper that:

Extracts titles, URLs, and descriptions from search results
Handles multiple search queries
Tracks ranking positions
Processes multiple result pages
Saves data in a structured format

Prerequisites

Python 3.7 or higher
Basic understanding of HTML and CSS selectors
Familiarity with web scraping concepts
Crawlee for Python v0.4.2 or higher

Project setup

Install Crawlee with required dependencies:

pipx install crawlee[beautifulsoup,curl-impersonate]

Create a new project using Crawlee CLI:

pipx run crawlee create crawlee-google-search

When prompted, select Beautifulsoup as your template type.
Navigate to the project directory and complete installation:
```
cd crawlee-google-search
poetry install
```

Development of the Google Search scraper in Python

1. Defining data for extraction

First, let's define our extraction scope. Google's search results now include maps, notable people, company details, videos, common questions, and many other elements. We'll focus on analyzing standard search results with rankings.

Here's what we'll be extracting:

Let's verify whether we can extract the necessary data from the page's HTML code, or if we need deeper analysis or JS rendering. Note that this verification is sensitive to HTML tags:

Based on the data obtained from the page, all necessary information is present in the HTML code. Therefore, we can use beautifulsoup_crawler.

The fields we'll extract:

Search result titles
URLs
Description text
Ranking positions

2. Configure the crawler

First, let's create the crawler configuration.

We'll use CurlImpersonateHttpClient as our http_client with preset headers and impersonate relevant to the Chrome browser.

We'll also configure ConcurrencySettings to control scraping aggressiveness. This is crucial to avoid getting blocked by Google.

If you need to extract data more intensively, consider setting up ProxyConfiguration.

from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
from crawlee.http_clients.curl_impersonate import CurlImpersonateHttpClient
from crawlee import ConcurrencySettings, HttpHeaders

async def main() -> None:
    concurrency_settings = ConcurrencySettings(max_concurrency=5, max_tasks_per_minute=200)

    http_client = CurlImpersonateHttpClient(impersonate="chrome124",
                                            headers=HttpHeaders({"referer": "https://www.google.com/",
                                                     "accept-language": "en",
                                                     "accept-encoding": "gzip, deflate, br, zstd",
                                                     "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36"
                                            }))

    crawler = BeautifulSoupCrawler(
        max_request_retries=1,
        concurrency_settings=concurrency_settings,
        http_client=http_client,
        max_requests_per_crawl=10,
        max_crawl_depth=5
    )

    await crawler.run(['https://www.google.com/search?q=Apify'])

3. Implementing data extraction

First, let's analyze the HTML code of the elements we need to extract:

There's an obvious distinction between readable ID attributes and generated class names and other attributes. When creating selectors for data extraction, you should ignore any generated attributes. Even if you've read that Google has been using a particular generated tag for N years, you shouldn't rely on it - this reflects your experience in writing robust code.

Now that we understand the HTML structure, let's implement the extraction. As our crawler deals with only one type of page, we can use router.default_handler for processing it. Within the handler, we'll use BeautifulSoup to iterate through each search result, extracting data such as title, url, and text_widget while saving the results.

@crawler.router.default_handler
async def default_handler(context: BeautifulSoupCrawlingContext) -> None:
    """Default request handler."""
    context.log.info(f'Processing {context.request} ...')

    for item in context.soup.select("div#search div#rso div[data-hveid][lang]"):
        data = {
            'title': item.select_one("h3").get_text(),
            "url": item.select_one("a").get("href"),
            "text_widget": item.select_one("div[style*='line']").get_text(),
        }
        await context.push_data(data)

4. Handling pagination

Since Google results depend on the IP geolocation of the search request, we can't rely on link text for pagination. We need to create a more sophisticated CSS selector that works regardless of geolocation and language settings.

The max_crawl_depth parameter controls how many pages our crawler should scan. Once we have our robust selector, we simply need to get the next page link and add it to the crawler's queue.

To write more efficient selectors, learn the basics of CSS and XPath syntax.

    await context.enqueue_links(selector="div[role='navigation'] td[role='heading']:last-of-type > a")

5. Exporting data to CSV format

Since we want to save all search result data in a convenient tabular format like CSV, we can simply add the export_data method call right after running the crawler:

await crawler.export_data_csv("google_search.csv")

6. Finalizing the Google Search scraper

While our core crawler logic works, you might have noticed that our results currently lack ranking position information. To complete our scraper, we need to implement proper ranking position tracking by passing data between requests using user_data in Request.

Let's modify the script to handle multiple queries and track ranking positions for search results analysis. We'll also set the crawling depth as a top-level variable. Let's move the router.default_handler to routes.py to match the project structure:

# crawlee-google-search.main

from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from crawlee.http_clients.curl_impersonate import CurlImpersonateHttpClient
from crawlee import Request, ConcurrencySettings, HttpHeaders

from .routes import router

QUERIES = ["Apify", "Crawlee"]

CRAWL_DEPTH = 2


async def main() -> None:
    """The crawler entry point."""

    concurrency_settings = ConcurrencySettings(max_concurrency=5, max_tasks_per_minute=200)

    http_client = CurlImpersonateHttpClient(impersonate="chrome124",
                                            headers=HttpHeaders({"referer": "https://www.google.com/",
                                                     "accept-language": "en",
                                                     "accept-encoding": "gzip, deflate, br, zstd",
                                                     "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36"
                                            }))
    crawler = BeautifulSoupCrawler(
        request_handler=router,
        max_request_retries=1,
        concurrency_settings=concurrency_settings,
        http_client=http_client,
        max_requests_per_crawl=100,
        max_crawl_depth=CRAWL_DEPTH
    )

    requests_lists = [Request.from_url(f"https://www.google.com/search?q={query}", user_data = {"query": query}) for query in QUERIES]

    await crawler.run(requests_lists)

    await crawler.export_data_csv("google_ranked.csv")

Let's also modify the handler to add query and order_no fields and basic error handling:

# crawlee-google-search.routes

from crawlee.beautifulsoup_crawler import BeautifulSoupCrawlingContext
from crawlee.router import Router

router = Router[BeautifulSoupCrawlingContext]()


@router.default_handler
async def default_handler(context: BeautifulSoupCrawlingContext) -> None:
    """Default request handler."""
    context.log.info(f'Processing {context.request.url} ...')

    order = context.request.user_data.get("last_order", 1)
    query = context.request.user_data.get("query")
    for item in context.soup.select("div#search div#rso div[data-hveid][lang]"):
        try:
            data = {
                "query": query,
                "order_no": order,
                'title': item.select_one("h3").get_text(),
                "url": item.select_one("a").get("href"),
                "text_widget": item.select_one("div[style*='line']").get_text(),
            }
            await context.push_data(data)
            order += 1
        except AttributeError as e:
            context.log.warning(f'Attribute error for query "{query}": {str(e)}')
        except Exception as e:
            context.log.error(f'Unexpected error for query "{query}": {str(e)}')

    await context.enqueue_links(selector="div[role='navigation'] td[role='heading']:last-of-type > a",
                                user_data={"last_order": order, "query": query})

And we're done!

Our Google Search crawler is ready. Let's look at the results in the google_ranked.csv file:

The code repository is available on GitHub

Scrape Google Search results with Apify

If you're working on a large-scale project requiring millions of data points, like the project featured in this article about Google ranking analysis - you might need a ready-made solution.

Consider using Google Search Results Scraper by the Apify team.

It offers important features such as:

Proxy support
Scalability for large-scale data extraction
Geolocation control
Integration with external services like Zapier, Make, Airbyte, LangChain and others

You can learn more in the Apify blog

What will you scrape?

In this blog, we've explored step-by-step how to create a Google Search crawler that collects ranking data. How you analyze this dataset is up to you!

As a reminder, you can find the full project code on GitHub.

I'd like to think that in 5 years I'll need to write an article on "How to extract data from the best search engine for LLMs", but I suspect that in 5 years this article will still be relevant.

Reverse engineering GraphQL persistedQuery extension

Saurav Jain — Fri, 22 Nov 2024 08:03:47 +0000

GraphQL is a query language for getting deeply nested structured data from a website's backend, similar to MongoDB queries.

The request is usually a POST to some general /graphql endpoint with a body like this:

However, with large data structures, this becomes inefficient - you are sending a large query in a POST request body, which is (almost always) the same and only changes on website updates; POST requests can’t be cached, etc. Therefore, an extension called “persisted queries” was developed. This isn’t an anti-scraping secret; you can read the public documentation about it here.

TLDR: the client computes the sha256 hash of the query text and only sends that hash. In addition, you can possibly fit all of this into the query string of a GET request, making it easily cachable. Below is an example request from Zillow

As you can see, it’s just some metadata about the persistedQuery extension, the hash of the query, and variables to be embedded in the query.

Here’s another request from expedia.com, sent as a POST, but with the same extension:

This primarily optimizes website performance, but it creates several challenges for web scraping:

GET requests are usually more prone to being blocked.
Hidden Query Parameters: We don’t know the full query, so if the website responds with a “Persisted query not found” error (asking us to send the query in full, not just the hash), we can’t send it.
Once the website changes even a little bit and the clients start asking for a new query - even though the old one might still work, the server will very soon forget its ID/hash, and your request with this hash will never work again, since you can’t “remind” the server of the full query text.

Therefore, for different reasons, you might find yourself in the need to extract the whole query text. You could dig through the website JavaScript, and if you’re lucky, you might find the query text there in full, but often, it is somehow dynamically constructed from multiple fragments, etc.

Therefore, we figured out a better way: we will not touch the client-side JavaScript at all. Instead, we will try to simulate the situation where the client tries to use a hash that the server does not know. Therefore, we need to intercept the (valid) request sent by the browser in-flight and modify the hash to a bogus one before passing it to the server.

For exactly this use case, a perfect tool exists: mitmproxy, an open-source Python library that intercepts requests made by your own devices, websites, or apps and allows you to modify them with simple Python scripts.

Download mitmproxy, and prepare a Python script like this:

import json

def request(flow):
    try:
        dat = json.loads(flow.request.text)
        dat[0]["extensions"]["persistedQuery"]["sha256Hash"] = "0d9e" # any bogus hex string here
        flow.request.text = json.dumps(dat)
    except:
        pass

This defines a hook that mitmproxy will run on every request: it tries to load the request's JSON body, modifies the hash to an arbitrary value, and writes the updated JSON as a new body of the request.

We also need to make sure we reroute our browser requests to mitmproxy. For this purpose we are going to use a browser extension called FoxyProxy. It is available in both Firefox and Chrome.

Just add a route with these settings:

Now we can run mitmproxy with this script: mitmweb -s script.py

This will open a browser tab where you can watch all the intercepted requests in real-time.

If you go to the particular path and see the query in the request section, you will see some garbage value has replaced the hash.

Now, if you visit Zillow and open that particular path that we tried for the extension, and go to the response section, the client-side receives the PersistedQueryNotFound error.

The front end of Zillow reacts with sending the whole query as a POST request.

We extract the query and hash directly from this POST request. To ensure that the Zillow server does not forget about this hash, we periodically run this POST request with the exact same query and hash. This will ensure that the scraper continues to work even when the server's cache is cleaned or reset or the website changes.

Conclusion

Persisted queries are a powerful optimization tool for GraphQL APIs, enhancing website performance by minimizing payload sizes and enabling GET request caching. However, they also pose significant challenges for web scraping, primarily due to the reliance on server-stored hashes and the potential for those hashes to become invalid.

Using mitmproxy to intercept and manipulate GraphQL requests gives an efficient approach to reveal the full query text without delving into complex client-side JavaScript. By forcing the server to respond with a PersistedQueryNotFound error, we can capture the full query payload and utilize it for scraping purposes. Periodically running the extracted query ensures the scraper remains functional, even when server-side cache resets occur or the website evolves.

12 tips on how to think like a web scraping expert

Max Bohomolov — Mon, 04 Nov 2024 10:15:24 +0000

Typically, tutorials focus on the technical aspects, on what you can replicate: "Start here, follow this path, and you'll end up here." This is great for learning a particular technology, but it's sometimes difficult to understand why the author decided to do things a certain way or what guides their development process.

One of our community members wrote this blog as a contribution to Crawlee Blog. If you want to contribute blogs like these to Crawlee Blog, please reach out to us on our discord channel.

Crawlee & Apify

This is the official developer community of Apify and Crawlee. | 8987 members

discord.com

In this blog, I'll discuss the general rules and principles that guide me when I work on web scraping projects and allow me to achieve great results.

So, let's explore the mindset of a web scraping developer.

1. Choosing a data source for the project

When you start working on a project, you likely have a target site from which you need to extract specific data. Check what possibilities this site or application provides for data extraction. Here are some possible options:

Official API - the site may provide a free official API through which you can get all the necessary data. This is the best option for you. For example, you can consider this approach if you need to extract data from Yelp
Website - in this case, we study the website, its structure, as well as the ways the frontend and backend interact
Mobile Application - in some cases, there's no website or API at all, or the mobile application provides more data, in which case, don't forget about the man-in-the-middle approach

If one data source fails, try accessing another available source.

For example, for Yelp, all three options are available, and if the Official API doesn't suit you for some reason, you can try the other two.

2. Check `robots.txt` and `sitemap`

I think everyone knows about robots.txt and sitemap one way or another, but I regularly see people simply forgetting about them. If you're hearing about these for the first time, here's a quick explanation:

robots is the established name for crawlers in SEO. Usually, this refers to crawlers of major search engines like Google and Bing, or services like Ahrefs and ChatGPT.
robots.txt is a file describing the allowed behavior for robots. It includes permitted crawler user-agents, wait time between page scans, patterns of pages forbidden for scanning, and more. These rules are typically based on which pages should be indexed by search engines and which should not.
sitemap describes the site structure to make it easier for robots to navigate. It also helps in scanning only the content that needs updating, without creating unnecessary load on the site

Since you're not Google or any other popular search engine, the robot rules in robots.txt will likely be against you. But combined with the sitemap, this is a good place to study the site structure, expected interaction with robots, and non-browser user-agents. In some situations, it simplifies data extraction from the site.

For example, using the sitemap for Crawlee website, you can easily get direct links to posts both for the entire lifespan of the blog and for a specific period. One simple check, and you don't need to implement pagination logic.

3. Don't neglect site analysis

Thorough site analysis is an important prerequisite for creating an effective web scraper, especially if you're not planning to use browser automation. However, such analysis takes time, sometimes a lot of it.

It's also worth noting that the time spent on analysis and searching for a more optimal crawling solution doesn't always pay off - you might spend hours only to discover that the most obvious approach was the best all along.

Therefore, it's wise to set limits on your initial site analysis. If you don't see a better path within the allocated time, revert to simpler approaches. As you gain more experience, you'll more often be able to tell early on, based on the technologies used on the site, whether it's worth dedicating more time to analysis or not.

Also, in projects where you need to extract data from a site just once, thorough site analysis can sometimes eliminate the need to write scraper code altogether. Here's an example of such a site - https://ricebyrice.com/nl/pages/find-store.

By analyzing it, you'll easily discover that all the data can be obtained with a single request. You simply need to copy this data from your browser into a JSON file, and your task is complete.

4. Maximum interactivity

When analyzing a site, switch sorts, pages, interact with various elements of the site, while watching the Network tab in your browser's Dev Tools. This will allow you to better understand how the site interacts with the backend, what framework it's built on, and what behavior can be expected from it.

5. Data doesn't appear out of thin air

This is obvious, but it's important to keep in mind while working on a project. If you see some data or request parameters, it means they were obtained somewhere earlier, possibly in another request, possibly they may have already been on the website page, possibly they were formed using JS from other parameters. But they are always somewhere.

If you don't understand where the data on the page comes from, or the data used in a request, follow these steps:

Sequentially, check all requests the site made before this point.
Examine their responses, headers, and cookies.
Use your intuition: Could this parameter be a timestamp? Could it be another parameter in a modified form?
Does it resemble any standard hashes or encodings?

Practice makes perfect here. As you become familiar with different technologies, various frameworks, and their expected behaviors, and as you encounter a wide range of technologies, you'll find it easier to understand how things work and how data is transferred. This accumulated knowledge will significantly improve your ability to trace and understand data flow in web applications.

6. Data is cached

You may notice that when opening the same page several times, the requests transmitted to the server differ: possibly something was cached and is already stored on your computer. Therefore, it's recommended to analyze the site in incognito mode, as well as switch browsers.

This situation is especially relevant for mobile applications, which may store some data in storage on the device. Therefore, when analyzing mobile applications, you may need to clear the cache and storage.

7. Learn more about the framework

If during the analysis you discover that the site uses a framework you haven't encountered before, take some time to learn about it and its features. For example, if you notice a site is built with Next.js, understanding how it handles routing and data fetching could be crucial for your scraping strategy.

You can learn about these frameworks through official documentation or by using LLMs like ChatGPT or Claude. These AI assistants are excellent at explaining framework-specific concepts. Here's an example of how you might query an LLM about Next.js:

I am in the process of optimizing my website using Next.js. Are there any files passed to the browser that describe all internal routing and how links are formed?

Restrictions:
- Accompany your answers with code samples
- Use this message as the main message for all subsequent responses
- Reference only those elements that are available on the client side, without access to the project code base

You can create similar queries for backend frameworks as well. For instance, with GraphQL, you might ask about available fields and query structures. These insights can help you understand how to better interact with the site's API and what data is potentially available.

For effective work with LLM, I recommend at least basically studying the basics of prompt engineering.

8. Reverse engineering

Web scraping goes hand in hand with reverse engineering. You study the interactions of the frontend and backend, you may need to study the code to better understand how certain parameters are formed.

But in some cases, reverse engineering may require more knowledge, effort, time, or have a high degree of complexity. At this point, you need to decide whether you need to delve into it or it's better to change the data source, or, for example, technologies. Most likely, this will be the moment when you decide to abandon HTTP web scraping and switch to a headless browser.

The main principle of most web scraping protections is not to make web scraping impossible, but to make it expensive.

Let's just look at what the response to a search on zoopla looks like

9. Testing requests to endpoints

After identifying the endpoints you need to extract the target data, make sure you get a correct response when making a request. If you get a response from the server other than 200, or data different from expected, then you need to figure out why. Here are some possible reasons:

You need to pass some parameters, for example cookies, or specific technical headers
The site requires that when accessing this endpoint, there is a corresponding Referrer header
The site expects that the headers will follow a certain order. I've encountered this only a couple of times, but I have encountered it
The site uses protection against web scraping, for example with TLS fingerprint

And many other possible reasons, each of which requires separate analysis.

10. Experiment with request parameters

Explore what results you get when changing request parameters, if any. Some parameters may be missing but supported on the server side. For example, order, sort, per_page, limit, and others. Try adding them and see if the behavior changes.

This is especially relevant for sites using graphql

Let's consider this example

If you analyze the site, you'll see a request that can be reproduced with the following code, I've formatted it a bit to improve readability:

import requests

url = "https://restoran.ua/graphql"

data = {
    "operationName": "Posts_PostsForView",
    "variables": {"sort": {"sortBy": ["startAt_DESC"]}},
    "query": """query Posts_PostsForView(
    $where: PostForViewWhereInput,
    $sort: PostForViewSortInput,
    $pagination: PaginationInput,
    $search: String,
    $token: String,
    $coordinates_slice: SliceInput)
    {
        PostsForView(
                where: $where
                sort: $sort
                pagination: $pagination
                search: $search
                token: $token
                ) {
                        id
                        title: ukTitle
                        summary: ukSummary
                        slug
                        startAt
                        endAt
                        newsFeed
                        events
                        journal
                        toProfessionals
                        photoHeader {
                            address: mobile
                            __typename
                            }
                        coordinates(slice: $coordinates_slice) {
                            lng
                            lat
                            __typename
                            }
                        __typename
                    }
    }"""
}

response = requests.post(url, json=data)

print(response.json())

Now I'll update it to get results in 2 languages at once, and most importantly, along with the internal text of the publications:

import requests

url = "https://restoran.ua/graphql"

data = {
    "operationName": "Posts_PostsForView",
    "variables": {"sort": {"sortBy": ["startAt_DESC"]}},
    "query": """query Posts_PostsForView(
    $where: PostForViewWhereInput,
    $sort: PostForViewSortInput,
    $pagination: PaginationInput,
    $search: String,
    $token: String,
    $coordinates_slice: SliceInput)
    {
        PostsForView(
                where: $where
                sort: $sort
                pagination: $pagination
                search: $search
                token: $token
                ) {  
                        id
                        # highlight-start
                        uk_title: ukTitle
                        en_title: enTitle
                        # highlight-end
                        summary: ukSummary
                        slug
                        startAt
                        endAt
                        newsFeed
                        events
                        journal
                        toProfessionals
                        photoHeader {
                            address: mobile
                            __typename
                            }
                        # highlight-start
                        mixedBlocks {
                            index
                            en_text: enText
                            uk_text: ukText
                            __typename
                            }
                        # highlight-end
                        coordinates(slice: $coordinates_slice) {
                            lng
                            lat
                            __typename
                            }
                        __typename
                    }
    }"""
}

response = requests.post(url, json=data)
print(response.json())

As you can see, a small update of the request parameters allows me not to worry about visiting the internal page of each publication. You have no idea how many times this trick has saved me.

If you see graphql in front of you and don't know where to start, then my advice about documentation and LLM works here too.

11. Don't be afraid of new technologies

I know how easy it is to master a few tools and just use them because it works. I've fallen into this trap more than once myself.

But modern sites use modern technologies that have a significant impact on web scraping, and in response, new tools for web scraping are emerging. Learning these may greatly simplify your next project, and may even solve some problems that were insurmountable for you. I wrote about some tools earlier.

I especially recommend paying attention to curl_cffi and frameworks
botasaurus and Crawlee for Python.

12. Help open-source libraries

Personally, I only recently came to realize the importance of this. All the tools I use for my work are either open-source developments or based on open-source. Web scraping literally lives thanks to open-source, and this is especially noticeable if you're a Python developer and have realized that on pure Python everything is quite sad when you need to deal with TLS fingerprint, and again, open-source saved us here.

And it seems to me that the least we could do is invest a little of our knowledge and skills in supporting open-source.

I chose to support Crawlee for Python, and no, not because they allowed me to write in their blog, but because it shows excellent development dynamics and is aimed at making life easier for web crawler developers. It allows for faster crawler development by taking care of and hiding under the hood such critical aspects as session management, session rotation when blocked, managing concurrency of asynchronous tasks (if you write asynchronous code, you know what a pain this can be), and much more.

:::tip
If you like the blog so far, please consider giving Crawlee a star on GitHub, it helps us to reach and help more developers.
:::

And what choice will you make?

Conclusion

I think some things in the article were obvious to you, some things you follow yourself, but I hope you learned something new too. If most of them were new, then try using these rules as a checklist in your next project.

I would be happy to discuss the article. Feel free to comment here, in the article, or contact me in the Crawlee developer community on Discord.

You can also find me on the following platforms: Github, Linkedin, Apify, Upwork, Contra.

Thank you for your attention :)

How to create a LinkedIn job scraper in Python with Crawlee

Arindam Majumder — Thu, 17 Oct 2024 07:07:41 +0000

Introduction

In this article, we will build a web application that scrapes LinkedIn for job postings using Crawlee and Streamlit.

We will create a LinkedIn job scraper in Python using Crawlee for Python to extract the company name, job title, time of posting, and link to the job posting from dynamically received user input through the web application.

Note
One of our community members wrote this blog as a contribution to Crawlee Blog. If you want to contribute blogs like these to Crawlee Blog, please reach out to us on our discord channel.

By the end of this tutorial, you’ll have a fully functional web application that you can use to scrape job postings from LinkedIn.

Let's begin.

Prerequisites

Let's start by creating a new Crawlee for Python project with this command:

pipx run crawlee create linkedin-scraper

Select PlaywrightCrawler in the terminal when Crawlee asks for it.

After installation, Crawlee for Python will create boilerplate code for you. You can change the directory(cd) to the project folder and run this command to install dependencies.

poetry install

We are going to begin editing the files provided to us by Crawlee so we can build our scraper.

Note
Before going ahead if you like reading this blog, we would be really happy if you gave Crawlee for Python a star on GitHub!

Star us on GitHub ⭐️

Building the LinkedIn job Scraper in Python with Crawlee

In this section, we will be building the scraper using the Crawlee for Python package. To learn more about Crawlee, check out their documentation.

1. Inspecting the LinkedIn job Search Page

Open LinkedIn in your web browser and sign out from the website (if you already have an account logged in). You should see an interface like this.

Navigate to the jobs section, search for a job and location of your choice, and copy the URL.

You should have something like this:

https://www.linkedin.com/jobs/search?keywords=Backend%20Developer&location=Canada&geoId=101174742&trk=public_jobs_jobs-search-bar_search-submit&position=1&pageNum=0

We're going to focus on the search parameters, which is the part that goes after '?'. The keyword and location parameters are the most important ones for us.

The job title the user supplies will be input to the keyword parameter, while the location the user supplies will go into the location parameter. Lastly, the geoId parameter will be removed while we keep the other parameters constant.

We are going to be making changes to our main.py file. Copy and paste the code below in your main.py file.

from crawlee.playwright_crawler import PlaywrightCrawler
from .routes import router                                     
import urllib.parse

async def main(title: str, location: str, data_name: str) -> None:
    base_url = "https://www.linkedin.com/jobs/search"

    # URL encode the parameters
    params = {
        "keywords": title,
        "location": location,
        "trk": "public_jobs_jobs-search-bar_search-submit",
        "position": "1",
        "pageNum": "0"
    }

    encoded_params = urlencode(params)

    # Encode parameters into a query string
    query_string = '?' + encoded_params

    # Combine base URL with the encoded query string
    encoded_url = urljoin(base_url, "") + query_string

    # Initialize the crawler
    crawler = PlaywrightCrawler(
        request_handler=router,
    )

    # Run the crawler with the initial list of URLs
    await crawler.run([encoded_url])

    # Save the data in a CSV file
    output_file = f"{data_name}.csv"
    await crawler.export_data(output_file)

Now that we have encoded the URL, the next step for us is to adjust the generated router to handle LinkedIn job postings.

2. Routing your crawler

We will be making use of two handlers for your application:

Default handler

The default_handler handles the start URL

Job listing

The job_listing handler extracts the individual job details.

Playwright crawler is going to crawl through the job posting page and extract the links to all job postings on the page.

When you examine the job postings, you will discover that the job posting links are inside an ordered list with a class named jobs-search__results-list. We will then extract the links using the Playwright locator object and add them to the job_listing route for processing.

router = Router[PlaywrightCrawlingContext]()

@router.default_handler
async def default_handler(context: PlaywrightCrawlingContext) -> None:
    """Default request handler."""

    #select all the links for the job posting on the page
    hrefs = await context.page.locator('ul.jobs-search__results-list a').evaluate_all("links => links.map(link => link.href)")

    #add all the links to the job listing route
    await context.add_requests(
            [Request.from_url(rec, label='job_listing') for rec in hrefs]
        )

Now that we have the job listings, the next step is to scrape their details.

We'll extract each job’s title, company's name, time of posting, and the link to the job post. Open your dev tools to extract each element using its CSS selector.

After scraping each of the listings, we'll remove special characters from the text to make it clean and push the data to local storage using the context.push_data function.

@router.handler('job_listing')
async def listing_handler(context: PlaywrightCrawlingContext) -> None:
    """Handler for job listings."""

    await context.page.wait_for_load_state('load')

    job_title = await context.page.locator('div.top-card-layout__entity-info h1.top-card-layout__title').text_content()

    company_name  = await context.page.locator('span.topcard__flavor a').text_content()   

    time_of_posting= await context.page.locator('div.topcard__flavor-row span.posted-time-ago__text').text_content()


    await context.push_data(
        {
            # we are making use of regex to remove special characters for the extracted texts

            'title': re.sub(r'[\s\n]+', '', job_title),
            'Company name': re.sub(r'[\s\n]+', '', company_name),
            'Time of posting': re.sub(r'[\s\n]+', '', time_of_posting),
            'url': context.request.loaded_url,
        }
    )

3. Creating your application

For this project, we will be using Streamlit for the web application. Before we proceed, we are going to create a new file named app.py in your project directory. In addition, ensure you have Streamlit installed in your global Python environment before proceeding with this section.

import streamlit as st
import subprocess

# Streamlit form for inputs 
st.title("LinkedIn Job Scraper")

with st.form("scraper_form"):
    title = st.text_input("Job Title", value="backend developer")
    location = st.text_input("Job Location", value="newyork")
    data_name = st.text_input("Output File Name", value="backend_jobs")

    submit_button = st.form_submit_button("Run Scraper")

if submit_button:

    # Run the scraping script with the form inputs
    command = f"""poetry run python -m linkedin-scraper --title "{title}"  --location "{location}" --data_name "{data_name}" """

    with st.spinner("Crawling in progress..."):
         # Execute the command and display the results
        result = subprocess.run(command, shell=True, capture_output=True, text=True)

        st.write("Script Output:")
        st.text(result.stdout)

        if result.returncode == 0:
            st.success(f"Data successfully saved in {data_name}.csv")
        else:
            st.error(f"Error: {result.stderr}")

The Streamlit web application takes in the user's input and uses the Python Subprocess package to run the Crawlee scraping script.

4. Testing your app

Before we test the application, we need to make a little modification to the __main__ file in order for it to accommodate the command line arguments.

import asyncio
import argparse

from .main import main

def get_args():
    # ArgumentParser object to capture command-line arguments
    parser = argparse.ArgumentParser(description="Crawl LinkedIn job listings")


    # Define the arguments
    parser.add_argument("--title", type=str, required=True, help="Job title")
    parser.add_argument("--location", type=str, required=True, help="Job location")
    parser.add_argument("--data_name", type=str, required=True, help="Name for the output CSV file")


    # Parse the arguments
    return parser.parse_args()

if __name__ == '__main__':
    args = get_args()
    # Run the main function with the parsed command-line arguments
    asyncio.run(main(args.title, args.location, args.data_name))

We will start the Streamlit application by running this code in the terminal:

streamlit run app.py

This is what your application what the application should look like on the browser:

You will get this interface showing you that the scraping has been completed:

To access the scraped data, go over to your project directory and open the CSV file.

You should have something like this as the output of your CSV file.

Conclusion

In this tutorial, we have learned how to build an application that can scrape job posting data from LinkedIn using Crawlee. Have fun building great scraping applications with Crawlee.

You can find the complete working Crawler code here on the GitHub repository..

Follow Crawlee for more content like this.

Crawlee Follow

Crawlee is a web scraping and browser automation library. It helps you build reliable crawlers. Fast.

Thank you!

Optimizing web scraping: Scraping auth data using JSDOM

Saurav Jain — Wed, 09 Oct 2024 05:26:49 +0000

As scraping developers, we sometimes need to extract authentication data like temporary keys to perform our tasks. However, it is not as simple as that. Usually, it is in HTML or XHR network requests, but sometimes, the auth data is computed. In that case, we can either reverse-engineer the computation, which takes a lot of time to deobfuscate scripts or run the JavaScript that computes it. Normally, we use a browser, but that is expensive. Crawlee provides support for running browser scraper and Cheerio Scraper in parallel, but that is very complex and expensive in terms of compute resource usage. JSDOM helps us run page JavaScript with fewer resources than a browser and slightly higher than Cheerio.

This article will discuss a new approach that we use in one of our Actors to obtain the authentication data from TikTok ads creative center generated by browser web applications without actually running the browser but instead of it, using JSDOM.

Analyzing the website

When you visit this URL:

https://ads.tiktok.com/business/creativecenter/inspiration/popular/hashtag/pc/en

You will see a list of hashtags with their live ranking, the number of posts they have, trend chart, creators, and analytics. You can also notice that we can filter the industry, set the time period, and use a check box to filter if the trend is new to the top 100 or not.

Our goal here is to extract the top 100 hashtags from the list with the given filters.

The two possible approaches are to use CheerioCrawler, and the second one will be browser-based scraping. Cheerio gives results faster but does not work with JavaScript-rendered websites.

Cheerio is not the best option here as the Creative Center is a web application, and the data source is API, so we can only get the hashtags initially present in the HTML structure but not each of the 100 as we require.

The second approach can be using libraries like Puppeteer, Playwright, etc, to do browser-based scraping and using automation to scrape all of the hashtags, but with previous experiences, it takes a lot of time for such a small task.

Now comes the new approach that we developed to make this process a lot better than browser based and very close to CheerioCrawler based crawling.

JSDOM Approach

Before diving deep into this approach, I would like to give credit to Alexey Udovydchenko, Web Automation Engineer at Apify, for developing this approach. Kudos to him!

In this approach, we are going to make API calls to https://ads.tiktok.com/creative_radar_api/v1/popular_trend/hashtag/list to get the required data.

Before making calls to this API, we will need few required headers (auth data), so we will first make the call to https://ads.tiktok.com/business/creativecenter/inspiration/popular/hashtag/pad/en.

We will start this approach by creating a function that will create the URL for the API call for us and, make the call and get the data.

export const createStartUrls = (input) => {
    const {
        days = '7',
        country = '',
        resultsLimit = 100,
        industry = '',
        isNewToTop100,
    } = input;

    const filterBy = isNewToTop100 ? 'new_on_board' : '';
    return [
        {
            url: `https://ads.tiktok.com/creative_radar_api/v1/popular_trend/hashtag/list?page=1&limit=50&period=${days}&country_code=${country}&filter_by=${filterBy}&sort_by=popular&industry_id=${industry}`,
            headers: {
                // required headers
            },
            userData: { resultsLimit },
        },
    ];
};

In the above function, we create the start url for the API call that include various parameters as we talked about earlier. After creating the URL according to the parameters it will call the creative_radar_api and fetch all the results.

But it won’t work until we get the headers. So, let’s create a function that will first create a session using sessionPool and proxyConfiguration.

export const createSessionFunction = async (
    sessionPool,
    proxyConfiguration,
) => {
    const proxyUrl = await proxyConfiguration.newUrl(Math.random().toString());
    const url =
        'https://ads.tiktok.com/business/creativecenter/inspiration/popular/hashtag/pad/en';
    // need url with data to generate token
    const response = await gotScraping({ url, proxyUrl });
    const headers = await getApiUrlWithVerificationToken(
        response.body.toString(),
        url,
    );
    if (!headers) {
        throw new Error(`Token generation blocked`);
    }
    log.info(`Generated API verification headers`, Object.values(headers));
    return new Session({
        userData: {
            headers,
        },
        sessionPool,
    });
};

In this function, the main goal is to call https://ads.tiktok.com/business/creativecenter/inspiration/popular/hashtag/pad/en and get headers in return. To get the headers we are using getApiUrlWithVerificationToken function.

Before going ahead, I want to mention that Crawlee natively supports JSDOM using the JSDOM Crawler. It gives a framework for the parallel crawling of web pages using plain HTTP requests and jsdom DOM implementation. It uses raw HTTP requests to download web pages, it is very fast and efficient on data bandwidth.

Let’s see how we are going to create the getApiUrlWithVerificationToken function:

const getApiUrlWithVerificationToken = async (body, url) => {
    log.info(`Getting API session`);
    const virtualConsole = new VirtualConsole();
    const { window } = new JSDOM(body, {
        url,
        contentType: 'text/html',
        runScripts: 'dangerously',
        resources: 'usable' || new CustomResourceLoader(),
        // ^ 'usable' faster than custom and works without canvas
        pretendToBeVisual: false,
        virtualConsole,
    });
    virtualConsole.on('error', () => {
        // ignore errors cause by fake XMLHttpRequest
    });

    const apiHeaderKeys = ['anonymous-user-id', 'timestamp', 'user-sign'];
    const apiValues = {};
    let retries = 10;
    // api calls made outside of fetch, hack below is to get URL without actual call
    window.XMLHttpRequest.prototype.setRequestHeader = (name, value) => {
        if (apiHeaderKeys.includes(name)) {
            apiValues[name] = value;
        }
        if (Object.values(apiValues).length === apiHeaderKeys.length) {
            retries = 0;
        }
    };
    window.XMLHttpRequest.prototype.open = (method, urlToOpen) => {
        if (
            ['static', 'scontent'].find((x) =>
                urlToOpen.startsWith(`https://${x}`),
            )
        )
        log.debug('urlToOpen', urlToOpen);
    };
    do {
        await sleep(4000);
        retries--;
    } while (retries > 0);

    await window.close();
    return apiValues;
};

In this function, we are creating a virtual console that uses CustomResourceLoader to run the background process and replace the browser with JSDOM.

For this particular example, we need three mandatory headers to make the API call, and those are anonymous-user-id, timestamp, and user-sign.

Using XMLHttpRequest.prototype.setRequestHeader, we are checking if the mentioned headers are in the response or not, if yeas, we take the value of those headers, and repeat the retries until we get all the headers.

Then, the most important part is that we use XMLHttpRequest.prototype.open to extract the auth data and make calls without actually using browsers or exposing the bot activity.

At the end of createSessionFunction, it returns a session with the required headers.

Now coming to our main code, we will use CheerioCrawler and will use prenavigationHooks to inject the headers that we got from the earlier function into the requestHandler.

const crawler = new CheerioCrawler({
    sessionPoolOptions: {
        maxPoolSize: 1,
        createSessionFunction: async (sessionPool) =>
            createSessionFunction(sessionPool, proxyConfiguration),
    },
    preNavigationHooks: [
        (crawlingContext) => {
            const { request, session } = crawlingContext;
            request.headers = {
                ...request.headers,
                ...session.userData?.headers,
            };
        },
    ],
    proxyConfiguration,
});

Finally in the request handler we make the call using the headers and make sure how many calls are needed to fetch all the data handling pagination.

async requestHandler(context) {
    const { log, request, json } = context;
    const { userData } = request;
    const { itemsCounter = 0, resultsLimit = 0 } = userData;
    if (!json.data) {
        throw new Error('BLOCKED');
    }
    const { data } = json;
    const items = data.list;
    const counter = itemsCounter + items.length;
    const dataItems = items.slice(
        0,
        resultsLimit && counter > resultsLimit
            ? resultsLimit - itemsCounter
            : undefined,
    );
    await context.pushData(dataItems);
    const {
        pagination: { page, total },
    } = data;
    log.info(
        `Scraped ${dataItems.length} results out of ${total} from search page ${page}`,
    );
    const isResultsLimitNotReached =
        counter < Math.min(total, resultsLimit);
    if (isResultsLimitNotReached && data.pagination.has_more) {
        const nextUrl = new URL(request.url);
        nextUrl.searchParams.set('page', page + 1);
        await crawler.addRequests([
            {
                url: nextUrl.toString(),
                headers: request.headers,
                userData: {
                    ...request.userData,
                    itemsCounter: itemsCounter + dataItems.length,
                },
            },
        ]);
    }
}

One important thing to note here is that we are making this code in a way that we can make any numbers of API calls.

In this particular example we just made one request and a single session, but you can make more if you need. When the first API call will be completed, it will create the second API call. Again, you can make more calls if needed, but we stopped at two.

To make things more clear, here is how code flow looks:

Conclusion

This approach helps us to get a third way to extract the authentication data without actually using a browser and pass the data to CheerioCrawler. This significantly improves the performance and reduces the RAM requirement by 50%, and while browser-based scraping performance is ten times slower than pure Cheerio, JSDOM does it just 3-4 times slower, which makes it 2-3 times faster than browser-based scraping.

The project's codebase is already uploaded here. The code is written as an Apify Actor; you can find more about it here, but you can also run it without using Apify SDK.

If you have any doubts or questions about this approach, reach out to us on our Discord server.