DEV Community

Cover image for How to Scrape Any Website Using Python, Bright Data, and MCP Servers
Developer Service
Developer Service

Posted on • Originally published at developer-service.blog

How to Scrape Any Website Using Python, Bright Data, and MCP Servers

In this article, I’ll show you how to build a Python-based chat agent that leverages Bright Data’s Model Context Protocol (MCP) server alongside MistralAI’s chat model, orchestrated via LangChain adapters and LangGraph’s ReAct agent framework.

I cover environment setup, MCP server parameters, Python client initialization using STDIO transport, tool loading, and the asynchronous chat loop.

By following this guide, you’ll have a seamless, tool-enabled AI assistant capable of invoking web-scraping, proxy rotation, CAPTCHA solving, and other Bright Data capabilities in a chat interface powered by MistralAI.


SPONSORED By Python's Magic Methods - Beyond init and str

This book offers an in-depth exploration of Python's magic methods, examining the mechanics and applications that make these features essential to Python's design.

Get the eBook


What Is MCP?

The Model Context Protocol (MCP) is an open, JSON-RPC 2.0–based standard that lets AI models invoke external tools through a unified interface.

Think of MCP like the “USB-C port for AI applications,” providing a standardized way to plug in capabilities such as web scrapers, proxy rotation, CAPTCHA solving, and headless browsers.

Bright Data’s MCP server (@brightdata/mcp) is a Node.js implementation exposing powerful scraping and unlocking tools over STDIO or SSE transports.

By using MCP, your AI agents gain real-time, reliable access to both static and dynamic web data without the hassle of building complex scraping infrastructure.


Prerequisites

Bright Data Account & API Token

Sign up at Bright Data and create an MCP server token in your dashboard.

Then create a 'Web Unlocker API':

Web Unlocker API

When creating it, make sure to enable 'CAPTCHA Solver' (it should be on by default):

Web Unlocker API - CAPTCHA Settings

Make sure to make note of the 'Zone name', it will be required for the 'WEB_UNLOCKER_ZONE' environment variable.

Additionally, you need to create a 'Browser API' to scrape JavaScript-heavy sites :

Browser API

Again, when creating it, make sure to enable 'CAPTCHA Solver' (it should be on by default):

Browser API - CAPTCHA Settings

After creation, it will display the credentials:

Browser API - Credentials

Make note of this credential, you will need it for the 'BROWSER_AUTH' environment variable later.

For this credential, the actual value used in the environment variable is:

brd-customer-hl_b2705b7f-zone-mcp_scraping_browser:50wpsh0oa734
Enter fullscreen mode Exit fullscreen mode

(You will need to remove the start and end parts of the credential)

Finally, you will need an API token, which you will find in the 'API Keys' section on the 'Account Settings'.

More info about the MCP-Server configuration details at Bright Data Docs.

Mistral AI

For using Mistral AI chat models, you will need an API key, which you can generate at the 'API Keys' section in the Admin panel.

Node.js & MCP Server Package

Ensure node and npx are installed.

The MCP server runs via the @brightdata/mcp npm package, which you invoke through npx (GitHub).

Python Environment

Python 3.8+ is required for this project.

Additionally, the following packages are required to be installed:

langchain_mistralai langchain_mcp_adapters langgraph python-dotenv

These will be installed by uv as described in the next section.


Code Walkthrough

All code is available at the GitHub repository mcp-scrape-web-article.

Let's install the requirements with:

uv pip install -r requirements.txt
Enter fullscreen mode Exit fullscreen mode

Then create a .env file:

MISTRALAI_API_KEY=your_mistralai_key
API_TOKEN=your_brightdata_token
BROWSER_AUTH=your_browser_api_key
WEB_UNLOCKER_ZONE=your_zone_name
Enter fullscreen mode Exit fullscreen mode

Replace the values with the values explained in the previous section.

Imports and Environment Loading

# Imports
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client
from langchain_mcp_adapters.tools import load_mcp_tools
from langgraph.prebuilt import create_react_agent
from langchain_mistralai import ChatMistralAI
from dotenv import load_dotenv
import asyncio
import os

# Load environment variables
load_dotenv()
Enter fullscreen mode Exit fullscreen mode

We begin by loading environment variables via python-dotenv, keeping secrets out of source control.

Model Initialization

# Initialize the model
model = ChatMistralAI(model="mistral-large-latest", api_key=os.getenv("MISTRALAI_API_KEY"))
Enter fullscreen mode Exit fullscreen mode

We instantiate ChatMistralAI with the mistral-large-latest model, which offers state-of-the-art conversational quality and supports streaming and tool calls.

MCP Server Parameters

# Initialize the server parameters
server_params = StdioServerParameters(
    command="C:\\Program Files\\nodejs\\npx.cmd",   # In Windows, other OS you can use "npx"
    env={
        "API_TOKEN": os.getenv("API_TOKEN"),
        "BROWSER_AUTH": os.getenv("BROWSER_AUTH"),
        "WEB_UNLOCKER_ZONE": os.getenv("WEB_UNLOCKER_ZONE"),
    },
    args=["@brightdata/mcp"],
)
Enter fullscreen mode Exit fullscreen mode

StdioServerParameters wraps the shell command (npx @brightdata/mcp), environment variables, and arguments needed to launch the MCP server locally on STDIO transport.

Asynchronous Chat Function

# Define the chat function
async def chat_with_agent():
    # Initialize the client
    async with stdio_client(server_params) as (read, write):
        # Initialize the session
        async with ClientSession(read, write) as session:
            # Initialize the session
            await session.initialize()
            # Load the tools
            tools = await load_mcp_tools(session)
            # Create the agent
            agent = create_react_agent(model, tools)

            # Start conversation history, this is the initial message from the system prompt
            messages = [
                {
                    "role": "system",
                    "content": "You can use multiple tools in sequence to answer complex questions. Think step by step.",
                }
            ]

            # Start the conversation
            print("Type 'exit' or 'quit' to end the chat.")
            while True:
                # Get the user's message
                user_input = input("\nYou: ")

                # Check if the user wants to end the conversation
                if user_input.strip().lower() in {"exit", "quit"}:
                    print("Goodbye!")
                    break

                # Add the user's message to the conversation history
                messages.append({"role": "user", "content": user_input})

                # Invoke the agent with the full message history
                agent_response = await agent.ainvoke({"messages": messages})

                # Get the agent's reply
                ai_message = agent_response["messages"][-1].content

                # Add the agent's reply to the conversation history
                messages.append({"role": "system", "content": ai_message})

                # Print the agent's reply
                print(f"Agent: {ai_message}")
Enter fullscreen mode Exit fullscreen mode

The heart of the app is chat_with_agent():

  • stdio_client launches the MCP server process and wires up STDIO for JSON-RPC messaging.
  • ClientSession.initialize() negotiates the MCP protocol version and fetches the list of available tools.
  • load_mcp_tools wraps each MCP tool (web unlocker, scraper, browser) into a LangChain tool you can call directly.
  • create_react_agent builds a LangGraph agent that interleaves “thought” and “action,” enabling multi-step workflows like “scrape price → parse JSON → summarize”.
  • The loop reads user input, appends it to history, and calls agent.ainvoke(), printing the result.

Entry Point

# Run the chat function
if __name__ == "__main__":
    # Run the chat function asynchronously
    asyncio.run(chat_with_agent())
Enter fullscreen mode Exit fullscreen mode

This runs the chat function in an asyncio event loop, ensuring non-blocking I/O with the MCP server and model.

Running the Chat

You can run the script with:

uv run main.py
Enter fullscreen mode Exit fullscreen mode

Let's see some examples, starting with a search about remote AI developer jobs from LinkedIn and Indeed:

(.venv) D:\GitHub\mcp-scrape-web-article>uv run main.py
Type 'exit' or 'quit' to end the chat.

You: search linkedin and indeed for remote ai developer jobs. for the top 2 jobs each job extract: job title, com
pany, location, salary, date and url
Agent: The top 2 remote AI developer jobs from LinkedIn are:

1. **Artificial Intelligence (AI) Engineer/Developer (Remote)**
   - **Company**: Statherósè
   - **Location**: Remote
   - **Date**: Not specified
   - **Job Title**: Artificial Intelligence (AI) Engineer/Developer
   - **Salary**: Not specified
   - **Job Type**: Full-time
   - **Experience Level**: Mid-Senior level
   - **Industries**: Computer Software
   - **Job Function**: Information Technology
   - **URL**: https://www.linkedin.com/jobs/view/artificial-intelligence-ai-engineer-developer-remote-at-stather%C3%B3s%C2%AE-4089407234

2. **AI/ML Developer - REMOTE WORK**
   - **Company**: Primus Global Technologies Pvt Ltd
   - **Location**: Remote
   - **Date**: Not specified
   - **Job Title**: AI/ML Developer
   - **Salary**: Not specified
   - **Job Type**: Contract
   - **Experience Level**: Not specified
   - **Industry**: Information Technology & Services
   - **Job Function**: Engineering
   - **URL**: https://www.linkedin.com/jobs/view/ai-ml-developer-remote-work-at-primus-global-technologies-pvt-ltd-4182993980

The top 2 remote AI developer jobs from Indeed are:

1. **Hackathon AI Curriculum Developer**
   - **Company**: Devpost
   - **Location**: Remote in New York, NY
   - **Date**: Not specified
   - **Salary**: $25 - $100 per hour
   - **Experience Level**: 3+ years of experience teaching or creating educational content for developers, including AI content that reaches a wide audience.
   - **Job Type**: Part-time
   - **Industry**: Not specified
   - **Job Function**: Not specified
   - **URL**: https://www.indeed.com/company/Devpost/jobs?jk=13e7814d20d9b310&from=serp&spa=1&utm_campaign=serp-more

2. **Head of Artificial Intelligence (AI)**
   - **Company**: Divergent Talent
   - **Location**: Remote in Palo Alto, CA
   - **Date**: Not specified
   - **Salary**: $200,000 - $275,000 per year
   - **Experience Level**: Not specified
   - **Job Type**: Full-time
   - **Industry**: Not specified
   - **Job Function**: Not specified
   - **URL**: https://www.indeed.com/company/divergent-talent/jobs?jk=a9d9ec0c913a3f1f&from=serp&spa=2&utm_campaign=serp-more
Enter fullscreen mode Exit fullscreen mode

And an example from searching Reddit for SaaS complaints:

Type 'exit' or 'quit' to end the chat.

You: search reddit and return the top 5 complaints about SaaS products
Agent: [{'type': 'text', 'text': 'Here are the top 5 complaints:\n\n1. **Subscription Traps**: Users often complain about being trapped into subscriptions with no clear way to cancel without contacting the company directly. This is seen as a deceptive practice that locks users into unwanted payments'}, {'type': 'reference', 'reference_ids': [1]}, {'type': 'text', 'text': '.\n\n2. **Scalability Issues**: As the user base expands, many SaaS products face significant challenges in scaling their infrastructure. This can lead to performance degradation and a poor user experience, which is a common frustration among users'}, {'type': 'reference', 'reference_ids': [2]}, {'type': 'text', 'text': '.\n\n3. **Lack of Product-Market Fit**: A frequent complaint is that many SaaS products fail because they do not meet the needs of their target market. This misalignment leads to dissatisfaction and low adoption rates'}, {'type': 'reference', 'reference_ids': [3]}, {'type': 'text', 'text': '.\n\n4. **Customer Support Problems**: Poor customer support is a recurring issue, with users often frustrated by the lack of timely and effective assistance when they encounter problems with the SaaS product'}, {'type': 'reference', 'reference_ids': [4]}, {'type': 'text', 'text': '.\n\n5. **Complex Onboarding Process**: Users often struggle with complicated onboarding processes that hinder their ability to quickly realize the value of the SaaS solution. A non-intuitive setup can lead to frustration and a reluctance to fully adopt the product'}, {'type': 'reference', 'reference_ids': [5]}, {'type': 'text', 'text': '.'}]
Enter fullscreen mode Exit fullscreen mode

Sometimes the output from the AI can be a bit better formatted.

As you see, the AI has access to search capabilities for even normally blocked sites, like LinkedIn and Reddit, thanks to Bright Data MCP Server, Web Unlocker, and Browser APIs.


How It Works Under the Hood

This chat application is a combination of data transport, MCP server, and ReAct paradigm using MistralAI.

The data is transported by:

  • STDIO: Simple, local; ideal for spawning tools via shell commands.

MCP defines a small set of JSON-RPC methods:

  • initialize: Negotiate protocol and capabilities.
  • tools/list: Discover available tools with metadata.
  • tools/call: Invoke a tool by name with structured arguments.
  • resources/*: Optional endpoints for streaming or subscribing to data.

The Reason + Act (ReAct) paradigm alternates between internal reasoning (“thoughts”) and external tool calls (“actions”).

This makes complex chains—like scraping a page, extracting table data, and translating content—explicit and debuggable, like “scrape a page → parse price → translate currency”.

MistralAI plays a key role here, as the AI uses the internal knowledge or it decides to use the tools provided by MCP to expand and query external data, like web search and HTML parsing.

As for the Bright Data Features:

  • Web Unlocker solves CAPTCHAs and bypasses blocks.
  • Proxy Rotation ensures global reach.
  • Browser API enables headless-browser rendering for JavaScript-heavy sites.

Conclusion

By combining Bright Data’s MCP server, MistralAI’s Chat model, and LangGraph’s React agent, you unlock a powerful framework for building autonomous AI assistants capable of fetching and acting on live web data.

I encourage you to:

  • Experiment with advanced tool parameters (geo-targeting, headless browser options).
  • Add memory modules to maintain context across sessions.

Feel free to fork the GitHub repo and start integrating real-time web data into your next AI project!


Follow me on Twitter: https://twitter.com/DevAsService

Follow me on Instagram: https://www.instagram.com/devasservice/

Follow me on TikTok: https://www.tiktok.com/@devasservice

Follow me on YouTube: https://www.youtube.com/@DevAsService

Top comments (0)