This is a submission for the Bright Data AI Web Access Hackathon
This is a project that I built for Bright Data MCP Hackathon. The reason I participated in this is to experiment with the MCP and also because I like building. I am currently open to work and have put a lot of effort into developing this project. I would be very thankful if you could react to my article and share it.
What I Built
Perpendicular AI is an AI agent designed to generate real-time datasets for AI/ML projects by leveraging advanced web scraping. It solves the challenge of acquiring up-to-date, trustworthy dataset by:
- Interpreting user queries to identify specific data needs
- Locating relevant sources via Bright Data’s search tools provided by Bright Data MCP
- Extracting and structuring data from diverse web pages using Bright Data MCP
- Creating tailored schemas for seamless data integration
Capabilties
Perpendicular can create realtime datasets from :
- Any specific site when provided with a URL
- General web
- Twitter posts
- LinkedIn data
- Instagram Posts
- Booking.com
- Zillow
- Amazon data
- Youtube
- ZoomInfo
Demo
Demo of a dataset generation using the perpendicular ai agent.
Perpendicular Github Repo
Here is the github repo link for the project. Clone it and follow the instructions in README.md to set it up and get it up running.
You will need Gemini API keys and Bright Data MCP setup.
Some screenshots of output
How I Used Bright Data's Infrastructure
The system leverages Bright Data's infrastructure through its MCP (Model Context Protocol):
-
Web Content Access: The agent uses Bright Data's tools to:
- Bypass websites with bot protection and CAPTCHAs
- Extract structured data from protected sites (Amazon, LinkedIn, etc.)
- Navigate complex web pages using remote browser capabilities
-
Real-time Search: Bright Data's search engine enables the agent to:
- Discover up-to-date sources for requested data
- Verify information freshness
- Expand search coverage beyond standard search engines
MCP Integration: The system leverages following Bright Data MCP tools:
- Uses
search_engine
tool to perform comprehensive web searches - Leverages
scraping_browser_get_text
to extract visible content from pages - Uses platform specific tools like
web_data_amazon_product_reviews
,web_data_youtube_videos
whenever a platform like Instagram, LinkedIn, Amazon, Facebook, X, Zillow, Booking.com, YouTube are Detected as a data source. - Uses Bright Data MCP tools to also navigate the general sites whenever a discovery source is not among the above sites.
Performance Improvements
By leveraging Bright Data's real-time web access, the system achieves significant improvements:
-
Data Accuracy: Eliminates hallucinations and fake data by:
- Accessing primary sources directly
- Verifying information against multiple sources
- Using up-to-date web content
-
Data Collection Efficiency: Optimizes data collection through:
- Automated navigation of complex sites
- Structured data extraction from diverse formats
- Rapid adaptation to changing web structures
- Minimizing manual intervention in data gathering
- Fast gathering of web data
-
Reliability: Ensures consistent operation with:
- Automatic retry mechanisms
- Bot protection bypass
- CAPTCHA solving capabilities
Conclusion
Bright Data MCP server is good. But Bright Data's own abilities are excellent. Its ability to scrap and navigate web pages and bypass bot and captcha protected pages is good. It is fast and its retry mechanism is reliable.
Top comments (2)
Nice demo.
Thanks Sibasis! What's your opinion on the project Sibasis?