DEV Community

Cover image for (SOTA) AI agent to generate real-time dataset for AI ML projects on demand - Perpendicular AI
Prasanjit dutta
Prasanjit dutta Subscriber

Posted on

4

(SOTA) AI agent to generate real-time dataset for AI ML projects on demand - Perpendicular AI

This is a submission for the Bright Data AI Web Access Hackathon

This is a project that I built for Bright Data MCP Hackathon. The reason I participated in this is to experiment with the MCP and also because I like building. I am currently open to work and have put a lot of effort into developing this project. I would be very thankful if you could react to my article and share it.


What I Built

Perpendicular AI is an AI agent designed to generate real-time datasets for AI/ML projects by leveraging advanced web scraping. It solves the challenge of acquiring up-to-date, trustworthy dataset by:

  • Interpreting user queries to identify specific data needs
  • Locating relevant sources via Bright Data’s search tools provided by Bright Data MCP
  • Extracting and structuring data from diverse web pages using Bright Data MCP
  • Creating tailored schemas for seamless data integration

Capabilties

Perpendicular can create realtime datasets from :

  1. Any specific site when provided with a URL
  2. General web
  3. Twitter posts
  4. LinkedIn data
  5. Instagram Posts
  6. Booking.com
  7. Zillow
  8. Amazon data
  9. Youtube
  10. ZoomInfo

Demo

Demo of a dataset generation using the perpendicular ai agent.

Perpendicular Github Repo
Here is the github repo link for the project. Clone it and follow the instructions in README.md to set it up and get it up running.

You will need Gemini API keys and Bright Data MCP setup.

Some screenshots of output

Create dataset from amazon product review

Image description

How I Used Bright Data's Infrastructure

The system leverages Bright Data's infrastructure through its MCP (Model Context Protocol):

  1. Web Content Access: The agent uses Bright Data's tools to:

    • Bypass websites with bot protection and CAPTCHAs
    • Extract structured data from protected sites (Amazon, LinkedIn, etc.)
    • Navigate complex web pages using remote browser capabilities
  2. Real-time Search: Bright Data's search engine enables the agent to:

    • Discover up-to-date sources for requested data
    • Verify information freshness
    • Expand search coverage beyond standard search engines
  3. MCP Integration: The system leverages following Bright Data MCP tools:

  • Uses search_engine tool to perform comprehensive web searches
  • Leverages scraping_browser_get_text to extract visible content from pages
  • Uses platform specific tools like web_data_amazon_product_reviews, web_data_youtube_videos whenever a platform like Instagram, LinkedIn, Amazon, Facebook, X, Zillow, Booking.com, YouTube are Detected as a data source.
  • Uses Bright Data MCP tools to also navigate the general sites whenever a discovery source is not among the above sites.

Performance Improvements

By leveraging Bright Data's real-time web access, the system achieves significant improvements:

  1. Data Accuracy: Eliminates hallucinations and fake data by:

    • Accessing primary sources directly
    • Verifying information against multiple sources
    • Using up-to-date web content
  2. Data Collection Efficiency: Optimizes data collection through:

    • Automated navigation of complex sites
    • Structured data extraction from diverse formats
    • Rapid adaptation to changing web structures
    • Minimizing manual intervention in data gathering
    • Fast gathering of web data
  3. Reliability: Ensures consistent operation with:

    • Automatic retry mechanisms
    • Bot protection bypass
    • CAPTCHA solving capabilities

Conclusion

Bright Data MCP server is good. But Bright Data's own abilities are excellent. Its ability to scrap and navigate web pages and bypass bot and captcha protected pages is good. It is fast and its retry mechanism is reliable.

Redis image

Short-term memory for faster
AI agents 🤖💨

AI agents struggle with latency and context switching. Redis fixes it with a fast, in-memory layer for short-term context—plus native support for vectors and semi-structured data to keep real-time workflows on track.

Start building

Top comments (2)

Collapse
 
sibasis_padhi profile image
Sibasis Padhi

Nice demo.

Collapse
 
prasanjit101 profile image
Prasanjit dutta

Thanks Sibasis! What's your opinion on the project Sibasis?

Tiger Data image

🐯 🚀 Timescale is now TigerData: Building the Modern PostgreSQL for the Analytical and Agentic Era

We’ve quietly evolved from a time-series database into the modern PostgreSQL for today’s and tomorrow’s computing, built for performance, scale, and the agentic future.

So we’re changing our name: from Timescale to TigerData. Not to change who we are, but to reflect who we’ve become. TigerData is bold, fast, and built to power the next era of software.

Read more

👋 Kindness is contagious

Discover fresh viewpoints in this insightful post, supported by our vibrant DEV Community. Every developer’s experience matters—add your thoughts and help us grow together.

A simple “thank you” can uplift the author and spark new discussions—leave yours below!

On DEV, knowledge-sharing connects us and drives innovation. Found this useful? A quick note of appreciation makes a real impact.

Okay