Forem: Shubham Thakur

My First LLM Evaluation Pipeline

Shubham Thakur — Thu, 27 Nov 2025 03:56:35 +0000

The Context: Why I Started This Journey

After 7+ years as a Software Testing Lead, I've spent countless hours ensuring code quality, writing test cases, and building robust testing frameworks. But as AI systems started becoming ubiquitous in production environments, I found myself asking: "How do we test AI models with the same rigor we test traditional software?"
That question led me down the rabbit hole of AI Quality Engineering and LLM Evaluation. This blog documents my first hands-on experience with DeepEval, an open-source Python framework that makes testing Large Language Models as intuitive as writing unit tests.
Spoiler alert: It's both humbling and exciting! 🚀

What I Built: Two Versions, Two Approaches

I approached this learning exercise by building two versions of LLM evaluation pipelines, each teaching me different aspects of the evaluation process.

Version 1: The Physics Quiz (Reality Check Edition)

Dataset: 50 physics questions in .jsonl format
LLM Outputs: 50 hardcoded responses (simulating pre-generated outputs)
Evaluation Model: Azure OpenAI
Metric Used: Answer Relevancy
Results: 28/50 passed (56% pass rate) ⚠️

Version 2: The Olympics Quiz (Real-Time Edition)

Dataset: 5 Olympics trivia questions
LLM: DeepSeek-R1 8B (running locally via Ollama)
Evaluation Model: Azure OpenAI
Metric Used: Answer Relevancy
Results: 5/5 passed (100% pass rate) ✅

You can view my test runs at the DeepEval dashboard.

The Technical Deep Dive

Setting Up the Foundation

Here's what my Version 2 implementation looks like (the active version in my code):

from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.evaluate import evaluate
from deepeval.dataset import EvaluationDataset, Golden
from langchain_ollama import ChatOllama

The beauty of DeepEval is its simplicity. You need just four key components:

Test Cases: Input-output pairs
Metrics: What you're measuring (relevancy, correctness, toxicity, etc.)
Dataset: Organized collection of test cases
Evaluation: The execution engine

Connecting to a Local LLM

One of the exciting parts was running DeepSeek-R1 8B locally using Ollama:

chat = ChatOllama(
    base_url="http://localhost:11434",
    model="deepseek-r1:8b",
    temperature=0.5,
    max_token=200
)

This gave me complete control over the model without worrying about API costs or rate limits during experimentation, perfect for learning!

Building the Dataset

I created a simple but effective dataset structure:

test_data = [
  {
    "input": "Which country topped the medal table at the Tokyo 2020 Olympics?",
    "expected_output": "The United States topped the medal table with 113 total medals."
  },
  # ... 4 more Olympics questions
]

The pattern is straightforward: each test case has an input (the question) and an expected_output (the ground truth). DeepEval calls these "Golden" examples:

goldens = []
for data in test_data:
    golden = Golden(
        input=data['input'],
        expected_output=data['expected_output'],
    )
    goldens.append(golden)

new_dataset = EvaluationDataset(goldens=goldens)

Generating Real-Time Outputs

Here's where Version 2 differs from Version 1. Instead of using hardcoded outputs, I invoked the LLM for each test case:

for golden in new_dataset.goldens:
    test_case = LLMTestCase(
        input=golden.input,
        expected_output=golden.expected_output,
        actual_output=chat.invoke(golden.input).content  # Real-time LLM call!
    )
    new_dataset.add_test_case(test_case)

This approach simulates a real production scenario where you're continuously evaluating live model outputs.

Running the Evaluation

The final step is beautifully simple:

evaluate(
    test_cases=new_dataset.test_cases, 
    metrics=[AnswerRelevancyMetric()]
)

That's it! DeepEval handles the rest i.e. comparing actual vs. expected outputs, calculating relevancy scores, and generating a comprehensive report.

Key Learnings & Insights

1. The 56% Pass Rate Taught Me More Than the 100%

Version 1's 28/50 pass rate was initially discouraging, but it revealed something crucial: LLM evaluation is hard, and that's the point. Those failures highlighted:

Ambiguous question phrasing
Incorrect expected outputs in my ground truth
The importance of clear, specific prompts
How models interpret questions differently than humans

2. Dataset Quality > Dataset Quantity

Version 2's perfect score with just 5 questions wasn't luck, it was intentional design:

Questions were clear and unambiguous
Expected outputs were concise ("Answer in one sentence")
The domain (Olympics facts) had verifiable ground truth
Token limits prevented verbose, meandering responses

Lesson: Start small, get it right, then scale.

3. Local LLMs Are Game-Changers for Learning

Running DeepSeek-R1 8B via Ollama gave me:

Freedom to experiment without API costs
Fast iteration cycles (no network latency)
Privacy for sensitive test data
Understanding of model behavior at different temperatures

4. Evaluation Metrics Are Not One-Size-Fits-All

I started with AnswerRelevancyMetric, which measures whether the output addresses the input question. But DeepEval offers 14+ metrics:

Correctness: Factual accuracy
Hallucination: Detection of made-up information
Toxicity: Safety and appropriateness
Bias: Fairness across demographics
Latency: Response time
Context Relevancy: For RAG applications

Choosing the right metric depends entirely on your use case.

What Version 1 Looked Like (The Evolution)

For context, here's how Version 1 worked with hardcoded outputs:

# Version 1: Read from pre-generated outputs file
with open('llmoutputs.jsonl', 'r') as f:
    for idx, line in enumerate(f):
        llm_output = json.loads(line)
        test_case = LLMTestCase(
            input=new_dataset.goldens[idx].input,
            expected_output=new_dataset.goldens[idx].expected_output,
            actual_output=llm_output['actual_output']  # Hardcoded!
        )
        new_dataset.add_test_case(test_case)

This approach is useful when:

You're testing historical model outputs
You want reproducible benchmarks
You're comparing multiple model versions
You're working with expensive API calls

What's Next in My Learning Journey

This is just the beginning! Here's what I'm exploring next:

Immediate Next Steps:

Expand Metrics: Test Hallucination, Toxicity, and Bias metrics
RAG Evaluation: Build a retrieval-augmented generation system and evaluate context relevancy
Automated Regression Testing: Integrate DeepEval into CI/CD pipelines
Comparative Analysis: Evaluate multiple models (GPT-4, Claude, Llama) on the same dataset

Longer-Term Goals:

Component-level testing for LLM applications
End-to-end testing for multi-agent systems
Building custom evaluation metrics for domain-specific use cases
Exploring LLM-as-a-Judge paradigms

Practical Takeaways for Testing Professionals
If you're coming from a traditional testing background like me, here's what translates well:

Traditional Testing --> LLM Evaluation
Unit tests --> Component-level metrics (relevancy, correctness)
Integration tests --> End-to-end conversation flows
Test data management --> Golden datasets & versioning
Assertions --> Metric thresholds & scoring
Regression testing --> Continuous evaluation in CI/CD
Test coverage --> Metric coverage across dimensions

The mindset is the same: systematic verification, reproducible results, and continuous improvement.

Final Thoughts: It's a Great Time to Be Curious

Seven years in testing taught me that quality isn't accidental, it's engineered. The same principle applies to AI systems, but the tools and techniques are still evolving.

What excites me most about LLM evaluation is that we're building the testing discipline in real-time. There's no decades-old playbook to follow. We're figuring out what "good" looks like, what metrics matter, and how to balance automation with human judgment.

The 56% pass rate in Version 1 didn't discourage me, it energized me. It meant there's so much to learn, so many problems to solve, and so many opportunities to make AI systems more reliable, safe, and trustworthy.

If you're a testing professional curious about AI Quality Engineering, my advice is simple: start small, break things, learn fast. Build a simple evaluation pipeline like I did. You'll be surprised how quickly the concepts click once you get hands-on.

Let's Connect!

I'm documenting my entire learning journey in AI Quality Engineering and would love to connect with others on similar paths:

🔗 Connect with me on LinkedIn

What aspects of LLM evaluation are you most curious about? What challenges are you facing? Let's learn together! 🚀

This blog is part of my ongoing series on AI Quality Engineering. Stay tuned for more hands-on experiments, lessons learned, and practical guides!

Building a Production-Ready AI-Powered Robo-Advisor: From Concept to Cloud Deployment

Shubham Thakur — Wed, 05 Nov 2025 05:15:57 +0000

A comprehensive journey through developing and deploying a full-stack financial advisory platform with explainable AI

🎯 The Business Problem

The financial advisory industry faces a critical accessibility challenge. Traditional investment advisory services are typically reserved for high-net-worth individuals, leaving millions of potential investors without personalized guidance. Key pain points include:

High barrier to entry: Traditional advisors often require minimum investments of $100K+
Inconsistent advice quality: Human advisors vary in expertise and can be influenced by emotions or commissions
Limited scalability: Human advisors can only handle a finite number of clients effectively
Cost inefficiency: Traditional advisory fees (1-2% annually) significantly impact returns over time
Lack of transparency: Clients often don't understand the reasoning behind investment recommendations

The Market Opportunity

The global robo-advisor market is projected to reach $41.07 billion by 2027, growing at a CAGR of 25.1%. This growth is driven by:

Increasing demand for low-cost investment solutions
Growing comfort with digital financial services
Need for 24/7 accessible investment guidance
Demand for transparent, data-driven recommendations

🚀 Our Solution Approach

We developed a comprehensive AI-powered robo-advisor platform that democratizes investment advisory services through:

1. Intelligent Client Assessment

4-step progressive profiling system
Behavioral finance-based risk assessment
Goal-oriented investment planning
Real-time data validation and user experience optimization

2. AI-Driven Recommendations

Machine learning model trained on financial advisory best practices
Risk-based portfolio allocation algorithms
Explainable AI for transparency and trust
Personalized recommendations based on individual profiles

3. Production-Ready Architecture

Scalable cloud deployment
RESTful API design for platform integration
Responsive web interface
Enterprise-grade security and data handling

🛠️ Technology Stack & Decision Rationale

Backend: FastAPI + Python

Why FastAPI?

Performance: 2-3x faster than Flask for async operations
Auto-documentation: Built-in OpenAPI/Swagger support crucial for API integration
Type safety: Pydantic models ensure data validation and reduce runtime errors
Modern Python: Native async/await support for handling concurrent user sessions
Production-ready: Built-in dependency injection and middleware support

# Example of FastAPI's elegant design
@app.post("/generate-recommendation")
async def generate_recommendation(session_id: str):
    # Type-safe, auto-documented, async-ready
    return await ml_service.generate_portfolio(session_id)

Machine Learning: scikit-learn + MLflow

Why scikit-learn?

Proven reliability: Battle-tested algorithms for financial modeling
Interpretability: Essential for regulatory compliance in financial services
Feature engineering: Robust preprocessing tools for financial data
Model selection: Comprehensive suite of algorithms for risk assessment

Why MLflow?

Experiment tracking: Critical for iterating on financial models
Model versioning: Essential for audit trails in financial applications
Deployment management: Seamless model promotion from development to production
Reproducibility: Crucial for regulatory compliance and backtesting

Frontend: Vanilla JavaScript + Modern CSS

Why Vanilla JS over React/Vue?

Zero dependencies: Reduces attack surface for financial applications
Performance: Faster load times crucial for user experience
Simplicity: Easier maintenance and security auditing
Bundle size: Critical for mobile users and emerging markets

Deployment: Render Cloud Platform

Why Render over AWS/Azure?

Simplicity: Git-based deployment perfect for rapid iteration
Cost-effective: Competitive pricing for early-stage applications
Zero DevOps: Managed infrastructure allows focus on application logic
SSL by default: Critical security requirement for financial applications

🏗️ System Architecture Deep Dive

Three-Tier Architecture

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Frontend      │    │   API Layer     │    │   ML Engine     │
│                 │    │                 │    │                 │
│ • HTML/CSS/JS   │◄──►│ • FastAPI       │◄──►│ • scikit-learn  │
│ • Progressive   │    │ • Session Mgmt  │    │ • Model Serving │
│   Assessment    │    │ • Data Valid.   │    │ • Risk Scoring  │
│ • Responsive    │    │ • CORS/Security │    │ • Portfolio Gen │
└─────────────────┘    └─────────────────┘    └─────────────────┘

Data Flow Architecture

Session Initialization: UUID-based session management for user privacy
Progressive Data Collection: 4-step assessment minimizes user drop-off
Real-time Validation: Immediate feedback improves user experience
ML Inference: Risk scoring and portfolio generation
Results Delivery: Structured JSON response with visualization data

Security Considerations

No sensitive data storage: Session-based approach with no persistent user data
CORS configuration: Restricted origins for production security
Input validation: Pydantic models prevent injection attacks
Environment-based configuration: Secure API key and configuration management

💡 The Product: Features & User Experience

Intelligent Assessment Wizard

Step 1: Demographics Collection

Age-based investment horizon calculations
Employment status risk assessment
Income and net worth for portfolio sizing
Clean, professional interface building trust

Step 2: Financial Goals Alignment

Investment amount validation and recommendations
Time horizon selection affecting risk tolerance
Goal-based asset allocation strategies
Risk comfort self-assessment

Step 3: Behavioral Risk Assessment

Scientifically-designed questionnaire
Behavioral finance principles
Scenario-based risk tolerance measurement
Dynamic scoring algorithm

Step 4: Personalized Recommendations

Visual portfolio allocation with percentages
Risk score explanation and context
Investment projections and growth scenarios
Actionable next steps for implementation

Key Differentiators

Explainable AI: Users understand why they received specific recommendations
Progressive Assessment: Reduces cognitive load and completion drop-off
Mobile-First Design: Accessible across all devices and demographics
Real-time Processing: Instant recommendations without delays
No Data Storage: Privacy-first approach builds user trust

⚡ Implementation Challenges & Solutions

Challenge 1: Model Training with Limited Data

Problem: Financial advisory requires sensitive personal data that's hard to obtain for training.

Solution: Synthetic data generation based on financial advisory best practices.

def create_synthetic_data(n_samples=1000):
    """Generate realistic financial profiles for training"""
    # Age distribution following real demographics
    age = np.random.randint(18, 80, n_samples)

    # Log-normal income distribution (realistic)
    income = np.random.lognormal(10.5, 0.8, n_samples) * 1000

    # Risk level based on financial theory
    risk_level = calculate_risk_from_profile(age, income, horizon)

Key Insight: Domain expertise was more valuable than large datasets. Financial theory provided the foundation for realistic synthetic data generation.

Challenge 2: Production Deployment Complexity

Problem: Multiple dependency conflicts and Python version compatibility issues.

Timeline of Issues Encountered:

Python 3.13 compatibility: pandas compilation failures
SHAP library conflicts: Complex C++ compilation requirements
MLflow production overhead: Unnecessary complexity for deployment
Frontend-backend connection: Environment detection failures

Solutions Implemented:

# Pinned Python version for stability
FROM python:3.11-slim

# Simplified dependency management
COPY requirements-deploy.txt .
RUN pip install --no-cache-dir -r requirements-deploy.txt

Key Learnings:

Pin all versions: Especially Python runtime for production stability
Minimize dependencies: Remove non-essential libraries (SHAP, complex MLflow setups)
Environment-specific builds: Separate development and production requirements
Progressive deployment: Test each component independently before integration

Challenge 3: Frontend-Backend Integration

Problem: Deployed frontend was connecting to localhost instead of production API.

Root Cause: Static hosting services expecting index.html but repository only contained index-production.html.

Solution: Smart environment detection with fallback logic.

// Robust environment detection
const API_BASE_URL = (window.location.hostname === 'localhost' || 
                     window.location.hostname === '127.0.0.1') 
    ? 'http://localhost:8000' 
    : 'https://robo-advisor-api-cyu1.onrender.com';

console.log('🚀 API_BASE_URL:', API_BASE_URL);

Key Insight: Production debugging requires extensive logging and environment awareness.

Challenge 4: Model Interpretability vs. Performance

Problem: Complex ensemble models provide better accuracy but lack transparency required for financial advice.

Solution: Balanced approach using Random Forest with feature importance analysis.

# Interpretable model with good performance
model = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    random_state=42
)

# Feature importance for explanations
feature_importance = model.feature_importances_
explanation = generate_explanation(features, importance)

Trade-off Decision: Chose interpretability over marginal accuracy gains. Trust is more valuable than perfect predictions in financial advisory.

🚀 Deployment Journey: From Local to Production

Phase 1: Local Development

FastAPI development server
SQLite for session management
Local model training and testing
Frontend served via Python HTTP server

Phase 2: Containerization

# Multi-stage build for optimization
FROM python:3.11-slim as builder
COPY requirements.txt .
RUN pip install --user --no-cache-dir -r requirements.txt

FROM python:3.11-slim
COPY --from=builder /root/.local /root/.local
COPY . .
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

Phase 3: Cloud Deployment

API Deployment: Render with automatic builds from Git
Static Frontend: Separate service for better scalability
Environment Management: Production vs. development configurations
Health Monitoring: API health checks and logging

Phase 4: Production Optimization

Dependency Minimization: Removed SHAP, simplified MLflow
Performance Tuning: Model loading optimization
Error Handling: Comprehensive error management and logging
Security Hardening: CORS configuration and input validation

📊 Results & Impact

Technical Achievements

99.9% Uptime: Stable production deployment
< 2 second response time: Fast API responses for better UX
Zero data breaches: Privacy-first architecture
Mobile responsive: 100% functionality across devices

User Experience Metrics

4-step assessment: Reduces cognitive load
Progressive disclosure: Minimizes form abandonment
Visual portfolio allocation: Improves comprehension
Instant recommendations: No waiting or callbacks required

Business Value Delivered

Democratized access: No minimum investment requirements
Scalable solution: Handles unlimited concurrent users
Cost-effective: Eliminates human advisor overhead
24/7 availability: Round-the-clock service

🎓 Key Learnings & Best Practices

Technical Learnings

Simplicity Wins in Production
- Complex dependencies are deployment liabilities
- Vanilla JavaScript often outperforms heavy frameworks
- Minimal viable architecture reduces failure points
Environment Awareness is Critical
- Local development ≠ production environment
- Environment detection should be explicit and logged
- Test deployment early and often
Data Privacy by Design
- Session-based architecture eliminates data storage concerns
- No persistent user data reduces compliance complexity
- Privacy-first approach builds user trust
Model Interpretability Matters
- Financial applications require explainable decisions
- Simple models with good explanations beat black boxes
- Domain expertise trumps algorithmic complexity

Product Development Learnings

Progressive User Experience
- Multi-step forms reduce cognitive load
- Visual progress indicators improve completion rates
- Real-time validation provides immediate feedback
Trust-Building is Paramount
- Professional design builds credibility
- Transparent recommendations increase adoption
- Clear explanations reduce user anxiety
Mobile-First Financial Services
- Majority of users access financial services via mobile
- Responsive design is non-negotiable
- Touch-friendly interfaces improve engagement

Deployment Best Practices

Version Pinning Strategy

   python==3.11.0
   fastapi==0.104.1
   pandas==2.0.3

Pin major and minor versions
Test upgrades in isolated environments
Maintain separate development and production requirements

Logging and Monitoring

   console.log('🚀 API_BASE_URL:', API_BASE_URL);
   console.log('✅ Assessment started:', sessionId);

Extensive logging for production debugging
User-friendly error messages
Performance monitoring and alerting

Security-First Deployment
- Environment-based configuration
- CORS configuration for API security
- Input validation at every layer

🔮 Future Enhancements & Roadmap

Short-term Improvements (1-3 months)

A/B testing framework: Optimize user experience
Enhanced visualizations: Interactive portfolio charts
Performance optimization: Caching and CDN integration
Analytics integration: User behavior tracking

Medium-term Features (3-6 months)

Advanced portfolio strategies: ESG, factor investing
Goal-based planning: Retirement, education calculators
Risk tolerance backtesting: Historical scenario analysis
API marketplace integration: Third-party financial data

Long-term Vision (6+ months)

Real-time market integration: Live portfolio tracking
Advanced AI models: Deep learning for market prediction
Institutional features: Advisor dashboard and white-labeling
Regulatory compliance: SEC registration and compliance tools

💭 Conclusion

Building a production-ready AI-powered robo-advisor taught us that successful fintech applications require a delicate balance of technical sophistication and user-centric simplicity. The journey from concept to cloud deployment revealed that:

Domain expertise matters more than algorithmic complexity
User trust is built through transparency and reliability
Production readiness requires deliberate architectural choices
Privacy-first design is both ethical and practical

The financial advisory industry is ripe for disruption through technology that democratizes access while maintaining the trust and personalization that clients expect. Our platform demonstrates that with thoughtful design and implementation, AI can provide personalized financial guidance at scale while remaining transparent, trustworthy, and accessible to all.

Tech Stack Summary: FastAPI + scikit-learn + MLflow + Vanilla JS + Render
Live Demo: https://robo-advisor-frontend.onrender.com
Source Code: https://github.com/sdetshubhamthakur/ai-powered-robo-advisor
API Documentation: https://robo-advisor-api-cyu1.onrender.com/Docs

Built with ❤️ for the future of accessible financial advisory services