Evaluation

👋 Sign in for the ability to sort posts by relevant, latest, or top.

Bala Madhusoodhanan

May 25

Prompt Engineering for Automated Evaluation: Making LLMs the Judge in AI Builder Solutions

#aibuilder #powerplatform #evaluation #powerfuldevs

4 min read

Cover image for Evaluating LLM code reviewers: an offline harness for precision, recall, and routing"

Prakhar Singh

May 13

Evaluating LLM code reviewers: an offline harness for precision, recall, and routing"

#llm #codereview #evaluation #ai

5 min read

Cover image for RAG Series (8): RAG Evaluation System — Speaking with Data

WonderLab

May 6

RAG Series (8): RAG Evaluation System — Speaking with Data

#rag #ragas #llm #evaluation

9 min read

Natnael Alemseged

May 8

Why Pairing Your Bootstrap Is Necessary — And When It Stops Helping

#machinelearning #statistics #llm #evaluation

5 min read

ANKUSH CHOUDHARY JOHAL

Apr 29

Benchmark: Ragas 0.1 vs. LangSmith 2.0: RAG Evaluation Speed for 1k Queries

#benchmark #ragas #langsmith #evaluation

12 min read

EClawbot Official

Apr 15

What Is Agent Evaluation? How EClaw Arena Benchmarks AI Agents Across 12 Dimensions

#ai #agents #benchmarks #evaluation

3 min read

ThomasP

Apr 8

LLM-as-Judge: using Claude to review a Gemini agent

#ai #llm #agents #evaluation

7 min read

Aamer Mihaysi

Apr 4

The Evaluation Gap: Why We Dont Know If Agents Are Getting Better

#ai #agents #evaluation #engineering

2 min read

kasi viswanath vandanapu

Apr 1

SQL Comparison Library Architecture

#sql #ai #evaluation #llm

14 min read

Tebogo Tseka

Mar 31

Building an LLM Judge That Doesn't Lie to You

#ai #evaluation #testing #machinelearning

8 min read

kasi viswanath vandanapu

Mar 30

Build a Production‑Ready SQL Evaluation Engine for LLMs

#sql #llm #evaluation #python

5 min read

Tebogo Tseka

Mar 30

Beyond Text: How We Built an Evaluation Framework for Multi-File AI Outputs

#ai #evaluation #testing #webdev

8 min read

Aamer Mihaysi

May 1

AI Evaluation Is Now a Capital Expense

#ai #evaluation #agents

2 min read

Alina Trofimova

Mar 19

Evaluating Vendor Offerings: A Structured Approach to Identify High-Quality, Compatible Tools at Conferences

#devops #kubecon #evaluation #kubernetes

13 min read

Ultra Dune

Mar 17

EVAL #006: LLM Evaluation Tools — RAGAS vs DeepEval vs Braintrust vs LangSmith vs Arize Phoenix

#llm #evaluation #ai #machinelearning

10 min read

👋 Sign in for the ability to sort posts by relevant, latest, or top.

Forem

# evaluation

Prompt Engineering for Automated Evaluation: Making LLMs the Judge in AI Builder Solutions

Evaluating LLM code reviewers: an offline harness for precision, recall, and routing"

RAG Series (8): RAG Evaluation System — Speaking with Data

Why Pairing Your Bootstrap Is Necessary — And When It Stops Helping

Benchmark: Ragas 0.1 vs. LangSmith 2.0: RAG Evaluation Speed for 1k Queries

What Is Agent Evaluation? How EClaw Arena Benchmarks AI Agents Across 12 Dimensions

LLM-as-Judge: using Claude to review a Gemini agent

The Evaluation Gap: Why We Dont Know If Agents Are Getting Better

SQL Comparison Library Architecture

Building an LLM Judge That Doesn't Lie to You

Build a Production‑Ready SQL Evaluation Engine for LLMs

Beyond Text: How We Built an Evaluation Framework for Multi-File AI Outputs

AI Evaluation Is Now a Capital Expense

Evaluating Vendor Offerings: A Structured Approach to Identify High-Quality, Compatible Tools at Conferences

EVAL #006: LLM Evaluation Tools — RAGAS vs DeepEval vs Braintrust vs LangSmith vs Arize Phoenix