<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: E S</title>
    <description>The latest articles on Forem by E S (@es2026).</description>
    <link>https://forem.com/es2026</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3758398%2Fb578331f-171c-4d37-9575-53e529bf953b.png</url>
      <title>Forem: E S</title>
      <link>https://forem.com/es2026</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/es2026"/>
    <language>en</language>
    <item>
      <title>LLM Hallucination Index 2026: Why Claude 4.6 Sonnet Dominates BullshitBench v2 While Reasoning Models Fail</title>
      <dc:creator>E S</dc:creator>
      <pubDate>Tue, 03 Mar 2026 15:37:41 +0000</pubDate>
      <link>https://forem.com/es2026/llm-hallucination-index-2026-why-claude-46-sonnet-dominates-bullshitbench-v2-while-reasoning-5cp5</link>
      <guid>https://forem.com/es2026/llm-hallucination-index-2026-why-claude-46-sonnet-dominates-bullshitbench-v2-while-reasoning-5cp5</guid>
      <description>&lt;p&gt;In the relentless race toward Artificial General Intelligence, the industry has become obsessed with a dangerous proxy for intelligence: Helpfulness. We have trained LLMs to be the ultimate “yes-men,” optimizing them to provide an answer at any cost.&lt;/p&gt;

&lt;p&gt;The release of BullshitBench v2 is a cold, empirical shower for this narrative. While standard benchmarks like MMLU are hitting their ceilings, this specialized stress test — designed specifically to catch models in a lie — reveals a widening “honesty gap” that separates the pretenders from the truth-tellers.&lt;/p&gt;

&lt;p&gt;The Reasoning Paradox: More Compute, More Delusion&lt;br&gt;
The most significant takeaway from the v2 data is the definitive confirmation of the “Reasoning Paradox.” The prevailing wisdom was that Chain-of-Thought (CoT) and increased inference-time compute would allow models to self-correct. BullshitBench v2 proves the opposite for the vast majority of the field.&lt;/p&gt;

&lt;p&gt;For most models, including the latest iterations of GPT-5.2 and Gemini 3 Pro, deeper reasoning actually lowers the success rate in detecting nonsense. Instead of using logic to debunk a false premise, the models use their increased “brain power” as a rationalization engine.&lt;/p&gt;

&lt;p&gt;If you feed a “smart” model a non-existent legal statute, it won’t flag the error. Instead, it will spend 30 seconds of compute explaining why that fake law is a perfectly logical extension of the current legal system. The more “intelligent” the model, the more convincingly it can justify absolute bullshit.&lt;/p&gt;

&lt;p&gt;The 2026 Reliability Hierarchy: Anthropic’s Hegemony&lt;br&gt;
The v2 leaderboard reveals a brutal divergence in the market. While most labs are plateauing, Anthropic has managed to build what can only be described as a “Skepticism Layer” into their architecture.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The Claude 4.6 Phenomenon: Breaking the 90% Barrier
Anthropic is the only vendor currently showing a consistent upward trajectory in “epistemic humility.”&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Claude Sonnet 4.6 (High Reasoning) sits at the absolute top with a 91.0% Green Rate (successful BS detection).&lt;/p&gt;

&lt;p&gt;Crucially, its Red Rate (the frequency of confidently swallowing a lie) is a mere 3.0%.&lt;/p&gt;

&lt;p&gt;In the 2026 landscape, Sonnet 4.6 is the only model that behaves like a skeptic by default. It doesn’t just know facts; it understands when a premise is fundamentally flawed.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The Open-Source Challenger: Qwen3.5
Alibaba’s latest flagship has emerged as the only serious threat to the Anthropic monopoly. Qwen3.5 397b (A17b) holds a 78.0% Green Rate.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The Insight: With a remarkably low 5.0% Red Rate, Qwen3.5 is actually safer and more honest than many Western closed-source models. For developers looking for open-weights reliability, the “Alibaba Moat” is now a reality.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The Stagnation of the Giants
The most uncomfortable truth in BullshitBench v2 is the performance of OpenAI and Google. Despite their dominance in creative and coding tasks, they are stuck in the 55–65% range. These models have been RLHF’d (Reinforced Learned from Human Feedback) to be so “helpful” that they have lost the ability to disagree with the user, making them a liability in high-stakes RAG (Retrieval-Augmented Generation) environments.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Quantitative Breakdown: Top Tier Performance&lt;/p&gt;

&lt;p&gt;Based on the latest v2 data, the hierarchy of truthfulness is now clearly defined:&lt;/p&gt;

&lt;p&gt;The Gold Standard: Claude Sonnet 4.6 (High Reasoning)&lt;br&gt;
91.0% Detection Rate | 3.0% Hallucination Rate.&lt;/p&gt;

&lt;p&gt;The Verdict: The only choice for autonomous agents in Law or Medicine.&lt;/p&gt;

&lt;p&gt;The Elite Runner-Up: Claude Opus 4.5 (High Reasoning)&lt;br&gt;
90.0% Detection Rate | 8.0% Hallucination Rate.&lt;/p&gt;

&lt;p&gt;The Verdict: Powerfully intelligent, but slightly more prone to “creative” errors than Sonnet 4.6.&lt;/p&gt;

&lt;p&gt;The Open-Source King: Qwen3.5 397b A17b (High)&lt;br&gt;
78.0% Detection Rate | 5.0% Hallucination Rate.&lt;/p&gt;

&lt;p&gt;The Verdict: The primary alternative to the Anthropic stack.&lt;/p&gt;

&lt;p&gt;The Efficiency Leader: Claude Haiku 4.5 (High)&lt;br&gt;
77.0% Detection Rate | 12.0% Hallucination Rate.&lt;/p&gt;

&lt;p&gt;The Verdict: Proof that “truthfulness” is being baked into smaller, faster models.&lt;/p&gt;

&lt;p&gt;Domain-Blindness: Bullshit is Universal&lt;/p&gt;

&lt;p&gt;BullshitBench v2 introduced 100 new questions across five critical domains: Coding (40), Medical (15), Legal (15), Finance (15), and Physics (15). The data shows that honesty is not a “knowledge” problem; it is an architectural trait. Models that fail to detect a fake Python library in the coding section fail at a nearly identical rate when presented with a fake medical symptom. You cannot “fine-tune” honesty into a model by giving it more textbooks; you have to train it to prioritize factual refusal over user satisfaction.&lt;/p&gt;

&lt;p&gt;Final Verdict for Developers&lt;/p&gt;

&lt;p&gt;BullshitBench v2 is a funeral march for the “Just Add More Parameters” philosophy. In 2026, the delta between a model that looks smart and a model that is reliable is wider than ever.&lt;/p&gt;

&lt;p&gt;For any project where a hallucination is a catastrophic failure — be it a legal researcher, a medical diagnostic aid, or a financial auditor — your choice is no longer between “GPT or Claude.” It is between Claude 4.6 and everything else.&lt;/p&gt;

&lt;p&gt;Want to see the carnage for yourself?&lt;/p&gt;

&lt;p&gt;Interactive Leaderboard: BullshitBench v2 Viewer(&lt;a href="https://petergpt.github.io/bullshit-benchmark/viewer/index.v2.html" rel="noopener noreferrer"&gt;https://petergpt.github.io/bullshit-benchmark/viewer/index.v2.html&lt;/a&gt;)&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft6vdc2vdrjab57l3cc3m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft6vdc2vdrjab57l3cc3m.png" alt=" " width="800" height="532"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Audit the Questions: GitHub Repository(ttps://github.com/petergpt/bullshit-benchmark)&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>machinelearning</category>
      <category>testing</category>
    </item>
    <item>
      <title>Python QuickStart: Calling AnyAPI.ai for LLM Requests (2026 Edition)</title>
      <dc:creator>E S</dc:creator>
      <pubDate>Sun, 15 Feb 2026 18:38:15 +0000</pubDate>
      <link>https://forem.com/es2026/python-quickstart-calling-anyapiai-for-llm-requests-2026-edition-5eo7</link>
      <guid>https://forem.com/es2026/python-quickstart-calling-anyapiai-for-llm-requests-2026-edition-5eo7</guid>
      <description>&lt;p&gt;In this guide, we will explore how to use AnyAPI as a unified gateway to access the latest frontier models using the standard OpenAI Python SDK.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Architecture Overview
AnyAPI.ai operates as a transparent proxy. Your code interacts with a single endpoint, while AnyAPI handles the complex routing to various providers.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Why Use AnyAPI.ai in 2026?&lt;/p&gt;

&lt;p&gt;Instant Model Switching:&lt;br&gt;
Move from OpenAI to Anthropic by changing just the model string.&lt;/p&gt;

&lt;p&gt;Unified Agentic Workflows:&lt;br&gt;
Use openai/gpt-5.2 for reasoning and google/gemini-3-pro for multimodal analysis under one API key.&lt;/p&gt;

&lt;p&gt;‍&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Setup and Configuration
`Bash
pip install openai python-dotenv&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Configuration&lt;br&gt;
Create a .env file:&lt;/p&gt;

&lt;p&gt;ANYAPI_BASE_URL=&lt;a href="https://api.anyapi.ai/v1" rel="noopener noreferrer"&gt;https://api.anyapi.ai/v1&lt;/a&gt;&lt;br&gt;
ANYAPI_API_KEY=your_anyapi_token_here&lt;br&gt;
‍`&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Implementation: Calling the Latest Models
Synchronous Request (GPT-5)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;`import os&lt;br&gt;
from openai import OpenAI&lt;br&gt;
from dotenv import load_dotenv&lt;/p&gt;

&lt;p&gt;load_dotenv()&lt;/p&gt;

&lt;p&gt;client = OpenAI(&lt;br&gt;
    base_url=os.getenv("ANYAPI_BASE_URL"),&lt;br&gt;
    api_key=os.getenv("ANYAPI_API_KEY")&lt;/p&gt;

&lt;h1&gt;
  
  
  Calling GPT-5 using provider/model format
&lt;/h1&gt;

&lt;p&gt;)&lt;br&gt;
response = client.chat.completions.create(&lt;br&gt;
    model="openai/gpt-5",&lt;br&gt;
    messages=[{"role": "user", "content": "Analyze the legal implications of AI-generated smart contracts."}]&lt;br&gt;
)&lt;/p&gt;

&lt;p&gt;print(f"GPT-5 Response: {response.choices[0].message.content}")&lt;/p&gt;

&lt;h1&gt;
  
  
  Asynchronous Streaming (Claude 4.6 Opus)
&lt;/h1&gt;

&lt;p&gt;import asyncio&lt;br&gt;
from openai import AsyncOpenAI&lt;/p&gt;

&lt;p&gt;async def main():&lt;br&gt;
    async_client = AsyncOpenAI(&lt;br&gt;
        base_url="&lt;a href="https://api.anyapi.ai/v1" rel="noopener noreferrer"&gt;https://api.anyapi.ai/v1&lt;/a&gt;",&lt;br&gt;
        api_key="your_anyapi_token"&lt;br&gt;
    )&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;stream = await async_client.chat.completions.create(
    model="anthropic/claude-4-6-opus",
    messages=[{"role": "user", "content": "Architect a microservices system in Rust."}],
    stream=True


async for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;if &lt;strong&gt;name&lt;/strong&gt; == "&lt;strong&gt;main&lt;/strong&gt;":&lt;br&gt;
    asyncio.run(main())`&lt;br&gt;
‍)&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Model Selection Strategy for 2026&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Entry-Level &amp;amp; High Speed:&lt;br&gt;
Use google/gemini-3-flash or meta-llama/llama-3.1-405b-instruct&lt;/p&gt;

&lt;p&gt;Professional Coding &amp;amp; Agents:&lt;br&gt;
Use openai/gpt-5 or anthropic/claude-4-5-sonnet.&lt;/p&gt;

&lt;p&gt;Frontier Reasoning:&lt;br&gt;
Use anthropic/claude-4-6-opus or openai/gpt-5.&lt;/p&gt;

&lt;p&gt;‍&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Standardized Error Handling&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Authentication Error (401):&lt;br&gt;
Check your AnyAPI key.&lt;/p&gt;

&lt;p&gt;Rate Limits (429):&lt;br&gt;
Occurs if your AnyAPI tier or downstream provider is throttled.&lt;/p&gt;

&lt;p&gt;Model Not Found (404):&lt;br&gt;
Ensure the model name (e.g., openai/gpt-5) is valid in your dashboard.&lt;/p&gt;

&lt;p&gt;‍&lt;/p&gt;

</description>
      <category>api</category>
      <category>llm</category>
      <category>python</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>OpenClaw meets AnyAPI.ai: How to scrape the web without losing your mind</title>
      <dc:creator>E S</dc:creator>
      <pubDate>Sat, 07 Feb 2026 11:43:28 +0000</pubDate>
      <link>https://forem.com/es2026/openclaw-meets-anyapiai-how-to-scrape-the-web-without-losing-your-mind-2cci</link>
      <guid>https://forem.com/es2026/openclaw-meets-anyapiai-how-to-scrape-the-web-without-losing-your-mind-2cci</guid>
      <description>&lt;p&gt;Let’s be real for a second. Web scraping used to be a nightmare of broken CSS selectors and constant cat-and-mouse games with site updates. If you are tired of your scrapers breaking because a developer changed a div to a section, you are in the right place.&lt;/p&gt;

&lt;p&gt;Today we are combining OpenClaw (the eyes and hands) with AnyAPI.ai (the brain). This combo lets you turn any messy website into clean JSON without writing a single line of fragile selector code.&lt;/p&gt;

&lt;p&gt;‍&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is the deal with OpenClaw?&lt;/strong&gt;&lt;br&gt;
OpenClaw is an open-source tool that uses AI agents to browse the web just like a human would. Instead of telling it "find the third span inside the second div," you just tell it "give me the product price."&lt;/p&gt;

&lt;p&gt;It handles the scrolling, the clicking, and the messy HTML. But to actually understand what it’s looking at, it needs to talk to a Large Language Model (LLM). That is where things usually get annoying with API keys and regional blocks.&lt;/p&gt;

&lt;p&gt;‍&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Enter AnyAPI.ai: The ultimate LLM shortcut&lt;/strong&gt;&lt;br&gt;
AnyAPI.ai is basically a universal remote for AI models. Instead of managing five different accounts for OpenAI, Anthropic, and Google, you get one key.&lt;/p&gt;

&lt;p&gt;One billing setup:&lt;br&gt;
You pay one place but get access to GPT-4o, Claude 3.5, and Llama 3.&lt;/p&gt;

&lt;p&gt;OpenAI-compatible:&lt;br&gt;
This is the best part. It uses the exact same format as OpenAI, so you can plug it into almost any AI tool by just changing one URL.&lt;/p&gt;

&lt;p&gt;No borders:&lt;br&gt;
If you are in a region where some AI providers are blocked, AnyAPI acts as your legal bridge.&lt;/p&gt;

&lt;p&gt;‍&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The 3-minute setup&lt;/strong&gt;&lt;br&gt;
First, make sure you have your API key from the AnyAPI.ai dashboard. Then, let’s get your environment ready.&lt;/p&gt;

&lt;p&gt;‍&lt;br&gt;
**&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The config (The .env way)**&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The cleanest way to do this is to set up a .env file. We are going to "trick" OpenClaw into thinking it is talking to OpenAI, while actually routing it through AnyAPI.&lt;/p&gt;

&lt;p&gt;`# Redirect OpenClaw to the AnyAPI gateway&lt;br&gt;
BASE_URL="&lt;a href="https://api.anyapi.ai/v1" rel="noopener noreferrer"&gt;https://api.anyapi.ai/v1&lt;/a&gt;"&lt;/p&gt;

&lt;p&gt;` Your AnyAPI Key goes here&lt;br&gt;
ANYAPI_API_KEY="your_actual_anyapi_key"&lt;/p&gt;

&lt;p&gt;Pick your favorite model from the AnyAPI list&lt;br&gt;
MODEL_NAME="gpt-4o"&lt;code&gt;&lt;br&gt;
‍&lt;/code&gt;&lt;br&gt;
**&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The Python code**&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here is a simple script to get you started. No complex setup, just pure data extraction.&lt;/p&gt;

&lt;p&gt;`from openclaw import OpenClaw&lt;br&gt;
import asyncio&lt;br&gt;
import os&lt;/p&gt;

&lt;p&gt;We point the base_url to AnyAPI&lt;br&gt;
`claw = OpenClaw(&lt;br&gt;
    api_key=os.getenv("ANYAPI_API_KEY"),&lt;br&gt;
    base_url="&lt;a href="https://api.anyapi.ai/v1" rel="noopener noreferrer"&gt;https://api.anyapi.ai/v1&lt;/a&gt;",&lt;br&gt;
    model="gpt-4o"&lt;br&gt;
)&lt;/p&gt;

&lt;p&gt;async def scrape_site():&lt;br&gt;
    # Tell OpenClaw exactly what you want&lt;br&gt;
    my_schema = {&lt;br&gt;
        "title": "string",&lt;br&gt;
        "price_usd": "float",&lt;br&gt;
        "availability": "boolean"&lt;br&gt;
    }&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;print("Working my magic...")

result = await claw.scrape(
    url="https://example-shop.com/product",
    schema=my_schema
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;!&lt;br&gt;
    print(f"Here is your data: {result}")&lt;/p&gt;

&lt;p&gt;if &lt;strong&gt;name&lt;/strong&gt; == "&lt;strong&gt;main&lt;/strong&gt;":&lt;br&gt;
    asyncio.run(scrape_site())&lt;br&gt;
‍``&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pro-tips for a better experience&lt;/strong&gt;&lt;br&gt;
Watch your tokens:&lt;br&gt;
Web pages are full of useless code. OpenClaw tries to clean this up, but choosing a model like gpt-4o-mini on AnyAPI can save you a ton of money if you are scraping thousands of pages.&lt;/p&gt;

&lt;p&gt;Timeouts are your friend:&lt;br&gt;
AI takes a few seconds to "think" about the page content. Make sure your script doesn't time out after 10 seconds. Give it 60 to be safe.&lt;/p&gt;

&lt;p&gt;Model switching:&lt;br&gt;
If GPT-4o is struggling with a specific table, just change your MODEL_NAME to claude-4-5-sonnet in your AnyAPI settings. No code changes required.&lt;/p&gt;

&lt;p&gt;Final thoughts&lt;br&gt;
By pairing OpenClaw with AnyAPI.ai, you have essentially built a scraper that is "future-proof." Even if the website changes its entire design tomorrow, the AI will still find your data.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>automation</category>
      <category>opensource</category>
      <category>webscraping</category>
    </item>
  </channel>
</rss>
