Forem: CodeRabbit

Why do test coverage metrics keep misleading developers?

Obisike Treasure — Wed, 22 Apr 2026 12:11:57 +0000

High test coverage is often seen as a sign of software quality, yet it raises an important question: Why do well-tested applications still have bugs?

Many assume that high scores on test coverage metrics translates to high-quality software. However, the truth is much different. While test coverage can tell you how much of your code is executed during testing, it does not indicate if those tests are effective or comprehensive enough to detect all bugs and edge cases in the software.

This article explores why test coverage stats can be misleading and highlights scenarios where test coverage metrics create a false sense of security and quality.

Are test coverage metrics misleading?

Test coverage metrics are often seen as a measure of testing effectiveness, but they can indeed be misleading. A high coverage percentage might suggest that a system is thoroughly tested, yet it says nothing about how well the tests validate the code. Coverage alone does not account for the quality, depth, or relevance of the tests being executed.

This is why it's important to look beyond coverage numbers and consider what they actually represent.

Test coverage does not equal meaningful testing

Just because a test executes a line of code doesn't mean it's actually testing anything useful, like a car fuel gauge that’s always showing full, even when the tank is empty. It looks fine, but it tells you nothing.

Test coverage is easy to manipulate to achieve the desired coverage percentage. It doesn't take much effort to write a useless test that surmounts 100%. For example, if you have a function that calculates the price whether given a discount or not:

function calculatePrice(price: number, discount?: number): number {
  if (discount) {
    return price - discount;
  }

  return price;
}

Here is a possible test case that gives 100% test coverage:

import { describe, test, expect } from "vitest";
import { calculatePrice } from "./calculatePrice";

describe("calculatePrice", () => {
  test("it adds properly", () => {
    expect(calculatePrice(0, 2)).toBe(2);
  });
  test("it adds properly with values higher than zero", () => {
    expect(calculatePrice(3, 2)).toBe(5);
  });
});

This is the result of the test and the coverage:

In this example, you can see that the test cases do not handle edge cases or address the business issue that the price of a commodity can't be negative. Even with that, you still have 100% test coverage.

Overemphasis on numbers in test coverage

Many companies treat test coverage as a key performance indicator (KPI), using it as a benchmark for assessing the quality of their testing efforts. In some cases, teams are even offered incentives to achieve a specific percentage of coverage. While this approach may seem logical, encouraging developers to write more tests, it often leads to behaviors that undermine the true purpose of software testing.

By placing too much emphasis on the coverage percentage, organizations shift the focus from writing effective, meaningful tests to merely increasing the test coverage number. This results in developers gaming the system to meet targets rather than ensuring software reliability.

Consider a scenario where a company mandates 90% test coverage as a KPI, with bonuses tied to achieving this goal. Developers, eager to meet the requirement, may resort to writing tests that:

Do not assert anything – The test runs but doesn't check any output, allowing coverage metrics to increase artificially.
Cover trivial code paths – Simple getter and setter functions are tested while complex business logic remains untested.
Artificially inflate coverage – Tests execute code but ignore potential failures or incorrect results, leading to a false sense of security.

While test coverage metrics may look impressive, the software remains vulnerable to undetected bugs. This misalignment between incentives and real testing needs ultimately results in fragile software that poses potential risks of failure in production despite high coverage numbers.

This overemphasis on numbers fosters a check-the-box mentality, where the success of testing is judged purely on a percentage rather than its actual effectiveness. It discourages engineers from thinking critically about edge cases, real-world scenarios, and risk-based testing approaches.

False sense of security with test coverage

Having a maximum test coverage gives you a false sense of security. High coverage creates a feeling of overconfidence that can cause you to overlook more exploratory testing methods or the important insights gained from peer reviews, as most companies would presume that their software has attained quality status on the application of high test coverage testing.

It's similar to driving a car with a nice dashboard. Everything may appear to be in order at first glance, but there's no guarantee that the engine is operating properly below.

For instance, a study titled "Can We Trust Tests To Automate Dependency Updates? A Case Study of Java Projects" examined the effectiveness of test suites in detecting faults related to dependency updates. The researchers found that, despite high test coverage, tests detected only 47% of faults in direct dependencies and 35% in transitive dependencies.

This indicates that even with substantial test coverage, a significant portion of potential issues remained untested, leading to a misplaced confidence in the code's reliability.

Signs your test coverage is misleading you

Test coverage scores can be misleading, especially when they fail to reflect real-world software reliability. If coverage numbers are impressive but critical bugs still slip through, it's a sign that your testing strategy needs reevaluation. Here are key indicators that your test coverage might not be as effective as it appears:

Frequent regressions despite high test coverage

A regression occurs when a previously working feature breaks after changes are made to the codebase, such as adding new features, refactoring, or fixing bugs. While high test coverage may suggest strong protection against regressions, it can be misleading if tests only check whether functions execute rather than validate their actual behavior.

If test cases do not account for edge cases, business logic variations, or interactions between components, regressions can slip through undetected. This often happens when tests are written to maximize coverage metrics instead of ensuring functional correctness.

As a result, you may find yourself repeatedly fixing the same issues despite having a high coverage score.

Tests that pass even when key functionality breaks

Strong test coverage does not prevent superficial tests from creating false positives. These false positive tests are tests that can pass even when important functionality is broken. They just guarantee that the code executes without evaluating its output or implementing thorough checks on business logic and edge cases.

Here is an example:

Let's say you have a function that calculates discounts for an e-commerce checkout system:

function calculateDiscount(price, discountPercentage) { 
  return price - (price * (discountPercentage / 100)); 
}

A poorly written test might only check if the function runs without errors, but not validate the correctness of the output:

test('calculateDiscount should not return null', () => {
  const result = calculateDiscount(100, 10);
  expect(result).not.toBeNull();
});

Here is why this test is misleading:

This test will pass even if the function is completely wrong because it only verifies that a value is returned.
If a developer mistakenly changes the function to always return 0, like this:

function calculateDiscount(price: number, discountPercentage: number): number {
  return 0;
}

The test will still pass despite the broken discount logic.

Test suites are bloated with low-value tests that don't catch critical bugs

Tests that are written superficially and do not adequately identify edge situations in a software's functionality defeat the objective of testing and lead to software bloating. Bloating without obvious justification can have a detrimental impact on the software's speed, performance, and, eventually, adoption by end users/customers.

Here's an example:

A function is supposed to process an order by applying a discount, but because of a bug, it never subtracts the discount from the total:

// Function intended to process an order with a discount.
function processOrder(order: { total: number; discount?: number }): number {
  // Bug: The discount is ignored; the total is returned as is.
  return order.total;
}

Here are the test cases:

describe("processOrder", () => {
 it("returns the total when discount is provided", () => {
   const result = processOrder({ total: 100, discount: 10 });
   expect(result).toBe(100);
 });

 it("returns the total when no discount is provided", () => {
   const result = processOrder({ total: 200 });
   expect(result).toBe(200);
 });
});

With that, you'll still get 100% test coverage:

In this example, the test suite may indicate excellent coverage since it runs every line of the code, but it fails to detect the critical bug in which the discount is never applied. It also does not account for edge case values, like cases where the total or discount is zero, negative numbers for total or discount, and cases where the discount is greater than the total.

Over-reliance on mocks and stubs

One critical issue in testing is the overuse of mocks and stubs. Modern software often relies on numerous external dependencies, and to test components that interact with these dependencies, developers commonly use mocks and stubs. While this approach can make testing more efficient, it comes with significant risks.

The problem lies in the assumption that developers fully understand the behavior of the mocked dependency, including its edge cases and quirks. In reality, this is nearly impossible. As a result, many unexpected behaviors, failure scenarios, and integration issues may never be tested. This creates a false sense of confidence, where tests pass but fail to reflect real-world conditions.

Don’t chase high test coverage

Instead of chasing high test coverage percentages, follow these practical strategies to improve test effectiveness and real-world reliability. Here's what you should prioritize to build a stronger, more meaningful test suite:

Test quality over quantity

In testing, quality matters more than quantity. High test coverage may look impressive, but if tests only confirm that code executes without validating behavior, they provide little real protection. Instead of chasing coverage metrics, focus on writing tests that verify expected outcomes.

Using the calculateDiscount function example:

export function calculateDiscount(price: number, discountPercent: number): number {
  if (discountPercent < 0) {
    throw new Error("Discount percent cannot be negative");
  }

  return price - (price * discountPercent / 100);
}

A low-quality test would look like this:

import { describe, test, expect } from 'vitest';
import { calculateDiscount } from './calculateDiscount';

// Low-quality test: only determines if the function runs without error.
describe('Low-Quality Test', () => {
  test('should run without throwing an error', () => {
    // This test only calls the function without asserting correctness.
    calculateDiscount(100, 20);
  });
});

While a high-quality test to verify the proper results will look like this:

// High-quality tests: verifying the intended behavior.
describe('High-Quality Tests', () => {
  test('returns correct discount for valid input', () => {
    expect(calculateDiscount(100, 20)).toBe(80);
  });
  test('returns full price when discount is 0', () => {
    expect(calculateDiscount(100, 0)).toBe(100);
  });
  test('returns 0 for a 100% discount', () => {
    expect(calculateDiscount(100, 100)).toBe(0);
  });
  test('throws error for negative discount', () => {
    expect(() => calculateDiscount(100, -10)).toThrow("Discount percent cannot be negative");
  });
});

The high-quality test covers most edge cases.

Combine test coverage with other test metrics

Relying solely on test coverage can lead to blind spots in your testing strategy. To make sure your tests are effective, supplement coverage with other key quality indicators:

Peer Reviews and Exploratory Testing – Manual testing and developer reviews help uncover edge cases that automated tests might overlook.
Code Complexity Analysis – Highly complex code demands more rigorous testing, even if coverage is high.
Bug Frequency and Production Issues – If frequently covered code still leads to real-world failures, your tests may not be meaningful.

For example, an order processing module might show high test coverage but still allow critical issues like duplicate orders, payment failures, or unexpected behavior under network failures. In such cases, expand your test suite to include edge cases, integration failures, and real-world scenarios rather than just ensuring the code runs.

Contextual testing

Test coverage alone does not guarantee that an application functions correctly under real-world conditions. Contextual testing aligns tests with the business logic and practical usage scenarios, rather than just verifying code execution.

For example, consider a login function. A superficial test might only check whether the function runs without errors. However, a contextual test should verify that:

Valid credentials return a session token.
Invalid credentials trigger the correct error message.

Example: Contextual Testing for a Login Function

A login function should behave as expected under different conditions.
Implementation:

export function login(username: string, password: string): string {
  // Returns a JWT-like token for valid credentials
  if (username === "valid_user" && password === "valid_pass") {
    return "eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9";
  } else {
    throw new Error("Invalid credentials provided.");
  }
}

Tests:

import { describe, test, expect } from 'vitest';
import { login } from './login';

describe('Login Function - Contextual Testing', () => {
  test('successful login returns a token', () => {
    const token = login("valid_user", "valid_pass");
    expect(typeof token).toBe("string");
    expect(token.startsWith("eyJ")).toBe(true);
  });

  test('failed login throws an error with proper message', () => {
    expect(() => login("valid_user", "wrong_pass")).toThrow("Invalid credentials provided.");
  });
});

Instead of merely checking if the function executes, these tests mirror real-world authentication behavior:

They validate that correct credentials return a properly structured token, mimicking a real authentication flow.
They confirm that incorrect credentials result in an error message, just as a real login system would behave.

While these tests add meaningful validation, they still rely on a controlled, isolated function. In real-world applications, authentication involves external dependencies like databases and APIs. Overuse of stubs and mocks in testing can lead to unrealistic test scenarios. To maintain reliability, you should focus on integration testing, where components like user authentication services, databases, and external APIs work together as expected.

Using technologies like Artificial Intelligence (AI) can improve software testing by identifying weak spots, analyzing patterns, and improving the test process. Beyond the normal coverage metrics, AI-driven tools offer smarter ways to detect potential issues before they become costly problems.

What AI tools can do for test coverage

AI introduces a smarter way to detect potential issues before they become costly problems. It reduces manual efforts and can perform complex analyses, including static code analysis, thanks to the vast knowledge it's trained with.

For example, suppose your team submits a pull request (PR) to add a new discount checker function but forgets to include a write test for an edge case. AI can analyze the PR, compare it to your repository's existing test suite, and flag missing test coverage and edge cases that are not accounted for.

Below, is an example of a test coverage summary that provides an overview of which functions are covered and highlights coverage gaps in the codebase

Then, it will provide a detailed breakdown, function-by-function, displaying which logic branches and conditions are tested or not.

If AI detects untested functions or missing edge case tests, or even test cases written just to increase the test coverage numbers, it flags them.

AI doesn't just identify issues, it provides suggestions on how you can improve your testing.

Beyond automating repetitive tasks, AI helps you write better tests by pinpointing gaps and refining existing ones. It can recommend test cases for complex logic, flag potential bugs, and even predict areas of risk based on past failures.

Test coverage: Quantity doesn't equal quality

Although test coverage is a helpful indicator of how much of our code has been tested, it's crucial to keep in mind that it’s just one component of the whole picture. High test coverage may seem fantastic, but it doesn't mean that your tests are truly examining the important sections of your code or identifying possible problems.

Writing comprehensive and insightful tests that verify functionality, edge cases, and business logic is where the true value lies. The emphasis should be on making sure that each test has a purpose and increases confidence in the accuracy of your application rather than being fixated on reaching a large percentage of coverage. Ultimately, when used carefully, test coverage can contribute significantly to a more robust and dependable development process.

How to use AI to identify and fix security vulnerabilities in your codebase

Damilola Oshungboye — Wed, 22 Apr 2026 09:40:52 +0000

Meta: Understand the common code-based security vulnerabilities, from SQL injection to XSS, and how AI simplifies the detection and resolution of these security vulnerabilities for improved code security.

With the average data breach now costing companies $4.45 million, securing your code has never been more urgent.

As development cycles accelerate, security vulnerabilities like SQL injections or cross-site scripting (XSS) are still common or discovered too late. AI is changing that. By scanning large codebases, learning from actual attack patterns, and offering targeted fixes, AI tools help you address code security flaws faster than ever.

This article explores where traditional approaches fall short, how AI can fill those gaps, and the practical steps to embedding AI code security into your development workflow.

Common security vulnerabilities in your codebase

When you look at the OWASP Top 10, you’ll notice how many serious threats come from your routine everyday coding patterns. Let’s examine a few of the most prevalent vulnerabilities:

1. SQL injection

SQL injection can be especially dangerous because it exploits the mechanism you rely on to store and retrieve data. An attacker essentially “injects” malicious SQL commands into an application’s inputs, potentially gaining unauthorized access to or altering the underlying data.

Here’s an example from a simple authentication routine:

# auth_service.py
def authenticate_user(username, password):
    query = f"""
        SELECT id, role FROM users 
        WHERE username = '{username}' 
        AND password = '{password}'
    """
    result = database.execute(query)
    return result.fetchone()

This code appears functional but is highly vulnerable to SQL injection attacks. The authenticate_user function constructs an SQL query using string concatenation, which allows an attacker to inject malicious SQL code. If an attacker inputs ' OR '1'='1 as the username, the query becomes:

SELECT id, role FROM users WHERE username = '' OR '1'='1' AND password = 'anything'

This query will always return true, allowing the attacker to potentially bypass authentication and access protected data. The consequences of SQL injection attacks can be severe, including loss of confidentiality, tampering with existing data, identity spoofing, and even gaining administrative access to the database server.

A more secure approach uses parameterized queries:

# auth_service.py (secure version)
def authenticate_user(username, password):
    query = """
        SELECT id, role FROM users 
        WHERE username = %s AND password = %s
    """
    result = database.execute(query, (username, password))
    return result.fetchone()

In this secure version, the SQL query uses placeholders (%s) for the user input, and the actual values are passed as parameters to the execute method. This approach ensures that the input data is properly escaped and cannot be used to manipulate the SQL query structure

2. Cross-site scripting (XSS)

XSS happens when attackers inject malicious scripts (often JavaScript) into web pages that other users view. Any spot in your application where user input is rendered onto the page can be a gateway for XSS.

The following example of a blog post display function is a common pattern in content management systems and is vulnerable to XSS attacks.

@app.route('/blog/<post_id>')
def display_post(post_id):
    post = get_post(post_id)
    return f"""
        <article>
            <h1>{post.title}</h1>
            <div>{post.content}</div>
        </article>

An attacker could inject malicious JavaScript through the post content, affecting every visitor.

In this example, an attacker could script malicious JavaScript through the post.content field. For instance, if an attacker submits a blog post with the following content:

<script>alert('XSS')</script>

This script will be executed by every visitor who views the blog post, allowing the attacker to steal session cookies, manipulate the user's browser, or perform other malicious actions.

To counter XSS, always sanitize and escape user-generated content:

from markupsafe import escape

@app.route('/blog/<post_id>')
def display_post(post_id):
    post = get_post(post_id)
    return render_template('post.html',
        title=escape(post.title),
        content=escape(post.content)
    )

The escape function from the markupsafe library is used to ensure that any user-provided content is properly sanitized, preventing malicious scripts from being executed in the user's browser.

3. Hardcoded secrets

Hardcoding sensitive information, like API keys or database credentials, into your code is a common mistake, often done for convenience. However, attackers can use these secrets to gain unauthorized access if this code ends up in a public repository or is leaked elsewhere.

Here is a simple but risky approach:

class StorageService:
    def __init__(self):
        self.aws_key = "AKIA1234567890ABCDEF"
        self.aws_secret = "jK8*2nP9$mB4#kL5"
        self.client = boto3.client('s3',
            aws_access_key_id=self.aws_key,
            aws_secret_access_key=self.aws_secret
        )

If these credentials appear in a publicly accessible repository, malicious actors can exploit them immediately. Instead, you can store secrets in environment variables or a secure secrets manager:

from decouple import config

class StorageService:
    def __init__(self):
        self.client = boto3.client('s3',
            aws_access_key_id=config('AWS_ACCESS_KEY'),
            aws_secret_access_key=config('AWS_SECRET_KEY')
        )

The decouple library from config is used to load the AWS credentials from environment variables, ensuring that sensitive information is not hardcoded in the application code. This approach significantly reduces the risk of credential exposure and subsequent security breaches.

Although these vulnerabilities may appear obvious individually, they become difficult to detect within large codebases, especially when many developers make multiple commits daily across various repositories. Manual reviews alone struggle to consistently catch security issues at scale.

Keeping these vulnerabilities in mind, let's examine how various types of security testing can help identify and prevent them.

Choosing static or dynamic code analysis?

It’s critical to identify vulnerabilities early. Two practical approaches to scrutinizing your code are:

Static Application Security Testing (SAST) – Analyzes your codebase without executing it, pinpointing issues like insecure coding patterns, hardcoded secrets, or known vulnerability signatures.
Dynamic Application Security Testing (DAST) – Interacts with your running application to spot real-time issues, such as misconfigurations or runtime injection paths.

Many security teams rely on both since SAST focuses on code structure, whereas DAST uncovers flaws that are only visible during runtime. Whichever approach you take, the idea is to catch issues well before affecting production users.

An AI-assisted approach to code security

Even if you’re vigilant about security, modern codebases constantly evolve, and it’s easy for vulnerabilities to slip through. AI code review tools help by quickly scanning large repositories, drawing on known attack patterns, and machine learning to spot issues you might otherwise miss.

They also offer practical suggestions for fixing those issues, reinforcing your overall security posture.

To see this in action, consider a Python Flask project that handles user profiles and file uploads (two areas where security oversights often hide). AI can highlight these hidden risks and guide you in resolving them before they become real problems.

Setting up an AI tool for automated code reviews for security vulnerabilities

If you’d like to explore AI-based reviews, you can try any tool that suits your needs; some include GitHub Copilot, Claude, and Perplexity. For this tutorial, we’ll use a popular AI tool like ChatGPT to review and examine common vulnerabilities with a Python Flask application that manages user profiles and handles photo uploads.

First, clone the Python project that handles user profile management.

git clone https://github.com/Tabintel/py-photo-lib.git

Navigate to the project directory and install dependencies.

cd py-photo-lib
python -m venv venv
venv\Scripts\activate
pip install -r requirements.txt

Then, create a branch for the new photo upload functionality:

git checkout -b feature/profile-uploads

After cloning the repository and creating the branch, copy the code from this GitHub gist into your app.py file.

This feature adds photo upload capabilities to the application, allowing users to upload profile images.

The changes include:

New route /upload for uploading images.
File upload handling using Flask's request.files.
Database query in /profile/<username> to fetch user details.
Configured API to handle profile image paths.

Before pushing the code to the main repository, you’ll first use an AI tool to review it for any vulnerabilities.

Give this prompt to your AI tool (we’re using ChatGPT in this case):

"Review the following code, which adds a new profile image upload feature to a Python Flask application. Identify any security vulnerabilities or potential risks in the implementation, briefly explain why, and suggest improvements where necessary."

In the review process, you would likely receive precise, line-by-line recommendations, such as these examples:

1. Exception handling enhancement

The review flagged broad exception handling, suggesting you implement more specific error tracking to prevent the exposure of sensitive information.

This is a critical security concern, as broad exception handling can make identifying and addressing potential issues difficult. To address this, you can implement a robust exception-handling mechanism that includes specific error messages and logging.

For example, instead of using a broad exception handler like this:

except Exception as e:

You can implement a more specific exception handler that includes error messages and logging:

try:
except ValueError as e:
    # Log the exception and return an error message
    logging.error(f"Invalid value: {e}")
    return {"error": "Invalid value"}
except Exception as e:
    # Log the exception and continue
    logging.error(f"An error occurred: {e}")

This approach ensures that each exception type is handled specifically, providing more detailed error logs and making diagnosing and fixing issues easier.

2. Undefined username

Another vulnerability identified is related to undefined model imports. Having undefined model imports in any codebase can make it difficult to ensure data integrity and prevent unauthorized access.

To address this, you can enhance the username validation to prevent undefined model imports.

from models import User
user = User.query.get(username)

When the User models are imported correctly, you prevent potential errors that could arise from undefined models, which can lead to data inconsistencies and prevent attacks like SQL injections.

3. Additional code vulnerabilities

From this particular review, you can see that the AI tool recognizes areas in the app.py code pull request and points out the exact lines of code that the changes need to be reviewed for security purposes. Let’s look at each of them:

4. Image format validation

The automated review flags the need to review image format validation. Currently, the application only allows png, jpg, jpeg, and gif formats for uploaded photos:

ALLOWED_EXTENSIONS = {'png', 'jpg', 'jpeg', 'gif'}
MAX_FILE_SIZE = 5 * 1024 * 1024  # 5MB

While this approach is sufficient for basic use cases, it leaves the application vulnerable to limitations or potential misuse.

For example, if the project requires support for other formats (e.g., webp, tiff), the lack of validation for these could result in errors or even security risks when unexpected file types are uploaded.

Proper validation ensures that the system explicitly handles supported formats, mitigating risks from unverified file uploads.

5. Function usage documentation

Next, there’s a suggestion to clarify function usage in docstring. A short docstring for allowed_file could be helpful in describing its purpose and the expected parameter or return types.

def allowed_file(filename):
    return '.' in filename and \
           filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS

Without documentation, you may misunderstand the purpose or behavior of this function. Adding a short docstring that specifies the function's role (validating file extensions) and details the expected parameters and return values would prevent miscommunication and misuse.

6. Image processing logging

There's a suggestion to validate and log image processing steps. The image processing logic is good but may benefit from explicit logging when conversions or resizing occur, especially to diagnose user-upload issues.

def process_image(image_path):
    with Image.open(image_path) as img:
        # Convert to RGB if necessary
        if img.mode != 'RGB':
            img = img.convert('RGB')
        # Resize if too large
        if max(img.size) > 2000:
            img.thumbnail((2000, 2000))
        # Save optimized version
        img.save(image_path, 'JPEG', quality=85, optimize=True)

Without logs, it becomes difficult to diagnose issues when image uploads fail. Logging each step would provide a clear trail for troubleshooting and prevent vulnerabilities caused by incomplete or incorrect processing. Logs also enhance accountability and allow developers to detect unexpected behavior during uploads.

Through this automated code review process, you've seen how AI tools comprehensively analyze the code, highlighting vulnerabilities in the photo upload feature. The review shows critical security gaps, from unhandled exceptions that could expose system information to unsecured file type validation that might allow malicious uploads and missing model imports that could compromise data integrity.

Once you've added the code from the review, commit and push the changes to your GitHub repository using the following commands:

git add app.py
git commit -m "feat: add profile photo upload functionality"
git push origin feature/profile-uploads

Implementing code security best practices

Security requires ongoing diligence. Here are some routines you can integrate into daily development:

1. Regular code reviews

Peer reviews remain indispensable for catching logical errors and sharing knowledge. To maximize coverage, combine them with automated scanning.

2. Secure coding standards

Avoid storing secrets in code.
- Enforce the least privilege for database connections.
- Use parameterized queries by default.
- These guidelines reduce the risk of common issues slipping through.

3. Frequent vulnerability assessments

Schedule periodic scans (e.g., monthly or quarterly) using SAST and DAST tools. Consider penetration tests for critical areas of your application.

4. Automate where possible

CI/CD pipelines can automatically run security tests before deployment, ensuring no commit merges without scrutiny. AI can assist in triaging results, making the process more efficient.

5. Embrace DevOps culture

Encourage open collaboration between development and operations, ensuring that security is baked in from the start rather than bolted on at the end.

6. Shift-Left security

Moving security checks to earlier stages of development (DevSecOps) reduces last-minute surprises. Addressing security vulnerabilities earlier in the development process significantly reduces the time and resources required for remediation.

Wrapping up: Securing your codebase

Applying thoughtful, ongoing code security practices and AI code review checks can significantly reduce your application’s vulnerability. Whether it’s preventing SQL injection, mitigating XSS, or ensuring secrets don’t leak, each layer of protection adds up.

As you refine these habits and explore AI code security solutions, you’ll discover more efficient ways to keep your applications secure, protect sensitive data, and maintain a reliable development pipeline.

When should you use canary deployments?

Doug Sillars — Mon, 20 Apr 2026 18:42:22 +0000

One solution for a tricky or high-risk deployment is to leverage a canary deployment. In a canary deployment, new features are rolled out to a small subset of your customer base with no fanfare. If things go well: hooray! Canary deployments are ideal for times when a deployment might go sideways or the feature might not work as expected. By utilizing a canary release, you have minimized the risk, as impacting a small fraction of your users is a lot better than affecting all of them.

This post will discuss when development teams might choose to use canary deployments in their applications, what is a canary deployment, common pitfalls to avoid, and how modern tooling can help ease the rollout of canary deployments.

What is a canary deployment?

Where does the term canary deployment come from? You may have heard of the term "canary in the coal mine." Early coal miners would keep caged songbirds in the mine with them, because these small birds are highly susceptible to toxic gases. If the canaries suddenly died, the miners knew that there was a dangerous gas release, and they could evacuate the mine before they succumbed to the deadly gas.

A canary deployment can be performed when there is concern about the risks involved. Perhaps you are rolling out new cloud infrastructure that has not yet been tested with the production environment, or maybe there are performance concerns—the updates might cause the website's performance to drop to a terrible crawl. Perhaps the change is small, but it could not be tested for security vulnerabilities. A failed rollout to your entire customer base could end in disaster in all of these cases.

Is your team already using canary deployments?

Many teams practice canary-style releases without explicitly calling them that. Common variations include:

Blue-green deployments: Two production environments run in parallel. The blue environment serves the stable version, while green runs the new version. Traffic is gradually shifted from blue to green, limiting risk and enabling fast rollback.
Experimental deployments: New behavior is exposed to a defined group of users, often with explicit awareness or opt-in. These are typically broader than canary deployments and focus on measuring user impact rather than system health alone.
Gradual rollouts (progressive delivery): After validating a release with a small subset of users, traffic is incrementally increased (for example, 5% → 25% → 100%) while monitoring key metrics like error rates and latency.
Shadow deployments: The new version runs in parallel and receives a copy of real production traffic, but responses are not served to users. This is useful for validation and performance testing without customer impact.

Planning a canary deployment

Imagine this situation: You are about to launch a new feature that is considered high-risk for some reason. The team has opted for a canary deployment, releasing it to a limited group of users first. Achieving a successful deployment in this scenario demands meticulous planning.

Deployment audience: Who gets the update? This requires a basic knowledge of your audience and understanding of who would benefit the most from the change. If your new version is an overhaul of the mobile dashboard, and the canary release is delivered to desktop users, how can you determine if the deployment was successful? It is certainly useful to have some desktop users in the cohort (in case something breaks on desktop), but in order to best understand the results of the deployment, you want to ensure that users interact with the features of the new version.
Duration: How long will the test run? An hour? A week? Depending on the type of deployment, you may know the success/failure quickly. (Did the database correctly sync with the servers?). However, sometimes it may take days to determine success. (Is website performance faster during busy periods?).
Metrics: What will the team track during the deployment? What metrics will be considered a success? What metrics would be branded a partial success? When does the team admit failure and roll back the deployment to fix bugs?
Feature flags: For some canary deployments, using feature flags can be helpful. Feature flags are tools in your code that manage the deployment of application features to specific audiences. If your team is planning to use canary deployment extensively, it may be useful to integrate feature flags into your codebase.

Deployment risks

With any release, there is a risk of failure. Using a canary deployment mitigates the risk by reducing the blast radius of a failure to a small subset of users—however, even canary releases have risks. Among these are:

Service outages: If the deployment goes bad, the application may slow down or even cease to work for the selected users.
Security breaches: The new release may introduce a security vulnerability that was not discovered in testing. Rollback mechanism: Many canary releases do not have a formalized rollback plan, should things go wrong.
Data loss: If the canary rollout is using a new database or a new interface to the database, any issue with the rollout may introduce data inconsistencies or data loss.

For any release, it is important to consider the risks and how they will be mitigated. While the canary release lowers the risks, having plans for rollback, handling data loss etc. should be made prior to the release.

Key metrics during canary deployments

During any release, it is critical to monitor your systems in case of any trouble during the deployment. Canary releases are no different, although the experimental nature of the release means there is a higher likelihood of issues arising:

Error rates: Are your systems throwing more errors than typically seen in production?
System metrics: Keep a close eye on memory, CPU, database, and API queries, and other resource metrics. We've all been a part of a release where the new code introduced a memory leak, or suddenly API usage was pegging 400% of normal. By monitoring these in the canary release, the issues can be quickly remediated for the canary users before releasing to the entire base.
Traffic distribution: Is the canary release being properly served to the selected users? Are users accidentally crossing between the two environments? Handling these issues early ensures that the new build is completely isolated from the old environment.
Security metrics: Is the login behavior different in the canary build? Are users interacting with this build differently than the previous version? Could this imply a potential threat?

Monitoring these metrics during the deployment and after the release can help your team better understand the wins (and potential improvements to your app) that occur during the canary release. In addition to traditional monitoring, there are now AI tools that can be used to compare your two environments, quickly parsing the data and determining differences from the old version and the new software version that might be difficult to see in traditional logging.

Canary deployments: What's next?

When dealing with a high-risk release, a canary deployment to a small subset of users is a great way to mitigate exposure in production. To be successful in your canary release, be sure to discuss how you will choose the cohort of users, monitor the release, and plan a rollback strategy prior to the release. Then, during the deployment process, your team can track the essential metrics to understand if the canary deployment was a success. Should the deployment go well, the team can plan to add users to the new version of the software, slowly deprecating the old version.

Developers who use canary releases for risky deployments are able to test problematic code or infrastructure to a small user base, pushing the product ahead while mitigating the risk of an outage.

Show me the prompt: What to know about prompt requests

Arindam Majumder — Thu, 29 Jan 2026 05:34:00 +0000

In the 1996 film Jerry Maguire, Tom Cruise’s famous phone call, where he shouts “Show me the money!” cuts through everything else. It’s the moment accountability enters the room.

In AI-assisted software development, “show me the prompt” should play a similar role.

As more code is generated by large language models (LLMs), accountability does not disappear. It moves upstream. The question facing modern engineering teams is not whether AI-generated code can be reviewed, but where and how review should happen when intent is increasingly expressed before code exists at all.

The Twitter debate: Prompts versus pull requests

Earlier this week, Gergely Orosz of Pragmatic Engineer shared a quote on Twitter (or X, if you prefer) from an upcoming podcast with Peter Steinberger, creator of the self-hosted AI agent Clawdbot.

Steinberger’s point was straightforward but provocative: as more code is produced with LLMs, traditional pull requests may no longer be the best way to review changes. Instead, he suggested, reviewers should be given the prompt that generated the change.

That idea quickly triggered a polarized response.

Supporters argued that reviewing large, AI-generated diffs is becoming increasingly impractical.

From their perspective, the prompt captures intent more directly than the output. It tells reviewers what the developer was trying to accomplish, what constraints they set, and what scope they intended. In addition, a prompt can be re-run or adjusted, which makes it easier to validate the approach without combing through thousands of lines of generated code.

Critics, however, pointed to issues that prompts alone do not solve: determinism, reproducibility, git blame, and legal accountability.

Because LLM outputs can vary across runs, models, and configurations, approving a prompt does not necessarily mean approving the exact code that ultimately ships. For audits, ownership, and downstream liability, that distinction matters. In their view, code review cannot be replaced by “prompt approval” without weakening the guarantees that PR-based workflows were designed to provide.

The core disagreement, then, is not whether prompts should be part of review. It is where accountability should live in an AI-assisted workflow: primarily in the prompt, primarily in the code, or in a deliberately structured combination of both.

What is a prompt request?

A prompt request is exactly what it sounds like: a request by a developer for a peer review of their prompt before feeding it into an LLM to generate code. Or, in the case of multi-shot or conversational prompts, a review of the conversation between the developer and the agent.

Instead of starting review at the diff level, a prompt request asks reviewers to evaluate the instructions given to the LLM so they can sign off on or contribute to the context, intent, constraints, and assumptions that guide the model’s output. A typical prompt request may include:

The system and user prompts
Relevant repository or architectural context
Model selection and configuration
Constraints, invariants, or non-goals
Examples of expected behavior

The goal is to make explicit what the model was asked to do before evaluating how well it did it.

In this sense, a prompt request functions more like a design artifact than a code artifact. It captures intent at the moment of generation and helps ensure the prompt is comprehensive and explicit enough to address the requirements. It can help teams better align around how they prompt and ensure that everyone is using the same context to generate code.

Good news: Prompt requests and pull requests are not in conflict

Much of the debate this week stemmed from treating prompt requests and pull requests as competitors. Either you do a prompt request or a pull request, some commenters suggested.

However, they shouldn’t be.

After all, they address different failure modes at different stages of the development lifecycle. Just like you’re not going to skip testing because you did a code review, you shouldn’t skip a code review because you did a prompt request.

Prompt requests are valuable because they ensure alignment and best practices early, before any code is generated or committed. They help teams align on what should be built, define boundaries, and constrain agent behavior. Because large language models are non-deterministic, capturing intent explicitly becomes even more important upstream, where variability is highest.

A prompt request can also help ensure that a prompt is optimized for the specific model or tool that will be used to generate the code, something essential in ensuring the quality of the output of increasingly divergent models (something we’ve consistently found in our evals).

Pull requests remain essential later, when teams review the exact code that will ship. They preserve determinism, traceability, testing, auditing, and accountability. One captures intent. The other captures execution.

Treating prompt requests as replacements for pull requests creates a false tension. Used together, they complement each other. Doing a prompt request and then skipping a pull request seems reckless and like tempting fate since the actual code produced hasn’t been validated.

Why teams are drawn to prompt requests

When done as part of the regular software development workflow that includes a thorough code review, prompt requests are a way to shift left and catch issues early. It ensures a team is aligned on the goals of the feature, can help optimize the prompt for the model it’s using, and can ensure that the proper context is being supplied to improve the generated output. This can cut down significantly on review and issues later on.

When used alone without doing a pull request after the code is generated, the primary appeal of prompt requests is cognitive efficiency and speed.

AI has dramatically increased the speed at which developers can produce code, but the review process has not kept pace. As AI-authored changes grow larger and more frequent, line-by-line review becomes increasingly difficult and cognitively taxing to complete. Subtle defects slip through not because engineers don’t care, but because reviewing enormous, machine-generated diffs is mentally taxing.

Prompts, by contrast, are typically shorter and more declarative. Reviewing a prompt allows engineers to reason directly about scope, intent, and constraints without getting buried in implementation details produced by the model.

Prompt-first review works particularly well for:

Scaffolding and boilerplate generation
Small changes
Greenfield prototypes
Fast-moving teams optimizing for iteration speed
Hobby projects where defects in prod aren’t that consequential

In these cases, the most important question is often not “is every line correct?” but “is this what we meant to build?”

Where prompt requests fall short

When used in concert with pull requests, there are few downsides since they simply offer another opportunity to review the proposed code change before generation. The biggest one is the time and cognitive effort it takes and how this could become a new bottleneck for code generation if it takes too long to get a review.

When treated as a replacement for pull requests, the biggest limitation of prompt requests is non-determinism.

After all, the same prompt can produce different outputs across runs or models. That makes reviewing prompts a weak substitute for reviewing an auditable record of what actually shipped. From the perspective of git blame, compliance, or legal accountability, prompt reviews alone are insufficient.

There are also real security and correctness risks. You might think you covered everything in your prompt but it may encode unsafe assumptions, omit edge cases, or fail to account for system-specific constraints that would normally be caught during careful code review. Reviewing intent does not guarantee that the generated output is secure, performant, or compliant.

Finally, prompts are highly contextual. A prompt that looks reasonable in isolation can still produce problematic implementations if the reviewer lacks deep familiarity with the codebase, infrastructure, or runtime environment. While prompt reviews are designed to limit this by bringing in additional sets of eyes to improve the prompt, human reviewers make mistakes all the time on actual code. Add in the unpredictability of a model and that’s a recipe for bugs and downtime These risks increase as prompts are reused or gradually modified over time or if you change models.

Prompt requests work best before pull requests

Used together, prompt requests and pull requests offset each other’s weaknesses.

A practical workflow might look like this:

A developer proposes a prompt request describing the intended change, constraints, and assumptions. This can involve just one prompt or a series of prompts for different parts of the code being generated. In the case of conversational prompts, the dev might propose a conversational response or share their transcript with the LLM after the fact. In that case, the review could help reprompt the agent to generate a better result.
The team reviews and aligns on the prompt(s) before code generation.
The code is generated and committed.
A traditional pull request reviews the concrete output for correctness, safety, and fit.

In this model, prompt requests act as an upstream alignment step for AI-generated work. They reduce ambiguity early, potentially shrink downstream diffs, and make pull requests easier to review.

Prompt requests do not replace the later rigor needed in pull requests. They just add more rigor earlier.

Are prompt requests going to replace pull requests?

Let’s be honest, prompt requests are unlikely to fully replace pull requests. No one thinks a large publicly traded company is going to trust AI-generated output so faithfully, they’ll bet their revenue (and future) on it without careful review.

While we are bullish on prompt requests at CodeRabbit, the industry is still in the early stages of their adoption, and today’s LLMs are not capable of fully replacing pull requests.

Will prompt requests work instead of pull requests for smaller open-source or single-maintainer projects? We are likely heading toward that reality sooner rather than later, but pull requests remain an essential part of the current software development lifecycle. This is especially true for production systems, regulated environments, or large teams with shared ownership and long-lived, complex codebases.

Pull requests exist because software development ultimately involves shipping specific, deterministic artifacts into production. As long as that remains true, teams will need a concrete mechanism to review, test, audit, and approve the exact code that runs.

The more realistic future is not prompt requests instead of pull requests. It is prompt requests before pull requests.

What is becoming clear is that the quality of the prompt increasingly determines the quality of the output. Treating prompts as first-class artifacts acknowledges that reality without abandoning the safeguards that traditional code review provides.

In that sense, “show me the prompt” does not remove accountability. It shifts some of it earlier, where it can reduce rework, surface intent, and make the pull request stage easier rather than unnecessary.

Interested in trying CodeRabbit? Get a 14-day free trial.

Why users shouldn’t choose their own LLM models: Choice is not always good

Arindam Majumder — Thu, 29 Jan 2026 01:29:00 +0000

Giving users a dropdown of LLMs to choose from often seems like the right product choice. After all, users might have a favorite model or they might want to try the latest release the moment it drops.

One problem: unless they’re an ML engineer running regular evals and benchmarks to understand where each model actually performs best, that choice is liable to hurt far more than it helps. You end up giving users what they think they want, while quietly degrading the quality of what they produce with your tool with inconsistent results, wasted tokens, and erratic model behavior.

For example, developers may unknowingly pick a model that’s slower, less reliable for their specific task, or tuned for a completely different kind of reasoning pattern. Or they might choose a faster model than they need that won’t comprehensively reason through the task.

Choosing which model to use isn’t a matter of personal taste… It's a systems-level optimization problem. The right model for any task depends on measurable performance across dozens of task dimensions, not just how recently it was released or how smart users perceive it to be. And that decision should belong to engineers armed with eval data, not end users who wrongly believe they’ll get better results with the model they personally prefer.

The myth of ‘preference’ in AI model selection

Many AI platforms love to market model choice as a premium feature. “Choose GPT-4o, Claude, or Gemini” sounds empowering and gives users the impression that they will get the best or latest experience. It taps into the same instinct that makes people want to buy the newest phone the week it launches: the feeling that newer and bigger must mean better.

The reality, though, is that most users have no idea which model actually performs best for their specific use case. And even if they did, that answer would likely shift from one query to another. The “best” model for code generation might not be the “best” for bug detection, documentation, or static analysis. There might also be multiple models that are best at different parts of a code review or other task, depending on what kind of code is being reviewed.

Some tasks require greater creativity and reasoning depth; others need precision and consistency. A developer who blindly defaults to “the biggest model available” for coding help, often ends up with slower, more expensive, and less deterministic results. In some cases, a smaller, domain-tuned model will handily outperform its heavyweight cousin.

Why model selection is an evaluation problem, not preference

Model selection isn’t a matter of taste… it's a data problem. Behind the scenes, engineers run thousands of evaluations across tasks like code correctness, latency, context retention, and tool integration. These aren’t one-time benchmarks; they’re continuous systems designed to measure how models actually perform under specific, reproducible conditions. The results form a kind of performance map which shows which model excels at refactoring versus summarizing code or which one handles long-context reasoning without drifting off-topic.

End users never see that map. While some might read benchmarks or articles about a model’s performance, most are making decisions blind, guided mostly by hunches, Reddit posts, or vague impressions of “smartness.”

Even if they wanted to, users rarely have the time or infrastructure to run their own evals across hundreds of tasks and models. The result is that people often optimize for hype rather than outcomes… choosing the model that feels cleverest or sounds more fluent, not the one that’s objectively better for the job.

And human perception alone is a terrible way to evaluate model competence. A model that seems chatty and confident can be consistently wrong, while one that feels hesitant might actually deliver the most accurate, reproducible results. Without hard data from evaluations, those distinctions disappear.

The prompting paradox

One critical drawback to choosing your own model is that no two LLMs think alike. Each model interprets prompts slightly differently. Some are more literal, others more associative; some favor verbosity, others prefer minimalism. A prompt that works perfectly on GPT-5 might completely derail on Sonnet 4.5, leading to hallucinated code, missing context, or an output that ignores key constraints.

Temperature, context length, and formatting differences only make the problem worse. A model with a higher temperature parameter might produce creative explanations but rewrite variable names, while another with stricter formatting rules could break markdown or indentation. These small mismatches can quietly poison a workflow, especially in environments where consistent structure matters most like with code reviews, diff comments, or documentation summaries.

When users choose their own models, they unknowingly disrupt the prompt-engineering assumptions that keep those workflows stable in systems where the prompts are written for the user. Every prompt is tuned with certain expectations about how the model parses instructions, handles errors, and formats its output. Swap out the model and those assumptions collapse.

It’s even harder to navigate in situations where the user writes the prompt themselves, like with AI coding tools. Users rarely have enough context, knowledge, and experience to write effective prompts for each model. However, over time, they might find a few prompting methods that help them get the best out of a particular model. If they later change to a new model, they often find their old prompts aren’t as effective and need to learn from scratch trying to get the best results from that new model.

That’s why well-designed systems rely on model orchestration, not user preference. In review pipelines or agentic systems, predictability is everything. You need each component to behave consistently so downstream tools and other models can interpret the results. Giving users the freedom to swap models isn’t customization; it’s chaos engineering without the safety net.

The hidden costs of model freedom

Once users can switch models at will, all the invisible consistency that makes AI-assisted workflows dependable begins to crumble. The consequences aren’t abstract; they’re measurable and they multiply fast.

Across teams, the first thing you notice is inconsistency. Two developers can run the same review prompt and get completely different feedback. One gets a precise diff comment, the other might get a philosophical musing on the meaning of clean code. That inconsistency makes it impossible to reproduce results, which is deadly for any process that relies on traceability or QA.

Then there’s cost. Larger models burn through tokens faster and often respond slower, introducing both financial waste and latency drag. And when users unknowingly pick models with shorter context windows, the result is truncated inputs or missing context. It’s like asking someone to summarize a novel after reading only half of it.

The better alternative: Dynamic, data-driven routing

The smarter alternative to user-driven chaos is dynamic, data-driven routing. That means systems that automatically choose the right model for the right task. Instead of asking users to guess which LLM might perform best, auto-routing engines make that choice in real-time based on metrics, evals, and historical performance.

Think of it as orchestration, not selection. A large model might be routed in for creative reasoning, open-ended problem solving, or complex code explanations. A smaller, domain-tuned model might handle deterministic checks, linting, or static analysis where precision and speed matter more than eloquence. The system continuously evaluates the outcomes tracking correctness, latency, and user feedback in order to refine its routing logic over time.

This approach turns what used to be human guesswork into an adaptive, evidence-based process. The routing system learns which models excel at which tasks, under which conditions, and how to balance cost, speed, and quality.

Advanced teams already operate this way. In CodeRabbit, for example, the orchestration layer sits between the user and the models, using structured prompts, eval data, and performance histories to dispatch requests intelligently. Developers don’t have to think about which LLM is behind a particular review comment. The system has already chosen the optimal one, validated against internal benchmarks.

In short, dynamic routing makes model choice invisible. The user gets consistently high-quality results; the engineers get measurable control and efficiency. Everyone wins. Except the dropdown menu.

Expertise is in the system, not the slider

The takeaway here is simple: model selection isn’t a feature, it’s a quality control issue. The best results come from systems that make those choices invisibly and are grounded in data, not gut instinct. When model routing is automatic and performance-based, users get consistent, high-quality outputs without needing to think about which model is doing the work.

Every product that puts a “Choose your LLM” dropdown front and center is outsourcing an engineering decision to the least equipped person to make it.

Or, put another way: the best AI tool UI is no LLM dropdown at all.

Curious what it looks like when an AI pipeline optimizes for LLM fit? Try CodeRabbit for free today!

An (actually useful) framework for evaluating AI code review tools

Arindam Majumder — Wed, 28 Jan 2026 12:24:44 +0000

Benchmarks promise clarity. They’re supposed to reduce a complex system to a score, compare competitors side by side, and let the numbers speak for themselves. But, in practice, they rarely do.

Benchmarks don’t measure “quality” in the abstract. They measure whatever the benchmark designer chose to emphasize, under the specific constraints, assumptions, and incentives of the evaluation.

Change the dataset, the scoring rubric, the prompts, or the evaluation harness, and the results can shift dramatically. That doesn’t make benchmarks useless, but it does make them fragile, manipulable, and easy to misinterpret. Case in point: database benchmarks.

Database benchmarks: A cautionary tale

The history of database performance benchmarks is a useful example. As benchmarks became standardized, vendors learned how to optimize specifically for the test rather than for real workloads. Query plans were hand-tuned, caching behavior was engineered to exploit assumptions, and systems were configured in ways no production team would realistically deploy.

Over time, many engineers stopped trusting benchmark results, treating them as marketing signals rather than reliable indicators of system behavior.

AI code review benchmarks are on the same trajectory

We’re currently seeing AI code review benchmarks go down a similar path. As models are evaluated on curated PR sets, synthetic issues, or narrowly defined correctness criteria, tools increasingly optimize for benchmark performance rather than for the messy, contextual, high‑stakes reality of real code review.

The deeper problem is not just that benchmarks can be misleading, it’s that many “ideal” evaluation designs are difficult to execute correctly in real engineering environments. When an evaluation framework is too detached from real workflows, too easy to game by badly configuring your competitor’s tool, or too complex to run well, the results become hard to trust.

What follows below is a practical framework for effectively evaluating AI code review tools that balances rigor with feasibility, and produces results that are both meaningful and interpretable.

Start from your objectives and make them explicit

Before assembling datasets or choosing metrics, it’s critical to define what you actually care about. “Better code review” means different things to different teams, and an evaluation that doesn’t encode those differences will inevitably optimize for the wrong outcome.

Common objectives include:

Catching real defects and risks before merge
Improving long‑term maintainability and reducing technical debt
Avoiding low‑value noise that degrades review quality
Maintaining developer trust and adoption

It’s also important to distinguish between leading indicators and lagging indicators. Outcomes like fewer production incidents or higher long‑term throughput are real and important, but they often emerge over months, not weeks. Shorter evaluations should focus on signals that correlate strongly with those outcomes, such as the quality of issues caught, whether they are acted on, and how developers respond to the tool.

Explicitly ranking your objectives such as quality impact, precision, developer experience, and throughput, helps ensure that your evaluation answers the questions that actually matter to your organization.

Determine what kind of evaluation is needed

The most reliable evaluation of any tool involves a real-world pilot over a controlled offline benchmark. This allows you to see how it works in day-to-day situations versus just evaluating a tool based on criteria defined by a third party vendor.

In-the-wild pilot

The most reliable signals come from observing how a tool behaves in real, day‑to‑day development.

Real‑time evaluation reflects actual constraints: deadlines, partial context, competing priorities, and human judgment. It shows not just what a tool can detect in theory, but what it surfaces in practice, and whether those issues matter enough for developers to act on them.

For this, select a few teams or projects for each tool and run each tool for a period of time under normal usage.

Measure things like:

Real-world detection of issues.
Severity of issues caught.
Developer satisfaction and perceived utility.

If possible, design A/B style experiments so you can measure using the tool vs no tool on comparable teams or repos or Tool A vs Tool B on similar workloads, perhaps alternating weeks or branches.

Offline benchmark

For teams that want additional confidence, controlled detection comparisons can provide useful insight if you design it yourself using your own pull requests and criteria so it gives you the data you actually need. However, it’s not required in most cases since it doesn’t provide as much useful data as a pilot and can be time intensive to set up.

One practical approach is to use a private evaluation or mirror repository. A small, representative set of pull requests can be replayed, allowing multiple tools to be run on the same diffs without disrupting real workflows.

These comparisons are best used to understand coverage differences by severity and category, and to identify systematic strengths and blind spots across tools.

After that, you just need to compute the metrics you’re looking to track. For example:

Precision/recall by severity and issue type.
Comment volume and distribution.

Why evaluating multiple tools on the same pull request is usually misleading

If you want to do a head-to-head comparison via either a benchmark or a pilot, a common instinct is to run them all on the same exact pull requests rather than mirroring that PR and running each tool you’re comparing separately on it or running them on different but comparable PRs. On the surface, running them all simultaneously feels fair and efficient. In practice, it introduces serious problems.

When multiple AI reviewers comment on the same PR:

Human reviewers are overwhelmed with feedback and cognitive load spikes.

No single tool can be experienced as it was designed to be used in that case. For example, some tools skip comments if they see another tool has already made that comment leading to the perception that that tool hasn’t found the issue.

Review behavior changes—comments are skimmed, bulk‑dismissed, or ignored

This creates interference effects. Tools influence each other’s perceived usefulness, and attention, not correctness, becomes the limiting factor. Precision metrics degrade because even high‑quality comments may be ignored simply due to volume. That makes it harder to know the percentage of comments your team would accept from each individual tool under normal usage.

The result is that you lose the ability to evaluate usability, trust, workflow fit, and real‑world usefulness. You are no longer measuring how a tool performs in practice, but how reviewers cope with noise.

Running multiple tools on the same exact PR can be useful in narrow, controlled contexts, such as offline detection comparisons, but it is a poor way to evaluate the actual experience and value of a code review tool.

To understand whether a tool helps your team, it often best be experienced in isolation within a normal review workflow.

Structuring fair comparisons without complex infrastructure

There are practical ways to compare tools without building elaborate experimentation harnesses.

Parallel evaluation across repos or teams is often the simplest approach. Select repos or teams that are broadly comparable in language, domain, and PR volume, and run different tools in parallel. Keep configuration effort symmetric and analyze results using normalization techniques (discussed below).

Alternatively, time‑sliced evaluation within the same repo or team can work when parallel groups are not available. Run one tool for a defined period, then switch. This approach requires acknowledging temporal effects—release cycles, workload changes, learning effects—but can still produce useful, directional insights when interpreted carefully.

Finally, simply mirroring PRs and running reviews on them with separate tools also works well, if you want to compare comments on the same PRs.

In all these cases, the goal is to preserve a clean developer experience while collecting comparable data.

In practice, these approaches can also be combined if a team feels like that’s helpful to give them a better idea of how a tool works. Teams may start with parallel evaluation across different repositories or teams, then swap tools after a fixed period. This helps balance differences in codebase complexity or workload over time, while still avoiding the disruption and interference that comes from running multiple tools on the same pull request. As with any time-based comparison, results should be normalized and interpreted with awareness of temporal effects, but this hybrid approach often provides a good balance of fairness, practicality, and interpretability.

Metrics that produce interpretable results

Based on successful deployments across thousands of repositories, we've identified a framework of seven metric categories that provide a complete picture of your integration which we suggest as metrics to measure to our customers.

Each category answers a specific question about your AI implementation:

Architectural Metrics – Is the tool appropriately integrated? How many of an org’s repos are connected, how many extensions are they using (git, IDE, CLI).
Adoption Metrics – Are developers actually using it? These metrics include monthly active users (MAU), the percentage of total repositories covered and week-over-week growth.
Engagement Metrics – Are they just ignoring it or actively collaborating with it? These metrics include PRs reviewed versus Chat Sessions initiated. Also track “Learnings used,” how often the AI applies context from previous reviews to new ones.
Impact Metrics – Is it catching bugs that matter to the team? These metrics include number of issues detected, actionable suggestions, and the “acceptance rate” (percentage of AI comments that result in a code change).
Quality & Security Metrics – Is it preventing expensive bugs and security vulnerabilities? These metrics include Linter/SAST findings, security vulnerabilities caught (e.g., Gitleaks), and reduction in pipeline failures.
Governance Metrics – Is it enforcing standards across the team? These metrics include usage of pre-merge checks, warnings vs. errors, and implementation of custom governance rules.
Developer Sentiment – Are the developers happy with their experience and product? These metrics include survey results, qualitative feedback, and “aha” moments.

Accepted issues as a primary quality signal

Not all metrics are equally informative and some are far easier to misread than others. A practical evaluation should focus more attention on signals that are both meaningful and feasible to measure. One of the strongest indicators of value is whether a tool’s feedback leads to real action.

An issue can reasonably be considered accepted when:

A subsequent commit addresses the comment or thread
A reviewer explicitly acknowledges that the issue has been resolved

This behavioral signal captures correctness, relevance, and usefulness in a way that pure scoring metrics cannot.

Accepted issues should be reported by:

Severity (e.g., critical, major, minor, low, nitpick)
Category (security, logic, performance, maintainability, testing, etc.)

Both absolute counts and rates are informative, especially when interpreted together.

Precision and signal‑to‑noise

Acceptance rate (accepted issues relative to total surfaced) is a practical proxy for precision. On its own, it is insufficient; paired with comment volume, it becomes far more meaningful.

High comment volume with low acceptance is a clear signal of noise. Patterns of systematically ignored categories or directories often reveal where configuration or tuning is needed.

It’s also important to avoid the “LGTM trap.” That means a tool that leaves very few comments, all correct, may appear precise while missing large classes of issues. In many cases, broad coverage combined with configurability is preferable to narrow precision that cannot be expanded.

Coverage and issue discovery in real review flows

In typical workflows, the sequence is:

PR opens → AI review → issues fixed → human review

Because humans review after the tool, it is often impossible to say with certainty which issues humans would have caught independently. Instead of trying to infer counterfactuals precisely, focus on practical signals:

Accepted issues that led to substantive code changes
Accepted issues in categories humans historically miss (subtle logic, edge cases, maintainability)
Consistent patterns of issues surfaced across PRs Sampling can help here. Reviewing a subset of PRs and asking, “Would this issue likely have been caught without the tool?” is often more informative than attempting exhaustive labeling.

Normalization: Making comparisons fair

Raw counts are misleading when pull requests vary widely in size and complexity. Normalization is essential for fair comparison.

Useful normalization dimensions include:

PR size (lines changed, files touched)
PR type (bug fix, feature, refactor, infra/config, test‑only)
Domain or risk area (frontend/backend, high‑risk components)

Comparisons should be made within similar buckets, and distributions are often more informative than averages. Small samples at extremes should be interpreted cautiously.

Interpreting throughput and velocity

Throughput metrics like time‑to‑merge are easy to misread. When a tool begins catching real issues that were previously missed, merge times may initially increase. This often reflects improved rigor rather than reduced productivity.

Throughput should therefore be treated as a secondary metric, normalized by PR complexity and evaluated over time alongside quality indicators. Short‑term slowdowns can be a leading indicator of long‑term gains in code health.

Bringing it all together

A reliable evaluation does not require perfect benchmarks or elaborate experimental design. It requires clarity about objectives, careful interpretation of metrics, and an emphasis on real‑world behavior.

Start with normal workflows and behavioral signals. Normalize to make comparisons fair. Use controlled comparisons selectively to deepen understanding. Combine quantitative metrics with concrete examples of impact.

Final takeaway

Benchmarks are useful starting points, not verdicts.

The most trustworthy evaluations of AI code review tools are grounded in real workflows, user behavior‑based signals, and balance rigor with practicality. When done well, they provide confidence not just that a tool performs well on paper, but that it meaningfully improves both the immediate quality of code changes and the long‑term health of the codebase.

Curious how CodeRabbit performs on your codebase? Get a free trial today!

CodeRabbit's AI Code Reviews now support NVIDIA Nemotron

Arindam Majumder — Wed, 28 Jan 2026 12:17:13 +0000

TL;DR: Blend of frontier & open models is more cost efficient and reviews faster. NVIDIA Nemotron is supported for CodeRabbit self-hosted customers.

We are delighted to share that CodeRabbit now supports the NVIDIA Nemotron family of open models among its blend of Large Language Models (LLMs) used for AI code reviews. Support for Nemotron 3 Nano has initially been enabled for CodeRabbit’s self-hosted customers running its container image on their infrastructure. Nemotron is used to power the context gathering and summarization stage of the code review workflow before the frontier models from OpenAI and Anthropic are used for deep reasoning and generating review comments for bug fixes.

How Nemotron helps: Context gathering at scale

This new blend of open and frontier models allows us to improve the overall speed of context gathering and improves cost efficiency by routing different parts of the review workflow to the appropriate model family, while delivering review accuracy that is at par with running frontier models alone.

High quality AI code reviews that can find deep lying and hidden bugs require lots of context gathering related to the code being analyzed. The most frequent (and most token-hungry) work is summarizing and refreshing that context: what changed in the code and does it match developer intent, how do those changes connect with rest of the codebase, what are the repo conventions or custom rules, what external data sources are available to aid the review, etc.

This context building stage is the workhorse of the overall AI code review process and it is run several times iteratively throughout the review workflow. NVIDIA Nemotron 3 Nano was built for high-efficiency tasks and its large context window (1 million tokens) along with fast speed helps to gather a lot of data and run several iterations of context summarization and retrieval.

Blend of frontier and Open Models

When you open a Pull Request (PR), CodeRabbit’s code review workflow is triggered starting with an isolated and secure sandbox environment where CodeRabbit analyzes code from a clone of the repo. In parallel, CodeRabbit pulls in context signals from several sources:

Code and PR index
Linter / Static App Security Tests (SAST)
Code graph
Coding agent rules files
Custom review rules and Learnings
Issue tickets (Jira, Linear, Github issues)
Public MCP servers
Web search

To dive deeper into our context engineering approach you can check out our blog: The art and science of context engineering for AI code reviews.

A lot of this context, along with the code diff being analyzed, is used to generate a PR Summary before any review comments are generated. This is where open models come in. Instead of sending all of the context to frontier models, CodeRabbit now uses Nemotron Nano v3 to gather and summarize the relevant context. Summarization is at the heart of every code review and is the key to delivering high signal-to-noise in the review comments.

After the summarization stage is completed the frontier models (e.g., OpenAI GPT-5.2-Codex and Anthropic Claude-Opus/Sonnet 4.5) perform deep reasoning to generate review comments for bug fixes, and execute agentic steps like review verification, pre-merge checks, and “finishing touches” (including docstrings and unit test suggestions).

What this means for our customers

CodeRabbit is now enabling Nemotron-3-Nano-30B support (initially for its self-hosted customers) for the context summarization part of the review workflow along with the frontier models from OpenAI and Anthropic. This results in faster code reviews without compromising quality.

We are also delighted to support the announcement from NVIDIA today about the expansion of its Nemotron family of open models and are excited to work with the company to help accelerate AI coding adoption across every industry.

Get in touch with our team to access CodeRabbit’s container image if you would like to run AI code reviews on your self-hosted infrastructure.

What's New in CodeRabbit: January 2026 Edition

Arindam Majumder — Wed, 28 Jan 2026 08:17:14 +0000

January kicked off strong with powerful new APIs for user and metrics management, plus streamlined data export capabilities.

Alongside these product updates, CodeRabbit just won a 2026 DEVIES Award at DeveloperWeek! We're honored to be recognized alongside the best in dev tools for helping teams ship cleaner code, faster.

We're also excited to announce that CodeRabbit's AI code reviews now support NVIDIA Nemotron open models! Self-hosted teams get faster, more cost-efficient reviews without sacrificing quality.

// Detect dark theme var iframe = document.getElementById('tweet-2008303896863928826-670'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=2008303896863928826&theme=dark" }

With that context, here's everything we shipped in January 2026:

User Management API (Jan 21)

Managing user seats and roles at scale is now possible through our REST API. Whether you're onboarding a new team or adjusting access across departments, you can now automate the entire process.

What you can do:

List all users in your organization with filters for seat status and role
Bulk assign or unassign up to 500 seats per request
Promote or demote users between admin and member roles in bulk
Get detailed success/failure feedback for each operation

All endpoints support partial success; if one user operation fails, the rest still complete.

See the User Management API documentation for authentication and usage details.

Data Export (Jan 14)

Export your pull request review metrics directly from the CodeRabbit Dashboard. Pick your date range, and download a CSV with complexity scores, review times, and comment breakdowns by severity and category.

Perfect for quarterly reviews, team retrospectives, or building custom analytics dashboards. Access it from the Data Export tab in your dashboard.

See the Data Export documentation for the complete list of fields.

Review Metrics API (Jan 14)

Need programmatic access to your review metrics? The new REST API gives you the same data as the dashboard export, but with full query flexibility. Filter by repository, user, or custom date ranges, and get responses in JSON or CSV format.

What you can do:

Query metrics for any date range programmatically
Filter results by specific repositories or users
Choose JSON for integration or CSV for analysis
Build custom reporting dashboards and automation workflows

See the Metrics API documentation for authentication and usage details.

Stay Tuned

We’re continually working to make CodeRabbit smarter, faster, and more collaborative. More updates are on the way, stay tuned!

Got feedback or want early access to what’s next?

Join us on Discord or follow @coderabbitai on X.

Our new report: AI code creates 1.7x more problems

Arindam Majumder — Tue, 06 Jan 2026 09:53:09 +0000

What we learned from analyzing hundreds of open-source pull requests.

Over the past year, AI coding assistants have gone from emerging tools to everyday fixtures in the development workflow. At many organizations, a part of every code change is now machine-generated or machine-assisted.

But while this has been accelerating the speed of development, questions have been quietly circulating:

Why are more defects slipping through into staging?
Why do certain logic or configuration issues keep appearing?
And are these patterns tied to AI-generated code?

It would appear like AI is playing a significant role. A recent report found that while pull requests per author increased by 20% year-over-year, thanks to help from AI, incidents per pull request increased by 23.5%.

This year also brought several high-visibility incidents, postmortems, and anecdotal stories pointing to AI-written changes as a contributing factor. These weren’t fringe cases or misuses. They involved otherwise normal pull requests that simply embedded subtle mistakes. And yet, despite rapid adoption of AI coding tools, there has been surprisingly little concrete data about how AI-authored PRs differ in quality from human-written ones.

So, CodeRabbit set out to answer that question empirically in our State of AI vs Human Code Generation Report.

Our State of AI vs Human Code Generation Report

We analyzed 470 open-source GitHub pull requests, including 320 AI-co-authored PRs and 150 human-only PRs, using CodeRabbit’s structured issue taxonomy. Every finding was normalized to issues per 100 PRs and we used statistical rate ratios to compare how often different types of problems appeared in each group.

The results? Clear, measurable, and consistent with what many developers have been feeling intuitively: AI accelerates output, but it also amplifies certain categories of mistakes.

READ THE FULL REPORT

Limitations of our study

Getting data on issues that are more prevalent in AI-authored PRs is critical for engineering teams but the challenge was determining which PRs were AI-authored vs human authored. Since it was impossible to directly confirm authorship of each PR of a large enough OSS dataset, we checked for signals that a PR was co-authored by AI and assumed that those that didn’t have it were human authored, for the purposes of the study.

This resulted in statistically significant differences in issue patterns between the two datasets, which we are sharing in this study so teams can better know what to look for. However, we cannot guarantee all the PRs we labelled as human authored were actually authored only by humans. Our full methodology is shared at the end of the report.

Top 10 findings from the report

No issue category was uniquely AI but most categories saw significantly more errors in AI-authored PRs. That means, humans and AI make the same kinds of mistakes. AI just makes many of them more often and at a larger scale.

1. AI-generated PRs contained ~1.7× more issues overall.

Across 470 PRs, AI-authored changes produced 10.83 issues per PR, compared to 6.45 for human-only PRs. Even more striking: high-issue outliers were much more common in AI PRs, creating heavy review workloads.

2. Severity escalates with AI: More critical and major issues.

AI PRs show ~1.4–1.7× more critical and major findings.

3. Logic and correctness issues were 75% more common in AI PRs.

These include business logic mistakes, incorrect dependencies, flawed control flow, and misconfigurations. Logic errors are among the most expensive to fix and most likely to cause downstream incidents.

4. Readability issues spiked more than 3× in AI contributions.

The single biggest difference across the entire dataset was in readability. AI-produced code often looks consistent but violates local patterns around naming, clarity, and structure.

5. Error handling and exception-path gaps were nearly 2× more common.

AI-generated code often omits null checks, early returns, guardrails, and comprehensive exception logic, issues tightly tied to real-world outages.

6. Security issues were up to 2.74× higher

The most prominent pattern involved improper password handling and insecure object references. While no vulnerability type was unique to AI, nearly all were amplified.

7. Performance regressions, though small in number, skewed heavily toward AI.

Excessive I/O operations were ~8× more common in AI-authored PRs. This reflects AI’s tendency to favor clarity and simple patterns over resource efficiency.

8. Concurrency and dependency correctness saw ~2× increases.

Incorrect ordering, faulty dependency flow, or misuse of concurrency primitives appeared far more frequently in AI PRs. These were small mistakes with big implication

9. Formatting problems were 2.66× more common in AI PRs.

Even teams with formatters and linters saw elevated noise: spacing, indentation, structural inconsistencies, and style drift were all more prevalent in AI-generated code.

10. AI introduced nearly 2× more naming inconsistencies.

Unclear naming, mismatched terminology, and generic identifiers appeared frequently in AI-generated changes, increasing cognitive load for reviewers.

READ THE FULL REPORT

Why these patterns appear

Why are teams seeing so many issues with AI-generated code? Here’s our analysis:

AI lacks local business logic: Models infer code patterns statistically, not semantically. Without strict constraints, they miss the rules of the system that senior engineers internalize.
AI generates surface-level correctness: It produces code that looks right but may skip control-flow protections or misuse dependency ordering.
AI doesn’t adhere perfectly to repo idioms: Naming patterns, architectural norms, and formatting conventions often drift toward generic defaults.
Security patterns degrade without explicit prompts: Unless guarded, models recreate legacy patterns or outdated practices found in older training data.
AI favors clarity over efficiency: Models often default to simple loops, repeated I/O, or unoptimized data structures.

What engineering teams can do about it

Adopting AI coding tools isn’t simply about speeding up development. It requires rethinking the guardrails that ensure all code entering production is safe, maintainable, and correct.

Based on the patterns in the data, here are the most important takeaways for teams:

1. Give AI the context it needs

AI makes more mistakes when it lacks business rules, configuration patterns, or architectural constraints. Provide prompt snippets, repo-specific instruction capsules, and configuration schemas to reduce misconfigurations and logic drift.

2. Use policy-as-code to enforce style

Readability and formatting were some of the biggest gaps. CI-enforced formatters, linters, and style guides eliminate entire categories of AI-driven issues before review.

3. Add correctness safety rails

Given the rise in logic and error-handling issues:

Require tests for non-trivial control flow
Mandate nullability/type assertions
Standardize exception-handling rules
Explicitly prompt for guardrails where needed

4. Strengthen security defaults

Mitigate elevated vulnerability rates by centralizing credential handling, blocking ad-hoc password usage, and running SAST and security linters automatically.

5. Nudge the model toward efficient patterns

Offer guidelines for batching I/O, choosing appropriate data structures, and using performance hints in prompts.

6. Adopt AI-aware PR checklists

Reviewers should explicitly ask:

Are error paths covered?
Are concurrency primitives correct?
Are configuration values validated?
Are passwords handled via the approved helper?

These questions target the areas where AI is most error-prone.

7. Get help reviewing and testing AI code

Code review pipelines weren’t created to handle the higher volume of code teams are currently shipping with the help of AI. Reviewer fatigue has been found to lead to more issues and missed bugs. An AI code review tool like CodeRabbit helps by standardizing code reviews acts as a third-party source of truth that standardizes quality across different AI tools that teams might use while reducing the time and cognitive labor needed for reviews. That allows developers to concentrate on reviewing the more complex parts of the code changes and reduce the amount of bugs and issues that end up in production.

READ THE FULL REPORT

The bottom line

AI coding tools are powerful accelerators, but acceleration without guardrails increases risk. Our analysis shows that AI-generated code is consistently more variable, more error-prone, and more likely to introduce high-severity issues without the right protections in place.

The future of AI-assisted development isn’t about replacing developers. It’s about building systems, workflows, and safety layers that amplify what AI does well while compensating for what it tends to miss.

For the teams that want the speed of AI without the surprises, the data is clear: Quality isn’t automatic. It requires deliberate engineering. Even when using AI tools.

An AI code review tool could also help. Try CodeRabbit today.

It's harder to read code than to write it (especially when AI writes it)

Arindam Majumder — Tue, 06 Jan 2026 09:43:32 +0000

"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it."

Brian Kernighan (co-creator of Unix and co-author of The C Programming Language)
I've been programming since I was ten. When it became a career, I got obsessed with code quality: clean code, design patterns, all that good stuff. My pull requests were polished like nobody's business: well-thought-out logic, proper error handling, comments, tests, documentation. Everything that makes reviewers nod approvingly.

Then, LLMs came along and changed everything. I don't write that much code anymore since AI does it faster. Developer’s work now mainly consists of two parts: explaining to a model what you need, then verifying what it wrote, right? I’ve become more of a code architect and quality inspector rolled into one.

And here came a problem I knew all too well from my years as a tech lead:

READING CODE IS ACTUALLY HARDER THAN WRITING IT.

As an open-source maintainer and senior developer, I had to review tons of other people's code, and I learned what Kernighan said the hard way. Reading unfamiliar code is exhausting. You have to reverse-engineer someone else's thought process, figure out why they made certain decisions, and consider edge cases they might have missed.

With my own code, reviewing and adjusting were a no-brainer. I designed it, I wrote it, and the whole mental model was still fresh in my head. Now the code is coming from an LLM and suddenly reviewing "my own code" has become reviewing someone else's code. Except this "someone else" writes faster than I can think and doesn't take lunch breaks.

AI is supposed to help, but if I want to ship production-grade software now, I actually have more hard work to do than before. The irony!

And that’s why, for my first blog post since joining CodeRabbit, I wanted to focus on that fact. This is also, incidentally, why I decided to join CodeRabbit. But we’ll get to that part later.

We’re human (unfortunately for code quality)

Here's where things get uncomfortable: we're human beings, not code-reviewing machines. And human brains don't want to do hard work, thoroughly reviewing something that a) already runs fine, b) passes all the tests, and c) someone else will review anyway. It's so much easier to just git commit && git push and go grab that well-deserved coffee. Job is done!

I went from “writing manually and shipping quality code,” to “generating code fast but shipping… bad code!” The quality dropped not because I had less time as I actually had MORE time since I wasn't typing everything myself. I just tend to “shorten” this verification phase, telling myself "it works, the tests pass, the team will catch anything major."

The problem with "Catching it in review"

At this point, I was already using CodeRabbit to review my team's pull requests (as an OSS-focused dev, I was an early adopter), and those reviews were genuinely helpful! CodeRabbit would catch things that slipped through. Security issues, edge cases, some logic bugs. Those problems that are easy to miss when you're moving fast.

But here's the thing: those reviews were coming too late. The code was already pushed. Already in the repository, visible to the entire team. Sure, CodeRabbit would flag the issues and I'd fix them but not before my teammates had seen my AI-generated code with obvious problems that I didn't bother to review properly.

That's not a great look when you've spent decades building a reputation for quality.

Enter: CodeRabbit in an IDE

Then, I discovered CodeRabbit had an IDE extension. The AI code reviewer I was already using for PRs could also review my code locally, before anything hits the repo. This was exactly what I needed.

When I ask CodeRabbit to check or simply stage my changes, CodeRabbit reviews them right in VS Code, catching issues before git push. Now, my team sees only the polished version, just like the old days. Except now, I'm shipping AI-generated code at AI speeds. And I’m doing it with actual quality control. Automatic reviews mean no willpower required: I don't have to remember to run it, I don't have to open a separate tool. It just happens at commit time. Reviewing doesn't feel like plowing in the rain anymore.

This gets critical when you're looking at potential security headaches, like the one on the screenshot. CodeRabbit caught an access token leak that could've been a total disaster! Issues like this needs to be addressed before that code gets pushed to a repository.

More than that, when it finds something, the fixes are committable. The tool doesn’t tell me to "go figure it out" but gives actual suggestions I can apply immediately, in one click.

For more advanced cases that can’t be resolved with a simple fix, CodeRabbit IDE extension writes a prompt that it sends to an AI agent of your choice. Fun fact: CodeRabbit is so good in writing prompts so I got a lot to learn from, improving my Prompt Engineering skills!

Even the free CodeRabbit IDE Review plan offers incredibly helpful feedback and catches numerous issues. However, the Pro plan unlocks its true power, providing the same comprehensive coverage you expect from regular CodeRabbit Pull Request reviews: tool runs, Code Graph analysis, and much more - there is a huge infrastructure behind every check!

The bottom line

Brian Kernighan was right: reading code is harder than writing it. That was true in 1974 and it's even more true now when AI can generate 300 lines while you're still thinking about a variable name.

We thought AI would make our jobs easier. And it does… if you only count the writing. But the reading verifying, reviewing, and understanding what the AI agent actually built? That got harder.

Many of us are doing 10x the volume at 10x the speed, which means 10x more code to read with the same human brain that gets lazy and wants coffee breaks. The solution isn't to slow down or go back to typing everything manually. The solution is to automate the code review process as thoroughly as we automated the code writing process. If your AI writes the code, another AI should be reading it before you get to it.

The quality of the reviews is why I recently transitioned from being a CodeRabbit user to joining the team. And that’s why you should also try CodeRabbit in your IDE. The free tier means there's basically no excuse not to try it. Your reputation will thank you.

Get started today with a 14-day free trial!

Behind the curtain: What it really takes to bring a new model online at CodeRabbit

Arindam Majumder — Tue, 06 Jan 2026 09:38:11 +0000

When we published our earlier article on why users shouldn't choose their own models, we argued that model selection isn't a matter of preference, it's a systems problem. This post explains exactly why.

Bringing a new model online at CodeRabbit isn't a matter of flipping a switch; it's a multi-phase, high-effort operation that demands precision, experimentation, and constant vigilance.

Every few months, a new large-language model drops with headlines promising “next-level reasoning,” “longer context,” or “faster throughput.” For most developers, the temptation is simple: plug it in, flip the switch, and ride the wave of progress.

We know that impulse. But for us, adopting a new model isn’t an act of curiosity, it’s a multi-week engineering campaign.

Our customers don’t see that campaign, and ideally, they never should. The reason CodeRabbit feels seamless is precisely because we do the hard work behind the scenes evaluating, tuning, and validating every model before it touches a single production review. This is what it really looks like.

1. The curiosity phase: Understanding the model’s DNA

Every new model starts with a hypothesis. We begin by digging into what it claims to do differently: is it a reasoning model, a coding model, or something in between? What’s its architectural bias, its supposed improvements, and how might those capabilities map to our existing review system?

We compare those traits against the many model types that power different layers of our context-engineering and review pipeline. The question we ask isn’t, “is this new model better?” but, “where might it fit?” Sometimes it’s a candidate for high-reasoning diff analysis; other times, for summarization or explanation work. Each of those domains has its own expectations for quality, consistency, and tone.

From there, we start generating experiments. Not one or two, but dozens of evaluation configurations across parameters like temperature, context packing, and instruction phrasing. Each experiment feeds into our evaluation harness, which measures both quantitative and qualitative dimensions of review quality.

2. The evaluation phase: Data over impressions

This phase takes time. We run models across our internal evaluation set, collecting hard metrics that span coverage, precision, signal-to-noise, and latency. These are the same metrics that underpin the benchmarks we’ve discussed in earlier posts like Benchmarking GPT-5, Claude Sonnet 4.5: Better Performance, but a Paradox, GPT-5.1: Higher signal at lower volume, and Opus 4.5: Performs like the systems architect.

But numbers only tell part of the story. We also review the generated comments themselves by looking at reasoning traces, accuracy, and stylistic consistency against our current best-in-class reviewers. We use multiple LLM-judge recipes to analyze tone, clarity, and helpfulness, giving us an extra lens on subtle shifts that raw metrics can’t capture.

If you’ve read our earlier blogs, you already know why this is necessary: models aren’t interchangeable. A prompt that performs beautifully on GPT-5 may completely derail on Sonnet 4.5. Each has its own “prompt physics.” Our job is to learn it quickly and then shape it to behave predictably inside our system.

3. The adaptation phase: Taming the differences

Once we understand where a model shines and where it struggles, we begin tuning. Sometimes that means straightforward prompt adjustments such as fixing formatting drift or recalibrating verbosity. Other times, the work is more nuanced: identifying how the model’s internal voice has changed and nudging it back toward the concise, pragmatic tone our users expect.

We don’t do this by guesswork. We’ll often use LLMs themselves to critique their own outputs. For example: “This comment came out too apologetic. Given the original prompt and reasoning trace, what would you change to achieve a more direct result?” This meta-loop helps us generate candidate prompt tweaks far faster than trial and error alone.

During this period, we’re also in constant contact with model providers, sharing detailed feedback about edge-case behavior, bugs, or inconsistencies we uncover. Sometimes those conversations lead to model-level adjustments; other times they inform how we adapt our prompts around a model’s quirks.

4. The rollout phase: From lab to live traffic

When a model starts to perform reliably in offline tests, we move into phased rollout.

First, we test internally. Our own teams see the comments in live environments and provide qualitative feedback. Then, we open an early-access phase with a small cohort of external users. Finally, we expand gradually using a randomized gating mechanism so that traffic is distributed evenly across organization types, repo sizes, and PR complexity.

Throughout this process, we monitor everything:

Comment quality and acceptance rates
Latency, error rates, and timeouts
Changes in developer sentiment or negative reactions to CodeRabbit comments
Precision shifts in suggestion acceptance

If we see degradation in any of these signals, we roll back immediately or limit exposure while we triage. Sometimes it’s a small prompt-level regression; other times, it’s a subtle style drift that affects readability. Either way, we treat rollout as a living experiment, not a switch-flip.

5. The steady-state phase: Continuous vigilance

Once a model is stable, the work doesn’t stop. We monitor it constantly through automated alerts and daily evaluation runs that detect regressions long before users do. We also listen, both to our own experience (we use CodeRabbit internally) and to customer feedback.

That feedback loop keeps us grounded. If users report confusion, verbosity, or tonal mismatch, we investigate immediately. Every day, we manually review random comment samples from public repots that use us to ensure that quality hasn’t quietly slipped as the model evolves or traffic scales.

6. Why we do all this & why you shouldn’t have to

Each new model we test forces us to rediscover what “good” means under new constraints. Every one comes with its own learning curve, its own failure modes, its own surprises. That’s the reality behind the promise of progress.

Could an engineering team replicate this process themselves? Technically, yes. But it would mean building a full evaluation harness, collecting diverse PR datasets, writing and maintaining LLM-judge systems, defining a style rubric, tuning prompts, managing rollouts, and maintaining continuous regression checks. All of this before your first production review!

That’s weeks of work just to reach baseline reliability. And you’d need to do it again every time a new model launches.

We do this work so you don’t have to. Our goal isn’t to let you pick a model; it’s to make sure you never have to think about it. When you use CodeRabbit, you’re already getting the best available model for each task, tuned, tested, and proven under production conditions.

Because “choosing your own model” sounds empowering until you realize it means inheriting all this complexity yourself.

Takeaway
Model adoption at CodeRabbit isn’t glamorous. It’s slow, meticulous, and deeply technical. But it’s also what makes our reviews consistent, trustworthy, and quietly invisible. Every diff you open, every comment you read, is backed by this machinery. Weeks of evaluation, thousands of metrics, and countless prompt refinements all in service of one thing:

Delivering the best possible review, every time, without you needing to think about which model is behind it.

Try out CodeRabbit today. Get a free 14-day trial!

CodeRabbit's AI Code Reviews now support NVIDIA Nemotron

Arindam Majumder — Tue, 06 Jan 2026 05:14:16 +0000

TL;DR: Blend of frontier & open models is more cost efficient and reviews faster. NVIDIA Nemotron is supported for CodeRabbit self-hosted customers.

How Nemotron helps: Context gathering at scale

CodeRabbit architecture with Nemotron support

Blend of frontier and Open Models

Code and PR index
Linter / Static App Security Tests (SAST)
Code graph
Coding agent rules files
Custom review rules and Learnings
Issue tickets (Jira, Linear, Github issues)
Public MCP servers
Web search

To dive deeper into our context engineering approach you can check out our blog: The art and science of context engineering for AI code reviews.

What this means for our customers

Get in touch with our team to access CodeRabbit’s container image if you would like to run AI code reviews on your self-hosted infrastructure.