<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Andrew</title>
    <description>The latest articles on Forem by Andrew (@koalr).</description>
    <link>https://forem.com/koalr</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3875674%2F54bba247-5f0c-4700-82f5-af2737af0797.png</url>
      <title>Forem: Andrew</title>
      <link>https://forem.com/koalr</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/koalr"/>
    <language>en</language>
    <item>
      <title>Why We Built Proactive Briefings Instead of Another Dashboard</title>
      <dc:creator>Andrew</dc:creator>
      <pubDate>Thu, 16 Apr 2026 00:49:44 +0000</pubDate>
      <link>https://forem.com/koalr/why-we-built-proactive-briefings-instead-of-another-dashboard-3dh6</link>
      <guid>https://forem.com/koalr/why-we-built-proactive-briefings-instead-of-another-dashboard-3dh6</guid>
      <description>&lt;p&gt;Dashboards are a pull medium. You have to remember to check them, find time to open them, and then interpret what you see. For engineering leaders who are already managing incident queues, planning meetings, and code reviews, that pull rarely happens until something is already wrong. We built AI briefings because we wanted risk visibility to be push.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The dashboard problem&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The engineering metrics dashboard has become the default answer to a real problem: how do you give engineering leaders visibility into risk without adding meetings to their calendar? The dashboard promises visibility on demand. The practical reality is that demand rarely materializes until after an incident.&lt;/p&gt;

&lt;p&gt;We have talked to dozens of engineering managers who have Koalr, LinearB, or Jellyfish dashboards open in a pinned tab. Most of them check it reactively — after a bad deploy, during a retrospective, when a VP asks why MTTR spiked last week. The dashboard is excellent for those conversations. It is not where risk gets caught before it becomes an incident.&lt;/p&gt;

&lt;p&gt;The pattern we kept seeing was this: the information was in the system. The high-risk PR had been scored. The CODEOWNERS gap had been flagged. The SLO burn rate was elevated. But nobody was looking at the dashboard that Monday morning when the deploy queue was filling up.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;The pull vs. push problem&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pull (Dashboard)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Requires intent to check&lt;/li&gt;
&lt;li&gt;Competes with every other tab&lt;/li&gt;
&lt;li&gt;Raw data requires interpretation &lt;/li&gt;
&lt;li&gt;No context about what changed since yesterday&lt;/li&gt;
&lt;li&gt;Gets checked reactively after incidents&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Push (Briefing)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Arrives where the team already is (Slack)&lt;/li&gt;
&lt;li&gt;Narrative summary, not raw metrics&lt;/li&gt;
&lt;li&gt;Delta-focused — what changed this week&lt;/li&gt;
&lt;li&gt;Actionable recommendations, not alerts&lt;/li&gt;
&lt;li&gt;Gets read before the deploy queue fills&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;The design constraint: no alert fatigue&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The obvious answer to "the dashboard doesn't get checked" is more alerts. Add a PagerDuty rule for high-risk PRs. Slack-notify on every score above 70. This is the wrong answer. Alert fatigue is already endemic in engineering teams, and adding more low-signal notifications makes engineers trust the channel less, not more.&lt;/p&gt;

&lt;p&gt;The design constraint for the briefing was: one message per week, per engineering manager, surfacing only the signals that changed materially. Not every high-risk PR — only the pattern shift. Not every CODEOWNERS gap — only when coverage has dropped enough to matter. Not raw scores — a narrative that tells you what to do with them.&lt;/p&gt;

&lt;p&gt;This forced a different architecture than a notification system. A notification system fires on threshold breaches. A briefing synthesizes a week of data into a coherent picture of what the risk landscape looks like now versus what it looked like last week.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;What goes into the briefing&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The weekly risk briefing is generated by Claude from a structured data payload containing the week's deploy activity. The inputs to the synthesis are:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;→ Risk score distribution.&lt;/strong&gt; How many deploys scored in the safe, moderate, high, and critical ranges this week versus last week. The absolute numbers matter less than the direction.&lt;br&gt;
&lt;strong&gt;→ High-risk concentrations.&lt;/strong&gt; Which services are contributing disproportionately to high-risk scores. A spike in payments-service risk is more actionable than a diffuse increase across 20 services.&lt;br&gt;
&lt;strong&gt;→ Signal-level drivers.&lt;/strong&gt; Which of the 33 signals are contributing most to elevated scores this week. Change entropy up? CODEOWNERS coverage down? Coverage delta deteriorating? Each has a different remediation path.&lt;br&gt;
&lt;strong&gt;→ MTTR and incident context.&lt;/strong&gt; Whether MTTR improved or deteriorated this week, and whether any incidents co-occurred with high-risk deploys — which feeds the model's accuracy signal.&lt;br&gt;
&lt;strong&gt;→ Positive signals.&lt;/strong&gt; Teams or services that had notably low risk scores this week. Surfacing what is working creates a reinforcement mechanism, not just a problem log.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Why LLM synthesis, not templates&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The briefing could have been a templated report. Pull the top 3 highest-risk services, list the most common signal contributors, format as bullet points. This would have been faster to build and easier to predict.&lt;/p&gt;

&lt;p&gt;We chose LLM synthesis because the value of the briefing comes from narrative coherence — the ability to say "payments-service and auth-service are both elevated this week, and both have CODEOWNERS gaps as the primary driver, which suggests a governance issue rather than a change volume issue." A template cannot make that connection. It can surface the two data points separately, but it cannot synthesize the pattern.&lt;/p&gt;

&lt;p&gt;The synthesis also allows the briefing to be appropriately calibrated to context. A week where MTTR improved and risk scores are down is a different briefing than a week where three high-risk deploys shipped on a Friday before a bank holiday weekend. The LLM generates the right emphasis for the actual situation.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Severity classification: critical, warning, info&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Each briefing card is classified as critical, warning, or info. This is not determined by the LLM — it is a deterministic classification based on the underlying metrics before synthesis:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Critical&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;High-risk score concentration above threshold, or incident co-occurrence with high-risk deploys in the same week. Requires action before the next release window.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Warning&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A signal trending in the wrong direction that has not yet produced incidents but warrants monitoring. Coverage drift, emerging CODEOWNERS gaps, MTTR regression.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Info&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A clean week, or a positive signal worth reinforcing — a team that has maintained low risk scores for three consecutive weeks, or model accuracy trending above target.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The classification is shown first in the briefing so the reader can triage at a glance. An engineering manager receiving a Slack digest at 9am on Monday should be able to determine within 10 seconds whether this week requires immediate action or a quick scan.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;What we learned from use&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The most consistent feedback we have received from engineering managers using the briefing is that it changed how they start their Monday. Not dramatically — it takes 90 seconds to read — but it means they arrive at the first standup already knowing whether there is a risk concentration to address.&lt;/p&gt;

&lt;p&gt;The second most common feedback is about specificity. The briefing names services, names signals, and names the engineers whose PRs are driving elevated scores. Vague reporting ("risk is elevated this week") does not produce action. Specific reporting ("payments-service has the highest change entropy in 90 days, and three of the five contributors this week had no prior file-level expertise in the modified paths") does.&lt;/p&gt;

&lt;p&gt;The briefing does not replace the dashboard. For deep investigation, for quarterly review, for explaining a trend to a VP, the dashboard is still the right tool. What the briefing does is ensure that the information in the system gets to the right people at the right time — before the deploy queue fills up, not after the incident report.&lt;/p&gt;

&lt;p&gt;The weekly briefing described here is live in Koalr — it runs every Monday and lands in Slack and email. The risk scoring that powers it is free for teams up to 5 contributors. If you want to see what a score looks like on a real PR before committing to anything: &lt;a href="https://dev.tourl"&gt;koalr.com/live-risk-demo&lt;/a&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>development</category>
      <category>ai</category>
      <category>programming</category>
    </item>
    <item>
      <title>We Scored 28 Famous Open Source PRs for Deploy Risk</title>
      <dc:creator>Andrew</dc:creator>
      <pubDate>Mon, 13 Apr 2026 02:18:54 +0000</pubDate>
      <link>https://forem.com/koalrapp/we-scored-28-famous-open-source-prs-for-deploy-risk-55bj</link>
      <guid>https://forem.com/koalrapp/we-scored-28-famous-open-source-prs-for-deploy-risk-55bj</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;&lt;br&gt;
The React Hooks PR that changed every React application on earth? Three words in the commit message. One feature flag removed. It scored 91 out of 100 for deploy risk. The Svelte 5 release scored 99. A 65-line TypeScript change scored 79 and silently broke type inference in codebases worldwide. We ran 28 landmark open source pull requests through Koalr's deploy risk model. Here is what we found — and why it matters for the PRs your team ships every week.&lt;/p&gt;



&lt;p&gt;&lt;strong&gt;The problem with code review&lt;/strong&gt;&lt;br&gt;
Modern code review answers one question well: is this code correct?&lt;/p&gt;

&lt;p&gt;It answers a different question poorly: how likely is this to cause a production incident?&lt;/p&gt;

&lt;p&gt;Those are not the same question. A PR can be clean, well-written, and thoroughly reviewed — and still wreck production because it touches a critical path nobody flagged, because the reviewer had twelve other PRs open, or because it is the fourth consecutive revert of a feature that never landed cleanly.&lt;/p&gt;

&lt;p&gt;Most teams have no objective signal for the second question. They have green checkmarks.&lt;/p&gt;



&lt;p&gt;&lt;strong&gt;What deploy risk scoring is&lt;/strong&gt;&lt;br&gt;
Koalr scores every pull request from 0 to 100 before it merges. The score is built from 36 signals:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Blast radius signals&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How many files changed&lt;/li&gt;
&lt;li&gt;What services those files belong to&lt;/li&gt;
&lt;li&gt;Whether shared libraries or interfaces were modified&lt;/li&gt;
&lt;li&gt;CODEOWNERS compliance — did the right people review the right files&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Change quality signals&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;File churn — how recently and how often these files have been modified&lt;/li&gt;
&lt;li&gt;Change entropy — how spread across the codebase the diff is&lt;/li&gt;
&lt;li&gt;Lines added vs deleted ratio&lt;/li&gt;
&lt;li&gt;Test coverage of changed files&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Context signals&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reviewer load — how many open PRs each reviewer currently has&lt;/li&gt;
&lt;li&gt;Author's recent incident rate&lt;/li&gt;
&lt;li&gt;Time since last deploy to the same service&lt;/li&gt;
&lt;li&gt;Revert history on the changed file set&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;History signals&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Consecutive reverts of the same feature&lt;/li&gt;
&lt;li&gt;Recent incident correlation with this file set&lt;/li&gt;
&lt;li&gt;PR age — how long the branch has been open&lt;/li&gt;
&lt;li&gt;A score of 0–39 is Low. 40–69 is Medium. 70–89 is High. 90–100 is Critical.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The score does not replace review. It gives reviewers a number to orient around before they start reading.&lt;/p&gt;



&lt;p&gt;&lt;strong&gt;The experiment&lt;/strong&gt;&lt;br&gt;
We pulled 28 of the most consequential pull requests in open source history and ran them through the model. These are PRs the industry knows by name — the ones that shipped features used by millions of developers, or broke them.&lt;/p&gt;

&lt;p&gt;Here is what the model said.&lt;/p&gt;



&lt;p&gt;&lt;strong&gt;The obvious ones scored as expected&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Svelte 5 release &lt;a href="https://github.com/sveltejs/svelte/pull/13701" rel="noopener noreferrer"&gt;https://github.com/sveltejs/svelte/pull/13701&lt;/a&gt; — score 99&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The full runes rewrite merged to main. Thousands of files changed, the entire reactivity model replaced, years of migration work consolidated into one merge. Of course it scored critical. High blast radius, enormous file count, fundamental architecture change. The model does what you would expect.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TypeScript modules conversion &lt;a href="https://github.com/microsoft/TypeScript/pull/51387" rel="noopener noreferrer"&gt;https://github.com/microsoft/TypeScript/pull/51387&lt;/a&gt; — score 98&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Microsoft's conversion of the entire TypeScript compiler codebase from namespaces to ES modules. It touched every source file in the compiler, changed the build system, and dropped dependencies. If any PR in history deserved a mandatory all-hands review before merge, it was this one.&lt;/p&gt;



&lt;p&gt;&lt;strong&gt;The surprising ones — small diffs, enormous blast radius&lt;br&gt;
This is where it gets interesting.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;React PR #14679 "Enable hooks!" &lt;a href="https://github.com/facebook/react/pull/14679" rel="noopener noreferrer"&gt;https://github.com/facebook/react/pull/14679&lt;/a&gt; — score 91&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The commit message is three words. The diff is the removal of a single feature flag. You could read the entire change in thirty seconds.&lt;/p&gt;

&lt;p&gt;It scored 91.&lt;/p&gt;

&lt;p&gt;Why? Because the model does not count lines — it looks at what the changed code controls. A feature flag in a framework used by tens of millions of applications is not a small change. It is a detonation switch. The blast radius is every React application on earth. The model flagged it correctly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;Signals fired&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;blast_radius_score&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.97&lt;/span&gt;
  &lt;span class="na"&gt;feature_flag_detected&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;downstream_consumers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical&lt;/span&gt;
  &lt;span class="na"&gt;reviewer_load&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;0.2 (core team — low load)&lt;/span&gt;

&lt;span class="na"&gt;Final score&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;91 / Critical&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Node.js PR #41749 "lib: add fetch" &lt;a href="https://github.com/nodejs/node/pull/41749" rel="noopener noreferrer"&gt;https://github.com/nodejs/node/pull/41749&lt;/a&gt; — score 82&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;One file changed: the bootstrap script that runs inside every Node.js process. Adding the global fetch API touched the most critical execution path in the runtime.&lt;/p&gt;

&lt;p&gt;Single-file PR. High score. The file changed is what matters, not how many files changed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TypeScript PR #57465 "Infer type predicates from function bodies" &lt;a href="https://github.com/microsoft/TypeScript/pull/57465" rel="noopener noreferrer"&gt;https://github.com/microsoft/TypeScript/pull/57465&lt;/a&gt; — score 79&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;65 lines of new code. One function modified.&lt;/p&gt;

&lt;p&gt;Those 65 lines changed type inference behavior across the entire checker, producing new type errors in codebases that had compiled cleanly for years. A reviewer looks at 65 lines, sees clean code, approves it. The model sees that those 65 lines live inside the type checker core and have cross-cutting effects on every downstream consumer.&lt;/p&gt;

&lt;p&gt;This is the failure mode standard review misses every time.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;The revert pattern&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Next.js PR #45196 &lt;a href="https://github.com/vercel/next.js/pull/45196" rel="noopener noreferrer"&gt;https://github.com/vercel/next.js/pull/45196&lt;/a&gt; — score 88&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Title: "Revert 'Revert 'Revert 'Revert 'Initial metadata support''"''&lt;/p&gt;

&lt;p&gt;PR body: "Hopefully last time."&lt;/p&gt;

&lt;p&gt;Four consecutive reverts of the same feature. The model has a specific signal for this: repeated churn on the same file set with revert commits in recent history. It is one of the strongest predictors of another rollback. The PR scored 88 before anyone read a single line of the diff.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;The one that surprised us most&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Jest-to-Vitest migration in tRPC — PR #3688 &lt;a href="https://github.com/trpc/trpc/pull/3688" rel="noopener noreferrer"&gt;https://github.com/trpc/trpc/pull/3688&lt;/a&gt; — scored 67. Medium risk.&lt;/p&gt;

&lt;p&gt;At first glance, that sounds about right for a test runner swap. But look at what actually changed: every single test file in the repository, plus the root configuration, plus the CI pipeline. The surface area was enormous.&lt;/p&gt;

&lt;p&gt;The score was “only” 67 because the risk model correctly identified that none of the changed files were production code paths — only test infrastructure. A test runner change cannot break a production deployment directly. What it can do is make future regressions invisible, which is a subtler and harder-to-measure risk.&lt;/p&gt;

&lt;p&gt;The model is honest about what it can and cannot see. Broken test infrastructure does not score as a deploy risk — it scores as a coverage risk. Different signal, different response.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;The score table&lt;/strong&gt;&lt;br&gt;
Here are eight of the 28 PRs we scored, with the risk level and the primary reason for the score:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpojhx60y8cdtdcv6646m.JPG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpojhx60y8cdtdcv6646m.JPG" alt="Here are eight of the 28 PRs we scored, with the risk level and the primary reason for the score" width="800" height="701"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;What this means for your PRs&lt;/strong&gt;&lt;br&gt;
The open source examples are useful because they are public and well-documented. But none of those teams needed a risk model — the React core team was reviewing the hooks PR. It still would have scored 91.&lt;/p&gt;

&lt;p&gt;The real value is the ordinary PR your team ships on a Thursday afternoon, reviewed by one person in fifteen minutes, that quietly introduces a breaking change nobody caught. That team does not have the React core team. They have two engineers, a Monday morning deadline, and a PR that looks fine.&lt;/p&gt;

&lt;p&gt;That is who Koalr is built for.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Try it&lt;/strong&gt;&lt;br&gt;
The live risk demo at &lt;a href="link" class="crayons-btn crayons-btn--primary"&gt;koalr.com/live-risk-demo&lt;/a&gt;
 scores any public GitHub PR in seconds. No account, no install. Paste a URL, get a score.&lt;/p&gt;

&lt;p&gt;If you want to score your own team's PRs — every PR, automatically, as part of your GitHub workflow — there is a free trial at &lt;a href="link" class="crayons-btn crayons-btn--primary"&gt;app.koalr.com/signup&lt;/a&gt;
.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>github</category>
      <category>programming</category>
      <category>opensource</category>
    </item>
    <item>
      <title>We Scored 28 Famous Open Source PRs for Deploy Risk — Here's What We Found</title>
      <dc:creator>Andrew</dc:creator>
      <pubDate>Mon, 13 Apr 2026 01:35:54 +0000</pubDate>
      <link>https://forem.com/koalrapp/we-scored-28-famous-open-source-prs-for-deploy-risk-heres-what-we-found-2333</link>
      <guid>https://forem.com/koalrapp/we-scored-28-famous-open-source-prs-for-deploy-risk-heres-what-we-found-2333</guid>
      <description>&lt;p&gt;The React Hooks PR that changed every React application on earth? Three words in the commit message. One feature flag removed. It scored &lt;strong&gt;91 out of 100&lt;/strong&gt; for deploy risk. The Svelte 5 release scored &lt;strong&gt;99&lt;/strong&gt;. A 65-line TypeScript change scored &lt;strong&gt;79&lt;/strong&gt; and silently broke type inference in codebases worldwide.&lt;/p&gt;

&lt;p&gt;We ran 28 landmark open source pull requests through Koalr's deploy risk model. Here is what we found — and why it matters for the PRs your team ships every week.&lt;/p&gt;




&lt;h2&gt;
  
  
  The problem with code review
&lt;/h2&gt;

&lt;p&gt;Modern code review answers one question well: is this code correct?&lt;/p&gt;

&lt;p&gt;It answers a different question poorly: how likely is this to cause a production incident?&lt;/p&gt;

&lt;p&gt;Those are not the same question. A PR can be clean, well-written, and thoroughly reviewed — and still wreck production because it touches a critical path nobody flagged, because the reviewer had twelve other PRs open, or because it is the fourth consecutive revert of a feature that never landed cleanly.&lt;/p&gt;

&lt;p&gt;Most teams have no objective signal for the second question. They have green checkmarks.&lt;/p&gt;




&lt;h2&gt;
  
  
  What deploy risk scoring is
&lt;/h2&gt;

&lt;p&gt;Koalr scores every pull request from 0 to 100 before it merges, built from 36 signals across four categories:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Blast radius&lt;/strong&gt; — files changed, services affected, CODEOWNERS compliance, shared library modifications&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Change quality&lt;/strong&gt; — file churn, change entropy, lines added vs deleted, test coverage of changed files&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context&lt;/strong&gt; — reviewer load, author's recent incident rate, time since last deploy, revert history on the changed file set&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;History&lt;/strong&gt; — consecutive reverts of the same feature, recent incident correlation, PR age&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;0–39 → Low&lt;/li&gt;
&lt;li&gt;40–69 → Medium&lt;/li&gt;
&lt;li&gt;70–89 → High&lt;/li&gt;
&lt;li&gt;90–100 → Critical&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The score does not replace review. It gives reviewers a number to orient around before they start reading.&lt;/p&gt;




&lt;h2&gt;
  
  
  The obvious ones scored as expected
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Svelte 5 release — score 99&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The full runes rewrite merged to main. Thousands of files changed, the entire reactivity model replaced, years of migration work consolidated into one merge. High blast radius, enormous file count, fundamental architecture change. The model does what you would expect.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TypeScript modules conversion — score 98&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Microsoft's conversion of the entire TypeScript compiler from namespaces to ES modules. Touched every source file in the compiler, changed the build system, dropped dependencies. If any PR in history deserved a mandatory all-hands review before merge, it was this one.&lt;/p&gt;




&lt;h2&gt;
  
  
  The surprising ones — small diffs, enormous blast radius
&lt;/h2&gt;

&lt;p&gt;This is where it gets interesting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;React PR #14679 "Enable hooks!" — score 91&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The commit message is three words. The diff is the removal of a single feature flag. You could read the entire change in thirty seconds.&lt;/p&gt;

&lt;p&gt;It scored 91.&lt;/p&gt;

&lt;p&gt;The model does not count lines — it looks at what the changed code &lt;em&gt;controls&lt;/em&gt;. A feature flag in a framework used by tens of millions of applications is not a small change. It is a detonation switch. The blast radius is every React application on earth.&lt;/p&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
json
{
  "blast_radius_score": 0.97,
  "feature_flag_detected": true,
  "downstream_consumers": "critical",
  "reviewer_load": 0.2
}

Score: 91 / Critical
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>devops</category>
      <category>github</category>
      <category>programming</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
