Forem: Max

Developers Spend Just 1% of Coding Time Using VS Code's Debugger (11,805 Sessions Analyzed)

Max — Thu, 23 Oct 2025 12:08:30 +0000

Analysis of 11,805 coding sessions from 68 developers tracked over 3 months. Developers spend just 1.4% of their time using VS Code's debugger - most rely on console.log statements instead.

📊 Research Summary

68 developers tracked over 3 months (July-October 2025)
11,805 coding sessions (30-minute intervals) averaging 18 minutes of active coding each
3,526 hours of active coding time analyzed (excluding idle time)
All data collected via FlouState automatic tracking

My Personal Wake-Up Call

Building FlouState solo meant debugging felt like half the job. Those late-night sessions hunting down edge cases were exhausting - terminal full of console.log() statements, manually reproducing bugs, reading stack traces.

After 3 months (203 hours tracked), I looked at my own data and saw something surprising:

I spent 0.2% of my time using VS Code's debugger.

That's 20 minutes over 3 months of active coding.

The Reality: Debugging felt exhausting and time-consuming. But the data showed I was mostly creating (56.6%) and exploring code (25.8%). The "debugging" I remembered was actually print statements scattered throughout normal development.

VS Code has a world-class debugger built in. I used it for 20 minutes in 3 months.

This Isn't Just Me. It's All of Us.

I analyzed 68 FlouState users who've been tracking since July 2025. The pattern was universal: we avoid VS Code's debugger like it's radioactive.

Debugger Usage Across 68 Developers:

46.2% - Writing code (includes console.log debugging)
28.7% - Reading code (includes stack trace hunting)
23.7% - Refactoring (includes removing debug logs)
1.4% - Using VS Code debugger (breakpoints, step-through)

📊 Understanding the Numbers

Across 3,526 hours of tracked coding time, developers spent an average (mean) of 13 minutes per month using VS Code's debugger. Not per day. Per month.

The median was 0 minutes - most developers never used it at all. Even among the 25% who used it at least once, the average was only 54 minutes per month.

📊 Distribution Analysis:

75% of developers never used the debugger - not even once
10% of developers used it less than 1% of their time
15% of developers used it 1%+ of their time (highest: 52%)

Median usage: 0%. Even with 9 developers using it 2%+, the average is still only 1.4%.

Important Clarification:

This 1% measures active debugger UI usage (breakpoints, watches, step-through). It does NOT include console.log() debugging, reading error logs, or manual bug reproduction - which likely account for a significant portion of actual coding time.

The gap reveals how we debug, not how much we debug.

Why Do Developers Avoid the Debugger?

If VS Code's debugger is so powerful, why do developers use it <1% of the time?

Here's the thing: console.log() has its place - quick sanity checks, debugging async/promise chains, production logging. But for complex state issues, race conditions, or stepping through multi-layer logic, the debugger is often more efficient.

Common Print Debugging Workflow:

Add console.log("here") → Save file → Reload browser → Check console → Repeat 10x
Forget to remove logs → Ship to production → Pollute user consoles (or strip them with build tools)
Can't inspect variables mid-execution → Add more logs → Cluttered code
Race conditions hard to diagnose → Guess at timing → Takes longer

This works fine for many bugs, but complex issues can take longer with this approach.

Why Developers Prefer console.log()

The strong preference for console.log() isn't laziness. It's psychology.

💨 Immediate Gratification

Type console.log("here") → See output in 3 seconds. Setting up a debugger configuration? That could take 10 minutes. Our brains choose the dopamine hit of instant feedback.

🔁 Familiarity Bias

You learned console.log() on day 1 of coding. You've used it thousands of times. The debugger? Maybe never. We default to familiar tools even when better ones exist.

📉 Sunk Cost Fallacy

You've already added 5 console.logs. "Might as well add one more" instead of switching to the debugger. 30 minutes later, you're still adding logs.

🤔 Perceived Complexity

"The debugger looks complicated" → Never learn it → Miss out on a potentially useful tool. Classic catch-22.

The Hidden Costs (That May or May Not Matter to You)

Some developers argue that never learning the debugger has costs. Others say console.log() works fine. Here are the arguments on both sides:

⏱️ Time Investment Tradeoff

Pro-Debugger: Learning the debugger takes 2 hours upfront but may save minutes per bug.
Pro-Console.log: Console.log is instant and requires zero setup time.

🧹 Code Cleanup

Pro-Debugger: No leftover logs to remove.
Pro-Console.log: Modern linters catch leftover logs automatically.

🔒 Production Safety

Pro-Debugger: Can't accidentally ship debug logs with sensitive data.
Pro-Console.log: Modern build tools strip console.logs in production anyway.

Ultimately, if console.log() works for you and you're shipping products, keep using it. But knowing the debugger gives you options when console.log isn't enough.

The Opportunity Cost:

Investing just 2 hours learning VS Code's debugger can save hours every week by reducing debugging cycles and eliminating context-switching overhead. Yet most developers never make that investment.

What We're Missing: The Console.log Loop

Most debugging happens without the debugger. We add print statements, reload, check output, repeat. Based on my own patterns, I estimate this accounts for roughly 15-20% of actual coding time - but it's scattered across "Creating" (adding logs) and "Exploring" (reading output), making it invisible in aggregate statistics.

What If You Used the Debugger Instead?

For example, a bug that might take 10 console.logs and 30 minutes could potentially be solved with 1 breakpoint and 5 minutes. The debugger lets you pause execution, inspect all variables at once, and step through logic without modifying code.

New to the debugger? Start with this official guide (includes a 13-minute video walkthrough).

Methodology: How We Collected This Data

This research is based on FlouState's automatic tracking system, which categorizes developer work into 4 types:

Creating (46.2%):
Primarily adding new code with minimal deletions. This included console.log() debugging since it adds lines of code.

Exploring (28.7%):
Many file views with few edits - reading and understanding codebases. This included reading console.log() output and stack traces.

Maintenance (23.7%):
Balanced mix of additions and deletions - restructuring existing code. This included removing console.log() statements.

Debugger Usage (1.4%):
Active VS Code debugger UI usage only (breakpoints, step-through, watch variables). This does NOT capture console.log() debugging or other debugging methods.

⚠️ Critical Note on Data Interpretation:

The 1.4% "Debugging" stat measures debugger tool usage, not total debugging time. Most debugging happens via console.log(), which is counted as "Creating" or "Exploring" depending on context.

Study Limitations

This analysis is based on FlouState users (n=68). Results may differ for enterprise development teams, users of other IDEs (JetBrains, Visual Studio), or developers working in different programming paradigms.

The study focuses on VS Code users specifically and may not represent debugging patterns across all development environments. However, VS Code's dominant market position (75.9% of developers according to Stack Overflow 2025 Survey) suggests these findings are broadly applicable to the industry.

All data collection happens locally in VS Code. Only aggregated 30-minute summaries are sent to the cloud - never your actual code content.

🔒 Privacy & Data Use:

This research uses anonymized aggregate data from 68 FlouState users. All data is fully anonymized - no individual developers, projects, or specific code patterns can be identified. Only aggregate statistics (percentages, totals, averages) are analyzed.

Your code content is NEVER captured by FlouState, only metadata like timestamps, file counts, language types, and branch names. Users can opt out of research participation anytime in Settings.

The Bottom Line

Developers spend just 1.4% of their time using VS Code's debugger.

The data shows we rely heavily on console.log() and manual debugging instead of built-in tools. Whether this is actually inefficient or just how developers prefer to work remains an open question.

The debugger exists. Most developers don't use it.

How this data was collected:

I built FlouState, a VS Code extension that automatically tracks coding activity. It records 30-minute intervals and tracks when the VS Code debugger is active vs inactive.

This analysis covers 68 developers, 11,805 coding sessions, and 3,526 hours of active coding time between July 14 - October 18, 2025.

Important caveats: This only tracks the VS Code debugger. It doesn't capture console.log, print statements, or external debuggers (gdb, lldb, etc.). So real "debugging" time is higher - but debugger tool usage is still remarkably low.

6 AI Models vs. 3 Advanced Security Vulnerabilities

Max — Mon, 13 Oct 2025 11:03:33 +0000

A security researcher submitted three advanced vulnerability examples to our AI benchmarking platform. Not textbook examples—real exploits: prototype pollution that bypasses authorization, an agentic AI supply-chain attack combining prompt injection with cloud API abuse, and OS command injection in ImageMagick.

We ran each through 6 top AI models: GPT-5, OpenAI o3, Claude Opus 4.1, Claude Sonnet 4.5, Grok 4, and Gemini 2.5 Pro.

The result? All six models caught all three vulnerabilities. 100% detection rate.

But here's the catch: the quality of their fixes varied by up to 18 percentage points. And when the security researcher voted on which model performed best, they disagreed with our AI judge entirely.

Here's what we learned about which AI models you should trust for security code reviews.

⚠️ Early Data Disclaimer (n=3 evaluations)

This case study analyzes 3 security evaluations from one external researcher. Results are directional and not statistically significant. We're building a larger benchmark dataset and actively seeking more security professionals to submit challenges.

Why publish early data? Even with limited sample size, these findings reveal important patterns about AI model behavior on cutting-edge vulnerabilities. We believe in transparency and iterative improvement.

The Three Vulnerabilities

Vulnerability #1: Prototype Pollution Privilege Escalation

What it is: A Node.js API with a deepMerge function that recursively merges user input into a config object. No hasOwnProperty checks or __proto__ filtering. Authorization relies on req.user.isAdmin property.

The exploit:

POST /admin/config
{
  "__proto__": {
    "isAdmin": true
  }
}

Result: All objects inherit isAdmin: true, instant admin access.

Why it matters: Affects popular npm packages (lodash, hoek, minimist). Real CVEs: CVE-2019-10744, CVE-2020-28477.

Vulnerability #2: Agentic AI Supply-Chain Attack (2025 Cutting-Edge)

What it is: An LLM agent microservice with three attack vectors:

Indirect prompt injection via poisoned web pages
Over-privileged Azure management API token with full tenant access
Unsafe WASM execution with filesystem mounts (from:'/', to:'/')

The exploit path:

Attacker hosts malicious webpage with hidden instructions
LLM agent fetches page, extracts instructions
Agent invokes Azure API tool to escalate privileges
WASM runtime executes arbitrary code with host filesystem access
Cross-tenant cloud compromise

Why it matters: OWASP Top 10 for LLMs #1 risk (prompt injection). Real incidents: ChatGPT plugins, Microsoft Copilot, GitHub Copilot Chat. No existing AI benchmark tests this attack vector.

Vulnerability #3: OS Command Injection (ImageMagick)

What it is: An Express API that shells out to ImageMagick via child_process.exec(). User-controlled font, size, and text parameters injected directly into command string. No input sanitization or escaping.

The exploit:

POST /render
{
  "text": "hello",
  "font": "Arial; rm -rf /",
  "size": "12"
}

Resulting command:

convert -font "Arial; rm -rf /" -pointsize 12 label:"hello" /tmp/out.png

Why it matters: ImageTragick (CVE-2016-3714) variants still common in 2025. Classic attack that every model should catch.

The Results: 100% Detection, But Quality Varied

✅ All Models Passed (But Not Equally)

Every model caught every vulnerability, but GPT-5 scored 13.5% higher than Grok 4.

Overall Rankings:

Rank	Model	Avg Score	Cost	Detection	Key Strength
1	GPT-5	95.4/100	$2.18	3/3 ✅	Best overall, comprehensive
2	OpenAI o3	92.7/100	$0.97	3/3 ✅	Pragmatic, user's choice
3	Gemini 2.5 Pro	89.2/100	$0.09	3/3 ✅	Cheapest
4	Claude Sonnet 4.5	88.2/100	$0.19	3/3 ✅	⭐ Best value (92% quality @ 9% cost)
5	Claude Opus 4.1	87.7/100	$0.88	3/3 ✅	Thorough but over-engineered
6	Grok 4	84.1/100	$0.14	3/3 ✅	Slowest, simplest fixes

What "Quality" Means in Security

All models identified the vulnerabilities. The score differences came from:

Completeness of fix – Did they address all attack vectors?
Defense-in-depth – Did they suggest multiple mitigation layers?
Code quality – Is the fix production-ready or just a patch?
Explanation depth – Did they explain why the fix works?

Example: Prototype Pollution Fixes

GPT-5 (96.4/100) suggested four mitigation strategies:

Use Object.create(null) for config objects
Add hasOwnProperty checks in deepMerge
Explicitly block __proto__, constructor, prototype keys
Use Object.freeze() on authorization logic

Grok 4 (85/100) suggested one:

Add key filtering in deepMerge (but incomplete – missed some edge cases)

Both "caught it" – but one fix is production-ready, the other has gaps.

📝 Code Example: GPT-5's Defense-in-Depth Approach

Here's how GPT-5 (96.4/100) fixed the prototype pollution vulnerability with a multi-layered approach:

// Helper: create null-prototype object
function obj(data) {
  return Object.assign(Object.create(null), data)
}

// Safe deepMerge with key filtering
function safeDeepMerge(target, source) {
  const dangerousKeys = ['__proto__', 'constructor', 'prototype']

  for (const key in source) {
    // Block dangerous keys
    if (dangerousKeys.includes(key)) continue

    // Only merge own properties
    if (!Object.hasOwnProperty.call(source, key)) continue

    // Recursively merge objects safely
    if (typeof source[key] === 'object' && source[key] !== null) {
      target[key] = safeDeepMerge(target[key] || {}, source[key])
    } else {
      target[key] = source[key]
    }
  }
  return target
}

// Create users with null prototypes
app.use((req, res, next) => {
  req.user = obj({ isAdmin: false, username: 'guest' })
  next()
})

// Require own property check for authorization
function isAdmin(user) {
  return Object.hasOwnProperty.call(user, 'isAdmin')
    && user.isAdmin === true
}

Why this approach scored 96.4/100:

Null-prototype objects – Prevents inheritance attacks
Key filtering – Blocks __proto__, constructor, prototype
Own-property checks – Validates isAdmin is directly set, not inherited
Helper function – Consistent null-prototype creation across app

Compare this to Grok 4's simpler approach (85/100), which only added basic key filtering but missed null-prototype objects and own-property validation—leaving edge cases unprotected.

Cost Analysis: GPT-5 Costs 49% of Budget

💰 Total Cost: $4.46 for 3 Evaluations × 6 Models

GPT-5 alone cost $2.18 (48.87%) – more than all other models combined!

Model	Total Cost	% of Budget	Avg Score	Value Rating
GPT-5	$2.18	48.87%	95.4	Premium
OpenAI o3	$0.97	21.76%	92.7	Good
Claude Opus 4.1	$0.88	19.79%	87.7	Fair
Claude Sonnet 4.5	$0.19	4.35%	88.2	⭐ Best Value
Grok 4	$0.14	3.23%	84.1	Budget
Gemini 2.5 Pro	$0.09	2.00%	89.2	⭐ Cheapest

💡 Budget Recommendation

If cost matters: Use Claude Sonnet 4.5 or Gemini 2.5 Pro for 90%+ of GPT-5's quality at 2-9% of cost.

If quality matters: Use GPT-5 for mission-critical security audits, or OpenAI o3 as middle ground (97% of GPT-5's quality at 44% of cost).

The Plot Twist: Human Disagreed with AI Judge

🤔 What Happened

On the ImageMagick command injection vulnerability:

AI Judge's Choice:

GPT-5 - 95.8/100 (Ranked #1)

User's Choice ✅:

OpenAI o3 - 90.4/100 (Ranked #4 by AI judge)

User's comment: "is better i think because"

Note: The comment was incomplete, but the user's choice reveals a key insight—human security experts prioritize different factors than AI judges. They likely valued o3's pragmatism (simpler, deployable fixes), clarity (easier to understand for teams), and production-readiness over GPT-5's more comprehensive but complex approach.

Why This Matters

AI Judges Optimize For:

Completeness (all criteria addressed?)
Thoroughness (how detailed?)
Code quality (style, structure)

Human Experts Value:

Pragmatism – Is this actually deployable?
Simplicity – Fewer moving parts
Clarity – Can my team maintain this?

Possible reasons the researcher chose o3 over GPT-5:

Simpler fix – o3's solution may have been more straightforward
Better explanation – o3 might have explained the "why" more clearly
Production-ready – Less over-engineering than GPT-5
Personal experience – They've used o3 before and trust its outputs

What This Teaches Us

Community voting ≠ AI judging. AI judges are objective but may miss human intuition. Security experts weigh different factors than AI rubrics.

This is why CodeLens combines both:

AI judge provides instant, consistent scoring
Human votes validate and correct AI blind spots

Real-world lesson: Don't blindly trust AI scores. Get human review on critical security decisions. Best approach: Use AI to triage, humans to validate.

Performance by Vulnerability Type

📊 Classic vs. Cutting-Edge Vulnerabilities

Pattern discovered: All models excel at classic vulnerabilities (prototype pollution, command injection). But newer attacks (agentic AI) create wider performance gaps.

Prototype Pollution (2019 Vulnerability, Well-Known)

Model	Score	Detection	Key Insight
GPT-5	96.4	✅	4 mitigation strategies, production-ready
OpenAI o3	95.2	✅	Clean helpers, null-prototype containers
Claude Sonnet 4.5	91.0	✅	Multi-layer defense with validation
Gemini 2.5 Pro	90.0	✅	Simple fix, some edge cases missed
Claude Opus 4.1	86.0	✅	Overengineered but comprehensive
Grok 4	85.0	✅	Partial mitigation, incomplete filtering

Insight: All models caught it, but GPT-5's fix was 13% better than Grok 4's.

Agentic AI Supply-Chain Attack (2025 Cutting-Edge)

Model	Score	Detection	Key Insight
GPT-5	94.0	✅	Defense-in-depth with scoped tokens
OpenAI o3	92.4	✅	Trust boundaries + policy gating
Gemini 2.5 Pro	87.4	✅	Comprehensive but complex
Claude Opus 4.1	83.8	✅	TypeScript + complex classes
Grok 4	83.2	✅	Brittle token decode
Claude Sonnet 4.5	82.0	✅	Over-engineered, lowest score

Insight: Claude Sonnet 4.5 scored 12 points lower on the advanced attack vs. classic vulnerabilities.

🎯 Pattern: Advanced Attacks Favor Frontier Models

Classic vulnerabilities (prototype pollution, command injection): 88-96/100 (tight 8-point range)

Advanced attack (agentic AI): 82-94/100 (wider 12-point spread)

Conclusion: For well-known vulnerabilities (OWASP Top 10), any model works. For cutting-edge attacks (LLM security, supply-chain), use GPT-5 or o3. Budget models excel at classics but struggle with novelty.

Key Takeaways & Recommendations

1. Detection ≠ Quality

All models caught all vulnerabilities (100% detection rate), but quality of fixes varied by 8-18%.

Lesson: Don't just ask "Did AI catch it?" Ask "Is the fix production-ready?"

2. Cost vs. Quality Tradeoff is Real

GPT-5: Best quality (95.4) but 49% of budget. Claude Sonnet: 92% of quality at 9% of cost.

Lesson: Define your quality threshold, then optimize for cost.

3. Human Experts ≠ AI Judges

AI judge chose GPT-5 (95.8 score). Security researcher chose o3 (90.4 score, ranked #4).

Lesson: Get human validation on critical security decisions.

4. Advanced Attacks Favor Frontier Models

Classic vulnerabilities: All models 85-96/100. Cutting-edge (agentic AI): 82-94/100 (12-point spread).

Lesson: Use GPT-5/o3 for novel threats, budget models for OWASP Top 10.

5. Model Choice Depends on Use Case

Not "which model is best?" but "best for what?" Different models excel at different domains.

Lesson: Match the model to the mission.

📋 Recommendation Matrix

For Mission-Critical Production Code → GPT-5

Cost: $0.73/eval avg, 95.4 quality

Use when: Financial systems, healthcare, authentication

Why: Most comprehensive fixes, defense-in-depth

For Everyday Security Audits → Claude Sonnet 4.5

Cost: $0.06/eval avg, 88.2 quality

Use when: Regular code reviews, PR automation

Why: 92% of GPT-5's quality at 9% of cost

For Budget-Constrained Teams → Gemini 2.5 Pro

Cost: $0.03/eval avg, 89.2 quality

Use when: Startups, open source, high-volume scanning

Why: Cheapest option, surprisingly strong performance

For Pragmatic Fixes → OpenAI o3

Cost: $0.32/eval avg, 92.7 quality

Use when: You want simple, deployable solutions

Why: Security expert's choice, good balance

Conclusion

The security researcher who submitted these vulnerabilities taught us something important: detection is table stakes, but quality is what matters.

Every AI model caught every vulnerability. That's impressive—a few years ago, this would have been impossible.

But the spread in fix quality (84-95/100) shows that not all AI security reviews are created equal. GPT-5 delivered the most comprehensive solutions. Claude Sonnet 4.5 offered 92% of the quality at 9% of the cost. And OpenAI o3 provided the pragmatic fixes that a real security engineer preferred over the AI judge's top pick.

The takeaway? Match the model to the mission. Use frontier models for novel threats and mission-critical code. Use budget models for everyday OWASP Top 10 scans. And always get human validation on the fixes you actually deploy.

Because in security, good enough isn't good enough.

🔓 Full Transparency: Raw Data Available

Every evaluation on CodeLens.AI is publicly accessible. View the complete data for this case study:

Prototype Pollution: https://codelens.ai/app/results/6c156ee5-eb9d-4655-b358-bb7fb2f5906a
Agentic AI Supply-Chain Attack: https://codelens.ai/app/results/9234cd36-a9cf-401a-94a0-cd9f93cde47e
Command Injection (ImageMagick): https://codelens.ai/app/results/66f22549-fc2a-494e-b3b3-672a522aa818

Each link shows: Original vulnerable code, task description, all 6 model outputs, AI judge scores (by criterion), and voting results.

Try It Yourself

Want to see which AI models catch vulnerabilities in your codebase?

Submit to CodeLens:

Paste your vulnerable code (50-500 lines)
Describe the security issue you're testing
Get instant comparison across 6 top models
Vote on which model's fix you'd actually deploy

👉 Submit Security Challenge

👉 View Full Leaderboard

No credit card required.

Based on real evaluation data from external security researcher • Date: October 11, 2025

Read more case studies: CodeLens.AI Blog