Study of 1,430 AI-built apps finds 73% have critical security flaws

The Vibe Check No One Asked For

In February 2026, a security researcher behind VibeEval published what would become one of the most cited pieces of evidence in the emerging debate over AI-generated code safety. The study was straightforward in its approach but damning in its conclusions: run a battery of security checks against publicly deployed applications built with AI coding tools, and see what falls out. What fell out was an avalanche of vulnerabilities so pervasive that they painted a portrait of an entire development model shipping broken security by default.

The researcher identified target applications through multiple signals - deployment on platforms popular with vibe coders (Vercel, Netlify, Replit), characteristic AI-generated code patterns visible in client-side bundles, package.json signatures suggesting rapid generation, and public repositories with commit messages from Claude, Cursor, or Copilot. Each application then received the same battery of 247 security checks, covering OWASP Top 10 vulnerabilities, misconfigurations, and AI-specific issues like hallucinated dependencies.

The headline number was eye-catching enough: 73% of the vibe-coded applications scanned contained at least one critical security vulnerability. But the breakdown beneath that headline told a far more troubling story about systemic, predictable failures baked into the way AI tools generate code.

The Numbers Behind the Headlines

The vulnerability categories read like a security team's nightmare catalog. Missing security headers topped the list at 89% prevalence - nearly nine out of every ten applications scanned lacked the basic HTTP headers (Content-Security-Policy, X-Frame-Options, Strict-Transport-Security) that form the first line of defense against cross-site scripting, clickjacking, and protocol downgrade attacks. These are headers that any security-conscious deployment pipeline would add automatically, but AI coding tools apparently don't consider them part of "making the app work."

Exposed API endpoints came in at 67%, meaning two-thirds of applications had routes or secrets visible in client-side code that should have been server-side only. Client-side secrets - API keys, database credentials, and service tokens embedded directly in frontend JavaScript bundles - appeared in 38% of applications. CORS misconfigurations showed up in 56%, essentially leaving the front door of cross-origin security wide open.

The authentication findings were particularly alarming. A full 45% of applications exhibited insecure authentication patterns, and 23% specifically had JWT (JSON Web Token) authentication bypasses. The characteristic error was using a decode function instead of a verify function when processing tokens - meaning the application would happily accept any token that had the right structure, regardless of whether it had actually been signed by the server. This is the security equivalent of checking that someone's driver's license is the right shape and color without reading the name on it.

SQL and NoSQL injection vulnerabilities appeared in 23% of scanned applications, and cross-site scripting (XSS) vulnerabilities in 31%. Meanwhile, 71% of applications ran outdated dependencies with known vulnerabilities - the kind of thing that an automated dependency update tool could fix in minutes, but that AI-generated code apparently never bothers to address.

Clean Code, Broken Security

Perhaps the most insidious finding in the study was what the researcher dubbed the "clean code, broken security" phenomenon. Unlike legacy code where security problems often accompany obviously messy implementations, AI-generated code looked professional. The user interfaces were polished. The component architectures were clean. The features worked exactly as specified. The code was, by conventional readability metrics, quite good.

The security was simply absent.

This matters because it undermines one of the traditional heuristics that experienced developers use to assess code quality. Messy code triggers suspicion; clean code triggers trust. When an AI produces a login form with elegant React components, proper TypeScript types, and smooth animations but implements token validation using decode instead of verify, the visual quality of the output actively works against the reviewer's ability to spot the vulnerability. The code passes the vibe check precisely because it was optimized for vibes rather than security.

Framework Matters More Than Tool

The study's framework-specific findings challenged a common assumption. Many observers expected the biggest differences to appear between AI tools - that Claude-generated code would be systematically more or less secure than Cursor-generated or Replit-generated code. Instead, the data showed that framework choice was a stronger predictor of security outcomes than tool choice.

Next.js applications, of which 412 were scanned, showed a 68% critical vulnerability rate. Remix applications (156 scanned) performed somewhat better. Vanilla React and Vite applications (289 scanned) had the worst showing at 81% - thirteen percentage points worse than Next.js. Astro applications (143 scanned) fell somewhere in between.

The pattern suggests that frameworks with more opinionated security defaults - server-side rendering that naturally keeps secrets off the client, built-in API routes that encourage proper separation of concerns, middleware systems that make it easy to add authentication checks - produce more secure AI-generated code simply because the framework's architecture makes it harder to do the wrong thing. Vanilla React, by contrast, gives AI tools maximum freedom to make maximum security mistakes.

The tool-specific patterns that did emerge were revealing in their own way. Cursor-generated code tended to produce architecturally sound applications that nonetheless missed security fundamentals. Claude-generated code often included security-relevant comments (suggesting the model was "aware" of security concepts) but didn't always implement them correctly. Replit Agent code showed the highest raw vulnerability density, possibly because Replit's integrated deployment pipeline makes it trivially easy to push code to production without any intermediate review step.

The Platform Effect

Deployment platform comparison yielded one of the study's starkest findings: Replit deployments showed roughly twice the vulnerability count of Vercel deployments. This wasn't necessarily because Replit's infrastructure was less secure - it was more likely a reflection of the developer profiles and workflows each platform attracts.

Vercel deployments tend to involve at least a basic CI/CD pipeline, a GitHub repository, and some degree of build configuration. That minimal friction introduces just enough process to catch some issues. Replit's value proposition, by contrast, is radical simplicity: write code and click deploy, with no intermediate steps. For prototyping and learning, that simplicity is a feature. For production applications handling real user data, it's a vulnerability amplifier.

The study found that applications with automated security scanning in their development pipeline had 91% fewer critical vulnerabilities - a finding so dramatic it almost reads as an advertisement for security tooling, except that it's supported by the data. The most secure applications in the dataset consistently used established auth libraries (like NextAuth or Clerk) instead of custom authentication implementations, further reinforcing the principle that AI tools produce better results when working within opinionated, security-conscious frameworks.

What the Study Means for Vibe Coding

The VibeEval findings landed at a moment when "vibe coding" - the practice of using AI tools to generate code through natural-language prompts with minimal manual review - was rapidly moving from early-adopter experimentation to mainstream development workflow. The study provided the first large-scale empirical evidence for what security professionals had been warning about: that the gap between "it works" and "it's secure" was not a theoretical concern but a measurable, widespread reality.

The numbers aligned with earlier, smaller-scale findings. A 2021 NYU study had found roughly 40% of GitHub Copilot outputs contained security vulnerabilities. By 2026, AppSec Santa's own AI code security testing across 534 code samples from six major LLMs found a 25.1% vulnerability rate - a meaningful improvement in raw code generation but still far from acceptable for production deployment. The VibeEval study's 73% rate for deployed applications suggests that the gap between "AI generates somewhat vulnerable code" and "developers ship that code directly to production" is where the real damage accumulates.

The study also surfaced a more subtle problem: the security issues it found were not exotic or advanced. Missing security headers, exposed API keys, JWT verification failures, outdated dependencies - these are well-understood, well-documented vulnerabilities with well-known fixes. The AI tools generating this code have been trained on millions of examples showing both the wrong way and the right way to handle authentication, authorization, and data protection. They consistently choose patterns that prioritize functionality over security, apparently because their training and reward signals optimize for "does the code run and do what the user asked" rather than "is the code safe to deploy."

The Uncomfortable Implication

The VibeEval study's most uncomfortable finding may have been its most hopeful one: the fixes for nearly every vulnerability category it identified are straightforward. Add security headers through middleware. Move secrets to environment variables and server-side code. Use established auth libraries instead of rolling custom JWT handling. Run automated dependency updates. Integrate a security scanner into the deployment pipeline.

None of these are difficult. None require deep security expertise. Most can be added to a project in under an hour. The 91% reduction in critical vulnerabilities for apps with automated scanning proves the point - the tooling exists and it works.

The problem is that 73% of vibe-coded applications hadn't implemented any of it. The speed that makes vibe coding attractive is the same speed that makes security review feel like unnecessary friction. The AI tools themselves don't flag the gaps because they're optimized to deliver what was asked for, and nobody asks for security headers. The result is a rapidly growing population of deployed applications where the code looks clean, the features work, and the security posture is essentially nonexistent - a phenomenon that scales exactly as fast as the AI tools enabling it.

Vibe Graveyard

Study of 1,430 AI-built apps finds 73% have critical security flaws

Incident Details

Tech Stack

References