RAG-Augmented Agile Story Generation: An Architectural Framework for LLM-Powered Backlog Automation

Manasa Nandelli — Wed, 17 Dec 2025 03:49:27 +0000

# RAG-Augmented Agile Story Generation: How I Built an AI System to Auto-Generate User Stories from Epics

** Author : Manasa Nandelli

As a software engineer working on enterprise applications, I spent countless hours translating high-level epics into actionable user stories. The process was repetitive, time-consuming, and—let's be honest—inconsistent depending on my caffeine levels that day.

So I decided to build something about it.

This article shares the architectural framework I developed for automated user story generation using Retrieval-Augmented Generation (RAG). I'll walk you through the design decisions, the pitfalls I encountered, and what I learned along the way.

TL;DR

I built a RAG pipeline that generates user stories from project epics
It retrieves organizational knowledge (story rules, product docs) from a vector database
The LLM receives this context and produces format-compliant, domain-accurate stories
Key insight: Combining human Agile expertise with AI capabilities beats either approach alone

The Problem I Was Trying to Solve

Every sprint planning, I faced the same challenge:

Product owner creates an epic: "Implement user notification preferences"
I need to break it into stories: But how many? What format? What acceptance criteria?
I dig through documentation: What does our system currently support? What's the terminology?
I write stories: Trying to remember our team's format standards
Review reveals issues: "This story is too big," "Missing acceptance criteria," "Wrong labels"

The pain points were clear:

Challenge	Impact
Inconsistency	My stories looked different from my teammates' stories
Knowledge silos	Domain expertise lived in my head, not accessible to AI
Time sink	30+ minutes per epic just for initial story drafts
Context switching	Constantly jumping between docs, JIRA, and my notes

The Insight: What If AI Had Access to Our Organizational Knowledge?

I'd experimented with asking ChatGPT to generate user stories. The results were... okay. Generic. They followed a reasonable format but lacked:

Our specific terminology
Our sizing guidelines
Our acceptance criteria format
Knowledge of what our product actually does

Then it hit me: The AI isn't bad at generating stories—it just doesn't know what I know.

What if I could give it access to:

Our story creation rules and guidelines
Our story splitting techniques
Our product documentation

That's when I discovered Retrieval-Augmented Generation (RAG).

The Architecture I Built

Here's the high-level system I designed:

┌─────────────────────────────────────────────────────────────────────────────┐
│                    KNOWLEDGE INGESTION (One-Time Setup)                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  📄 Source Documents    →    📝 Text Extraction    →    🔢 Embeddings       │
│  (Rules, Guidelines,         (Chunking with            (Vector              │
│   Product Docs)               Overlap)                  Representations)    │
│                                                              │              │
│                                                              ▼              │
│                                                         🗄️ Vector DB       │
│                                                                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                    STORY GENERATION (Runtime)                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                              │              │
│  📋 Epic Intake    →    🔍 Module Detection    →    🎯 Semantic Query ──────┘
│  (From Issue                (Classify epic            (Retrieve relevant     
│   Tracker)                   by domain)                context)              
│                                                              │              │
│                                                              ▼              │
│                         📝 Prompt Assembly    →    🤖 LLM Generation        │
│                         (Inject retrieved          (Produce stories)        │
│                          context)                                           │
│                                                              │              │
│                                                              ▼              │
│                                                         ✅ Output Stories   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Let me break down each component.

Part 1: Building the Knowledge Base

Before the AI could generate good stories, I needed to give it the right knowledge. I identified three categories of documents to ingest:

1. Story Creation Rules

These are the organizational standards I'd been carrying in my head:

Format requirements: "As a [role], I want [feature], so that [benefit]"
Acceptance criteria template: Given-When-Then format
Sizing guidelines: What constitutes a 1-point vs 5-point story
Required metadata: Labels, components, definition of done
Anti-patterns: Things to avoid (stories without acceptance criteria, compound features, etc.)

2. Story Splitting Techniques

Over time, I'd developed mental models for breaking down large stories. I formalized these into nine distinct techniques:

Technique	When to Apply
Split by Role	Multiple user types involved
Split by Workflow	Multiple user journeys embedded
Split by Data Variation	Multiple data types/categories
Split by Data Entry	Multiple fields listed with "and/or"
Split by Complexity	Multiple integrations required
Split by Platform	Multiple devices/browsers
Split by Business Rules	Conditional logic present
Split by CRUD	Data lifecycle management
Split by BDD Scenarios	Multiple acceptance test scenarios

I wrote a companion article diving deep into these nine techniques: Encoding Agile Expertise: Nine Heuristics for Story Decomposition

3. Product Documentation

I ingested relevant product documentation so the AI would understand:

What features actually exist
Correct terminology
Domain-specific concepts

🔧 Part 2: The Ingestion Pipeline

Here's how I processed documents into the vector database:

Step 1: Text Extraction

I extracted text from various formats (markdown, PDF). The key challenge with PDFs was ensuring text was selectable, not scanned images.

Step 2: Chunking Strategy

This was one of the most important decisions. I experimented with different approaches:

Chunk Size	Result
200 chars	Too fragmented—lost context
2000 chars	Too large—included irrelevant content
800 chars	Sweet spot—1-2 paragraphs of coherent content

I also added 150-character overlap between chunks. This prevents sentences from being cut off mid-thought and ensures important context isn't lost at boundaries.

Document: "The calendar module lets you schedule events. Events can be recurring..."
                                                    ↓
Chunk 1: "The calendar module lets you schedule events. Events can be recurring or..."
Chunk 2: "Events can be recurring or one-time. You can invite attendees..."
         ↑ Overlap ensures continuity

Step 3: Embedding Generation

Each chunk gets converted into a vector (a list of 1536 numbers) that represents its semantic meaning. Similar concepts produce similar vectors—even without matching keywords.

Step 4: Vector Storage with Metadata

Each vector is stored with metadata enabling filtered queries:

Vector Entry:
├── ID: "rules_chunk_0"
├── Values: [0.023, -0.156, 0.892, ... 1536 numbers]
└── Metadata:
    ├── documentType: "rules"
    ├── category: "splitting_techniques"
    └── title: "Story Splitting Guide"

This metadata is crucial—it lets me query specifically for rules vs. documentation.

Part 3: The Query Strategy

When an epic comes in, I don't just do one vector search. I developed a dual-query strategy:

Query 1: Always Retrieve Rules

No matter what the epic is about, I always fetch story creation rules and splitting techniques. This ensures consistent format compliance.

Filter: documentType = "rules"
Top K: 5 chunks

Query 2: Retrieve Domain-Specific Documentation

Based on the epic content, I detect the relevant product domain and fetch documentation specific to that area.

Filter: documentType = "documentation" AND domain = [detected_domain]
Top K: 5 chunks

Why Two Queries?

Initially, I tried a single query combining everything. The problem: documentation chunks often ranked higher than rules (they're longer and more detailed), pushing critical formatting rules out of the top results.

Separating the queries ensures:

✅ Rules are always present (format compliance)
✅ Documentation is contextually relevant (domain accuracy)

Part 4: Prompt Engineering

With retrieved context in hand, I assemble the prompt. Here's the structure I developed:

System Prompt (Static)

This establishes the AI's role and injects the rules:

You are an expert Agile story creator.

## STORY CREATION RULES (You MUST follow these):
[Retrieved rules chunks]

## STORY SPLITTING TECHNIQUES:
[Retrieved splitting techniques]

## DOMAIN CONTEXT:
[Retrieved documentation chunks]

## OUTPUT FORMAT:
For each story, provide:
1. Story Title: "As a [role], I want [feature], so that [benefit]"
2. Description: Brief explanation
3. Acceptance Criteria: Given-When-Then format
4. Story Points: 1, 2, 3, 5, or 8
5. Labels: Appropriate categorization

## IMPORTANT:
- Each story should be independently deliverable
- Stories should be 1-8 points (split larger ones)
- Don't duplicate existing linked issues

User Prompt (Dynamic)

This contains the specific epic to process:

Create user stories for this epic:

## EPIC SUMMARY:
[Epic title]

## EPIC DESCRIPTION:
[Epic details]

## EXISTING RELATED ISSUES (DO NOT DUPLICATE):
[List of already-created stories]

Please generate 3-7 well-structured user stories.

Key Prompt Engineering Lessons

Explicit format specification: The AI follows formats better when they're clearly defined
Negative examples: Telling the AI what NOT to do is as important as what to do
Context labeling: Clear section headers ("RULES", "CONTEXT") help the AI understand what's what
Emphasis markers: "You MUST follow these" increases adherence

Results and Observations

After deploying this system, here's what I observed:

What Worked Well

Aspect	Improvement
Time to first draft	From ~30 min to <2 min
Format consistency	Stories consistently followed our template
Domain terminology	Correct product terms used
Splitting suggestions	Large epics got appropriate decomposition recommendations

Challenges I Encountered

1. Over-splitting

Sometimes the AI would split too aggressively, creating 10+ tiny stories from a simple epic. I addressed this by adding guidance about minimum story scope.

2. Chunk boundary issues

Occasionally, a technique description would get split awkwardly across chunks, leading to incomplete retrieval. The overlap helps, but it's not perfect.

3. Domain edge cases

Some epics touched multiple domains or didn't fit neatly into categories. I added a "general" fallback that retrieves broadly applicable documentation.

4. Human review still necessary

The generated stories are good starting points, but still require human review and refinement. I think of them as "80% drafts" that save significant time but aren't fire-and-forget.

Key Takeaways

After building and iterating on this system, here's what I learned:

1. RAG Beats Pure Prompting

Trying to put all organizational knowledge directly into prompts doesn't scale. RAG lets you maintain a living knowledge base that grows with your organization.

2. Chunking Strategy Matters More Than You Think

My initial chunks were too small. The sweet spot for my use case was 800 characters with overlap. Your mileage may vary—experiment.

3. Dual-Query Ensures Consistency

Separating "always needed" content (rules) from "contextually needed" content (documentation) prevents important guidance from being displaced.

4. Encode Tacit Knowledge

The most valuable part of this project was forcing myself to document what I "just knew" about story creation. That documentation now helps humans AND AI.

5. AI Augments, Doesn't Replace

The system produces quality first drafts that save significant time, but human judgment remains essential for final acceptance.

If You Want to Build Something Similar

Here's a high-level roadmap:

Phase 1: Document Your Knowledge

Write down your story format requirements
Document your splitting techniques (or use/adapt mine)
Gather relevant product documentation

Phase 2: Set Up Vector Storage

Choose a vector database (Pinecone, Weaviate, Supabase pgvector)
Design your metadata schema
Implement the ingestion pipeline

Phase 3: Build the Query Layer

Implement domain detection
Set up dual-query retrieval
Test retrieval quality

Phase 4: Develop the Generation Pipeline

Design your prompt template
Connect to an LLM API
Implement output parsing

Phase 5: Integrate and Iterate

Connect to your issue tracker
Gather feedback
Refine prompts and rules based on output quality

What's Next

I'm continuing to iterate on this system. Areas I'm exploring:

Automatic story creation: Moving from "draft as comment" to direct issue creation with approval workflow
Feedback learning: Using acceptance/rejection signals to improve generation
Multi-epic awareness: Considering relationships between epics
Estimation calibration: Learning team-specific sizing patterns from historical data

Wrapping Up

Building this system taught me that the future of AI in software engineering isn't about replacing human judgment—it's about augmenting it. By encoding organizational knowledge into retrievable format and combining it with LLM generation capabilities, I created a tool that saves significant time while maintaining quality.

The architecture patterns I've shared here are transferable across domains. Whether you're generating stories, documentation, test cases, or other structured content, the RAG approach of "retrieve relevant context, then generate" is powerful.

If you found this useful, check out my companion article on the nine story splitting techniques—it's the "human expertise" half of this human+AI equation.

Have questions or built something similar? Drop a comment below—I'd love to hear about your experience!

Follow me for more articles on AI-augmented software engineering workflows.

📚 Further Reading

Encoding Agile Expertise: Nine Heuristics for Story Decomposition - My deep dive into the story splitting techniques
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks - The foundational RAG paper
User Stories Applied - Mike Cohn's classic on user stories

Forem: Manasa Nandelli