Forem: Jun Bae

Graphs for RAG: Knowledge Graph and GraphRAG (GraphDB)

Jun Bae — Sat, 09 May 2026 14:55:45 +0000

Introduction

I introduced RAG for LLM inference in the previous post in this series. As I mentioned, RAG has some limitations, so there are several advanced methods to overcome them. One of the most well-known approaches is to utilize graphs. This post will cover what a Knowledge Graph is, how RAG utilizes KGs, and how to build one.

Knowledge Graph

Back in the early days of the Internet, search engines such as Yahoo, Netscape, and Lycos relied mainly on full-text search. This was not too clunky until the volume of information became too large. As online information accumulated, when users searched for something, search engines had to scan thousands of pages, check whether each page contained relevant content, and often return too many results.

Google and PageRank

To solve this problem, Google introduced PageRank in 2000.

Google introducing PageRank. (https://googlepress.blogspot.com/2000/06/google-launches-worlds-largest-search.html) A visualization of PageRank algorithm. The percentage shows the perceived importance, and the arrows represent hyperlinks.

It was a technology that modeled the web as a Webgraph and ranked most relevant documents at the top of search results. It used graph algorithms based on probability distributions. It made Google one of the most fastest-growing companies in the world at that time. It beat the pants off companies still relying on traditional text search. Google dominated all internet search with this algorithm until 2012.

Google and Knowledge Graph

What beat Google was Google. Google published a famous blog post in 2012: Introducing the Knowledge Graph: things, not strings

What does this mean? During the PageRank era, the search engine mostly showed relevant documents that frequently referenced the keywords in your query or were statistically related to them. But the Knowledge Graph helps search engines understand your queries. It captures the concepts behind words. That is what Google meant by "things, not strings."

Relevant information about Marie Curie

These days, when you search for something on Google, it doesn't only return information about the exact thing you searched for. It also shows you its related information, such as family members, coworkers, accomplishments, books, related locations, and so on.

So, what is a Knowledge Graph and how do we build it?

Knowledge Graph

A Knowledge Graph is a structured representation of information where real-world objects or concepts are stored as entities, and the connections between them are stored as relationships.

For example:

(Tom Hanks) -[:ACTED_IN]-> (Forrest Gump)
(Forrest Gump) -[:DIRECTED_BY]-> (Robert Zemeckis)

When you enter a query into Google, instead of simply asking "Which documents contain the words in your query?", Google might ask "Which entities are related to this entity, and how are they connected?"

Core Knowledge Graph Concepts

Entity

An entity is a real-world object or concept.

Examples:

Person: "Marie Curie"
Organization: "OpenAI"
Product: "iPhone"
Disease: "Diabetes"
Concept: "Retrieval-Augmented Generation"

In popular graph database systems such as Neo4j, entities are usually represented as nodes. Neo4j’s property graph model represents domain objects as nodes, relationships as directed connections, with additional information stored as properties.

Relationship

A relationship describes how two entities are connected.

Examples:

Marie Curie DISCOVERED Radium
OpenAI DEVELOPED ChatGPT
Employee WORKS_AT Company
Paper CITES Paper

Triple

A triple is a simple way to represent a fact:

(subject, predicate, object)

Example:

(Marie Curie) -[:DISCOVERED]-> (Radium)

Ontology and Schema

An ontology defines the types of entities and relationships allowed in your graph.

For example, in a medical KG:

Entity types:
- Patient
- Disease
- Medication
- Symptom

Relationship types:
- HAS_SYMPTOM
- DIAGNOSED_WITH
- PRESCRIBED
- INTERACTS_WITH

A schema is the implementation-level structure of the graph: labels, relationship types, property names, constraints, and indexes. Neo4j describes schema as the prescribed property existence and data types for nodes and relationships.

Entity and Relation Extraction Methods

so, how can we extract entities and relations from text?

A good KG extraction result should contain not only (head, relation, tail), but also entity types, evidence text, source document/chunk ID, confidence score, and sometimes properties.

Example:

{
  "head": {"id": "Brian Chesky", "type": "Person"},
  "relation": "FOUNDED",
  "tail": {"id": "Airbnb", "type": "Company"},
  "evidence": "Brian Chesky founded Airbnb in San Francisco in 2008.",
  "source_chunk_id": "doc1_chunk3",
  "confidence": 0.91
}

There are several methods for extracting entities and relations.

Extraction Method 1: spaCy

spaCy is best when you want a fast, local, deterministic, production-friendly NLP pipeline. Compared with an LLM-based approach, it is simpler and cheaper.

spaCy processes raw text into structured linguistic annotations through the following pipeline:

Raw text
  -> tokenizer
  -> tok2vec / transformer
  -> tagger / morphologizer
  -> dependency parser
  -> NER
  -> custom components, if any

How does it recognize entities and relations?

Entity Recognition: The simplest way is to use a neural model. spaCy uses a lightweight deep learning model, such as the one included in en_core_web_sm. The official docs describe EntityRecognizer as a transition-based NER component that identifies non-overlapping labeled spans and stores them in Doc.ents. Therefore, its entity types are restricted to the pretrained label set.

You can easily extract entities from text with this NER pipeline.

import spacy
from dataclasses import dataclass


@dataclass(frozen=True)
class Entity:
    text: str
    label: str
    start_char: int
    end_char: int


def extract_entities_spacy(text: str) -> list[Entity]:
    """Extract named entities with spaCy's pretrained NER pipeline."""
    nlp = spacy.load("en_core_web_sm")
    doc = nlp(text)

    return [
        Entity(
            text=ent.text,
            label=ent.label_,
            start_char=ent.start_char,
            end_char=ent.end_char,
        )
        for ent in doc.ents
    ]


if __name__ == "__main__":
    text = "Brian Chesky founded Airbnb in San Francisco in 2008."
    for ent in extract_entities_spacy(text):
        print(ent)

Output:

Entity(text='Brian Chesky', label='PERSON', start_char=0, end_char=12)
Entity(text='Airbnb', label='ORG', start_char=21, end_char=27)
Entity(text='San Francisco', label='GPE', start_char=31, end_char=44)
Entity(text='2008', label='DATE', start_char=48, end_char=52)

Another option is to use a non-neural, rule-based component such as EntityRuler which relies on explicit patterns.

Relation Recognition: For relations, spaCy usually relies on dependency parsing and rule-based patterns. You can also use a custom-trained relation extraction model.

As you can see in the example, it can easily handle a sentence like "Brian Chesky founded Airbnb in San Francisco in 2008."

But it may struggle with sentences like these: "Brian Chesky is one of the people behind Airbnb.", "The company was started in 2008 by Brian Chesky and others."

Extraction Method 2: GLiNER / GLiNER2

GLiNER is a lightweight NER framework designed to identify arbitrary entity types using label descriptions, rather than being limited to a fixed NER label set, which is one limitaition of spaCy. Its documentation describes it as a practical middle ground between traditional NER and expensive LLM-based extraction.

GLiNER2 extends this idea. It is designed as a schema-based information extraction framework that supports multiple tasks such as NER, text classification, structured extraction, and relation extraction.

The GLiNER2 GitHub page says it is a unified schema-based information extraction (IE) model for entity extraction, classification, structured extraction, and relation extraction in one efficient model.

So you can define your own entity types and relation types with GLiNER2.

Simple Example:

from gliner2 import GLiNER2


def extract_graph_gliner2(text: str) -> dict:
    """Extract entities and relations with a GLiNER2 schema."""
    extractor = GLiNER2.from_pretrained("fastino/gliner2-base-v1")

    schema = (
        extractor.create_schema()
        .entities(
            {
                "person": "Names of people, founders, executives, or researchers",
                "company": "Organizations, companies, startups, or labs",
                "location": "Cities, regions, or countries",
                "date": "Years or specific dates",
            }
        )
        .relations(
            {
                "founded": "Relationship where a person created or co-created a company",
                "located_in": "Relationship where an organization is based in a location",
                "works_for": "Employment or affiliation relationship",
            }
        )
    )

    return extractor.extract(text, schema, include_confidence=True)


if __name__ == "__main__":
    text = "Brian Chesky founded Airbnb in San Francisco in 2008."
    result = extract_graph_gliner2(text)
    print(result)

Output:

{'entities':
{'person': [{'text': 'Brian Chesky', 'confidence': 0.9999997615814209}], 'company': [{'text': 'Airbnb', 'confidence': 0.9999885559082031}], 'location': [{'text': 'San Francisco', 'confidence': 0.9999994039535522}], 'date': [{'text': '2008', 'confidence': 0.9999909400939941}]},
'relation_extraction':
{'founded': [{'head': {'text': 'Brian Chesky', 'confidence': 0.999990701675415}, 'tail': {'text': 'Airbnb', 'confidence': 0.9999719858169556}}], 'located_in': [{'head': {'text': 'Airbnb', 'confidence': 0.9986342787742615}, 'tail': {'text': 'San Francisco', 'confidence': 0.9998704195022583}}], 'works_for': []}}

Extraction Method 3: LLM-based Extraction

LLM-based extraction is currently the most flexible and widely used method for KG construction, especially when documents are complex, relationships are implicit, or the ontology is still evolving. Neo4j’s docs say modern LLMs can be instructed with prompts, examples, schemas, existing entities, and output formatting to extract and deduplicate entities and relationships from unstructured text.

If you define a structured output with Pydantic, the LLM can return data in a structure that fits KG triples.

Example of structured output:

from typing import Literal
from pydantic import BaseModel, Field


EntityType = Literal["Person", "Company", "Location", "Date"]
RelationType = Literal["FOUNDED", "LOCATED_IN", "WORKS_AT"]


class Entity(BaseModel):
    id: str = Field(description="Canonical entity name")
    type: EntityType
    evidence: str = Field(description="Text span supporting the entity")


class Relation(BaseModel):
    head: str = Field(description="Canonical head entity id")
    relation: RelationType
    tail: str = Field(description="Canonical tail entity id")
    evidence: str = Field(description="Exact sentence or phrase supporting the relation")


class KGExtraction(BaseModel):
    entities: list[Entity]
    relations: list[Relation]

Note: Even if you define a structured output schema, you should validate and deserialize the output. LLMs can still ignore the structured format and return invalid output. This happens more often with smaller models.

The langchain framework provides an LLMGraphTransformer class in langchain_experimental. You can use it like this:

from langchain_core.documents import Document
from langchain_experimental.graph_transformers import LLMGraphTransformer
from langchain.chat_models import init_chat_model
from dotenv import load_dotenv

load_dotenv()

llm = init_chat_model("openai:gpt-5-nano")

documents = [
    Document(
        page_content="Brian Chesky founded Airbnb in San Francisco in 2008.",
        metadata={"source": "example_doc_1"},
    )
]

llm_transformer = LLMGraphTransformer(
    llm=llm,
    allowed_nodes=["Person", "Company", "Location", "Date"],
    allowed_relationships=[
        ("Person", "FOUNDED", "Company"),
        ("Company", "LOCATED_IN", "Location"),
    ],
    strict_mode=False,
    node_properties=["name"],
    relationship_properties=["evidence"],
    additional_instructions=(
        "Extract only relationships explicitly stated in the text. "
        "Use canonical entity names. "
        "Do not create vague nodes such as 'startup' or 'company'."
    ),
)

graph_documents = llm_transformer.convert_to_graph_documents(documents)

for graph_doc in graph_documents:
    print("Nodes:", graph_doc.nodes)
    print("Relationships:", graph_doc.relationships)

Output:

Nodes: [Node(id='Brian Chesky', type='Person', properties={'name': 'Brian Chesky'}), Node(id='Airbnb', type='Company', properties={'name': 'Airbnb'}), Node(id='San Francisco', type='Location', properties={'name': 'San Francisco'})]

Relationships: [Relationship(source=Node(id='Brian Chesky', type='Person', properties={}), target=Node(id='Airbnb', type='Company', properties={}), type='FOUNDED', properties={'evidence': '2008-01-01'}), Relationship(source=Node(id='Airbnb', type='Company', properties={}), target=Node(id='San Francisco', type='Location', properties={}), type='LOCATED_IN', properties={})]

Neo4j’s current guide shows essentially this pattern: use LLMGraphTransformer, pass allowed_nodes, allowed_relationships, node_properties, convert documents to graph documents, then add them to Neo4j with include_source=True.

Comparison: spaCy vs. GLiNER2 vs. LLM-based Extraction

Method	Strengths	Weaknesses	Best use case
spaCy	Fast, mature, low-cost, production-friendly	Fixed labels unless custom-trained; limited relation extraction	Standard NER, preprocessing, high-throughput pipelines
GLiNER2	Flexible schema-based extraction; lighter than LLMs	Newer ecosystem; may need evaluation for your domain	Custom entity extraction and lightweight IE
LLM-based extraction	Strong semantic understanding; flexible schema and relation extraction	Higher cost, higher latency, hallucination risk	Complex KG construction from messy documents
Hybrid	Balances speed, cost, and quality	More engineering complexity	Production GraphRAG systems

GraphRAG and GraphDB

What is GraphRAG? In February 2024, Microsoft published a blog post about GraphRAG: GraphRAG Blog Post; arXiv Paper

This approach is often described as using LLM-Derived Knowledge Graphs. It utilizes a Knowledge Graph and lets LLMs traverse the graph to retrieve information. So, the concept is simple: Connect the LLM to a Knowledge Graph. But how?

One approach is to have the model generate a graph query language called Cypher. If you want to build your own graph database, you need to learn Cypher. (Actually, I'm not that proficient at this.😒 But it's not so different from other DB query languages.) Then, where should I store the DB? There are several graph DB systems, and one of the most popular ones is Neo4j. There are several other DBs such as AWS Neptune, Azure Cosmos DB, Memgraph, NebulaGraph, etc. Neo4j is open-source only for Community Edition, so if you want an enterprise graph DB service, you should consider that.

How to build GraphDB

I will demonstrate GraphDB construction using Neo4j AuraDB. You can build a Neo4j graph locally, but AuraDB is convenient for a demo. (Of course, it is not free for enterprise production use.)

As in the first post, I will use the "Demon Slayer" series for the demonstration.

First, when you create an AuraDB instance, it will show you the connection details for your graph, such as the username, password, database name, and URI.

DB Connection Functions (Click to expand)

import os
from dataclasses import dataclass

from dotenv import load_dotenv
from langchain_neo4j import Neo4jGraph

@dataclass(frozen=True)
class Neo4jSettings:
    uri: str
    username: str
    password: str
    database: str = "neo4j"

def get_neo4j_settings() -> Neo4jSettings:
    uri = os.getenv("NEO4J_URI") or os.getenv("NEO4J_URL")
    username = os.getenv("NEO4J_USERNAME") or os.getenv("NEO4J_USER")
    password = os.getenv("NEO4J_PASSWORD")
    database = os.getenv("NEO4J_DATABASE", "neo4j")

    missing = [
        name
        for name, value in {
            "NEO4J_URI": uri,
            "NEO4J_USERNAME": username,
            "NEO4J_PASSWORD": password,
        }.items()
        if not value
    ]
    if missing:
        raise RuntimeError(f"Missing required Neo4j environment variables: {', '.join(missing)}")

    return Neo4jSettings(uri=uri, username=username, password=password, database=database)


def get_graph(refresh_schema: bool = False, enhanced_schema: bool = False) -> Neo4jGraph:
    settings = get_neo4j_settings()
    return Neo4jGraph(
        url=settings.uri,
        username=settings.username,
        password=settings.password,
        database=settings.database,
        refresh_schema=refresh_schema,
        enhanced_schema=enhanced_schema,
        sanitize=True,
    )

Langchain provides a class called LLMGraphTransformer, which lets you transform documents into graph-compatible triples very easily.

from langchain_experimental.graph_transformers import LLMGraphTransformer

llm_transformer = LLMGraphTransformer(
    llm=get_llm(),
    node_properties=["description"],
    relationship_properties=["evidence"],
)

graph_documents = llm_transformer.convert_to_graph_documents(batch)
graph.add_graph_documents(graph_documents, include_source=True, baseEntityLabel=True)

Nodes and Relationships Related to the Main Character, Tanjirou

You can search the graph with a query like this:

MATCH (n1) - [r] - (n2) 
WHERE n1.id =~ ".*Tanjirou.*" AND not n2:Document 
RETURN n1, r, n2

The result can look a bit cluttered because it includes so many minor objects such as doors and stone. You can specify the types of nodes and relationships. I set allowed_nodes like this:

llm_transformer = LLMGraphTransformer(
    llm=get_llm(model="xai:grok-4.3"),
    node_properties=["description"],
    relationship_properties=["evidence"],
    allowed_nodes=[
        "Person",
        "Organization",
        "Location",
        "Breathing",
        "Weapon",
        "Demon",
        "Object",
    ],
)

Then I fed the Season 1, EP 1 document. Let's see how the LLM generated the graph structure and built a graph. I'm tracking all my LLM calls on LangSmith, so let's look at some of them.

I used a total of 56.1k tokens, and the process took 367 seconds for one episode with Grok-4.3 model. Unsolicited tip: the API cost for this brand-new frontier model is so low that it might be the best model for this kind of demo or personal project. Thanks, Elon! 😊

The graph I built from Demon Slayer Season 1, Episode 1

The input prompt
The output for creating nodes and relationships

You can see the default instructions in LLMGraphTransformer, and you can also override them with your own prompt. The prompt is very simple, but I think it is enough to create a suitable graph. However, when you use this in production-grade projects, you'd be better off writing your own prompt meticulously. You may have noticed that if you read the prompt closely, you can find out that it doesn't directly mention the allowed nodes. It just checks and trims the output after generation. As we all know, garbage-in, garbage-out

Anyway, LLMs are powerful for this kind of job, but if you want to reduce costs, a hybrid method combining spaCy, GLiNER2, LLMs might be the best fit for you.

How to Traverse and Retrieve from a Graph

Last but not least, you need to traverse the graph and retrieve information relevant to your query. Whenever I try GraphRAG projects, I always realize that this part is the most difficult. You can build your graph very easily thanks to LLMs, but traversal is another story.

Of course, Langchain provides a retrieval class called GraphCypherQAChain, but it is not always smart or reliable enough out of the box. Let me show you.

def query_llm():
    return get_llm("xai:grok-4.3")

graph = get_graph(refresh_schema=True)
chain = GraphCypherQAChain.from_llm(
    llm=query_llm(),
    graph=graph,
    verbose=True,
    validate_cypher=True,
    return_intermediate_steps=True,
    top_k=top_k,
    allow_dangerous_requests=True,
)
result = chain.invoke({"query": question})
answer = result["result"]

When you retrieve from the graph and invoke the LLM in this way, the result may not be very satisfactory. If anything, it can be almost unusable.

When I run this query: uv run python -m graph_rag.query --question "Why did Giyu send Tanjirou to Mt.Sagiri?", it returns the following output:

=== Generated Cypher ===
MATCH (g:Person)-[r:INSTRUCTS|REFERS]->(t:Person)-[:TRAVELS_TO|GOES_TO|VISITS]->(l:Location)
WHERE (g.id CONTAINS "Giyu" OR g.description CONTAINS "Giyu") AND (t.id CONTAINS "Tanjirou" OR t.description CONTAINS "Tanjirou") AND (l.id CONTAINS "Sagiri" OR l.description CONTAINS "Sagiri")
RETURN r.evidence AS reason
=== Cypher Context ===
[]
=== Final Answer ===
I don't know the answer to that.

As you can see, it fails to retrieve any information. That is because the function mainly generates a Cypher query and sends it to AuraDB. So if you use it as-is, it doesn’t fully understand your graph structure or domain. To solve this, you need to build your own pipeline and write targeted queries. I tweaked the pipeline like this.

Pipeline Code (Click to expand)

ENTITY_QUERY = """
CALL db.index.fulltext.queryNodes("graph_entity_fulltext", $lucene_query, {limit: $limit})
YIELD node, score
RETURN elementId(node) AS element_id, node.id AS id, labels(node) AS labels, score
ORDER BY score DESC
"""

GRAPH_CONTEXT_QUERY = """
MATCH (seed)
WHERE elementId(seed) IN $seed_ids
CALL {
    WITH seed
    MATCH p = (seed)-[*1..2]-(neighbor)
    WHERE all(rel IN relationships(p) WHERE type(rel) <> "MENTIONS")
    UNWIND relationships(p) AS rel
    RETURN DISTINCT
        coalesce(startNode(rel).id, startNode(rel).name) AS source,
        labels(startNode(rel)) AS source_labels,
        type(rel) AS relationship,
        coalesce(endNode(rel).id, endNode(rel).name) AS target,
        labels(endNode(rel)) AS target_labels,
        properties(rel) AS properties
    LIMIT $relationship_limit
}
RETURN source, source_labels, relationship, target, target_labels, properties
"""

SOURCE_CONTEXT_QUERY = """
MATCH (seed)
WHERE elementId(seed) IN $seed_ids
MATCH (doc:Document)-[:MENTIONS]->(seed)
WITH doc, collect(DISTINCT seed.id) AS matched_entities, count(DISTINCT seed) AS entity_hits
RETURN DISTINCT
    doc.id AS id,
    doc.source AS source,
    doc.chunk_index AS chunk_index,
    coalesce(doc.text, doc.page_content, doc.content, "") AS text,
    matched_entities,
    entity_hits
ORDER BY entity_hits DESC
LIMIT $chunk_limit
"""

def rerank_chunks_by_similarity(
    question: str,
    chunks: list[dict],
    chunk_limit: int,
    embedding_model: str,
) -> list[dict]:
    text_chunks = [chunk for chunk in chunks if chunk.get("text")]
    if not text_chunks:
        return []

    embeddings = chunk_embeddings(embedding_model)
    query_vector = embeddings.embed_query(question)
    chunk_vectors = embeddings.embed_documents([chunk["text"] for chunk in text_chunks])

    scored_chunks = []
    for chunk, chunk_vector in zip(text_chunks, chunk_vectors):
        scored_chunk = dict(chunk)
        scored_chunk["similarity"] = cosine_similarity(query_vector, chunk_vector)
        scored_chunks.append(scored_chunk)

    return sorted(scored_chunks, key=lambda chunk: chunk["similarity"], reverse=True)[:chunk_limit]

def retrieve_context(
    graph,
    question: str,
    entity_limit: int,
    relationship_limit: int,
    chunk_candidate_limit: int,
    chunk_limit: int,
    embedding_model: str,
) -> GraphContext:
    entity_names = extract_question_entities(question)
    lexical = lucene_query(" ".join(entity_names) if entity_names else question)
    entities = graph.query(ENTITY_QUERY, {"lucene_query": lexical, "limit": entity_limit})

    if entity_names:
        exact_entities = graph.query(
            """
            MATCH (n:__Entity__)
            WHERE toLower(n.id) IN $entity_names
            RETURN elementId(n) AS element_id, n.id AS id, labels(n) AS labels, 100.0 AS score
            LIMIT $limit
            """,
            {"entity_names": [name.lower() for name in entity_names], "limit": entity_limit},
        )
        seen = set()
        entities = [
            entity
            for entity in exact_entities + entities
            if not (entity["element_id"] in seen or seen.add(entity["element_id"]))
        ][:entity_limit]

    seed_ids = [entity["element_id"] for entity in entities]
    if not seed_ids:
        return GraphContext(entities=[], relationships=[], chunks=[])

    relationships = graph.query(
        GRAPH_CONTEXT_QUERY,
        {"seed_ids": seed_ids, "relationship_limit": relationship_limit},
    )
    chunks = graph.query(
        SOURCE_CONTEXT_QUERY,
        {"seed_ids": seed_ids, "chunk_limit": chunk_candidate_limit},
    )
    chunks = rerank_chunks_by_similarity(question, chunks, chunk_limit, embedding_model)
    return GraphContext(entities=entities, relationships=relationships, chunks=chunks)

def format_context(context: GraphContext) -> str:
    entity_lines = [
        f"- {entity['id']} labels={entity['labels']} score={entity['score']:.2f}"
        for entity in context.entities
    ]
    relationship_lines = [
        (
            f"- ({rel['source']})-[:{rel['relationship']}]->({rel['target']}) "
            f"properties={rel['properties']}"
        )
        for rel in context.relationships
    ]
    chunk_lines = [
        (
            f"- {chunk.get('source')} chunk={chunk.get('chunk_index')} "
            f"similarity={chunk.get('similarity', 0.0):.4f}\n"
            f"  {chunk.get('text', '')}"
        )
        for chunk in context.chunks
        if chunk.get("text")
    ]

    return "\n".join(
        [
            "Matched entities:",
            *(entity_lines or ["- None"]),
            "",
            "Graph relationships:",
            *(relationship_lines or ["- None"]),
            "",
            "Source chunks:",
            *(chunk_lines or ["- None"]),
        ]
    )

def answer_question_manual(
    question: str,
    entity_limit: int,
    relationship_limit: int,
    chunk_candidate_limit: int,
    chunk_limit: int,
    embedding_model: str,
) -> None:
    trace_metadata = {
        "mode": "manual",
        "entity_limit": entity_limit,
        "relationship_limit": relationship_limit,
        "chunk_candidate_limit": chunk_candidate_limit,
        "chunk_limit": chunk_limit,
        "embedding_model": embedding_model,
    }
    with trace(
        "graph_rag.query",
        run_type="chain",
        inputs={"question": question},
        project_name=langsmith_project(),
        tags=["graph-rag", "query"],
        metadata=trace_metadata,
    ) as query_run:
        graph = get_graph(refresh_schema=False)
        context = retrieve_context(
            graph,
            question,
            entity_limit,
            relationship_limit,
            chunk_candidate_limit,
            chunk_limit,
            embedding_model,
        )
        formatted_context = format_context(context)

        prompt = ChatPromptTemplate.from_messages(
            [
                (
                    "system",
                    """You will get some parts of the certain series story. Considering this context, answer the user's question.
                    Don't hallucinate or guess with insufficient information.""",
                ),
                (
                    "human",
                    "Question:\n{question}\n\nGraph context:\n{context}\n\nAnswer:",
                ),
            ]
        )
        chain = prompt | query_llm() | StrOutputParser()
        answer_config: RunnableConfig = {
            "run_name": "graph_rag.query.answer",
            "tags": ["graph-rag", "query", "answer-generation"],
            "metadata": {
                "matched_entity_count": len(context.entities),
                "relationship_count": len(context.relationships),
                "chunk_count": len(context.chunks),
            },
        }
        answer = chain.invoke(
            {"question": question, "context": formatted_context},
            config=answer_config,
        )
        query_run.end(
            outputs={
                "answer": answer,
                "matched_entities": [entity["id"] for entity in context.entities],
                "relationship_count": len(context.relationships),
                "chunk_count": len(context.chunks),
            }
        )

        print("\n=== Entity Search Cypher ===")
        print(ENTITY_QUERY.strip())
        print("\n=== Graph Traversal Cypher ===")
        print(GRAPH_CONTEXT_QUERY.strip())
        print("\n=== Graph Context ===")
        print(formatted_context)
        print("\n=== Final Answer ===")
        print(answer)

Long code short, the process is like this:

It uses an LLM graph transformer to extract entity names from the question
It searches Neo4j’s graph_entity_fulltext index for matching graph entities.
The matched entities become seed nodes for retrieval.
From those seed nodes, it traverses graph relationships up to two hops away, excluding MENTIONS relationships.
It also finds source document chunks that mention the seed entities.
Finally, it embeds the question and candidate chunks, scores them with cosine similarity, and keeps the most relevant chunks.

This is the process I built. I combined GraphRAG with dense RAG. Let's see if it works. I ran the same query uv run python -m graph_rag.query --question "Why did Giyu send Tanjirou to Mt.Sagiri?" and this time, it returned:

=== Final Answer ===
Answer:
Giyu Tomioka (Tomioka) sends Tanjirou to Mt. Sagiri specifically so he can meet the old man Sakonji Urokodaki who lives at its foot. Tomioka instructs Tanjirou to tell Urokodaki that “Giyuu Tomioka sent you,” after Nezuko has been turned into a demon. He also warns Tanjirou never to let her be exposed to sunlight (though she is currently safe because it is cloudy). This direction is given right after Tomioka knocks Nezuko out, places a bamboo gag on her, and decides not to kill her, implying Urokodaki is meant to provide guidance, training, or protection for both Tanjirou and the demon Nezuko.
(The provided chunks contain no further details about Urokodaki’s exact role or what happens after they reach him.)

It answered very well.

You can try your own approaches. You can write your own Cypher queries, tweak the prompt, combine other methods, use a specific model for generating queries, try an agentic approach, and whatnot.

This might be the most stressful part, and even coding tools can't give you a definitive answer. You should take into account your domain, data, information, budget, and other constraints, and understand the whole process. This is the most important part, and it is what humans still have to deal with firsthand.

You can outsource your thinking, but you can't outsource your understanding — Andrej Karpathy

Note: There are some advanced, relatively plug-and-play GraphRAG frameworks, such as LightRAG and PathRAG. However, I would not consider them production-grade yet; they are still closer to research-driven frameworks. They can be useful for simple GraphRAG demos, prototypes, or experiments. However, they still have many limitations, so much so that they are not yet suitable for customer-facing or enterprise production systems.

Conclusion

In this post, I introduced Knowledge Graphs and GraphRAG. The basic algorithms were invented decades ago, but as LLMs have started using them, they have resurfaced. Sparse RAG and dense RAG are more straightforward and closer to plug-and-play, but GraphRAG requires more groundwork and scaffolding, as well as extra, sometimes frustrating steps. Still, it can be worthwhile in certain cases.

Introduction to RAG for LLMs: Sparse (Lexical) RAG and Dense RAG (Semantic Vector Search)

Jun Bae — Sat, 25 Apr 2026 17:06:52 +0000

Introduction

LLMs store information within their own parameters. By being trained on massive datasets, the models learn this data. But what if they are asked about the information they don't know? These queries will likely result in hallucinations or entirely wrong answers.

As we know, updating the models with current data is very difficult and resource-intensive. Therefore, most AI service providers do not update their models frequently. Instead, they usually leave the models as they are after release because retraining is highly inefficient. That's why all models have their knowledge cutoff dates.

How, then, can they answer questions about up-to-date information? For example, "who is the president of the U.S. right now" or "Tell me today's news regarding the U.S-Iran conflict." Without external tools, they simply can't.

Qwen3.5 that was released on March 2026 doesn't know the information of the last year.

Knowledge Integration Strategies

Fundamentally, LLMs hold intrinsic knowledge within their own parameters. Additionally, users can inject specific information through their prompts. Therefore, there are three main methods to provide models with the proper information: Fine-tuning, Prompt Engineering, and RAG.

Three Techniques to Optimize LLMs

Method	Required Resources (Cost)	Inference Time (Latency)	Training / Data Prep Time
Fine-Tuning	High	Low	High
Prompt Engineering	Very Low	Low to Medium	Zero
RAG	Medium	High	Low (Ingestion/Indexing)

Each of these three methods has its own pros and cons.

1) Fine-tuning

Fine-tuning adjusts the actual weights of the model by training it on a dataset.

Pros
- Deep Customization: Bakes your specific domain knowledge, stylistic tone, or formatting rules directly into the model's "brain."
- Shorter Prompts & Lower Latency: Because the knowledge is embedded in the weights, you no longer need to stuff massive context into your prompts. This drastically reduces the time-to-first-token and bypasses context window limits.
Cons
- High Upfront Cost: Requires GPUs to train.
- Data Hungry: You need a high-quality, perfectly curated dataset (often hundreds or thousands of examples).
- Knowledge Stagnation: The model's knowledge is frozen at the exact time of training. Retraining and deploying are very inconvenient and take a long time.

2) Prompt engineering

This is the simplest way to inject information. You're tweaking the input text to guide the model's output without changing its underlying neural network weights.

Pros
- Fastest Iteration: You can test, tweak, and deploy changes in seconds.
- Zero Training Cost: No heavy GPU compute is required.
- Highly Flexible: Switch tasks instantly just by altering the prompt instructions.
Cons
- Context Limits: You are strictly restricted by the model's maximum context window.
- Token Costs: Stuffing prompts with massive context gets expensive at scale.
- Inconsistent Reliability: Highly complex, multi-step instructions can confuse the model or trigger hallucinations.

3) RAG (Retrieval-Augmented Generation)

RAG connects your LLM to an external knowledge base (like a vector database). When a user asks a query, the model retrieves relevant data and feeds it to the LLM as context.

Pros
- Reduces Hallucinations: Answers are explicitly grounded in your specific, verifiable data.
- Up-to-Date Knowledge: You don't need to retrain the model when your data changes; simply update the vector database.
- Source Citations: You can trace exactly which document the model used to generate its response, adding trustworthiness.
Cons
- Higher Latency: Fetching embeddings, querying the database, and processing a bloated context window slows down inference time.
- Additional Infrastructure: Requires maintaining extra infrastructure (embedding models, vector DBs, retrieval pipelines).
- Garbage In, Garbage Out: If your retrieval step fails to find the right chunk, the LLM will fail to answer correctly regardless of its size.

Those methods are not inherently superior or inferior to one another. When you build an LLM system, you might use a combination of them.

For example, let's assume that you are building an AI-powered pet service that provides medical diagnoses for pets and locates nearby veterinarians. In this case, you need to provide the model with a veterinary knowledge base. Because this information doesn't change frequently, if you train the model on it once, you won't need to retrain it often. You could also input the information directly into the prompt, but that makes the prompt excessively long. Therefore, it is better to fine-tune the model on this knowledge.

Next, you have to write a basic instruction prompt outlining what services it should provide and what kind of persona or tone it should adopt.

Finally, how do we provide information about the vets' locations? If you simply train the model on this data, there is no guarantee it will retrieve the information accurately. Furthermore, you would have to retrain it whenever new clinics open or existing ones close across the state. This requires frequent updates, but it is also impossible to fit all the state's veterinary data into a single prompt. That is why you need to build a RAG system for this.

Now, let's dive deep into RAG.

RAG (Retrieval-Augmented Generation)

RAG, as the name implies, retrieves the data or information from the database. As mentioned above, the model's parametric memory of the model is static and obscures data provenance. And even if we train the model on our data, we can't be sure that it can reference it properly. Prompt engineering is relatively surefire, but more often than not, we can't input an entire dataset and change the prompt for every inference.

So, how exactly does RAG retrieve the data? There are several methods it uses.

Lexical(sparse) RAG

When people talk about RAG today, they usually mean Dense RAG—converting text into dense vector embeddings—but before that, there was a simpler way called Sparse(Lexical) RAG.

Lexical retrieval looks for exact keyword matches. It doesn't require high-dimensional embedding models. That's why it is called "sparse."

The undisputed king of traditional lexical retrieval is the Okapi BM25 algorithm.

Here is the master equation used to calculate the relevance score of a document $D$ given a user query $Q$ :

score(D,Q)=∑i=1nIDF(qi)⋅f(qi,D)⋅(k1+1)f(qi,D)+k1⋅(1−b+b⋅∣D∣avgdl)\text{score}(D, Q) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot \left(1 - b + b \cdot \frac{|D|}{\text{avgdl}}\right)}

Let's break down these variables:

$q_i$ : The $i$ -th keyword in the user's query.
$f(q_i, D)$ : The term frequency (how many times the keyword $q_i$ appears in document $D$ ).
$∣ D ∣$ : The total word count (length) of the document.
$avgdl\text{avgdl}$ : The average document length across your entire knowledge base.
$k_1$ and $b$ : Tunable constants. $k_1$ (usually between 1.2 and 2.0) controls how quickly the term frequency score saturates. $b$ (usually around 0.75) controls how much the document length penalizes the score, preventing massive, wordy documents from automatically dominating the top results.

The IDF (Inverse Document Frequency) portion ensures that rare words carry significantly more weight than common words like "the" or "animal." It is calculated as:

IDF(qi)=ln⁡(N−n(qi)+0.5n(qi)+0.5+1)\text{IDF}(q_i) = \ln \left( \frac{N - n(q_i) + 0.5}{n(q_i) + 0.5} + 1 \right)

$N$ : The total number of documents in the database.
$n(q_i)$ : The number of documents that contain the keyword $q_i$ .

Essentially, BM25 says: "Reward documents where the query terms appear frequently, but only if those terms are rare across the whole database, and penalize documents that are just incredibly long."

Therefore, if you enter a specific query, it will calculate a score for each document and retrieve the top- $N$ documents. Let's try this with a real dataset and some code.

Sparse RAG Example

There are several frameworks that support BM25, such as langchain. In this example, I'm going to use the Qdrant/bm25 model from fastembed.

from fastembed import SparseTextEmbedding

sparse_model = SparseTextEmbedding(model_name="Qdrant/bm25")

documents = ["Hello. Who are you?", "Hello World, who the hell are you?"]
print(list(sparse_model.embed(documents)))

[
  SparseEmbedding(values=array([1.6877]), indices=array([613153351])), 
  SparseEmbedding(values=array([1.6786, 1.6786, 1.6786]), indices=array([613153351, 74040069, 1587029005]))
]

I just embedded two sentences with the BM25 model. The 'Indices' array is for distinguishing the words in the sentences. But you might notice that the first sentence has only one index. Actually, this model automatically filters out stop words (common, high-frequency words like "the", "is"). Then what are the values?

The values are the pre-calculated term weights: $TF⋅(k1+1)TF+k1\frac{TF \cdot (k_1 + 1)}{TF + k_1}$

But this is pretty weird. Why does it pre-calculate these value? For the entire equation of the score, you still need to calculate the $I D F$ and $avgdl\text{avgdl}$ . In fact, this is a trick that fastembed uses.

The secret lies in the name of the model: Qdrant/bm25. This is not a dynamic BM25 algorithm calculating stats from your specific pet clinic database. It is a pre-trained model. Researchers ran the BM25 algorithm over a massive, generic dataset (usually MS MARCO, a dataset of millions of Bing searches). From that massive dataset, they permanently froze two values:

The Global $I D F$ : A massive lookup table of how rare words are in the English language.
The Global $avgdl\text{avgdl}$ : A static constant representing the average document length in their training corpus.

Given that both variables are treated as fixed constants, the resulting values are technically the entire equation the moment you embed the document. This is the efficient trick that the framework leverages.

To test this, I downloaded the hotpot_qa dataset and embedded 1,000 rows of it using a BM25 scorer.

{'id': '5ac2a912554299218029dae8', 'question': 'Which band was founded first, Hole, the rock band that Courtney Love was a frontwoman of, or The Wolfhounds?', 'answer': 'The Wolfhounds', 'type': 'comparison', 'level': 'medium', 'supporting_facts': {'title': ['Courtney Love', 'Courtney Love', 'The Wolfhounds'], 'sent_id': [0, 2, 0]}, 'context': {'title': ["Nobody's Daughter", 'Courtney Love filmography', 'Patty Schemel', 'Beautiful Son', 'The Wolfhounds', 'Live Through This', 'Turpentine (song)', 'Miss World (song)', 'Softer, Softest', 'Courtney Love'], 'sentences': [["Nobody's Daughter is the fourth and final studio album by American alternative rock band Hole, released worldwide on April 27, 2010, through Mercury Records.", ' The album was originally conceived by Hole frontwoman Courtney Love as a solo project titled "How Dirty Girls Get Clean", following her poorly received solo debut "America's Sweetheart" (2004).', ' Much of the material featured on "Nobody's Daughter" originated from studio sessions for "How Dirty Girls Get Clean", which had been conceived in 2006 after a multitude of legal issues, drug addiction, and rehabilitation sentences had left Love "suicidal".', ' Love financed the making of the record herself, which cost nearly two million dollars.'], ['Courtney Love is an American ...

The dataset llooks something like this. Now, I'm going to try retrieving data using the sample question.

query = "Which band was founded first, Hole, the rock band that Courtney Love was a frontwoman of, or The Wolfhounds?"
retrieved_results = store.search(query, top_k=3)

for i, r in enumerate(retrieved_results):
    print(f"Retrieved #{i+1}: ", r, "\n")

output:

Retrieved #1:  {'score': 37.0975227355957, 'text': 'Miss World (song). "Miss World" is a song by American alternative rock band Hole, written by frontwoman Courtney Love and lead guitarist Eric Erlandson.  The song was released as the band\'s fifth single and the first from their second studio album, "Live Through This", in March 1994.', 'metadata': {'title': 'Miss World (song)', 'source': 'wikipedia_extract', 'original_doc_id': '5ac2a912554299218029dae8_Miss_World_(song)'}}

Retrieved #2:  {'score': 35.72223663330078, 'text': 'Courtney Love. Courtney Michelle Love (born Courtney Michelle Harrison; July 9, 1964) is an American singer, songwriter, actress, and visual artist.  Prolific in the punk and grunge scenes of the 1990s, Love has enjoyed a career that spans four decades.  She rose to prominence as the frontwoman of the alternative rock band Hole, which she formed in 1989.  Love has drawn public attention for her uninhibited live performances and confrontational lyrics, as well as her highly publicized personal life following her marriage to Kurt Cobain.', 'metadata': {'title': 'Courtney Love', 'source': 'wikipedia_extract', 'original_doc_id': '5ac2a912554299218029dae8_Courtney_Love'}}

Retrieved #3:  {'score': 34.49491882324219, 'text': 'Beautiful Son. "Beautiful Son" is a song by American alternative rock band Hole, written collectively by frontwoman Courtney Love, lead guitarist Eric Erlandson and drummer Patty Schemel.  The song was released as the band\'s fourth single in April 1993 on the European label City Slang.  To coincide with the song\'s lyrics, Love used a photograph of her husband, Kurt Cobain, at age 7 as the single\'s artwork.', 'metadata': {'title': 'Beautiful Son', 'source': 'wikipedia_extract', 'original_doc_id': '5ac2a912554299218029dae8_Beautiful_Son'}}

As you can see, the RAG system failed to retrieve the relevant data. Actually, this type of query is poorly suited for the sparse RAG. As mentioned above, sparse RAG is a method that finds documents containing the exact words from a query. So, if you mention the word "dog" in a query, it can't match it with "pet" or "puppy."

Therefore, it is better to use sparse RAG like this:

retrieved_results = store.search("Wolfhounds", top_k=3)
for i, r in enumerate(retrieved_results):
    print(f"Retrieved #{i+1}: ", r, "\n")

output:

Retrieved #1:  {'score': 11.775125503540039, 'text': 'The Wolfhounds. The Wolfhounds are an indie pop/noise pop band formed in Romford, UK in 1985 by Dave Callahan, Paul Clark, Andy Golding, Andy Bolton and Frank Stebbing, and originally active until 1990.  The band reformed in 2005 and continues to write, record and play live, releasing new albums in 2014 and 2016.', 'metadata': {'title': 'The Wolfhounds', 'source': 'wikipedia_extract', 'original_doc_id': '5ac2a912554299218029dae8_The_Wolfhounds'}}

If you want the correct answer to the question—"Which band was founded first, Hole, the rock band that Courtney Love was a frontwoman of, or The Wolfhounds?"—it is better to search for individual keywords: "Hole", "Courtney Love", "Wolfhounds" and combine them in a query.

Cons of Sparse RAG

Sparse RAG has some significant weaknesses.

First, it is strictly an exact-match algorithm. It can't correct minor typos or recognize conceptually relevant words like LLMs do.

retrieved_results = store.search("dog", top_k=1)
print(retrieved_results)

>>> [{'score': 10.09609603881836, 'text': 'Salty dog (cocktail). A salty dog is a cocktail of tequila, or gin, and grapefruit juice, served in a highball glass with a salted rim.  The salt is the only difference between a salty dog and a greyhound.  Vodka may be used as a substitute for tequila; nevertheless, it is historically a tequila drink.', 'metadata': {'title': 'Salty dog (cocktail)', 'source': 'wikipedia_extract', 'original_doc_id': '5ac3ad225542995ef918c1da_Salty_dog_(cocktail)'}}]

retrieved_results = store.search("puppy", top_k=1)
print(retrieved_results)

>>> []

Second, some languages like Korean or Japanese don't separate a word from its postposition. If this were applied to English, "A boy is" would look like "A boyis". Therefore, these languages need a specific process called morphological analysis or morpheme separation before searching with sparse RAG.

There are other algorithms such as BM42 or SPLADE to make up for sparse RAG's limitations. However, they are not widely used due to their complexity. BM25 still remains the industry standard, and if you need more precise and complex search tool, it is better to utilize the other RAG methods that I will explain below.

Dense RAG (Semantic Vector Search)

Due to the limitations of sparse RAG, dense RAG is the most widely used RAG method. When people refer to RAG, they usually mean dense RAG.

While sparse RAG is great for exact matches, Dense RAG is where the magic of "understanding" happens. By converting text into dense vectors (typically 768 or 1024 dimensions), we can find documents that are conceptually related, even if they share zero common words.

The method is simple: compute the distance between the query vector and document vectors, and retrieve the closest top- $N$ documents. The relevance score (distance score) is computed instantaneously, usually using the cosine similarity:

sim(q,d)=Eq⋅Ed∣Eq∣∣Ed∣\text{sim}(q, d) = \frac{E_q \cdot E_d}{|E_q| |E_d|}

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from langchain_ollama import OllamaEmbeddings

embedding_model = OllamaEmbeddings(model="qwen3-embedding:4b")
vector_dog = np.array(embedding_model.embed_query("dog")).reshape(1, -1)
vector_puppy = np.array(embedding_model.embed_query("puppy")).reshape(1, -1)
vector_cat = np.array(embedding_model.embed_query("cat")).reshape(1, -1)
vector_missile = np.array(embedding_model.embed_query("patriot missile")).reshape(1, -1)

cosine_similarity(vector_dog, vector_puppy)
>>> Out[15]: array([[0.84683586]])

cosine_similarity(vector_dog, vector_cat)
>>> Out[16]: array([[0.79599878]])

cosine_similarity(vector_dog, vector_missile)
>>> Out[17]: array([[0.54258399]])

Should we then compute all of the relevance scores and retrieve the K-Nearest Neighbor (KNN)? This computational process would be exhaustive and brutal if the vector dimension is high. The computational complexity is $\cdot d)$ where $N$ is the total number of documents and $d$ is the vector dimensionality.

So, there are some alternative methods to the full KNN. This approach is called ANN (Approximate Nearest Neighbor) and these two algorithms are the most representative ANN methods: HNSW and IVF-PQ.

Hierarchical Navigable Small World (HNSW)

HNSW is currently the "gold standard" for most vector databases. It is a graph-based algorithm that builds a multi-layered structure of vectors.

How it works

Think of it like a "skip list" for graphs. The top layers have fewer points and long-distance links (for fast traversal across the data "map"), while the bottom layers have all the points and short-distance links (for precise local searching). You start at the top, zoom in to the right neighborhood, and move down a layer to refine your search.

HNSW Image from pinecone.io

How it builds layers

The process of building and deploying nodes is a bit complicated. I will explain this step by step.

Step 1) In Layer 0 (the lowest layer) contains all inserted vectors. As you move to higher layers (Layer 1, Layer 2, etc.), the number of nodes decreases exponentially. How does it choose which nodes will remain? First, it rolls the dice for each node.

The maximum layer $l$ for a new node is determined by an exponentially decaying probability distribution:

\lfloor -\ln(u) \cdot m_L \rfloor

Where:

$\sim U(0,1)$ is a uniformly distributed random number between 0 and 1.
$m_L$ is the Level Generation Multiplier.

The Role of $m_L$ and $M$
$m_L$ is mathematically tied to the hyperparameter $M$ (the maximum number of connections per node). The theoretical value for $m_L$ to minimize search complexity is:

m_L = \frac{1}{\ln(M)}

Therefore, this ensures that the number of nodes in each subsequent layer decreases by a factor of exactly $M$ .

Step 2) Let's assume our new vector $v\mathbf{v}$ was assigned a maximum layer $l = 2$ . The graph currently has a maximum layer of $L = 4$ .

The algorithm starts at the top (Layer 4) at a predefined entry point node. It evaluates the distance between $v\mathbf{v}$ and the entry point's neighbors. It greedily jumps to the neighbor closest to $v\mathbf{v}$ repeating this process until it reaches Layer 2.

Step 3) Now that the routing has brought us spatially close to where $v\mathbf{v}$ belongs in Layer 2, the actual graph building begins.

From Layer 2 to Layer 0, the algorithm performs a local search to find the nearest neighbors to connect $v\mathbf{v}$ to.

The algorithm maintains a dynamic list of the closest nodes it has found so far, capped at the size of $e f C o n s t r u c t i o n$ .
It continually explores the neighbors of the nodes in this queue. If it finds a closer node, it adds it to the queue.
Once the local area is fully explored, it selects up to $M$ nodes from the $e f C o n s t r u c t i o n$ queue to create bidirectional edges with $v\mathbf{v}$ .
It prunes the worst edge if adding $v\mathbf{v}$ causes an existing node to exceed its maximum allowed connection (usually $M$ for upper layers, $2 M$ for Layer 0). But it doesn't just prune the furthest edge; it drops nodes that are clustered together, ensuring connections are spread out in different directions to maintain the "small world" navigability.
After connecting in Layer 2, it goes down to Layer 1, uses the best nodes found in Layer 2 as the new entry points, fills a new $e f C o n s t r u c t i o n$ queue, and connects up to $M$ nodes again. This process repeats until Layer 0 is fully connected.

Configuring HNSW (Hyperparameter tuning)

$M$ (Typical range: 16 to 64): Controls the number of bidirectional links per node and dictates the layer density.
- Trade-off: A higher $M$ yields better accuracy, but drastically increases RAM usage and insertion time.
$e f C o n s t r u c t i o n$ (Typical range: 100 to 500): Controls the depth of the search during insertion.
- Trade-off: A higher $e f C o n s t r u c t i o n$ builds a significantly higher-quality graph, but the penalty is a linear increase in index build time. It does not affect query latency.
$e f S e a rc h$
(Typical range: 50 to 200): The equivalent of $e f C o n s t r u c t i o n$ , but used purely at query time.
- Trade-off: Controls the speed vs. recall trade-off for your users.

Note: When my company first introduced a RAG system, our Cloud Service Provider presented their tips and technical know-how of the HNSW. In that presentation, they recommended setting $M$ to 16 and both $e f C o n s t r u c t i o n$ and $e f S e a rc h$ to 128. They told us that these values are the optimal balance considering all factors including memory usage, latency, and recall. According to our internal evaluations, that turned out to be true, but I don't think this is the absolute standard. Just consider this as a useful tip; you need to test it with your own data.

The query complexity of HNSW scales logarithmically, $\log n )$ , facilitating exceptionally fast retrieval. However, this demands a massive memory footprint, as the system must continuously store complex adjacency lists and bidirectional edge pointers in RAM.

Inverted File Index (IVF) & Product Quantization (PQ)

IVF is a clustering-based approach, often paired with Product Quantization (PQ) to save memory. It consumes less memory compared to HNSW, but requires a training process.

IVF (Inverted File Index) — The Macro Partition

IVF partitions the high-dimensional space into distinct regions (Voronoi cells) and only searches the regions closest to the query.

How it works

Training (K-Means): During index initialization, a clustering algorithm (typically K-means) is run across a representative sample of your dataset to find $C$ cluster's centers (centroids).
Assignment: Every vector in your database is assigned to its nearest centroid. The index builds an Inverted List—a dictionary mapping each vector ID to the its nearest centroid.
Querying: When a query vector $q\mathbf{q}$ arrives, it calculates the distance from $q\mathbf{q}$ to each of the $C$ centroids. Then, it extracts only the $n p ro b e$ number of the nearest centroids and computes distances to only the vectors residing in those specific cells.

PQ (Product Quantization) — The Micro Compression

While IVF reduces the number of vectors it searches, PQ reduces the size of the vectors themselves.

How it works

Splitting: A high-dimensional vector $x∈RD\mathbf{x} \in \mathbb{R}^D$ is split into $m$ sized sub-vectors. For example, if $m = 8$ and the vector dimension is 1,024, PQ splits the vectors into 8 sub-vectors of 128 dimensions each.
Sub-Clustering: For each of the $m$ sub-spaces, it runs K-Means to find sub-centroids. (Usually, the number of sub-centroids for each sub-space is set to 256, so that the centroid ID fits in a single 8-bit byte).
Encoding: Every sub-vector is replaced by the nearest centroid ID (0-255)
- Memory Saving: A 1024-dim float32 vector (4096 bytes) is compressed into $m$ bytes. If $m = 8$ , it is 8 bytes, achieving a 512x compression ratio.
Querying:: When a query vector $q\mathbf{q}$ arrives, it is split into $m$ sub-vectors. For each query sub-vector $qi\mathbf{q}_i$ , it calculates its distance to all 256 sub-centroids in the sub-space, and store them in a lookup table. Therefore, the distances will be calculated 256 * $m$ times for each query.
To calculate the distance between $qi\mathbf{q}_i$ and any compressed vector $x\mathbf{x}$ , we simply sum the pre-computed distances from the lookup table using the stored centroid IDs:

d(\mathbf{q}, \mathbf{x}) \approx \sum_{i=1}^m d(\mathbf{q}_{i}, c_{i, j_{i}})

(Where

c_{i, j_{i}}

is the sub-centroid assigned to the

i

-th sub-vector of

x\mathbf{x}

Image from pinecone.io

For example, let's assume $m = 8$ , the vector dimensionality is 1024, and we have a dataset containing 100,000 vectors.

Without PQ (Flat Search): You must run 100,000 distance calculations between 1024-dimension vectors. That requires 102,400,000 multiplications.
With PQ: You would run 2,048 (256 * 8) distance computations to build the lookup table. Then, without any multiplications, you merely reference the lookup table 100,000 $×\times$ 8 times. Not only are fewer computations required, but the file size of the vector DB is reduced significantly.

HNSW vs. IVF-PQ

Method	Inference Time (Latency)	RAM consumption	Index Build Time
HNSW	Very Low	High	High
IVF-PQ	Medium	Low	Medium (Requires Training)

Comparison image between HNSW and IVF-PQ from NVIDIA Techinical Blog

In most cases, people tend to prefer the HNSW. It is essentially plug-and-play; it doesn't require a training process. Also, its search time is faster even when using only a CPU. So, it appears superior to IVF in almost every aspect.

But if the database becomes too large (over 100M vectors), HNSW will eat up too much RAM. Therefore, to save on massive RAM costs, IVF-PQ could be a better choice in this case.

A Real Example of Using Dense RAG and Vector Search

I will demonstrate dense RAG and a vector search DB using HNSW. I'm going to input the entire transcript of the series "Demon Slayer" Seasons 1 through 4. (I think it would be boring to just use the typical data scattered across the internet!)

First, you need to build the client and DB. I will use the Qdrant framework. There are a bunch of RAG frameworks, and they all have their own pros and cons. So, you should find out which framework best fits for your project.

import docx
from langchain_ollama import OllamaEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient, models

# Text splitter: It splits text into several chunks. 
# You can also set the overlap window to maintain context across chunks.text. 
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1500,
    chunk_overlap=150, # 10% overlap
    length_function=len,
)

embedding_dimension = 1024

for filepath in files:
    try:
        document = docx.Document(filepath)
        doc_content = []

        # Use user provided logic to extract tables
        for table in document.tables:
            for row in table.rows:
                for idx, cell in enumerate(row.cells):
                    if (idx == 0 or cell.text != row.cells[0].text) and cell.text:
                        doc_content.append(cell.text)

        content = "\n".join(doc_content).strip()

        if content:
            # Split content into smaller chunks with overlap
            chunks = text_splitter.split_text(content)
            for chunk_idx, chunk in enumerate(chunks):
                docs_texts.append(chunk)
                docs_metadata.append({
                    "source": filepath,
                    "filename": os.path.basename(filepath),
                    "chunk_index": chunk_idx,
                    "page_content": chunk # Store chunk text here
                })
    except Exception as e:
        print(f"Warning: Could not read file {filepath} - {e}")

embeddings = OllamaEmbeddings(model=model_name, dimensions=embedding_dimension) # I'm using the "qwen3-embedding:4b" model

vectors = embeddings.embed_documents(docs_texts)

Next, I build the Qdrant vector DB using HNSW. I set $m = 16$ and $e f C o n s t r u c t i o n$ to 128.

client = QdrantClient(path=output_dir)
collection_name = "semantic_rag_demon_slayer_collection"

client.create_collection(
    collection_name=collection_name,
    vectors_config=models.VectorParams(
        size=embedding_dimension,
        distance=models.Distance.COSINE,
        hnsw_config=models.HnswConfigDiff(
            m=16,
            ef_construct=128, 
            full_scan_threshold=0,  # Always use HNSW search, never fall back to brute force
        ),
    ),
    optimizers_config=models.OptimizersConfigDiff(
        indexing_threshold=100,  # Low threshold to force HNSW index build on small data
    ),
)

Next, I need to combine the vector points and metadata using PointStruct; and then insert them into the DB.

 points = []
for i, (vector, meta) in enumerate(zip(vectors, docs_metadata)):
    point_id = str(uuid.uuid4())
    points.append(
        models.PointStruct(
            id=point_id,
            vector=vector,
            payload=meta
        )
    )

client.upsert(
    collection_name=collection_name,
    points=points
)

uv run semantic_rag/embed_docs.py --input_dir database/demon_slayer --output_dir qdrant_db/vector_db/demon_slayer

---

Created new collection: semantic_rag_demon_slayer_collection
Processing document: database/demon_slayer\Demon Slayer _ S.1 E.01 (ENG sub).docx
Loaded 13 documents. Initializing the embedding model 'qwen3-embedding:4b'...
Generating embeddings... this might take a while.
Successfully saved the document: database/demon_slayer\Demon Slayer _ S.1 E.01 (ENG sub).docx
Processing document: database/demon_slayer\Demon Slayer _ S.1 E.02 (ENG sub).docx
Loaded 12 documents. Initializing the embedding model 'qwen3-embedding:4b'...
Generating embeddings... this might take a while.
Successfully saved the document: database/demon_slayer\Demon Slayer _ S.1 E.02 (ENG sub).docx

...

Successfully saved the document: database/demon_slayer\Demon Slayer _ S.4 E.07 (ENG sub).docx
Processing document: database/demon_slayer\Demon Slayer _ S.4 E.08 (ENG sub).docx
Loaded 13 documents. Initializing the embedding model 'qwen3-embedding:4b'...
Generating embeddings... this might take a while.
Successfully saved the document: database/demon_slayer\Demon Slayer _ S.4 E.08 (ENG sub).docx
Successfully vectorized and saved all documents locally.

I successfully saved all the documents in the vector DB.

Now, we need to test this with an LLM model to see if it retrieves the relevant information accurately.

You can retrieve the vectors in Qdrant this way:

embeddings = OllamaEmbeddings(model=model_name, dimensions=1024)

query_vector = embeddings.embed_query(query)

client = QdrantClient(path=db_dir)
collection_name = "semantic_rag_demon_slayer_collection"

results = client.query_points(
    collection_name=collection_name,
    query=query_vector,
    limit=top_k,
    search_params=models.SearchParams(hnsw_ef=128),
).points

retrieved_context = [res.payload.get("page_content", "") for res in results if res.payload]

Now, let's see whether it retrieves the Demon Slayer knowledge accurately.

top_k = 5 # retrieve 5 vectors
query = "Why did the Master ask Himejima to kill Muzan?"
query_vector = embeddings.embed_query(query)

results = client.query_points(
    collection_name=collection_name,
    query=query_vector,
    limit=top_k,
    search_params=models.SearchParams(hnsw_ef=128),
).points

for r in retrieved_context:
    print(r)

...
(Flashback) Use me as bait… and cut off… Muzan’s head.
Himejima
(Flashback) What makes you think that?
Kagaya
(Flashback) Fufu… Just my intuition. That’s all. No reason.
Himejima
(Thoughts) Along with his special voice, what he called “intuition” was prodigious among the Ubuyashiki clan.
(Thoughts) It’s also known as “foresight”. The power to see into the future. Using this, they built up their fortune and avoided crises many times over.
Kagaya
(Flashback) The other children… won’t agree… to using me as bait.
(Flashback) You’re the only one that I can ask… Gyoumei.
Himejima
(Flashback) Understood. If that is your wish, Master.
Kagaya
(Flashback) Thank you.
...

It successfully retrieved the relevant documents from the DB. Actually, this was the second most relevant chunk. The first one is somewhat relevant, but it doesn't include the right information to answer the question.

Next, before trying this with an LLM, I want to check if the GPT-5-nano model can answer questions about the series without any external information. Some models have the information about the series baked into their own weights.

I will ask these four questions:

"Why did Akaza want Kyojuro to become a demon?"
"Name all the Hashiras who entered the Infinity Castle"
"What advanced versions of Thunder Breathing the First Form Thunderclap and Flash can Zenitsu use?"
"What is the meaning of 'Musical Score' of Tengen?

from langchain_openai import ChatOpenAI
from dotenv import load_dotenv

load_dotenv()

llm = ChatOpenAI(model="gpt-5-nano", api_key=os.getenv("OPENAI_API_KEY"))

template = """
Answer the query about the Japanese anime series Demon Slayer.
If you cannot find the answer, say that you do not know. Do not hallucinate.
Query:
{question}
"""

prompt = PromptTemplate(input_variables=["question"], template=template)
chain = prompt | llm

query_list = [
    "Why did Akaza want Kyojuro to become a demon?",
    "Name all the Hashiras who entered the Infinity Castle",
    "What advanced versions of Thunder Breathing the First Form Thunderclap and Flash can Zenitsu use?",
    "What is the meaning of 'Musical Score' of Tengen?"
    ]

answer_list = []
for query in query_list:
    response = chain.invoke({
        "question": query
    })

    answer_list.append(response.content)

for idx, answer in enumerate(answer_list):
    print(f"Answer #{idx+1}: ", answer)

Answer #1:  Akaza wanted Kyojuro Rengoku to become a demon because he was looking for a powerful human to recruit into the Twelve Demon Moons. He admired Rengoku’s exceptional fighting ability and believed that, as a demon, Rengoku would be stronger and capable of fighting forever, giving Akaza the opportunity to face and test the strongest opponent.

Answer #2:  I don’t know.

Answer #3: 'I don’t know. If you’d like, I can look up or summarize what’s officially stated about Thunder Breathing forms for Zenitsu.'

Answer #4:  I don’t know the meaning of a term called “Musical Score” related to Tengen. In official Demon Slayer material, Tengen Uzui’s fighting style is “Sound Breathing” (音の呼吸), and there’s no canon term or technique known as “Musical Score.” If you saw that phrase somewhere, it’s likely a fan translation or a metaphor/metonym for his sound/music motif rather than an official term. If you can share the source, I can help interpret it.

All the answers are wrong. The first answer is somewhat close, but Akaza didn't try to recruit anyone to the Twelve Demon Moons.

But what if the model has access to the RAG DB? Let's try it with the vector search.

answer_list_with_rag = []
for query in query_list:
    query_vector = embeddings.embed_query(query)

    results = client.query_points(
        collection_name=collection_name,
        query=query_vector,
        limit=top_k,
        search_params=models.SearchParams(hnsw_ef=128),
    ).points
    retrieved_context = [res.payload.get("page_content", "") for res in results if res.payload]
    answer = llm_inference.generate_answer(query=query, retrieved_context=retrieved_context)

    answer_list_with_rag.append(answer)

Answer #1:  Akaza offered to make Kyojurou a demon so that he could keep training for 100–200 years and become much stronger. He pointed out that humans age and die, implying demonhood would grant him the time to grow far stronger.

Answer #2:  - Tokitou (Muichiro Tokito) – Mist Hashira
- Shinazugawa (Sanemi Shinazugawa) – Wind Hashira
- Iguro (Obanai Iguro) – Serpent Hashira
- Kanroji (Mitsuri Kanroji) – Love Hashira
- Shinobu (Shinobu Kocho) – Insect Hashira
- Tomioka (Giyu Tomioka) – Water Hashira
- Himejima (Gyomei Himejima) – Stone Hashira

Answer #3:  He can use Thunderclap and Flash: Godlike Speed, Sixfold, and Eightfold.

Answer #4:  It refers to the name of Tengen Uzui’s technique, a music-based fighting style. “Musical Score” (the Musical Score Technique) uses the idea of a musical score to guide and counter his moves—even turning the opponent’s Blood Demon Art into a “song” to read and deflect it.

Now, it can answer very accurately. This is the power of the RAG. Even if the backbone model doesn't have enough knowledge about a certain topic, the RAG system can efficiently inject the required knowledge efficiently.

Limitations of Vector Search RAG

However, it is not a silver bullet. Of course, it has limitations. When it comes to the basic RAG, it can only compare the semantic vectors; it can't understand the words on a deep learning level, and it is not good at multi-hop inference.

For example, consider this query: "What does the main character do in the second most populous state of the U.S.?"

When the model receives this query, it has to know who the main character is and which state is the second most populous. Then, it must combine this information and search through the database. But as we know, basic level RAG is not designed to perform this kind of multi-hop inference.

If I send the query "How did the Hashira who lost his brother to a demon manage to defeat one of the Twelve Demon Moons?" to the RAG, it shows me entirely irrelevant documents. Considering the fundamental algorithms of the vector search, this is natural. If we send the query directly to the semantic vector search client, it can't understand who the Hashira that lost his brother is, or how he defeated the demon. There is only one character who fits this condition: Muichiro Tokito. But the vector distance between "Muichiro Tokito" and "the Hashira who lost his brother to a demon" is not close. In fact, they might be far away from each other in the vector space.

query = "How did the Hashira who lost his brother to a demon manage to defeat one of the Twelve Demon Moons?"

"""
--- Retrieved Context Summary ---
[1] Source: database/demon_slayer\Demon Slayer _ S.4 E.02 (ENG sub).docx | Preview: I’m leaving now to be trained by the Wind Hashira. Shinobu I see. Kanao Can your training session wait until after the Stone Hashira’s? Shinobu I won’...
[2] Source: database/demon_slayer\Demon Slayer _ S.2 E.08 (ENG sub).docx | Preview: It’s just that… as he suffers from a skin disease, he can’t go outside during the day. Guest Oh dear, the poor thing. Father I was hoping that we coul...
[3] Source: database/demon_slayer\Demon Slayer _ S.1 E.24 (ENG sub).docx | Preview: And also, I’d like to entrust my dream to you. Tanjirou Dream? Shinobu Yes.  My dream that we can become friends with the demons. I’m quite sure you c...
[4] Source: database/demon_slayer\Demon Slayer _ S.1 E.23 (ENG).docx | Preview: Not to mention that the times have changed considerably in this era. Himejima Other than those who’ve had their loved ones brutally massacred and join...
[5] Source: database/demon_slayer\Demon Slayer _ S.1.5 – The Movie_ Mugen Train (ENG sub).docx | Preview: (Thoughts) Even though I’d taken 200 humans hostage, I still struggled! I was held at bay! Is this the power of a Hashira?  (Thoughts) And him… He was...

================ FINAL ANSWER ================

I do not know. The provided context mentions Shinobu Kocho losing her older sister Kanae to a demon and wanting to teach how to kill that demon, but it does not describe how she or any Hashira defeated a Twelve Demon Moon.

==============================================
"""

There are several ways to adapt RAG to overcome its limitations. A prime example is called GraphRAG, which utilizes Knowledge Graphs. Alternatively, you can combine several methods into a hybrid approach. I will cover some of these advanced RAG methods in the next post.

Conclusion

RAG is one of the most popular methods for retrieving information in LLM architectures. Sometimes you might need just simple sparse RAG, but other times you might need dense RAG or even more advanced RAG algorithms. You should consider your database, resources, backbone LLM, and use case to choose the method that best fits your project.

Coroutine series 3) Coroutines for LLM inference

Jun Bae — Wed, 22 Apr 2026 14:09:43 +0000

Introduction

In this post, I will briefly introduce how to utilize coroutines for LLMs. Using asyncio for LLM inference is straightforward because most AI frameworks support asyncio natively nowadays.

Coroutines for LLM

LLM API SDK

API SDKs—Google GenAI, OpenAI, Claude, and others— provide coroutine functions. You can easily leverage coroutines for AI like this:

# Synchronous inference function of OpenAI SDK
client = OpenAI()
response = client.chat.completions.create(
    model="gpt-5-mini",
    messages=[{"role": "user", "content": prompt}],
)

# Coroutine function of OpenAI SDK
client = AsyncOpenAI()
response = await client.chat.completions.create(
    model="gpt-5-mini",
    messages=[{"role": "user", "content": prompt}],
)

# Synchronous inference function of Google GenAI SDK
client = genai.Client()
response = client.models.generate_content(
    model="gemini-2.5-flash", contents=prompt
)

# Coroutine function of Google GenAI SDK
client = genai.Client()
response = await client.aio.models.generate_content(
    model="gemini-2.5-flash", contents=prompt
)

For the OpenAI SDK, there’s a specific client for async. For Google GenAI, you should add aio after client. Other SDKs, such as Claude and Ollama, also support coroutine functions.

Example comparing async and sync functions

Input

import asyncio
import time
import os
import logging
from dotenv import load_dotenv
from google import genai
from openai import OpenAI, AsyncOpenAI

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', datefmt='%H:%M:%S')
logger = logging.getLogger(__name__)

load_dotenv()

# --- Sync Functions ---

def sync_chat_google(prompt: str):
    logger.info("Starting Sync Google GenAI request...")
    try:
        client = genai.Client()
        response = client.models.generate_content(
            model="gemini-2.5-flash", contents=prompt
        )
        logger.info("Finished Sync Google GenAI request.")
        return f"Google (Sync): {response.text[:50]}..."
    except Exception as e:
        logger.error(f"Sync Google Error: {e}")
        return f"Google (Sync) Error: {e}"

def sync_chat_openai(prompt: str):
    logger.info("Starting Sync OpenAI request...")
    try:
        client = OpenAI()
        response = client.chat.completions.create(
            model="gpt-5-mini",
            messages=[{"role": "user", "content": prompt}],
            reasoning_effort="low"
        )
        logger.info("Finished Sync OpenAI request.")
        return f"OpenAI (Sync): {response.choices[0].message.content[:50]}..."
    except Exception as e:
        logger.error(f"Sync OpenAI Error: {e}")
        return f"OpenAI (Sync) Error: {e}"

def run_sync():
    logger.info("--- Starting Sync Execution ---")
    start_time = time.time()

    prompt = "Explain asyncio in one sentence."
    res_google = sync_chat_google(prompt)
    res_openai = sync_chat_openai(prompt)

    end_time = time.time()
    total_time = end_time - start_time

    print(f"\n[Sync Results]")
    print(res_google)
    print(res_openai)
    print(f"Total Sync Time: {total_time:.2f} seconds\n")
    return total_time

# --- Async Functions ---

async def async_chat_google(prompt: str):
    logger.info("Starting Async Google GenAI request...")
    try:
        client = genai.Client()
        response = await client.aio.models.generate_content(
            model="gemini-2.5-flash", contents=prompt
        )
        logger.info("Finished Async Google GenAI request.")
        return f"Google (Async): {response.text[:50]}..."
    except Exception as e:
        logger.error(f"Async Google Error: {e}")
        return f"Google (Async) Error: {e}"

async def async_chat_openai(prompt: str):
    logger.info("Starting Async OpenAI request...")
    try:
        client = AsyncOpenAI()
        response = await client.chat.completions.create(
            model="gpt-5-mini",
            messages=[{"role": "user", "content": prompt}],
            reasoning_effort="low"
        )
        logger.info("Finished Async OpenAI request.")
        return f"OpenAI (Async): {response.choices[0].message.content[:50]}..."
    except Exception as e:
        logger.error(f"Async OpenAI Error: {e}")
        return f"OpenAI (Async) Error: {e}"

async def run_async():
    logger.info("--- Starting Async Execution ---")
    start_time = time.time()

    prompt = "Explain asyncio in one sentence."
    # Schedule both calls concurrently
    async with asyncio.TaskGroup() as tg:
        task1 = tg.create_task(async_chat_google(prompt))
        task2 = tg.create_task(async_chat_openai(prompt))

    results = [task1.result(), task2.result()]

    end_time = time.time()
    total_time = end_time - start_time

    print(f"\n[Async Results]")
    for res in results:
        print(res)
    print(f"Total Async Time: {total_time:.2f} seconds\n")
    return total_time

async def main():
    logger.info("Starting Comparison...")

    sync_time = run_sync()
    async_time = await run_async()

    print("-" * 30)
    print(f"Sync Time:  {sync_time:.2f}s")
    print(f"Async Time: {async_time:.2f}s")
    if async_time < sync_time:
        print(f"Conclusion: Async was {sync_time / async_time:.2f}x faster!")
    else:
        print("Conclusion: Async was not faster (overhead or network variance).")

if __name__ == "__main__":
    asyncio.run(main())

Output

00:16:17 - INFO - Starting Comparison...
00:16:17 - INFO - --- Starting Sync Execution ---
00:16:17 - INFO - Starting Sync Google GenAI request...
00:16:18 - INFO - AFC is enabled with max remote calls: 10.
00:16:24 - INFO - HTTP Request: POST https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash:generateContent "HTTP/1.1 200 OK"
00:16:24 - INFO - Finished Sync Google GenAI request.
00:16:24 - INFO - Starting Sync OpenAI request...
00:16:28 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
00:16:28 - INFO - Finished Sync OpenAI request.

[Sync Results]
Google (Sync): Asyncio is Python's library for writing concurrent...
OpenAI (Sync): asyncio is Python's library for writing concurrent...
Total Sync Time: 10.36 seconds

00:16:28 - INFO - --- Starting Async Execution ---
00:16:28 - INFO - Starting Async Google GenAI request...
00:16:28 - INFO - AFC is enabled with max remote calls: 10.
00:16:28 - INFO - Starting Async OpenAI request...
00:16:31 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
00:16:31 - INFO - Finished Async OpenAI request.
00:16:35 - INFO - HTTP Request: POST https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash:generateContent "HTTP/1.1 200 OK"
00:16:35 - INFO - Finished Async Google GenAI request.

[Async Results]
Google (Async): asyncio is Python's library for writing concurrent...
OpenAI (Async): asyncio is Python's standard-library framework for...
Total Async Time: 6.91 seconds

------------------------------
Sync Time:  10.36s
Async Time: 6.91s
Conclusion: Async was 1.50x faster!

As you can see, the async coroutine functions are 1.5x faster because they call the OpenAI API and the Google API concurrently. On the other hand, the synchronous functions call the Google Genai API first, wait for the response, and then subsequently call the OpenAI API.

Sync vs Async

When you build complex AI inference architectures, such as Agentic AI, you should think carefully about whether to utilize async. It depends on the specific requirements of your architecture.

Sequential Dependencies: When some inferences depend on the previous ones—for example, if a previous step retrieves necessary context (like in RAG), or the result of one inference must be included in the prompt of the subsequent inference—you generally execute them sequentially.
Independent Tasks: When you run independent inferences that do not rely on each other, you can leverage coroutines to run them concurrently.

Example

import asyncio
import time
import os
import logging
from dotenv import load_dotenv
from typing import List

from langchain_google_genai import ChatGoogleGenerativeAI
import ollama

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', datefmt='%H:%M:%S')
logger = logging.getLogger(__name__)

load_dotenv()

# --- 1. Async Retrieval Simulation ---

async def retrieve_docs_db(query: str) -> str:
    """Simulates retrieving documents from a database (IO bound)."""
    logger.info(f"[DB] Searching for: '{query}'...")
    await asyncio.sleep(2)  # Simulate network/DB latency
    result = "Context from DB: Asyncio is single-threaded but concurrent."
    logger.info(f"[DB] Found: {result}")
    return result

async def retrieve_docs_web(query: str) -> str:
    """Simulates retrieving documents from the web (IO bound)."""
    logger.info(f"[Web] Searching for: '{query}'...")
    await asyncio.sleep(2)  # Simulate network latency
    result = "Context from Web: Asyncio uses an event loop to manage tasks."
    logger.info(f"[Web] Found: {result}")
    return result

# --- 2. Sync Ollama Inference ---

def query_ollama(context: List[str], question: str) -> str:

    combined_context = "\n".join(context)
    prompt = f"Context:\n{combined_context}\n\nQuestion: {question}\n\nAnswer:"

    logger.info("Starting Ollama generation...")
    try:
        client = ollama.Client() 
        model = "qwen3:8b" 

        response = client.chat(model=model, messages=[
            {'role': 'user', 'content': prompt},
        ])

        answer = response['message']['content']
        logger.info("Finished Ollama generation.")
        return answer
    except Exception as e:
        logger.error(f"Ollama Error: {e}")
        return f"Ollama Error: {e}"

# --- 3. Sync Google Refinement ---

def refine_with_google(draft_answer: str) -> str:
    logger.info("Starting Google Refinement...")
    try:
        prompt = f"Please refine and polish the following text to be more professional:\n\n{draft_answer}"

        genai_llm = ChatGoogleGenerativeAI(
            model="gemini-3-flash-preview",
            thinking_level="low"
        )
        response = genai_llm.invoke(prompt)
        logger.info("Finished Google Refinement.")
        return response.text
    except Exception as e:
        logger.error(f"Google Refinement Error: {e}")
        return f"Google Error: {e}"

# --- Main Flow ---

async def main():
    start_total = time.time()
    query = "How does asyncio work?"

    print(f"\n=== Starting Concurrent RAG Demo ===")
    print(f"Query: {query}\n")

    # Step 1: Concurrent Retrieval
    logger.info("--- Step 1: Concurrent Retrieval ---")
    start_retrieval = time.time()

    # Launch both retrievals at the same time
    async with asyncio.TaskGroup() as tg:
        task1 = tg.create_task(retrieve_docs_db(query))
        task2 = tg.create_task(retrieve_docs_web(query))
    results = [task1.result(), task2.result()]

    retrieval_time = time.time() - start_retrieval
    logger.info(f"Retrieval complete in {retrieval_time:.2f}s")

    # Step 2: Sync Ollama Inference (Dependent on Step 1)
    logger.info("--- Step 2: Sync Ollama Inference ---")
    start_ollama = time.time()

    draft_answer = query_ollama(results, query)

    ollama_time = time.time() - start_ollama
    print(f"\n[Ollama Draft]:\n{draft_answer}\n")

    # Step 3: Sync Google Refinement (Dependent on Step 2)
    logger.info("--- Step 3: Sync Google Refinement ---")
    start_google = time.time()

    refined_answer = refine_with_google(draft_answer)

    google_time = time.time() - start_google
    print(f"\n[Google Refined]:\n{refined_answer}\n")

    total_time = time.time() - start_total

    print("=" * 40)
    print(f"Retrieval Time: {retrieval_time:.2f}s")
    print(f"Ollama Time:    {ollama_time:.2f}s")
    print(f"Google Time:    {google_time:.2f}s")
    print(f"Total Time:     {total_time:.2f}s")
    print("=" * 40)

if __name__ == "__main__":
    asyncio.run(main())

Output

=== Starting Concurrent RAG Demo ===
Query: How does asyncio work?

20:05:38 - INFO - --- Step 1: Concurrent Retrieval ---
20:05:38 - INFO - [DB] Searching for: 'How does asyncio work?'...
20:05:38 - INFO - [Web] Searching for: 'How does asyncio work?'...
20:05:40 - INFO - [DB] Found: Context from DB: Asyncio is single-threaded but concurrent.
20:05:40 - INFO - [Web] Found: Context from Web: Asyncio uses an event loop to manage tasks.
20:05:40 - INFO - Retrieval complete in 2.01s
20:05:40 - INFO - --- Step 2: Sync Ollama Inference ---
20:05:40 - INFO - Starting Ollama generation...
20:05:57 - INFO - HTTP Request: POST http://127.0.0.1:11434/api/chat "HTTP/1.1 200 OK"
20:05:57 - INFO - Finished Ollama generation.

[Ollama Draft]:
Asyncio is a Python library that enables **single-threaded concurrency** through **asynchronous I/O** and an **event loop**. Here's how it works:

### 1. **Single-Threaded, Concurrent Execution**
   - Asyncio operates in a **single thread**, avoiding the overhead of multi-threading. Instead of using multiple threads, it leverages **non-blocking I/O** to handle multiple tasks concurrently.
   - Concurrency here means tasks can **overlap** in execution, even though they run in the same thread. This is ideal for **I/O-bound tasks** (e.g., network requests, file reads) where waiting for I/O is common.

### 2. **Event Loop as the Core Mechanism**
   - The **event loop** is the heart of asyncio. It manages and schedules tasks, handling I/O operations and switching between coroutines when needed.
   - When a coroutine (an `async def` function) is started, the event loop schedules it. If the coroutine encounters an I/O operation (e.g., a network call), it **yields control** back to the event loop, allowing 
other tasks to run in the meantime.                                                                                                                                                                                  
### 3. **Coroutines and `await`**
   - Coroutines are defined using `async def`. They can be paused and resumed, enabling **cooperative multitasking**.
   - The `await` keyword is used to **delegate control** to another coroutine or I/O operation. While waiting, the event loop can process other tasks, ensuring efficient resource utilization.

### 4. **Non-Blocking I/O**
   - Asyncio avoids blocking the thread by using **asynchronous I/O**. For example, when a coroutine makes a network request, it doesn’t wait for the response to complete. Instead, it **registers a callback** with
 the event loop and continues executing other tasks. Once the I/O operation completes, the event loop resumes the coroutine.                                                                                         
### 5. **Task Scheduling and Collaboration**
   - The event loop manages **tasks** (coroutines) and ensures they run in a coordinated manner. It handles:
     - Scheduling coroutines to run.
     - Handling callbacks for completed I/O operations.
     - Managing exceptions and ensuring orderly execution.

### 6. **Use Cases**
   - Asyncio is best suited for **I/O-bound applications** (e.g., web servers, APIs, data scraping) where waiting for I/O is a bottleneck.
   - It is **not ideal** for CPU-bound tasks, which would benefit more from multiprocessing or threading.

### Summary
Asyncio achieves concurrency by using a **single thread** with an **event loop** to manage **non-blocking I/O** and **coroutines**. This allows multiple tasks to run **simultaneously** without blocking the main th
read, making it efficient for handling many I/O operations in applications like web servers or network clients.                                                                                                      
20:05:57 - INFO - --- Step 3: Sync Google Refinement ---
20:05:57 - INFO - Starting Google Refinement...
20:05:57 - INFO - AFC is enabled with max remote calls: 10.
20:06:06 - INFO - HTTP Request: POST https://generativelanguage.googleapis.com/v1beta/models/gemini-3-flash-preview:generateContent "HTTP/1.1 200 OK"
20:06:06 - INFO - Finished Google Refinement.

[Google Refined]:
Here are three ways to refine the text, depending on the desired tone and context.

### Option 1: Formal & Technical (Best for Documentation or Whitepapers)
This version uses precise terminology and focuses on the architectural benefits of the library.

**Overview of Asyncio in Python**
Asyncio is a specialized Python library designed to facilitate **single-threaded concurrency** through **asynchronous I/O** and an **event loop** architecture. It is structured around the following core principles
:                                                                                                                                                                                                                    
*   **Single-Threaded Concurrency:** Unlike traditional multi-threading, asyncio operates within a single thread. It leverages **non-blocking I/O** to manage multiple execution streams concurrently. This approach 
eliminates the overhead associated with context switching and thread management, making it highly efficient for **I/O-bound tasks**.                                                                                 *   **The Event Loop:** Serving as the central orchestrator, the event loop manages task scheduling and I/O operations. When a coroutine encounters an I/O bottleneck, it **yields control** back to the loop, which 
then executes other pending tasks, ensuring optimal CPU utilization.                                                                                                                                                 *   **Coroutines and Cooperative Multitasking:** Defined via the `async def` syntax, coroutines are the fundamental units of execution. Through the `await` keyword, these functions practice **cooperative multitask
ing**, pausing their execution to allow the event loop to process other operations until the awaited result is available.                                                                                            *   **Non-Blocking Operations:** Asyncio prevents thread blocking by registering callbacks for I/O events. This allows the system to initiate a request—such as a network call—and immediately move to the next task,
 resuming the original coroutine only once the data is ready.                                                                                                                                                        *   **Optimized Use Cases:** Asyncio is the industry standard for high-performance **I/O-bound applications**, including web servers, distributed systems, and real-time data streaming. Conversely, for CPU-intensiv
e computations, multiprocessing remains the preferred parallelization strategy.                                                                                                                                      
---

### Option 2: Professional & Concise (Best for a Presentation or Summary)
This version is streamlined for readability while maintaining a professional tone.

**Understanding Asyncio**
Asyncio enables Python developers to handle high-concurrency workloads within a single thread. By utilizing an event loop and non-blocking I/O, it provides a scalable alternative to traditional threading.

1.  **Concurrency without Threads:** By overlapping task execution rather than running them in parallel, asyncio avoids the memory overhead of multiple threads while efficiently handling thousands of simultaneous 
connections.                                                                                                                                                                                                         2.  **The Event Loop & Coroutines:** The event loop acts as a scheduler. Coroutines (`async def`) cooperatively yield control using `await`, allowing the loop to switch between tasks seamlessly whenever a program 
is waiting for external data.                                                                                                                                                                                        3.  **Efficiency through Non-Blocking I/O:** Instead of halting execution during a network or file operation, asyncio registers the operation and continues with other work. The event loop resumes the paused task o
nly after the I/O operation signals completion.                                                                                                                                                                      4.  **Strategic Application:** Asyncio is ideal for network-heavy applications like APIs and web scrapers. However, for CPU-bound tasks, developers should utilize multiprocessing to bypass the Global Interpreter L
ock (GIL).                                                                                                                                                                                                           
---

### Option 3: Modern & Direct (Best for a Technical Blog or Internal Memo)
This version is punchy and uses active language to explain the concepts.

**Asyncio: Scaling Python through Asynchronous I/O**
Asyncio is Python's built-in solution for **single-threaded concurrency**. It is designed to maximize resource efficiency by ensuring the CPU never sits idle while waiting for network or disk responses.

*   **How it Works:** At the core is the **Event Loop**, which manages **Coroutines** (asynchronous functions).
*   **Cooperative Scheduling:** Using the `await` keyword, a coroutine voluntarily pauses itself during I/O operations. This "cooperation" allows the event loop to rotate through other tasks, creating a highly res
ponsive system.                                                                                                                                                                                                      *   **Non-Blocking Workflow:** By utilizing asynchronous I/O, the application can trigger multiple network requests simultaneously without blocking the main execution thread.
*   **When to use it:** Use Asyncio for I/O-heavy workloads (Web Servers, APIs, Database-heavy apps). Avoid it for heavy mathematical computations, where Multiprocessing is better suited.

**Summary:** Asyncio delivers high-performance concurrency by combining a single-threaded event loop with cooperative multitasking, making it an essential tool for modern, scalable Python development.

---

### Key Improvements Made:
*   **Vocabulary:** Changed "heart of" to "central orchestrator," and "ideal" to "industry standard" or "preferred strategy."
*   **Clarity:** Clarified the distinction between concurrency (overlapping tasks) and parallelism (simultaneous tasks).
*   **Precision:** Replaced general terms with more technical equivalents like "context switching," "resource utilization," and "cooperative multitasking."

========================================
Retrieval Time: 2.01s
Ollama Time:    17.12s
Google Time:    9.58s
Total Time:     28.71s
========================================

This is just a simple LLM example using LangChain (Google GenAI) and the Ollama (Qwen3:8b) SDK. As you can see, the RAG component is just a mock simulation for convenience.

This process can retrieve context from two sources and these tasks run concurrently. However, the subsequent LLM inference requires that context data, so it must run sequentially after the retrievals. Similarly, the refinement step must also be sequential, as it relies on refining the previous result.

Note: I used synchronous LLM clients for Langchain and Ollama since this is just a demonstration. However, in production-level development, you should implement these parts as coroutines as well. In a production environment, the server may receive high traffic; using async allows the system to handle multiple incoming requests efficiently while waiting for the LLM to respond.

AI Framework

Most AI frameworks—such as LangGraph, the OpenAI Agents SDK, Google AI SDKs, AutoGen, and CrewAI— build agents or workflows to be asynchronous by default.

But you must ensure that you use coroutine functions within your nodes or agents. Even though the frameworks themselves are asynchronous, performance can crawl if you inadvertently include blocking I/O functions in certain nodes.

Conclusion

Implementing coroutines for LLMs is not that difficult. In fact, it is quite straightforward because most SDKs and frameworks now provide native support for coroutines. The most important aspect, however, is utilizing asyncio patterns correctly. Production-level projects can be complex, making it easy to misuse coroutines or introduce bottlenecks. Therefore, when building AI inference projects, you must carefully consider several factors: which parts should be asynchronous versus synchronous, when to offload tasks using to_thread, what the optimal concurrency limit shoud be.

Coroutine series 2) Useful Asyncio Functions

Jun Bae — Wed, 22 Apr 2026 14:05:19 +0000

This is the second post in the series Coroutine, IO bound and Asyncio for AI.

Introduction

I explained coroutines and asyncio in the previous post: https://dev.to/jun07/coroutine-series-1-what-are-coroutine-asyncio-io-bound-2gh5

In this post, I’m going to show some functions of the asyncio library that are very useful for production-level development.

Coroutine functions

I briefly introduced simple coroutine functions such as async, await, gather, create_task. There are more functions in the asyncio library and you can do so many things with these powerful tools.

`gather` and `TaskGroup`

I demonstrated how to use gather but I didn’t explain much about it. After you make some coroutine functions, you have to make them coordinate well. Often, several coroutine functions should be executed asynchronously. In that case, you can utilize the functions called gather or TaskGroup in asyncio. But are they exactly the same?

Actually, they have a very crucial difference. When one of the tasks in gather fails, the other tasks keep running. On the other hand, TaskGroup cancels the remaining tasks immediately. Look at this example.

import asyncio
import time

async def example_task(name, delay, should_fail=False):
    print(f"Task {name} starting (delay={delay}s)")
    await asyncio.sleep(delay)
    if should_fail:
        print(f"Task {name} failing")
        raise ValueError(f"Task {name} failed")
    print(f"Task {name} finished")
    return f"Result {name}"

async def demonstrate_gather():
    print("\n--- Demonstrating asyncio.gather ---")
    start_time = time.time()

    # Create tasks: B will fail
    tasks = [
        example_task("A", 1),
        example_task("B", 2, should_fail=True),
        example_task("C", 3)
    ]

    # asyncio.gather without return_exceptions=True raises the first exception immediately
    # BUT it does NOT cancel the other tasks. They keep running in the background.
    print("Starting gather with a failing task...")
    try:
        results = await asyncio.gather(*tasks)
    except Exception as e:
        print(f"Gather caught exception: {e}")

    await asyncio.sleep(2)
    print(f"Gather finished (after waiting). Note if other tasks continued printing above.")
    print(f"Gather took: {time.time() - start_time:.2f}s")

async def demonstrate_task_group():
    print("\n--- Demonstrating asyncio.TaskGroup ---")
    start_time = time.time()

    # TaskGroup cancels all other tasks immediately when one fails
    print("Starting TaskGroup with a failing task...")
    try:
        async with asyncio.TaskGroup() as tg:
            tg.create_task(example_task("X", 1))
            tg.create_task(example_task("Y", 2, should_fail=True))
            tg.create_task(example_task("Z", 3))

    except Exception as e:
        print(f"TaskGroup caught exception: {type(e).__name__}: {e}")

    print(f"TaskGroup finished. Other tasks should have been cancelled (no 'finished' print for Z).")
    print(f"TaskGroup took: {time.time() - start_time:.2f}s")

async def main():
    await demonstrate_gather()
    await demonstrate_task_group()

if __name__ == "__main__":
    asyncio.run(main())

Output

--- Demonstrating asyncio.gather ---
Starting gather with a failing task...
Task A starting (delay=1s)
Task B starting (delay=2s)
Task C starting (delay=3s)
Task A finished
Task B failing
Gather caught exception: Task B failed
Task C finished
Gather finished (after waiting). Note if other tasks continued printing above.
Gather took: 4.00s

--- Demonstrating asyncio.TaskGroup ---
Starting TaskGroup with a failing task...
Task X starting (delay=1s)
Task Y starting (delay=2s)
Task Z starting (delay=3s)
Task X finished
Task Y failing
TaskGroup caught exception: ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
TaskGroup finished. Other tasks should have been cancelled (no 'finished' print for Z).
TaskGroup took: 1.99s

As you can see, gather keeps running even if one of the tasks fails while TaskGroup raises an exception and cancels the remaining tasks immediately.

Differences: `TaskGroup` vs. `gather`

Cancellation: In asyncio.gather(), if one task fails, the others continue running (unless explicitly cancelled). TaskGroup cancels siblings immediately upon the first failure, conserving resources.
Error Reporting: gather() returns a list of results or raises the first exception. TaskGroup raises an ExceptionGroup containing all exceptions that occurred, providing a complete picture of the failure state.

Why is `TaskGroup` favored over `gather`?

TaskGroup enforces a "parent-child" relationship where the parent cannot exit until all children are finished. This eliminates "orphaned tasks"—background processes that silently keep running (and consuming resources) after your main function has finished or crashed. That is why modern high-reliability Python architecture favors Structured Concurrency(asyncio.TaskGroup) over unstructured create_task + gather.

By design, TaskGroup operates on a "fail-fast" principle. If one task fails, the group immediately sends a cancellation signal to all other running tasks in that group.

But what if I don’t want to cancel the whole task group even if one fails? Should I just use just gather? In that case, you can still utilize TaskGroup by handling exceptions with try and except blocks inside the individual tasks.

Note: asyncio.TaskGroup is available in Python 3.11+. For older versions, you may need third-party libraries or stick to gather.

Synchronization Primitives: `Queue` and `Event`

What if I want to run some tasks synchronously in certain parts? To do this, asyncio provides synchronization primitives.

asyncio.Queue (The Conveyor Belt): Used for Data Transfer. It allows you to pass data safely between tasks. It handles "backpressure" (slowing down the producer if the consumer is full).
asyncio.Event (The Traffic Light): Used for Signaling. It holds a simple boolean state (True/False). Tasks "wait" for the light to turn green before proceeding.

Asyncio.Queue

As you know, a Queue is a First-In-First-Out(FIFO) data structure. But what is asyncio.Queue then? It is designed to decouple tasks that produce work from tasks that perform work.

What is Queue used for?

Load Balancing: If the producer is faster than the consumer, the queue buffers the excess.
Flow Control: If you set a maxsize, the producer will pause (await put()) until the consumer clears space. This prevents memory explosions.

Before getting into the example code, here are some key methods:

Key Methods

await q.put(item): Add an item. Blocks if the queue is full.
await q.get(): Remove and return an item. Blocks if the queue is empty.
q.task_done(): Tells the queue "I finished processing the item I just got."
await q.join(): Blocks until all items in the queue have been processed (marked via task_done).

Example

Input

import asyncio
import logging
import random
import time

logging.basicConfig(
    level=logging.INFO,
    format='%(relativeCreated)d ms - [%(levelname)s] - %(message)s',
)


async def producer(name: str, queue: asyncio.Queue, items_to_produce: int):
    """Generates work items and puts them into the queue."""
    for i in range(items_to_produce):
        item = f"{name}-item-{i}"

        # Simulate varying production time
        await asyncio.sleep(random.uniform(0.2, 0.8))

        await queue.put(item)
        logging.info(f"[{name}] Added: {item} (Q size: {queue.qsize()})")

async def consumer(name: str, queue: asyncio.Queue):
    """Processes items from the queue continuously."""
    while True:
        # Waits here if queue is empty
        item = await queue.get()

        # Simulate processing work
        logging.info(f"[{name}] Processing {item}...")
        await asyncio.sleep(random.uniform(1.0, 2.0))

        # Signal that this specific item is fully processed
        queue.task_done()
        logging.info(f"[{name}] Finished {item}")

async def main():
    # maxsize=3 to easily demonstrate backpressure (producers waiting for consumers)
    q = asyncio.Queue(maxsize=5)

    logging.info("--- Starting Pipeline ---")

    # 1. Start Consumers as daemon tasks
    # They run forever until cancelled.
    consumers = asyncio.create_task(consumer(f"Consumer-0", q))

    # 2. Run Producers to completion
    # We use TaskGroup to wait for all producers to finish.
    async with asyncio.TaskGroup() as tg:
        tg.create_task(producer("Producer-A", q, 5))
        tg.create_task(producer("Producer-B", q, 5))
        # Total 10 items will be produced

    logging.info("--- All producers finished. Waiting for processing... ---")

    # 3. Wait for the queue to be fully processed
    # execution blocks here until q.task_done() is called for every item
    await q.join()

    logging.info("--- All work completed. helper tasks cancelled ---")

    # 4. Cancel the consumer tasks since they are permanent loops
    consumers.cancel()

if __name__ == "__main__":
    asyncio.run(main())

Output

48 ms - [INFO] - --- Starting Pipeline ---
608 ms - [INFO] - [Producer-A] Added: Producer-A-item-0 (Q size: 1)
609 ms - [INFO] - [Consumer-0] Processing Producer-A-item-0...
788 ms - [INFO] - [Producer-B] Added: Producer-B-item-0 (Q size: 1)
1079 ms - [INFO] - [Producer-A] Added: Producer-A-item-1 (Q size: 2)
1325 ms - [INFO] - [Producer-A] Added: Producer-A-item-2 (Q size: 3)
1325 ms - [INFO] - [Producer-B] Added: Producer-B-item-1 (Q size: 4)
1587 ms - [INFO] - [Producer-B] Added: Producer-B-item-2 (Q size: 5)
1850 ms - [INFO] - [Consumer-0] Finished Producer-A-item-0
1851 ms - [INFO] - [Consumer-0] Processing Producer-B-item-0...
1851 ms - [INFO] - [Producer-A] Added: Producer-A-item-3 (Q size: 5)
3511 ms - [INFO] - [Consumer-0] Finished Producer-B-item-0
3511 ms - [INFO] - [Consumer-0] Processing Producer-A-item-1...
3512 ms - [INFO] - [Producer-B] Added: Producer-B-item-3 (Q size: 5)
4822 ms - [INFO] - [Consumer-0] Finished Producer-A-item-1
4822 ms - [INFO] - [Consumer-0] Processing Producer-A-item-2...
4822 ms - [INFO] - [Producer-A] Added: Producer-A-item-4 (Q size: 5)
6038 ms - [INFO] - [Consumer-0] Finished Producer-A-item-2
6038 ms - [INFO] - [Consumer-0] Processing Producer-B-item-1...
6038 ms - [INFO] - [Producer-B] Added: Producer-B-item-4 (Q size: 5)
6038 ms - [INFO] - --- All producers finished. Waiting for processing... ---
7538 ms - [INFO] - [Consumer-0] Finished Producer-B-item-1
7538 ms - [INFO] - [Consumer-0] Processing Producer-B-item-2...
9468 ms - [INFO] - [Consumer-0] Finished Producer-B-item-2
9469 ms - [INFO] - [Consumer-0] Processing Producer-A-item-3...
10698 ms - [INFO] - [Consumer-0] Finished Producer-A-item-3
10699 ms - [INFO] - [Consumer-0] Processing Producer-B-item-3...
11836 ms - [INFO] - [Consumer-0] Finished Producer-B-item-3
11836 ms - [INFO] - [Consumer-0] Processing Producer-A-item-4...
12919 ms - [INFO] - [Consumer-0] Finished Producer-A-item-4
12919 ms - [INFO] - [Consumer-0] Processing Producer-B-item-4...
13927 ms - [INFO] - [Consumer-0] Finished Producer-B-item-4
13927 ms - [INFO] - --- All work completed. helper tasks cancelled ---

In the example above, the producers generate items and put them into the queue, and the consumer processes them. Given that the maximum queue size is 5, some blocking occurs during the process. Overflowing requests wait for the queue to free up space. Thus, you can utilize this pattern as a Message Queue.

Asyncio.Event

An Event is much simpler. It manages an internal flag that can be set to True or False. It is a broadcast mechanism: one event can wake up many waiting tasks. That’s why I called it a ‘Traffic light’.

What is Event used for?

Startup Sequences: "Don't start accepting HTTP requests until the Database connection is ready."
Graceful Shutdown: "Tell all background workers to stop loop processing."

And here’re some key methods:

Key Methods

await event.wait(): Blocks the task until the flag becomes True. If it's already True, it proceeds immediately.
event.set(): Sets the flag to True. All tasks waiting on wait() are immediately woken up.
event.clear(): Resets the flag to False. Subsequent wait() calls will block again.

Example

Input

import asyncio
import random
import logging
import time

logging.basicConfig(
    level=logging.INFO,
    format='%(relativeCreated)d ms - [%(levelname)s] - %(message)s',
)

async def database_connection_manager(event: asyncio.Event):
    """
    Simulates a database connection manager that handles connection states.
    It manages the event flag to signal when the DB is ready or down.
    """
    logging.info("Manager: Database connection initializing...")
    await asyncio.sleep(2)  # Simulate startup time

    # 1. Event Set: Signal that the resource is ready
    logging.info("Manager: Database CONNECTED. Setting event (Green Light).")
    event.set()

    # Let workers process for a while
    await asyncio.sleep(3)

    # 2. Event Clear: Signal that the resource is unavailable (e.g., maintenance or crash)
    logging.info("Manager: Connection LOST! Clearing event (Red Light). blocking workers...")
    event.clear()

    # Simulate downtime loop
    await asyncio.sleep(3)

    logging.info("Manager: Reconnecting...")
    await asyncio.sleep(2)

    # 3. Event Set again: Recovery
    logging.info("Manager: Database RECONNECTED. Setting event again.")
    event.set()

async def query_worker(worker_id: int, event: asyncio.Event):
    """
    Simulates a worker that needs the database to be ready to process queries.
    """
    logging.info(f"Worker {worker_id}: Ready to start processing.")

    for i in range(5):
        # 4. Event Wait: Block until the event is set
        # If event.is_set() is True, it returns immediately.
        # If False, it waits until some other coroutine calls event.set().
        logging.info(f"Worker {worker_id}: Waiting for DB connection to process request {i+1}...")

        await event.wait()

        # If we reach here, the event is set!
        logging.info(f"Worker {worker_id}: Processing request {i+1} (DB is Online)")

        # Simulate processing time
        process_time = random.uniform(0.5, 1.5)
        await asyncio.sleep(process_time)

async def main():
    # Create the Event object
    # Internal flag is initially False
    db_ready_event = asyncio.Event()

    async with asyncio.TaskGroup() as tg:
        # Create tasks
        tg.create_task(database_connection_manager(db_ready_event))

        workers = [
            tg.create_task(query_worker(i, db_ready_event))
            for i in range(1, 4)
        ]

if __name__ == "__main__":
    asyncio.run(main())

Output

48 ms - [INFO] - Manager: Database connection initializing...
49 ms - [INFO] - Worker 1: Ready to start processing.
49 ms - [INFO] - Worker 1: Waiting for DB connection to process request 1...
49 ms - [INFO] - Worker 2: Ready to start processing.
49 ms - [INFO] - Worker 2: Waiting for DB connection to process request 1...
49 ms - [INFO] - Worker 3: Ready to start processing.
49 ms - [INFO] - Worker 3: Waiting for DB connection to process request 1...
2057 ms - [INFO] - Manager: Database CONNECTED. Setting event (Green Light).
2057 ms - [INFO] - Worker 1: Processing request 1 (DB is Online)
2057 ms - [INFO] - Worker 2: Processing request 1 (DB is Online)
2058 ms - [INFO] - Worker 3: Processing request 1 (DB is Online)
2702 ms - [INFO] - Worker 1: Waiting for DB connection to process request 2...
2702 ms - [INFO] - Worker 1: Processing request 2 (DB is Online)
2959 ms - [INFO] - Worker 3: Waiting for DB connection to process request 2...
2959 ms - [INFO] - Worker 3: Processing request 2 (DB is Online)
3501 ms - [INFO] - Worker 2: Waiting for DB connection to process request 2...
3501 ms - [INFO] - Worker 2: Processing request 2 (DB is Online)
3945 ms - [INFO] - Worker 1: Waiting for DB connection to process request 3...
3945 ms - [INFO] - Worker 1: Processing request 3 (DB is Online)
4315 ms - [INFO] - Worker 3: Waiting for DB connection to process request 3...
4315 ms - [INFO] - Worker 3: Processing request 3 (DB is Online)
4632 ms - [INFO] - Worker 1: Waiting for DB connection to process request 4...
4632 ms - [INFO] - Worker 1: Processing request 4 (DB is Online)
4785 ms - [INFO] - Worker 2: Waiting for DB connection to process request 3...
4785 ms - [INFO] - Worker 2: Processing request 3 (DB is Online)
4836 ms - [INFO] - Worker 3: Waiting for DB connection to process request 4...
4836 ms - [INFO] - Worker 3: Processing request 4 (DB is Online)
5056 ms - [INFO] - Manager: Connection LOST! Clearing event (Red Light). blocking workers...
5834 ms - [INFO] - Worker 1: Waiting for DB connection to process request 5...
5890 ms - [INFO] - Worker 2: Waiting for DB connection to process request 4...
6070 ms - [INFO] - Worker 3: Waiting for DB connection to process request 5...
8056 ms - [INFO] - Manager: Reconnecting...
10063 ms - [INFO] - Manager: Database RECONNECTED. Setting event again.
10063 ms - [INFO] - Worker 1: Processing request 5 (DB is Online)
10064 ms - [INFO] - Worker 2: Processing request 4 (DB is Online)
10064 ms - [INFO] - Worker 3: Processing request 5 (DB is Online)
10848 ms - [INFO] - Worker 2: Waiting for DB connection to process request 5...
10848 ms - [INFO] - Worker 2: Processing request 5 (DB is Online)

Asyncio.to_thread

Sometimes, you might encounter a problem where the functions you want to use do not have asynchronous capabilities. These days, many libraries and frameworks support async operations, but some legacy libraries still do not. In this situation, you can use to_thread.

asyncio.to_thread is a bridge. It allows you to take Blocking I/O (like reading a huge file or using a synchronous library like requests) or Light CPU work and run it without "stopping the world" for your other async tasks.

The Mechanics of `asyncio.to_thread`

When you call to_thread, the following sequence occurs:

The Wrapper: asyncio wraps your synchronous function in a coroutine object.
The Context: It captures the current contextvars (so your database sessions or tokens stay available).
The Executor: It pushes the function into a separate ThreadPoolExecutor.
The Yield: The calling coroutine awaits a Future object. This allows the Event Loop to go do other work while the thread is busy.

import asyncio
import time


def brew_coffee(n: int) -> None:
    print(f"Start brewing coffee #{n}...")

    time.sleep(5)

    print(f"Coffee #{n} is ready!")


async def main() -> None:
    start = time.time()

    tasks = [asyncio.to_thread(brew_coffee, i + 1) for i in range(3)]

    await asyncio.gather(*tasks)

    end = time.time()
    print("Coffee ready!")
    print(f"Total time: {end - start:.2f} seconds")


if __name__ == "__main__":
    asyncio.run(main())

This code is a similar example to the one in the previous post. When we ran that code with asyncio.create_task, it executed synchronously because of the blocking I/O function: time.sleep.

But in this example, if we run this code, it returns

Start brewing coffee #1...
Start brewing coffee #2...
Start brewing coffee #3...
Coffee #2 is ready!
Coffee #1 is ready!
Coffee #3 is ready!
Coffee ready!
Total time: 5.00 seconds

As you can see, the tasks ran asynchronously. You can run legacy blocking I/O functions asynchronously by leveraging to_thread.

Is this the same as Multi-threading? What is the difference?

Feature	with ThreadPoolExecutor()	asyncio.to_thread()
Context Switching	OS-level (Preemptive)	OS-level (Preemptive execution)
Context Variables	Lost (e.g., Trace IDs, Auth tokens)	Preserved (Propagates `contextvars`)
Error Handling	Needs manual `future.exception()` check	Integrated with `try/except` and `TaskGroups`
Lifecycle	Manual (must use `with` or `.shutdown()`)	Automatic (managed by the event loop)
Configuration	Explicit (`max_workers=...`)	Uses the loop's default executor

Actually, to_thread uses multi-threading, but wraps it in a coroutine interface. It doesn’t incur data loss because it automatically propagates context variables, unlike a raw ThreadPool.

Someone might ask “Then, I don’t need to use async coroutine functions, right? Because I can run both sync and async functions asynchronously with to_thread.”

Limitation of `to_thread`

That is absolutely not true. to_thread utilizes threads for I/O bound work, whereas native async tasks are single-threaded. to_thread is a hybrid feature.
It utilizes a global thread pool for processing I/O bound work with legacy blocking I/O functions. However, it can’t process more tasks than the number of threads in the global pool. In contrast, native coroutines can process a much larger number of I/O bound tasks with just one thread.

Note: Default numberof threads in the pool

os.cpu_count()+4) \mathbf{N}_{threads}=\mathrm{min}(32,\, \mathrm{os.cpu\_count}() + 4)

Conclusion

I have introduced some useful functions of asyncio. In the next post, I will demonstrate how to apply these functions to LLM inferences in the next post.

Coroutine series 1) What are Coroutine, Asyncio, I/O bound?

Jun Bae — Wed, 22 Apr 2026 13:40:59 +0000

Introduction

In many cases, we have to run several jobs concurrently. Most developers are likely familiar with multi-threading or multi-processing, both of which Python supports through ThreadPoolExecutor and ProcessPoolExecutor.

However, there is another powerful tool: the ‘coroutine’. While multi-threading and multi-processing are Hardware/OS-based methods, coroutines provide concurrency at the software level.

Python introduced coroutines via asyncio library of Python 3.5 (2015). For a long time, it was just a niche tool primarily for backend developers. However, since the LLM boom of the 2020s, coroutines have become an essential feature for AI engineers.

About this series

I think it would be difficult to cover everything about coroutines in a single post, which is why I created this coroutine series. I will explore the features of coroutine and their specific applications in AI step by step.

What is a Coroutine?

While multi-processing is hardware-based parallellism and multi-threading is OS-level concurrency, coroutines are software-based concurrency.

Method	Multi-Processing	Multi-Threading	Coroutines (Asyncio)
Type	Hardware Parallelism	Kernel Concurrency	Software Concurrency
The Scheduler	The OS Kernel	The OS Kernel	The Python Event Loop
Switching Style	Preemptive (Forceful)	Preemptive (Forceful)	Cooperative (Polite)
Resource Cost	High (Separate Memory)	Medium (Stack + Kernel Objects)	Low (Tiny Object on Heap)
Awareness	The CPU knows.	The OS knows.	Only Python knows.

Multi-processing requires multiple CPU logical cores, while multi-threading ostensibly utilizes multiple threads within a single process. However, as you may know, technically Python multi-threading is not truly capable of executing tasks simultaneously due to the Global Interpreter Lock (GIL). Consequently, it is better suited for I/O-bound concurrency rather than true parallellism.

For CPU-bound tasks: Threads must hold the GIL to execute bytecode. They contend for the lock, causing overhead. A multi-threaded CPU-bound program in Python is often slower than a single-threaded one due to lock contention and context switch overheads.
For I/O-bound tasks: The GIL is released when a thread performs a blocking I/O operation (e.g., socket.recv, time.sleep, file I/O). This allows other threads to run Python code while the first thread waits in the OS kernel. Consequently, standard threading is effective for concurrent I/O in Python.

But what if we don’t even need to allocate multiple threads in an execution? In most I/O-bound scenarios, having several threads is actually unnecessary—especially when you are making API calls and simply waiting for a response. This is precisely why Python introduced support for coroutines.

Why has it become essential?

As mentioned above, when coroutines were first introduced in Python, they were a niche tool used by only a small group of developers. Today, however, we rely heavily on LLMs, most of which are accessed through API calls. Unless you are running a model locally on your own PC, you must request an inference from a server—even in an on-premises environment. (What developer would write code on H100 while it is already serving an LLM model?)

When working with LLM APIs, over 90% of your execution time is spent waiting for a response. Instead of sitting idle while waiting, your process could be handling the next task, not just be laid back.

If you have ten concurrent requests, you don’t need to wait for the first user's prompt to finish. While waiting for the first response, you can send the second, third, and eventually all ten prompts almost simultaneously (provided your LLM server has enough throughput).

Furthermore, in the Agentic AI structure, many inferences are independent of one another. Then sending prompts to LLM servers concurrently can significantly reduce your total latency.

In this context, your Python process is not the Calculator; it is the Traffic Controller.

How can I leverage coroutines?

Python has a built-in library called asyncio that allows you to implement this.

Imagine you are making several cups of coffee and have multiple coffee machines. But you always use only one machine and do nothing while wating for one cup of coffee because you are too lazy to manage tasks concurrently. In code, it looks like this:

Input

import time

def brew_coffee(n: int) -> None:
    print(f"Start brewing coffee #{n}...")
    # This blocks the entire thread. Nothing else can happen.
    time.sleep(5)
    print(f"Coffee #{n} is ready!")

def main() -> None:
    start = time.time()

    # Sequential execution
    for i in range(3):
        brew_coffee(i + 1)

    end = time.time()
    print(f"Total time: {end - start:.2f} seconds")

if __name__ == "__main__":
    main()

Output

Start brewing coffee #1...
Coffee #1 is ready!
Start brewing coffee #2...
Coffee #2 is ready!
Start brewing coffee #3...
Coffee #3 is ready!
Total time: 15.00 seconds

But as your cafe grows, you have to become more efficient now that you can’t keep customers waiting by using only one machine at a time. This is where you—the barista—act as a coroutine. You start one machine, and while it's brewing, you immediately start the second and third.

Input

import asyncio
import time
from typing import List, Coroutine, Any

## 1. Define a coroutine using 'async def'
async def brew_coffee(n: int) -> None:
    print(f"Start brewing coffee #{n}...")

    # 2. 'await' hands control back to the Event Loop
    # This is non-blocking sleep.
    await asyncio.sleep(5)

    print(f"Coffee #{n} is ready!")

async def main() -> None:
    start = time.time()

    # 3. Create a list of coroutine objects (they haven't run yet!)
    tasks: List[Coroutine[Any, Any, None]] = [brew_coffee(i + 1) for i in range(3)]

    # 4. Schedule them concurrently and wait for all to finish
    await asyncio.gather(*tasks)

    end = time.time()
    print("Coffee ready!")
    print(f"Total time: {end - start:.2f} seconds")

if __name__ == "__main__":
    # 5. The entry point for the async world
    asyncio.run(main())

Output

Start brewing coffee #1...
Start brewing coffee #2...
Start brewing coffee #3...
Coffee #1 is ready!
Coffee #2 is ready!
Coffee #3 is ready!
Coffee ready!
Total time: 5.00 seconds

Now you can make three cups of coffee in just five seconds. You didn’t hire more workers, you didn’t move faster, and you didn’t buy extra equipment. You just took advantage of the idle time you previously wasted.

So, what exactly are coroutines, asyncio, await, and gather? Let’s dive into them one by one.

Coroutine functions

In Python, a standard block of code starting with def is simply called a function. But if you add async in front of it like async def, it becomes a coroutine function. This tells Python that the function is intended to run asynchronously. ‘Asynchronous’ means that tasks can be handled independently rather than waiting for one to finish before starting the next.

Asynchronous vs Synchronous

The dictionary definitions of these words can sometimes be confusing. Some people might think ‘synchronous jobs? Does it mean the jobs would be run at the same time?’ It helps to think of "synchronous" like "syncing" a device. When you sync two devices, they must be "in step" or connected to share data. Therefore, synchronous programming means tasks are executed in a strict sequence; the previous task must finish before the next one begins. On the other hand, asynchronous means tasks are decoupled, allowing the program to move on to other work without waiting for a specific operation to complete.

Synchronous tasks (Dependent): Washing your clothes and then hanging them out to dry. You cannot hang the laundry until the washer has finished its cycle.
Asynchronous tasks (Independent): Running the dishwasher and microwaving your lunch. You don't need to wait for the dishwasher to finish to start the microwave; both can happen at the same time.

async def and await

As mentioned earlier, the ‘async’ keyword allows you to define a coroutine. Creating one is simple, but running it requires a slightly different approach than a standard function.

import asyncio

async def brew_coffee(n):
    print(f"Start brewing coffee #{n}...")
    await asyncio.sleep(5)
    print(f"Coffee #{n} is ready!")

This is the coffee example, but how can I run this coroutine? Same as normal function, just brew_coffee(1) will this be executed? If you run brew_coffee(1), you will see this output.

<coroutine object brew_coffee at 0x000002957D795BE0>

Unfortunately, it doesn’t start making coffee. Then how can I start it? You have to write await.

await brew_coffee(1)

Start brewing coffee #1...
Coffee #1 is ready!

Now you just made one cup of coffee. await means the process will wait until the job finishes. But wait a minute.. Wait until jobs done? Then what if I start two async jobs like this.

await brew_coffee(1)
await brew_coffee(2)

Then it will return this:

Start brewing coffee #1...
Coffee #1 is ready!
Start brewing coffee #2...
Coffee #2 is ready!

Right.. This was not run concurrently. If it waits on await, how can I start several jobs concurrently?

asyncio.create_task(brew_coffee(1))
asyncio.create_task(brew_coffee(2))

# Or as the first example code
works = [brew_coffee(1), brew_coffee(2)]
await asyncio.gather(*works)

This returns

Start brewing coffee #1...
Start brewing coffee #2...
Coffee #1 is ready!
Coffee #2 is ready!

Now it works!

What is `await` exactly?

When a coroutine encounters an await expression, it follows a three-step process:

Suspend: When await is encountered, the coroutine saves its current instruction pointer and stack variables.
Yield: Control is returned to the Event Loop.
Resume: When the awaited operation completes, the Event Loop restores the coroutine's state and resumes execution at the instruction following the await.

Long story short, when a task hits await, it says, "I’m waiting for something; go ahead and run someone else." And Event Loop will kick off another task. Only after the awaited operation is complete will the loop return to the original task to finish the job.

But a word of caution: async and await are not a silver bullet! You must avoid using "blocking" I/O functions inside a coroutine.

The Danger of Blocking I/O

Consider this version of our coffee example. It looks almost identical to our successful coroutine code, but I have replaced asyncio.sleep(5) with time.sleep(5).

import asyncio
import time
from typing import List, Coroutine, Any


async def brew_coffee(n: int) -> None:
    print(f"Start brewing coffee #{n}...")

    time.sleep(5)

    print(f"Coffee #{n} is ready!")


async def main() -> None:
    start = time.time()

    tasks: List[Coroutine[Any, Any, None]] = [brew_coffee(i + 1) for i in range(3)]

    await asyncio.gather(*tasks)

    end = time.time()
    print("Coffee ready!")
    print(f"Total time: {end - start:.2f} seconds")


if __name__ == "__main__":
    asyncio.run(main())

Output

Start brewing coffee #1...
Coffee #1 is ready!
Start brewing coffee #2...
Coffee #2 is ready!
Start brewing coffee #3...
Coffee #3 is ready!
Coffee ready!
Total time: 15.00 seconds

Yes, this is not a coroutine. Why? Because time.sleep(5) is a blocking I/O operation. If you use blocking functions (like time.sleep, requests.get, or standard file reading) inside a coroutine, you effectively stop the entire "Traffic Controller" from working. For a coroutine to be effective, you must use non-blocking equivalents that know how to await.

Note: You might think this mistake is too obvious to mention. But in production-level environments, you should be very careful about this issue. Real-world projects often consist of complex codebases with dozens or even hundreds of files. You must carefully audit your coroutines to ensure that no blocking I/O is hidden within whole downstream. One small blocking call can bring your entire high-performance system to a crawl.

Conclusion

In this post, we have explored the fundamentals of coroutines and how to implement them at a basic level. I will cover more advanced features of asyncio and demonstrate how to leverage them for AI inference in the next post of this series.

KV Cache and Prompt Caching: How to Leverage them to Cut Time and Costs

Jun Bae — Wed, 22 Apr 2026 13:35:03 +0000

Introduction

A Problem of LLM Inference

In the transformer structure, the model calculates the $K,V\mathbf{K}, \mathbf{V}$ matrices using weight matrices $W\mathbf{W}$ . When an input $x0\mathbf{x}_0$ vector enters the model, it is first multiplied by the $Wq,Wk,Wv\mathbf{W}_q, \mathbf{W}_k, \mathbf{W}_v$ matrices. This yields the $q0,k0,v0\mathbf{q}_0, \mathbf{k}_0, \mathbf{v}_0$ vectors. As you iterate this process and stack the $k,v\mathbf{k}, \mathbf{v}$ vectors, they form the $K,V\mathbf{K}, \mathbf{V}$ matrices.

Assume you have successfully generated an output token after completing the transformer process, and that token is $x1\mathbf{x}_1$ . Here lies the problem: for subsequent inference, the model must calculate not only $k1,v1\mathbf{k}_1, \mathbf{v}_1$ but $k0,v0\mathbf{k}_0, \mathbf{v}_0$ again. Because the attention score is calculated as $q1KTV\mathbf{q}_1\mathbf{K}^{T}\mathbf{V}$ , it requires the $k,v\mathbf{k}, \mathbf{v}$ vectors from all previous inputs. This results in redundant computations every time a new token is generated. As the number of input tokens grows, these recomputations become significantly time-consuming.

What is KV cache exactly?

If you understant the problem, you might have already thought of the solution. Yes, the solution is to sotre the previous $k,v\mathbf{k}, \mathbf{v}$ vectors in what is known as a “cache”. It is a relatively straightforward concept.

A Simple example of KV cache

Let’s continue with the example from the introduction. When generating a token subsequent to $x1\mathbf{x}_1$ , what needs to be computed? First, we calculate $q1=x1Wq\mathbf{q}_1=\mathbf{x}_1\mathbf{W}_q$ . And then, you need the $K,V\mathbf{K}, \mathbf{V}$ matrices. Let’s look at this step-by-step.

We need to form $K\mathbf{K}$ using vectors $k0\mathbf{k}_0$ and $k1\mathbf{k}_1$
This requires computing $k0=x0Wk\mathbf{k}_0=\mathbf{x}_0\mathbf{W}_k$ and $k1=x1Wk\mathbf{k}_1=\mathbf{x}_1\mathbf{W}_k$ . Notice that we have just recomputed $k0\mathbf{k}_0$ .
Similarly, we need $V\mathbf{V}$ , composed of vectors $v0\mathbf{v}_0$ and $v1\mathbf{v}_1$ .
Following the same logic as step 2, we calculate $v0=x0Wv\mathbf{v}_0=\mathbf{x}_0\mathbf{W}_v$ and $v1=x1Wv\mathbf{v}_1=\mathbf{x}_1\mathbf{W}_v$ .
Finally, with the $K,V\mathbf{K}, \mathbf{V}$ matrices ready, we can compute $q1KTV\mathbf{q}_1\mathbf{K}^{T}\mathbf{V}$ .

As you can see, the initial computation for $x0\mathbf{x}_0$ required three vector calculations( $q0,k0,v0\mathbf{q}_0,\mathbf{k}_0,\mathbf{v}_0$ ). However, for the next token, that number jumps to five( $k0,v0,q1,k1,v1\mathbf{k}_0,\mathbf{v}_0,\mathbf{q}_1,\mathbf{k}_1,\mathbf{v}_1$ ). This grows to seven for the third token and nine for the fourth. If your input is 1,000 token long, you would have to perform 2,001 vector computations just for the last token. This is computationally expensive and highly inefficient.

This is where the "cache" comes in. If we store the entire $K,V\mathbf{K}, \mathbf{V}$ matrices after each step, we no longer need to recompute $k0,v0\mathbf{k}_0,\mathbf{v}_0$ . The number of computations remains constant at three per token $qn,kn,vn\mathbf{q}_n,\mathbf{k}_n,\mathbf{v}_n$ , regardless of how many tokens have already been processed.

Note: This explanation assumes a decoder-only architecture. You might wonder why we don't cache the $Q\mathbf{Q}$ matrix in an encoder; however, since most modern LLMs are decoder-only, that isn't a concern here. Additionally, "hitting the cache" requires both the token and its position to be identical due to positional embeddings.

How much fater is it?

Various engineering blogs and papers (e.g., from NVIDIA, Hugging Face, and Databricks) have quantified the benefits of KV caching.

Benchmark: Generating 1,000 Tokens (Llama-2-7B)

With KV Cache: ~20–30 seconds (consistent speed per token).
Without KV Cache: > 2–3 minutes (each token takes progressively longer than the last).

As the number of tokens grows, the gap widens significantly. Given that some SOTA models support now support context lengths of 1M, this type of cache engineering has become indispensable.

How to leverage KV cache when serving models?

In most cases, you don’t need to worry about complex configurations. Since KV cache has become the standard serving method, libraries like vLLM and Huggingface Transformers activate KV cache automatically by default provided you have enough memory.

In Transformers, the use_cache parameter in model.generation_config controls this behavior; its default value is True.

In vLLM, it allocates the remaining memory—defined by gpu_memory_utilization—to the KV cache after the model weights are loaded in the memory.

vLLM is particularly well-known for its sophisticated cache engineering techniques, such as PagedAttention. Because of this, vLLM offers massive throughput with almost no memory fragmentation, making it a unmatched top-tier choice for efficient LLM serving.

The Problem with KV Caching

Memory Consumption

The memory required for the KV cache increases linearly with sequence length and batch size.

KV Cache Memory Formula:

\textbf{Memory}_{KV} = 2\,\times\,\textbf{Batch}\,\times\,\textbf{SeqLen}\,\times\,\textbf{Layers}\,\times\,\textbf{HiddenDim}\,\times\,\textbf{Precision}

2: One for keys and one for values.
Batch: Number of concurrent users being served.
SeqLen: The length of the context/sequence.
Precision: Bytes (e.g., 2 bytes for FP16, 1 byte for FP8).

Let me give you a simple example of how it increases with a typical LLM model.

Example: Llama-3-70B (FP16)

Model Weights: about 140GB
KV Cache (1 user, 1k tokens): ~0.3GB
KV Cache (1 user, 128k tokens): ~40GB
KV Cache (64 users, 16k tokens): ~310GB
KV Cache (64 users, 128k tokens): ~2,560GB

As you can see, the KV cache footprint explodes as you increase context length or batch size. Even with a 1 GB allocation per user, you might need more than one H100 GPU just to serve 100 concurrent users.

Beyond memory capacity, this also impacts latency. While GPUs are exceptionally fast at mathematics, they are limited by memory bandwidth. The GPU must constantly fetch the cached data from VRAM to the chip to generate each new token. This is why you may see high latency even when GPU compute utilization is only at 10%—the memory bandwidth has become the bottleneck.

There are emerging solutions to these problems, such as Grouped-Query Attention(GQA), PagedAttention, Quantization. However I don’t think these are enough. Definitely KV cache is hindering the scalability of LLMs today. If LLMs eventually hit a "dead end," as Yann LeCun has suggested, I believe the inefficiencies of the KV cache may be a contributing factor. It has become so indispensable that developers cannot afford to abandon it. If a new model is too massive to leverage the KV cache with current computing resources, it might be useless in the real world as consumers will not tolerate prohibitively slow inference speeds.

What was once a solution is now a massive problem.. So ironic. We’ll have to see how big tech companies tackle this tricky challenge in the future. Isn’t it exciting?

Prompt (Context) caching

This is another method in LLM engineering. While different from standard KV caching, it significantly improves efficiency and reduces computation time. Furthermore, if you use LLM services via API—such as Gemini or OpenAI—, this method can save you both time and money.

What is Prompt Caching?

Standard KV caching typically occurs within a single turn. Once a turn ends, the cache is cleared and the process restarts from scratch for the next turn.

Prompt caching extends this concept across multiple turns. After one turn is completed, the server saves the $K,V\mathbf{K}, \mathbf{V}$ matrices and retains them for a period of time. When the next turn begins, the model can reuse those exact same $K,V\mathbf{K}, \mathbf{V}$ matrices. Let me give you a simple example.

Example: A Multi-Turn Conversation

First turn

System prompt: Answer user’s question.

User prompt: Hi

Assistant prompt: Hi! How can I help you?

Second turn

User prompt: Who are you?

Assistant prompt:

In the first turn, the input tokens consist of $(Answer),(user’s),(question),(Hi)(\textrm{Answer}), (\textrm{user's}), (\textrm{question}),(\textrm{Hi})$ (I knowingly omitted tokens of ‘\n’, spacing, BOS, EOS or start token of certain prompt for convenience). Of course, the model wil compute the $K,V\mathbf{K}, \mathbf{V}$ matrices for all tokens and begin generating responses. After it finishes generating entire assistant prompt, it will have also stacked the $k,v\mathbf{k}, \mathbf{v}$ vectors for $(Hi!),(How),(can),(I),(help),(you?)(\textrm{Hi!}), (\textrm{How}), (\textrm{can}),(\textrm{I}),(\textrm{help}),(\textrm{you?})$ onto the matrices.

Then, in multi-turn structure, what happens during the second turn? As you might expect, the model reprocesses the $k,v\mathbf{k}, \mathbf{v}$ vectors for the previous context: $(Answer),(user’s),(question),(Hi)(\textrm{Answer}), (\textrm{user's}), (\textrm{question}),(\textrm{Hi})$ and $(Hi!),(How),(can),(I),(help),(you?)(\textrm{Hi!}), (\textrm{How}), (\textrm{can}),(\textrm{I}),(\textrm{help}),(\textrm{you?})$ . it then computes the vectors for the new user prompt, and begin generating an answer.

Of course, the model could leverage a KV cache, but rebuilding that cache from scratch would be highly inefficient. It would consume too much time performing the exact same operations. However, if we cache the $K,V\mathbf{K}, \mathbf{V}$ matrices from the previous turn, we can avoid recomputing the $k,v\mathbf{k}, \mathbf{v}$ vectors for the initial instructions and the first exchange. By retrieving these vectors from memory, the model can begin processing immediately from the new prompt, “Who are you?”. Here is an example image from the OpenAI website.

As you can see in the image, the token positions must remain exactly the same. If even a single word is inserted, the cache will be invalidated from that point forward because the positional encodings will shift. Therefore, to leverage prompt caching effectively, you should keep your token sequences as consistent as possible.

Note: Currently, the terminology for this technique has not been standardized. OpenAI calls it ‘Prompt Caching,’ vLLM uses ‘Prefix Caching,’ and Google refers to it as ‘Context Caching’. They all refer to the same concept. Unfortunately, this kind of naming inconsistency happens occasionally in the AI field. :(

With this method, tokens from previous turns don’t even need to be re-processed by the transformer layers. As long as the cache exists, the model can skip the heavy computation for those segments entirely. You simply load the cache and move forward without needing to re-evaluate the earlier parts of the conversation. It’s an incredibly efficient way to handle long-context interactions

Saving Money with Prompt Caching

When you successfully "hit" the prompt cache during a multi-turn conversation, several LLM API providers—such as OpenAI and Google Gemini—offer significant discounts. Below, I’ve broken down the current discount policies for both platforms.

OpenAI API

They applies a comprehensive half-price policy to input costs.

While the chart above uses specific examples, most recent models—including the GPT-5 family—benefit from this pricing. They still haven’t updated this chart, despite it being over half a year since GPT-5 launched.

Activation: The minimum token length to trigger prompt caching is 1,024 tokens.
TTL (Time to Live): The cache typically persists for 5 to 10 minutes during peak hours, and up to one hour during off-peak times.
Extended Caching: Certain models (such as gpt-5.2 and gpt-4.1) support extended prompt caching, allowing the cache to be retained for up to 24 hours.

Google Gemini API

Gemini offers two distinct caching tiers: Implicit and Explicit.

Implicit Caching: This is automatically enabled for most Gemini models. The minimum threshold is 1,024 tokens for Flash models and 4,096 tokens for Pro models. However, Google does not strictly guarantee cost savings with this tier; it is primarily an optimization for latency.
Explicit Caching: This allows for a cost-saving guarantee but must be configured manually. You can define specific parameters like this:

cache = client.caches.create(
    model=model,
    config=types.CreateCachedContentConfig(
      display_name='sherlock jr movie', # used to identify the cache
      system_instruction=system_prompt,
      contents=content,
      ttl="300s",
  )

The default TTL is 1 hour and no minimum or maximum time. The billing is based on these factors:

Cache token count: The number of input tokens cached, billed at a reduced rate when included in subsequent prompts.

Storage duration: The amount of time cached tokens are stored (TTL), billed based on the TTL duration of cached token count. There are no minimum or maximum bounds on the TTL.

Other factors: Other charges apply, such as for non-cached input tokens and output tokens.

And also they explicitly show the exact costs of cached tokens like this:

This is Gemini-3-Pro example and it varies depending on models.

Important Note: As seen in the Gemini 3 Pro pricing, you are charged a storage fee. Be mindful when setting a long TTL.

Conclusion

In this post, we have explored two pillars of LLM optimization: KV cache and Prompt caching. These techniques are indispensable for reducing latency and lowering operational costs. However, these efficiencies come with a significant trade-off: a massive increase in memory consumption.

This "memory wall" has forced developers and enterprises to secure high-performance GPUs with vast amounts of VRAM, not just for training, but for inference as well. It is no surprise that memory manufacturers like Samsung, SK Hynix, and Micron have seen such a surge in demand; their hardware is the literal foundation upon which these caching methods reside.

Currently, the memory bottleneck is one of the most significant "sticking points" in LLM architecture. Many of the world's leading tech companies are racing to find more efficient ways to handle these tricky difficulties. If the industry can overcome these scaling limits with new, innovative architectures, it will represent the next major breakthrough in the evolution of Artificial Intelligence.

Why LoRA? Understanding the representative PEFT

Jun Bae — Wed, 22 Apr 2026 13:24:04 +0000

Why LoRA?

Low-Rank Adaptation (LoRA) has revolutionized the way we approach Large Language Models (LLMs). As the most prominent Parameter-Efficient Fine-Tuning (PEFT) method, LoRA allows developers to adapt massive models like Llama 3 or GPT-4 to specific tasks without needing a cluster of H100 GPUs.

But how exactly does it work, and why is it so effective? In this post, we’ll dive into the mathematical intuition behind LoRA, the concept of "intrinsic dimension," and why this method is a game-changer for AI engineers.

The Problem: The Cost of Scale

When it comes to traditional statistical models—like simple or generalized linear models, Tree model series (Random Forest, Light GBM, etc)—, we can train our mod

els from scratch using only their blueprints and architectures. Historically, this was the standard approach and was not particularly difficult or cumbersome.

However, the Deep Learning revolution changed the landscape. As model sizes ballooned from millions to billions of parameters, retraining or even fine-tuning a model became logistically impossible for most individuals and even many companies.

The Hypothesis of Intrinsic Dimension

Here is the crucial insight: We don't need to train all the parameters.

Research suggests that over-parameterized models reside on a low intrinsic dimension. Simply put, while a model might have billions of parameters, the "knowledge" required to solve a specific new task can be represented by a much smaller subset of variables.

This premise was validated before LoRA's invention by researchers like Li et al. (2018) and Aghajanyan et al. (2020), who showed that learning happens in a lower-dimensional subspace. LoRA operationalizes this theory.

What is LoRA? The Mechanics

So what exactly is LoRA?

The concept is remarkably straightforward. As mentioned, LoRA is a fine-tuning method that trains the model using only a fraction of the total parameters.

The Mathematical Formulation

Let’s look at the linear algebra. Suppose we have a pre-trained weight matrix $W0∈Rd×k\mathbf{W}_0 \in \mathbb{R}^{d \times k}$ .

In full fine-tuning, we update the weights by learning a cumulative gradient update $ΔW\Delta \mathbf{W}$ : $h=(W0+ΔW)x\mathit{h} = (\mathbf{W}_0 + \Delta \mathbf{W})\mathit{x}$

LoRA constrains this update $ΔW\Delta \mathbf{W}$ by representing it as the product of two smaller matrices, $B\mathbf{B}$ and $A\mathbf{A}$ , with a low rank $r$ :

Where:

$B∈Rd×r\mathbf{B} \in \mathbb{R}^{d \times r}$
$A∈Rr×k\mathbf{A} \in \mathbb{R}^{r \times k}$
$\ll \min(d, k)$ (The rank, usually set between 8 and 64)

All you need to do is to train A and B matrices. Then what are the A and B matrices?

Initialization Strategy

To ensure the training starts exactly where the pre-trained model left off (i.e., $ΔW\Delta \mathbf{W}$ is zero at the beginning), we initialize:

Matrix A: Random Gaussian distribution $N(0,σ2)\mathcal{N}(0, \sigma^2)$
Matrix B: Zeros

Therefore, $BA=0\mathbf{B}\mathbf{A} = 0$ initially, ensuring no destabilizing noise is added to the model at step zero.

The Scaling Factor $α\alpha$

One detail often missed is the scaling factor $α\alpha$ . The update is actually scaled as:

\Delta \mathbf{W} = \frac{\alpha}{r} (\mathbf{B}\mathbf{A})

This allows us to tune the learning rate effectively regardless of the rank $r$ we choose.

Then, the problem is.. is it accurate enough?

The answer is ‘yes’. And even more accurate.

Why is LoRA Effective?

Let’s talk numbers. Assume $d = 10, 000$ and $k = 10, 000$ with a rank $r = 16$ .

Full Fine-Tuning: $\times 10,000 = 100,000,000$ parameters.
LoRA: $\times 16) + (16 \times 10,000) = 320,000$ parameters.

That is a 99.68% reduction in trainable parameters. But is it accurate?

Performance vs. Efficiency

Yes, it is surprisingly accurate.

As shown in the original LoRA paper, on benchmarks like WikiSQL, LoRA performs neck-and-neck with full fine-tuning. In some cases, it even outperforms the baseline because it acts as a form of regularization, preventing the model from overfitting to the small training set.

Then why? It might seem counterintuitive that tuning less than 0.01% of the parameters is sufficient. In reality, it is.

For those with a background in statistics, they can feel a bit familiar with this concept. There is a very famous and conventional method of decreasing dimension: PCA (Principal Component Analysis).

The Intuition: PCA Analogy

Imagine a dataset containing "Biological Species" and "Number of Legs." These two features are highly correlated; knowing the species often tells you the leg count. You don't need two independent dimensions to represent this variance.

There are a few more dimensionality reduction methods like clustering.

Anyway, the important thing is that the concept—reducing the dimension can improve not only training efficiency but performance— is valid.

Similarly, in deep neural networks, weight updates often occur in a subspace with high correlations. We can project these high-dimensional updates into a lower-rank space without losing significant information.

And LoRA takes advantage of this concept and it maximizes this effect. It doesn’t have any specific mathematic formula like PCA. It just creates random subspace matrices and trains them.

According to the authors, the subspaces of LLM parameters are more similar than previously thought. As you see in the image, the first dimension shares very similar subspaces with all of other dimensions. Therefore, they said only one rank worked quite well for fine-tuning GPT-3 model.

Mitigating Catastrophic Forgetting

A unique advantage of LoRA is that it freezes the original weights ( $W0\mathbf{W}_0$ ).

Full Fine-Tuning: Overwrites $W0\mathbf{W}_0$ . If you train heavily on coding tasks, the model might "forget" how to write poetry (Catastrophic Forgetting).
LoRA: Keeps $W0\mathbf{W}_0$ intact. You can train one adapter for Coding and another for Poetry. You can simply swap the adapters at runtime without reloading the base model.

Conclusion

LoRA has become an indispensable tool in the AI toolkit. Today, most individuals—and even many companies—lack the resources to perform full fine-tuning on models with billions of parameters. Even with access to H100 or A100 GPUs, full fine-tuning is often prohibitively slow and, as demonstrated, rarely provides a better return on investment.

Furthermore, in generic models, an individual query effectively activates only a small portion of the parameters. This is a concept that has been further expanded upon in another famous method, Mixture of Experts (MoE), which I will cover in a future post.

You can leverage LoRA very easily because the libraries these days are so convenient such as peft or trl. Just try it. Then you will see that fine-tuning about 10 billions models takes about 3-6 hours with one H100. And the output files are only hundreds Megabytes. (Of course, you have to merge them with original model when you serve it.)

Reference

Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the Intrinsic Di mension of Objective Landscapes. arXiv:1804.08838 [cs, stat], April 2018a. URL http: //arxiv.org/abs/1804.08838. arXiv: 1804.08838.

Armen Aghajanyan, Luke Zettlemoyer, and Sonal Gupta. Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning. arXiv:2012.13255 [cs], December 2020. URL http://arxiv.org/abs/2012.13255.

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021, June 17). LORA: Low-Rank adaptation of Large Language Models. arXiv.org. https://arxiv.org/abs/2106.09685

Forem: Jun Bae

Graphs for RAG: Knowledge Graph and GraphRAG (GraphDB)

Introduction

Knowledge Graph

Google and PageRank

Google and Knowledge Graph

Knowledge Graph

Core Knowledge Graph Concepts

Entity

Relationship

Triple

Ontology and Schema

Entity and Relation Extraction Methods

Extraction Method 1: spaCy

Extraction Method 2: GLiNER / GLiNER2

Extraction Method 3: LLM-based Extraction

Comparison: spaCy vs. GLiNER2 vs. LLM-based Extraction

GraphRAG and GraphDB

How to build GraphDB

How to Traverse and Retrieve from a Graph

Conclusion

Introduction to RAG for LLMs: Sparse (Lexical) RAG and Dense RAG (Semantic Vector Search)

Introduction

Knowledge Integration Strategies

Three Techniques to Optimize LLMs

1) Fine-tuning

2) Prompt engineering

3) RAG (Retrieval-Augmented Generation)

RAG (Retrieval-Augmented Generation)

Lexical(sparse) RAG

Sparse RAG Example

Cons of Sparse RAG

Dense RAG (Semantic Vector Search)

Hierarchical Navigable Small World (HNSW)

How it works

How it builds layers

Inverted File Index (IVF) & Product Quantization (PQ)

IVF (Inverted File Index) — The Macro Partition

PQ (Product Quantization) — The Micro Compression

HNSW vs. IVF-PQ

A Real Example of Using Dense RAG and Vector Search

Limitations of Vector Search RAG

Conclusion

Coroutine series 3) Coroutines for LLM inference

Introduction

Coroutines for LLM

LLM API SDK

Example comparing async and sync functions

Sync vs Async

AI Framework

Conclusion

Coroutine series 2) Useful Asyncio Functions

Introduction

Coroutine functions

gather and TaskGroup

Differences: TaskGroup vs. gather

Why is TaskGroup favored over gather?

Synchronization Primitives: Queue and Event

Asyncio.Queue

Key Methods

Asyncio.Event

Asyncio.to_thread

The Mechanics of asyncio.to_thread

Limitation of to_thread

Conclusion

Coroutine series 1) What are Coroutine, Asyncio, I/O bound?

Introduction

About this series

What is a Coroutine?

Why has it become essential?

How can I leverage coroutines?

Coroutine functions

Asynchronous vs Synchronous

async def and await

What is await exactly?

The Danger of Blocking I/O

Conclusion

KV Cache and Prompt Caching: How to Leverage them to Cut Time and Costs

Introduction

A Problem of LLM Inference

`gather` and `TaskGroup`

Differences: `TaskGroup` vs. `gather`

Why is `TaskGroup` favored over `gather`?

Synchronization Primitives: `Queue` and `Event`

The Mechanics of `asyncio.to_thread`

Limitation of `to_thread`

What is `await` exactly?

The Scaling Factor $α\alpha$