Forem: Torkian

Add Guardrails So Your AI App Doesn't Lie — A Two-Layer Approach with NVIDIA NIM

Torkian — Sun, 24 May 2026 00:01:22 +0000

In Part 1 we got a USC campus assistant talking. In Part 2 we taught it to retrieve only the relevant context. Both posts ended with the same observation — when someone asked for the wifi password, the assistant refused. That refusal worked because we told it to. It would have just as happily made something up if we'd phrased the prompt differently.

This post is about hardening that refusal so it's not luck. Two guardrail layers, both small enough to read in one sitting, neither requiring a framework. First, tighten the prompt so the assistant knows what it's allowed to talk about. Second, add a second LLM call that re-reads the answer and the context and decides whether to ship the answer or refuse.

I'm B Torkian, NVIDIA Developer Champion at USC. This is the layer where a demo becomes something I'd actually let students use.

What you're adding

User question
  → retrieve top-k context (from Part 2)
  → scoped prompt: model answers OR returns the exact fallback line
  → grounding check: a second NIM call asks "is the answer supported by the context?"
  → ship the answer, or replace it with the fallback line

The chat call and the embedding setup carry over from Parts 1 and 2. Everything new in this post is fewer than 40 lines.

Why guardrails are not optional

The retrieval step from Part 2 narrowed what the model sees. It does nothing to stop the model from being clever with the data it has, or from drifting into topics outside the assistant's job.

Two real failure modes I've seen in student demos:

Out-of-scope creep. Someone asks "can you write my breakup text?" The model is happy to oblige. The retriever pulled three USC chunks (cosine just returns something), the prompt didn't forbid relationship advice, so the model wrote the text.
Confident-sounding hallucinations. The retrieved chunk says "Monday to Friday, 10 AM to 6 PM." The user asks about Saturday hours. The model decides the friendly answer is "Saturday hours are 11 AM to 4 PM" — a fabrication that sounds like a reasonable inference.

The first failure is solved by prompt scope. The second is what the grounding check is for.

Step 1 — Setup (self-contained)

If you already have Workshops 1 + 2 running in the same Colab session, skip this cell. If you're starting fresh, paste this in — it bundles the client, the embedding model, the USC knowledge base, and the retriever from Parts 1 and 2 so the rest of this post stands on its own.

%pip install -q openai numpy

import os, getpass
from openai import OpenAI
import numpy as np

if not os.getenv("NVIDIA_API_KEY"):
    os.environ["NVIDIA_API_KEY"] = getpass.getpass("Paste your NVIDIA API key (starts with nvapi-): ")

client = OpenAI(
    base_url="https://integrate.api.nvidia.com/v1",
    api_key=os.environ["NVIDIA_API_KEY"],
)

MODEL = "meta/llama-3.1-8b-instruct"
EMBED_MODEL = "nvidia/nv-embedqa-e5-v5"

def ask(system_prompt, user_message):
    response = client.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user",   "content": user_message},
        ],
        temperature=0.3,
        max_tokens=400,
    )
    return response.choices[0].message.content

knowledge_base = [
    {"title": "USC AI Club meeting",
     "text": "The USC AI Club meets every Thursday at 5 PM in the engineering building, room 204."},
    {"title": "USC GPU lab hours",
     "text": "The USC GPU computing lab is open Monday to Friday from 10 AM to 6 PM."},
    {"title": "NVIDIA Developer Program",
     "text": "USC students can join the NVIDIA Developer Program for free."},
    {"title": "Next USC workshop",
     "text": "The next USC AI Club workshop will cover Retrieval Augmented Generation (RAG)."},
    {"title": "USC AI/ML office hours",
     "text": "Office hours for the USC AI/ML faculty are Tuesdays 2-4 PM."},
    {"title": "USC robotics lab",
     "text": "The USC robotics lab requires safety training before students can use the soldering station."},
    {"title": "USC tutoring",
     "text": "Peer tutoring for introductory Python at USC is available Wednesdays from 1 PM to 3 PM."},
]

def embed_texts(texts, input_type="passage"):
    response = client.embeddings.create(
        model=EMBED_MODEL,
        input=texts,
        extra_body={"input_type": input_type},
    )
    return [np.array(item.embedding, dtype=np.float32) for item in response.data]

def cosine_similarity(a, b):
    denom = np.linalg.norm(a) * np.linalg.norm(b)
    if denom == 0:
        return 0.0
    return float(np.dot(a, b) / denom)

def retrieve_context(question, k=3):
    q_emb = embed_texts([question], input_type="query")[0]
    scored = [(cosine_similarity(q_emb, item["embedding"]), item) for item in knowledge_base]
    scored.sort(key=lambda p: p[0], reverse=True)
    return "\n".join(f"- {item['text']}" for _, item in scored[:k])

for item, emb in zip(knowledge_base, embed_texts([i["text"] for i in knowledge_base], "passage")):
    item["embedding"] = emb

print(f"Ready. Embedded {len(knowledge_base)} chunks.")

That cell defines everything Workshops 1 and 2 produced. The Part 3 code below builds on ask, retrieve_context, and the embedded knowledge_base.

Step 2 — Layer 1: prompt scope with a fixed fallback line

FALLBACK = "I don't have that information — check with the USC AI Club."

SCOPED_SYSTEM_PROMPT_TEMPLATE = """You are a USC campus assistant for AI Club,
GPU lab, NVIDIA program, workshop, office hour, robotics lab, and tutoring
questions only.

Rules:
- Answer ONLY using the CONTEXT below.
- If the user asks about anything outside this scope (e.g. weather, jokes,
  personal advice, code generation, general world knowledge), reply with
  exactly: "{fallback}"
- If the answer is not present in the context, reply with exactly: "{fallback}"
- Do not invent names, dates, room numbers, links, passwords, schedules,
  policies, or instructions that are not in the context.

CONTEXT:
{context}
"""

Three things are doing work in this prompt:

A finite topic list. The assistant has a job description. "Anything outside this scope" gives the model a clear opt-out — it doesn't have to guess what's in-bounds.
One exact fallback string. Same wording, every time. This matters in Step 3 — the grounding check returns the same string when it overrides, so downstream code only has to recognize one shape.
An explicit don't-invent list. Models are pliable. Spelling out the dangerous categories (room numbers, passwords, policies) lowers hallucination noticeably with no extra calls.

This layer alone catches most off-topic and most "the context didn't mention it" cases.

Step 3 — Layer 2: a grounding check on every answer

The scoped prompt is a request — the model can still ignore it. Layer 2 is a separate, narrower NIM call whose only job is to look at the context and the answer and decide whether the answer is supported.

def answer_is_grounded(question: str, context: str, answer: str) -> bool:
    verdict = ask(
        system_prompt=(
            "You are a strict grounding verifier. Read the CONTEXT and the "
            "ANSWER. Respond with only 'yes' or 'no'. Say 'yes' if every "
            "factual claim in the ANSWER is directly supported by the CONTEXT. "
            "Say 'no' otherwise — including if the ANSWER adds information not "
            "in the CONTEXT, even if that information sounds plausible."
        ),
        user_message=(
            f"CONTEXT:\n{context}\n\n"
            f"QUESTION:\n{question}\n\n"
            f"ANSWER:\n{answer}\n\n"
            "Is every factual claim in the ANSWER supported by the CONTEXT?"
        ),
    )
    return verdict.strip().lower().startswith("yes")

Three things to notice:

It's just another ask() call — same client, same hosted NIM model, no new infrastructure. Layer 2 costs one extra call per question.
Yes/no only. Constraining the response shape makes the parsing reliable. If the verifier waffles ("yes, but..."), we treat that as a fail by checking the start of the string only.
It can be wrong too. The verifier is itself an LLM. For workshop-grade safety this is fine; for production you'd add deterministic checks (regex for room numbers, exact string match for fallback) on top.

Step 4 — Wire both layers into `ask_guarded()`

def ask_guarded(question: str) -> str:
    context = retrieve_context(question)              # from Part 2
    system_prompt = SCOPED_SYSTEM_PROMPT_TEMPLATE.format(
        fallback=FALLBACK, context=context,
    )
    answer = ask(system_prompt, question)             # Layer 1
    if not answer_is_grounded(question, context, answer):
        return FALLBACK                               # Layer 2 override
    return answer

for question in [
    "When does the USC AI Club meet?",        # in scope, in context
    "Can you write my breakup text?",         # OUT of scope
    "What is the wifi password?",             # in scope, NOT in context
    "What are the USC GPU lab Saturday hours?",   # invites a hallucination
]:
    print(f"Q: {question}")
    print(f"A: {ask_guarded(question)}\n")

Read the output carefully.

The AI Club question returns a real answer from the context. Both layers pass.
The breakup-text question hits Layer 1 — the scope rule catches it.
The wifi question also hits Layer 1 — nothing in the context mentions passwords, the scoped prompt forbids inventing them.
The Saturday-hours question is the one that earns its keep. The context says "Monday to Friday." A friendlier model would guess "closed on Saturday." Layer 2 reads that answer, sees "Saturday" is not in the context, and returns the fallback instead.

Step 5 — What you actually built

You took the retriever from Part 2 and put it inside two cheap, inspectable guardrails. The whole thing is still one Python file, still one hosted NIM endpoint, still no vector database. The mental model is:

Retrieval decides what the model sees.
Scoped prompt decides what the model is allowed to write.
Grounding check decides whether what the model wrote ships.

Real production systems extend each of these — deterministic rule checks, structured output, confidence thresholds, dedicated safety models, human review queues. The shape stays the same. Every additional layer is a yes/no gate between the user's question and the final response.

Get the code

Repo: github.com/torkian/nvidia-nim-workshop
One-click Colab for Part 3: Open part3_guardrails.ipynb
Local Python: part3_guardrails.py in the repo (python3 part3_guardrails.py after pip install -r requirements.txt).

MIT licensed. I run this at USC — fork it, swap the knowledge base for your school, your club, your project, and run it wherever you are.

Previously / next in this series

Part 1: Build Your First AI App with NVIDIA NIM in 30 Minutes
Part 2: From Manual RAG to Real Retrieval — Embedding-Based RAG with NVIDIA NIM
Part 4 (next): Run NIM on Your Own GPU — same OpenAI-compatible API, different endpoint. Useful when you want data locality, predictable latency, or a self-hosted dev loop.

From Manual RAG to Real Retrieval — Embedding-Based RAG with NVIDIA NIM

Torkian — Sat, 23 May 2026 00:33:15 +0000

In Part 1, we built a USC campus assistant by pasting a five-line knowledge base directly into the prompt. That works when "the data" fits in your head. It stops being cute the moment the campus handbook, club docs, and workshop notes all want a seat at the same prompt window.

The fix is retrieval — store the chunks once, and at query time pull only the few that look relevant. That's what RAG (Retrieval-Augmented Generation) actually means once you strip away the marketing.

This post takes the assistant from Part 1 and bolts on a real retriever, using NVIDIA's hosted embedding model. No vector database, no LangChain, no abstraction layer. A Python list and NumPy are enough to understand what's actually happening. Once you've seen the moving parts, swapping in pgvector or Pinecone later is a fifteen-minute job.

I'm B Torkian, NVIDIA Developer Champion at USC. Same workshop series, same campus, one more capability added.

What you're adding

User question → embed query → compare to stored chunks → pick top-k → send only those to the LLM → answer

The model call itself barely changes. The work is in steps 2–4: turn text into vectors, compare vectors, return the closest chunks.

Why the manual approach from Part 1 breaks

In Part 1, the entire knowledge base sat inside the prompt:

campus_info = """
The USC AI Club meets every Thursday at 5 PM...
The USC GPU computing lab is open Monday to Friday...
...
"""

Five lines is fine. But every model has a context window, and every token costs money and latency. You don't want to paste the entire USC student handbook into every question — most of it is irrelevant to "when does the AI Club meet?"

Retrieval is the answer to "which 3 paragraphs out of 3000 are actually about this question?" You compute that before calling the LLM, then send only the winners.

What an embedding actually is

An embedding is a list of numbers (a vector) that represents the meaning of a piece of text. Two texts that mean similar things land near each other in vector space. Two texts that mean different things land far apart.

NVIDIA's nv-embedqa-e5-v5 is an embedding model tuned specifically for question-answer retrieval. It has a quirk worth knowing about up front — it treats queries and passages differently. You tell it which one you're embedding via an input_type parameter. Getting this wrong is the most common beginner mistake — it still runs, but retrieval quality drops noticeably.

input_type='passage' → use for the documents you store
input_type='query' → use for the user's question at search time

That's it. Same model, two modes.

Step 1: Set up the client and `ask()` from Part 1

If you're continuing from Part 1, you already have these defined and can skip this cell. If you're starting fresh, paste this in first — everything later builds on it.

%pip install -q openai numpy

import os, getpass
from openai import OpenAI

if not os.getenv('NVIDIA_API_KEY'):
    os.environ['NVIDIA_API_KEY'] = getpass.getpass('Paste your NVIDIA API key (starts with nvapi-): ')

client = OpenAI(
    base_url='https://integrate.api.nvidia.com/v1',
    api_key=os.environ['NVIDIA_API_KEY'],
)

MODEL = 'meta/llama-3.1-8b-instruct'

def ask(system_prompt, user_message):
    response = client.chat.completions.create(
        model=MODEL,
        messages=[
            {'role': 'system', 'content': system_prompt},
            {'role': 'user',   'content': user_message},
        ],
        temperature=0.3,
        max_tokens=400,
    )
    return response.choices[0].message.content

client calls NVIDIA's API Catalog. ask() is the same chat-completion shape from Part 1. The retriever we're about to build slots in next to these, not instead of them.

Step 2: Build a small knowledge base and embed it as passages

import numpy as np

EMBED_MODEL = 'nvidia/nv-embedqa-e5-v5'

knowledge_base = [
    {'title': 'USC AI Club meeting',
     'text': 'The USC AI Club meets every Thursday at 5 PM in the engineering building, room 204.'},
    {'title': 'USC GPU lab hours',
     'text': 'The USC GPU computing lab is open Monday to Friday from 10 AM to 6 PM.'},
    {'title': 'NVIDIA Developer Program',
     'text': 'USC students can join the NVIDIA Developer Program for free.'},
    {'title': 'Next USC workshop',
     'text': 'The next USC AI Club workshop will cover Retrieval Augmented Generation (RAG).'},
    {'title': 'USC AI/ML office hours',
     'text': 'Office hours for the USC AI/ML faculty are Tuesdays 2-4 PM.'},
    {'title': 'USC robotics lab',
     'text': 'The USC robotics lab requires safety training before students can use the soldering station.'},
    {'title': 'USC tutoring',
     'text': 'Peer tutoring for introductory Python at USC is available Wednesdays from 1 PM to 3 PM.'},
]

def embed_texts(texts, input_type='passage'):
    response = client.embeddings.create(
        model=EMBED_MODEL,
        input=texts,
        extra_body={'input_type': input_type},
    )
    return [np.array(item.embedding, dtype=np.float32) for item in response.data]

# Embed every chunk once, as a passage. Store the vector alongside the text.
embeddings = embed_texts([item['text'] for item in knowledge_base], input_type='passage')
for item, embedding in zip(knowledge_base, embeddings):
    item['embedding'] = embedding

print(f'Embedded {len(knowledge_base)} chunks. Vector dim:', embeddings[0].shape[0])

Two things to notice:

The OpenAI Python client doesn't have a native field for NVIDIA's input_type, so we pass it through extra_body. That's the right way to send provider-specific arguments without forking the client.
We're storing the embeddings in plain Python dicts. For seven chunks this is fine. For seven thousand, you'd reach for a vector database (and the only thing that changes is where the vectors live; the cosine math is identical).

Step 3: Retrieve the top-k chunks for a question

def cosine_similarity(a, b):
    denominator = np.linalg.norm(a) * np.linalg.norm(b)
    if denominator == 0:
        return 0.0
    return float(np.dot(a, b) / denominator)

def retrieve_context(question, k=3):
    question_embedding = embed_texts([question], input_type='query')[0]

    scored = []
    for item in knowledge_base:
        score = cosine_similarity(question_embedding, item['embedding'])
        scored.append((score, item))

    scored.sort(key=lambda pair: pair[0], reverse=True)
    top_items = [item for score, item in scored[:k]]

    return '\n'.join(f"- {item['text']}" for item in top_items)

Three things are happening here:

The question is embedded as a query, not a passage. This is the part beginners trip over. Same model, different mode.
Cosine similarity scores how close the question vector is to each stored chunk vector. Numbers near 1.0 mean very similar; numbers near 0 mean unrelated.
Top-k picks the highest-scoring chunks. Three is a reasonable default for a tiny knowledge base; tune it for yours.

There is no magic in step 3. A vector database would do the same comparison but use indexing tricks to do it fast at scale.

Step 4: Plug retrieval into the same `ask()` from Part 1

def ask_with_retrieval(question):
    context = retrieve_context(question)

    system_prompt = f"""You are a USC campus assistant. Answer ONLY using the
context below. If the answer is not in the context, say
"I don't have that information — check with the USC AI Club."

CONTEXT:
{context}
"""

    return ask(system_prompt, question)

for question in [
    'Where does the USC AI Club meet?',
    'When can I get Python tutoring at USC?',
    'What is the wifi password?',
]:
    print(f'Q: {question}')
    print(f'Context:\n{retrieve_context(question)}')
    print(f'A: {ask_with_retrieval(question)}\n')

Run it. Three things to read carefully:

The first question retrieves the AI Club chunk and answers from it. Good.
The second retrieves the tutoring chunk and answers from it. Notice that "Python tutoring" doesn't appear verbatim in the stored text — the chunk says "introductory Python" — but the embedding model knows those are semantically close. That's the whole point of vector search over keyword search.
The wifi question retrieves three chunks anyway (top-k always returns k items), but none of them contain a password. The assistant falls back to the refusal line because the ONLY using the context rule forces it to. That's the guardrail from Part 1 doing its job — and it's exactly the bridge into Part 3.

Step 5: What you actually did

You replaced the hand-picked campus_info string from Part 1 with a real retrieval step. The model call is identical, and the system prompt follows the same guardrail pattern — answer only from the provided context, otherwise fall back. The only structural change is that {context} now comes from a function instead of a hardcoded constant.

That swap is the entire mental model behind RAG. Real production systems add chunking strategies, hybrid search, re-ranking, and a vector database — but the spine stays the same: embed once, embed query, compare, pass top-k to the LLM.

In your own work, the seven-line knowledge_base becomes hundreds of paragraphs scraped from PDFs, lecture notes, club Slack archives, Notion pages, or a wiki. The retriever code doesn't change. The dict-with-vector storage gets replaced by something like pgvector, Qdrant, or Pinecone the moment you outgrow a Python list.

Get the code

Repo: github.com/torkian/nvidia-nim-workshop
One-click Colab for Part 2: Open part2_rag.ipynb
Local Python: part2_rag.py in the repo (python3 part2_rag.py after pip install -r requirements.txt).

MIT licensed. I run this at USC — fork it, swap the knowledge base for your school, your club, your project, and run it wherever you are.

Previously / next in this series

Part 1: Build Your First AI App with NVIDIA NIM in 30 Minutes
Part 3 (next): Add Guardrails So It Doesn't Lie — a two-layer approach using prompt scope + a tiny verifier call. The fallback line that fired on the wifi question above is the foundation we build on.

Build Your First AI App with NVIDIA NIM in 30 Minutes

Torkian — Thu, 21 May 2026 22:43:28 +0000

Most students I've taught at USC have used ChatGPT. Far fewer have called a model from code.

That is the gap this post is meant to close. In 30 minutes, you'll call an NVIDIA-hosted language model from Python, pass it a small knowledge base, and make it answer only from that data. No GPU setup, no CUDA detour, no pretending a notebook is production. The goal is simple — write a normal Python program that talks to an LLM and gets useful text back.

I'm B Torkian, NVIDIA Developer Champion at USC, and I use this as a starter workshop for university and community groups. I've run a version of it with about 40 USC students. What usually surprises people is how ordinary the app feels. Most of it is normal software; one function call in the middle just happens to be weirdly powerful.

Everything runs in Google Colab because, for a room full of mixed laptops (I have made peace with this), boring setup wins.

This is Part 1 of a 5-part series that goes from one API call all the way to a small tool-using agent. Each post stands on its own, so start here and move forward as far as you want to go.

What you're building

User question → Python app → NVIDIA NIM API → LLM response → App output

A small USC campus assistant. It will call an NVIDIA-hosted Llama model, use the data you provide, and refuse when the answer isn't there.

That refusal part matters. Demos can guess. Useful apps need to know when to say "I don't know."

What NVIDIA NIM is

NIM stands for NVIDIA Inference Microservices. For this post, treat it as hosted model inference from NVIDIA with a clean API in front.

There are two common ways to use it:

Hosted through NVIDIA's API Catalog at build.nvidia.com. That's what we're using here; check the current catalog terms before you teach it, because credits and available models can change.
Self-hosted on your own GPU later, with the same API shape. (That's Part 4 of this series.)

Whoever decided NVIDIA's API should mimic OpenAI's saved everyone a week of onboarding. You use the client most people have already seen, point it at a different endpoint, and move on.

Prerequisites (5 minutes)

A free NVIDIA Developer account — developer.nvidia.com
An API key from build.nvidia.com → pick any model → Get API Key. It starts with nvapi-.
A Google account for Colab.

The first time I taught this, I forgot to say the key starts with nvapi-, and half the room pasted the wrong thing (usually not their fault). Check that before you debug anything else.

Step 1: Open Colab and install the client

NVIDIA's API Catalog is OpenAI-compatible, so we'll use the standard openai Python client and point it at NVIDIA's endpoint.

%pip install -q openai

import os, getpass
from openai import OpenAI

os.environ['NVIDIA_API_KEY'] = getpass.getpass('Paste your NVIDIA API key: ')

client = OpenAI(
    base_url='https://integrate.api.nvidia.com/v1',
    api_key=os.environ['NVIDIA_API_KEY'],
)

MODEL = 'meta/llama-3.1-8b-instruct'

Notice two things:

base_url points at NVIDIA's hosted inference endpoint.
MODEL is just a model name from the API Catalog. Swap it later if you want; the call shape does not change.

Step 2: Make your first model call

def ask(system_prompt: str, user_message: str) -> str:
    response = client.chat.completions.create(
        model=MODEL,
        messages=[
            {'role': 'system', 'content': system_prompt},
            {'role': 'user',   'content': user_message},
        ],
        temperature=0.3,
        max_tokens=400,
    )
    return response.choices[0].message.content

print(ask(
    system_prompt='You are a helpful, concise assistant.',
    user_message='Explain GPU acceleration to a first-year CS student in 5 sentences.',
))

Run it.

That ask() function is the basic shape of a lot of AI apps — instructions in, user input in, model response out. Real systems add plumbing, but this is the core.

Step 3: Use the system prompt to steer the model

Now keep the model and change its job description:

print(ask(
    system_prompt='You are a sarcastic but accurate professor. Keep it under 5 sentences.',
    user_message='Explain GPU acceleration to a first-year CS student.',
))

The output changes because the system prompt changes the model's job. A little precision buys you a lot here; vague prompts make debugging miserable.

Treat prompts like tiny specs — include constraints, output shape, and what to do when a question goes off-track. Then test with slightly annoying questions, because users will absolutely ask those.

Step 4: Build the USC campus assistant

An LLM doesn't know the USC schedule. It may still sound confident, which is exactly the problem.

So put the USC campus information directly into the prompt:

campus_info = """
The USC AI Club meets every Thursday at 5 PM in the engineering building, room 204.
The USC GPU computing lab is open Monday to Friday from 10 AM to 6 PM.
USC students can join the NVIDIA Developer Program for free.
The next USC AI Club workshop will cover Retrieval Augmented Generation (RAG).
Office hours for the USC AI/ML faculty are Tuesdays 2-4 PM.
"""

assistant_system_prompt = f"""You are a USC campus assistant. Answer ONLY using the
information in CAMPUS INFO below. If the answer is not in there, say
"I don't have that information — check with the USC AI Club."

CAMPUS INFO:
{campus_info}
"""

for question in [
    'When does the USC AI Club meet?',
    'Is the USC GPU lab open on Saturday?',
    'What is the wifi password?',
]:
    print(f'Q: {question}')
    print(f'A: {ask(assistant_system_prompt, question)}\n')

Run it and read the outputs before moving on. The USC AI Club answer should come straight from the text. For Saturday, the model often refuses with the fallback line instead of inferring closed. That is the behavior I want for this exercise — "Monday to Friday" gives a human enough to reason about Saturday, but the exact Saturday answer is not stated in the provided data.

The wifi question should also get the fallback line, because there is nothing in campus_info about passwords. If your model says "I don't have that information — check with the USC AI Club," do not treat that as a failure. It stayed inside the box we gave it, which is the whole point.

Last USC cohort, one student replaced the campus info with their D&D campaign notes and ended up with the most fun bug-hunting session of the day. The pattern works for silly data and useful data, which is why it sticks.

Step 5: What you actually did

You just built manual RAG — you picked the context by hand, inserted it into the prompt, and asked the model to answer from that context. In a production-ish version, the hand-picked campus_info string becomes whatever your retrieval system finds.

In a real app, the context probably comes from PDFs, docs, tickets, lecture notes, or a wiki. You retrieve a few relevant chunks at query time, usually with embeddings and a vector database, then pass only those along.

The model call barely changes — campus_info becomes the output of retrieval. Most of the engineering work lives in that swap.

That swap is exactly what Part 2 of this series is about.

Get the code

Repo: github.com/torkian/nvidia-nim-workshop
One-click Colab: Open the notebook

MIT licensed. I run this at USC — fork it, change campus_info to your school, your club, your project, and run it wherever you are.

Next in this series

Part 2: Give Your AI App Real Knowledge — Embedding-Based RAG with NVIDIA NIM. We replace the hand-picked context string with a real retriever that uses NVIDIA's embedding model, cosine similarity, and a query/passage distinction that most beginners get wrong on the first try.

Forem: Torkian

Add Guardrails So Your AI App Doesn't Lie — A Two-Layer Approach with NVIDIA NIM

What you're adding

Why guardrails are not optional

Step 1 — Setup (self-contained)

Step 2 — Layer 1: prompt scope with a fixed fallback line

Step 3 — Layer 2: a grounding check on every answer

Step 4 — Wire both layers into ask_guarded()

Step 5 — What you actually built

Get the code

Previously / next in this series

From Manual RAG to Real Retrieval — Embedding-Based RAG with NVIDIA NIM

What you're adding

Why the manual approach from Part 1 breaks

What an embedding actually is

Step 1: Set up the client and ask() from Part 1

Step 2: Build a small knowledge base and embed it as passages

Step 3: Retrieve the top-k chunks for a question

Step 4: Plug retrieval into the same ask() from Part 1

Step 5: What you actually did

Get the code

Previously / next in this series

Build Your First AI App with NVIDIA NIM in 30 Minutes

What you're building

What NVIDIA NIM is

Prerequisites (5 minutes)

Step 1: Open Colab and install the client

Step 2: Make your first model call

Step 3: Use the system prompt to steer the model

Step 4: Build the USC campus assistant

Step 5: What you actually did

Get the code

Next in this series

Step 4 — Wire both layers into `ask_guarded()`

Step 1: Set up the client and `ask()` from Part 1

Step 4: Plug retrieval into the same `ask()` from Part 1