Building a Smart Café Menu Ordering Agent ☕🤖: Natural Language to Structured JSON with RAG

Mihir Joshi — Sat, 19 Apr 2025 17:48:29 +0000

A Technical Deep Dive into Creating an Intelligent Interface

Imagine walking into your favorite cafe and simply speaking your order: "Can I get a large oat milk latte and one blueberry muffin?". For a human barista, this is easy. But for an automated system, understanding this natural language request and translating it into a precise, machine-readable order (like a JSON object 🧾) is a complex technical challenge.

This post dives into how we can build a Smart Café Menu Ordering Agent using modern GenAI techniques, specifically focusing on the Retrieval Augmented Generation (RAG) pattern. We'll explore the technical architecture behind converting free-text customer queries into structured output, making automated order processing a reality. This project, explored in a Kaggle notebook, integrates several key ML and GenAI components to create an intelligent interface.

The Agent's Technical Workflow: From Query to JSON Order

Let's break down the technical components that empower our Smart Café Agent, following the journey of a customer's natural language order from input to a structured JSON output.

1. Data Ingestion and Preparation 📊

The agent needs to know the menu! The process starts with loading the menu data using pandas. A crucial preprocessing step prepares the text for understanding: concatenating the item name, description, and category into a single text column. This consolidated string provides a rich textual representation for each menu item, essential for the next step.

Library: pandas
Key Operation: Data loading (pd.read_csv), Feature Engineering (string concatenation for text column).

menu_df = pd.read_csv("<YOUR_CSV_FILE_PATH>")
menu_df['text'] = menu_df['item_name'] + " - " + menu_df['description'] + " - category: " + menu_df['category']

This snippet shows the initial data loading into a pandas DataFrame and the creation of the combined text column used for embedding.

2. Semantic Understanding with Embeddings ✨

To enable our agent to understand the customer's request and semantically match it to our menu, we convert all text into numerical vectors using Embeddings. A pre-trained SentenceTransformer model, all-MiniLM-L6-v2 (chosen for efficiency in environments like Kaggle), generates these dense vectors for both the menu items and the customer's query. Items or queries with similar meanings will have vectors close to each other in a high-dimensional space.

Capability: Embeddings
Libraries:: sentence-transformers
Model: all-MiniLM-L6-v2
Key Operation: Encoding text into vectors (embedder.encode).


from sentence_transformers import SentenceTransformer

# Load embedding model (use small one for Kaggle)

embedder = SentenceTransformer('all-MiniLM-L6-v2')

# Generate embeddings

menu_embeddings = embedder.encode(menu_df['text'].tolist(), show_progress_bar=True)

# Assuming user_query is defined

# query_embedding = embedder.encode([user_query])

This code demonstrates loading the specific embedding model and applying it to the menu text data to create searchable vectors.

3. Finding Relevant Items with Vector Search 🔍

When a customer makes a request, the agent performs a Vector Search to find the most relevant menu items. The embedding of the customer's query is compared to the embeddings of all menu items using cosine similarity. The items with the highest similarity scores are retrieved as potential matches. A simple in-memory index stores the menu embeddings for quick lookup.

Capability: Vector Search
Libraries: numpy, sklearn
Key Operation: Cosine Similarity Calculation (cosine_similarity), Index Sorting (np.argsort).
Architectural Note: For production, this in-memory index would be replaced by a dedicated vector database for scalability and performance.


from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Assuming query_vector and menu_vectors are prepared (query_vector from user query embedding, menu_vectors from menu_embeddings)

similarities = cosine_similarity(query_vector, menu_vectors)

# Get index of the most similar item

top_match_index = np.argmax(similarities)

# Optional: Get indices of top N matches

top_n = 3
top_n_indices = np.argsort(similarities)[::-1][:top_n]

This snippet shows the core calculation of cosine similarity between the query and menu embeddings to find the most semantically similar items.

4. Contextual Interpretation with RAG ✨🧠

Here's where Retrieval Augmented Generation (RAG) comes into play, providing Grounding for the LLM. Instead of letting the language model guess based on its general training data, we give it the specific details of the top-N relevant menu items found in the vector search. The agent constructs a prompt that includes the customer's original query and the formatted details of these retrieved items. This guides the LLM to interpret the request accurately based on the actual menu options.

Capability: Retrieval Augmented Generation (RAG), Grounding
Key Operation: Prompt Construction (incorporating retrieved text).



# Extract and format retrieved items to include in prompt

# retrieved_items comes from selecting rows from menu_df based on top_n_indices

retrieved_text = "\n".join([
f"- {item['item_name']}: {item['description']} (Price: \${item['price']})"
for item in retrieved_items
])

# Final prompt structure sent to the LLM

rag_prompt = f"""
You are an AI café assistant. A customer asked: "{user_query}"

Here are some menu items that may be relevant:
{retrieved_text}

Based on this, generate a structured JSON order suggestion with fields:

- item
- quantity
- modifiers (if any)
- Price

Respond only with a JSON block. Do not include explanations or extra text.
"""

These snippets illustrate how the retrieved menu details are formatted and then embedded within the structured prompt sent to the LLM, providing essential context....

5. Generating Structured JSON Output 🧾

The agent sends the RAG-augmented prompt to a powerful Large Language Model, gemini-2.0-flash. A key requirement is getting a machine-readable order, so we configure the LLM for Structured Output (JSON mode). We explicitly request JSON using response_mime_type="application/json" in the API call configuration and reinforce this in the prompt instructions. The LLM then processes the query and context to generate the order details in the specified JSON format.

Capability: Structured Output / Controlled Generation, LLM Interaction
Library: google-genai
Model: gemini-2.0-flash
Key Operation: API Call (client.models.generate_content), Configuration (response_mime_type).
Technical Consideration: While JSON mode is powerful, robust parsing and validation post-generation are still necessary due to the probabilistic nature of LLM outputs.


from google import genai
from google.genai import types

# Assuming client is initialized and rag_prompt is constructed

response = client.models.generate_content(
model="gemini-2.0-flash",
config=types.GenerateContentConfig(
response_mime_type="application/json",
),
contents=rag_prompt
)

# The model's JSON response is in response.text

This code shows the API call to the Gemini model, specifying the model, the crucial JSON output configuration, and the RAG prompt as input.

6. Post-processing and Validation ✅

The final steps involve the agent parsing the LLM's JSON response using json.loads. A custom validate_order function then checks if the items suggested by the LLM actually exist on the menu (using a menu_lookup dictionary created from the menu data) and verifies their availability. This ensures the suggested order is valid based on the current menu state. This entire sequence, from understanding the query to validating the output, constitutes the basic Agentic workflow of our system.

Capability: Basic Agentic Behavior
Key Operation: JSON Parsing (json.loads), Data Validation (custom function, dictionary lookup).


import json

# Assuming response is received from the LLM

try:
raw_output = json.loads(response.text)
\# ... parsing and normalization logic to get order_json ...
except json.JSONDecodeError:
print("Error decoding JSON. Raw response:\n", response.text)
order_json = {"order": []} \# Handle parsing errors gracefully

# Assuming menu_lookup is created from menu_df metadata

def validate_order(order_json, menu_lookup):
\# ... logic to check each item against menu_lookup ...
pass \# Function definition snippet

# validated_order = validate_order(order_json, menu_lookup)

This snippet highlights the JSON parsing step with error handling and references the validation function, crucial for ensuring the LLM's output is usable and correct.

Conclusion: Building Smarter Interfaces 🚀

This project demonstrates how combining embeddings for semantic search, RAG for contextual grounding, and LLMs with structured output capabilities allows us to build intelligent agents capable of understanding natural language and generating precise, machine-readable responses. The Smart Café Menu Ordering Agent is a practical example of this pattern, directly converting a customer's free-text request into a structured JSON order.

This RAG pattern is a foundational technique in AI Engineering for creating robust natural language interfaces across diverse domains. Future enhancements could involve adding conversation memory for multi-turn orders, integrating with a production-grade vector database for larger menus, implementing more sophisticated validation rules, or building out the agent workflow using frameworks like LangChain or LangGraph for increased complexity and tool use.

By mastering these components, you can unlock the potential to build smarter, more intuitive systems that bridge the gap between human language and automated processes.

If you found this post and the notebook helpful, please consider giving the notebook an upvote on Kaggle!

Note:💡 You can find the full code for this project in my Capstone Project on Kaggle.

Forem: Mihir Joshi