Forem: Libardo Ramirez

Building a Fully Local RAG System with Qdrant and Ollama

Libardo Ramirez — Thu, 02 Apr 2026 08:43:24 +0000

Some months ago I was working on a custom solution and I needed to add RAG to it. The requirements were simple but not flexible: everything had to run local, and it had to be deployable in Docker alongside the rest of the services. After looking at some options, I choose Qdrant, and after doing some experiments with it I can say it was a good decision.

I know there are more complete solutions to add RAG to a local LLM setup. Frameworks like LangChain or LlamaIndex already abstract most of what I will describe here. But my requirements were not complex, and I did not want to add more dependencies and abstractions on top of a stack I already understand. Keeping things explicit made more sense for this project.

This article explains what I learned. It is not a deep technical guide, it is more a conceptual explanation for developers who want to understand how Qdrant and Ollama work together before they start coding.

Why Run Everything Local?

My client did not want documents leaving their network, so I did not have much to think about. But even before I started the project, I was already curious about local LLMs. I wanted to understand how far you can go without depending on external services.

The answer is: pretty far. The models available through Ollama are good enough for most practical use cases, and tools like Qdrant make the infrastructure side simple. The cost of "running local" is much lower than I expected, both in setup time and in hardware requirements.

The tradeoff is real though. A local 7B model is not going to perform like GPT-4. For this project that was fine, because the task is retrieval and summarization, not complex reasoning. The model just needs to read some context and write a coherent answer, and for that, smaller models work well.

What is RAG?

RAG is not a new idea. It has been used for a while and is now a well known pattern. I am not saying this is something new. But it is very useful for this type of use case, and it is worth understanding how it works before you start connecting the tools.

A standard LLM only knows what it learned during training, and it can only answer questions from that knowledge. If you ask it about your internal documents, your company wiki, or a PDF you have, it has no idea about that content.

RAG solves this by adding a retrieval step before the model generates an answer: it searches your documents, finds the relevant parts, and gives them to the model as context. The model then uses that context to write the answer, so the response is based on your real data and not just what the model learned before, which reduces hallucinations a lot.

The steps are:

Index your documents - split them into small pieces, convert each piece into a vector (a numerical representation of its meaning), and store those vectors in a vector database.
Receive a question - convert the question into a vector using the same embedding model.
Search - find the stored pieces whose vectors are most similar to the question vector.
Build the prompt - put the found pieces as context before the question and pass everything to the LLM.
Generate the answer - the model reads the context and responds based on it.

User Question
     │
     ▼
[Embedding Model] ──► Question Vector
                              │
                              ▼
                      [Qdrant] ── similarity search ──► Top-k Chunks
                                                               │
                              ┌────────────────────────────────┘
                              ▼
                     Prompt = Context + Question
                              │
                              ▼
                         [Ollama LLM]
                              │
                              ▼
                           Answer

The Stack

Qdrant - The Vector Database

Qdrant is an open source vector database built for storing and searching vectors efficiently. In a RAG pipeline it works as the memory of the system: you push the document pieces into it during indexing, and when a question comes it finds the most relevant ones in milliseconds.

What I liked about it is how little friction there is to start. It runs as a single Docker container with no extra configuration, and its REST API is clean enough to use directly without a framework on top. Each stored item can also carry metadata alongside the vector, so you can filter results by things like document type, date, or source, which is useful when you have documents from different contexts in the same collection.

docker run -d -p 6333:6333 -p 6334:6334 qdrant/qdrant

It also comes with a web dashboard at http://localhost:6333/dashboard where you can browse your collections and inspect stored points, which is very useful when you are debugging why a particular chunk is or is not being retrieved.

Ollama - Local LLM Runtime

Ollama is the runtime that makes running language models locally feel simple. It handles model downloads, quantization, and serving, and you interact with it through a CLI or a local HTTP API that has a similar format to the OpenAI API, so most existing tools work with minimal changes.

For this RAG setup, Ollama does two things: it runs the embedding model that converts text into vectors, and the generation model that synthesizes the final answer. Having both in the same runtime keeps the stack simple: one service, one API, and no separate embedding server to manage.

Install it from ollama.com and pull the models:

ollama pull llama3.2          # generation model
ollama pull nomic-embed-text  # embedding model

How They Work Together

The indexing phase happens once, or when your documents change. You start by loading your documents. In my case this was PDFs, text files, and also some MP4 files whose audio I transcribed to text before indexing. Once you have plain text, Qdrant does not care about the original format. You then split the text into overlapping chunks, typically around 512 tokens with some overlap so context is not lost at the boundaries. For each chunk, you call Ollama's embedding API to get a vector (for example 768 dimensions with nomic-embed-text) and save that vector together with the original text and any metadata into a Qdrant collection.

The query phase runs for every user question. You convert the question to a vector using the same Ollama embedding model, pass that vector to Qdrant's search API, and get back the most similar chunks. You then build a prompt by putting those chunks as context before the question, send it to the Ollama generation model, and return the answer to the user.

One important thing to understand: you must use the same embedding model for indexing and for queries, because the vector space it creates only makes sense if both document chunks and questions are embedded in the same space. If you change the model, you need to re-index everything.

Key Things to Know

Chunking

How you split the documents affects the quality of the results more than most people expect. Chunks that are too big bring too much irrelevant text and reduce retrieval precision. Chunks that are too small lose the context needed to answer the question well.

A good starting point is chunks of 512 tokens with 64 tokens of overlap. The overlap makes sure that a sentence split across a boundary is not lost entirely. For structured documents like FAQs or product specs, splitting by logical section usually works better than splitting by character count.

Embedding Model

For a local setup with Ollama, these are the common options:

Model	Dimensions	Notes
`nomic-embed-text`	768	Fast, good for general English
`mxbai-embed-large`	1024	Better quality, needs more resources
`nomic-embed-text-v1.5`	768	Supports flexible dimension reduction

I used nomic-embed-text, not because I did a detailed comparison, but because I already used it some months earlier when I was learning RAG from a tutorial, it worked well then, and there was no reason to change. Sometimes the familiar option is good enough.

Collections in Qdrant

A collection in Qdrant is similar to a table in a relational database. When you create one you declare the vector size and the distance metric (cosine similarity is the standard for text embeddings):

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams

client = QdrantClient("http://localhost:6333")

client.create_collection(
    collection_name="docs",
    vectors_config=VectorParams(size=768, distance=Distance.COSINE),
)

Filtering by Metadata

One of the most useful features of Qdrant for RAG is the ability to filter search results by the metadata you attach to each vector. If you are building a system where different users have their own documents, you can tag each vector with a user_id and filter the search so users only retrieve their own content, without needing a separate collection for each user:

from qdrant_client.models import Filter, FieldCondition, MatchValue

results = client.search(
    collection_name="docs",
    query_vector=question_embedding,
    query_filter=Filter(
        must=[FieldCondition(key="user_id", match=MatchValue(value="alice"))]
    ),
    limit=5,
)

A Simple Example

Here is the basic flow in Python, no framework, just the minimum to make it work end to end:

import requests
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

OLLAMA_BASE = "http://localhost:11434"
EMBED_MODEL = "nomic-embed-text"
CHAT_MODEL  = "llama3.2"
qdrant = QdrantClient("http://localhost:6333")

# 1. Create collection
qdrant.recreate_collection(
    collection_name="docs",
    vectors_config=VectorParams(size=768, distance=Distance.COSINE),
)

# 2. Helper to embed text
def embed(text: str) -> list[float]:
    resp = requests.post(
        f"{OLLAMA_BASE}/api/embeddings",
        json={"model": EMBED_MODEL, "prompt": text},
    )
    return resp.json()["embedding"]

# 3. Index documents
documents = [
    "Qdrant is a vector database written in Rust, designed for fast nearest-neighbor search.",
    "Ollama lets you run large language models locally with a simple CLI and REST API.",
    "RAG combines information retrieval with text generation to ground LLM answers in real data.",
]

points = [
    PointStruct(id=i, vector=embed(doc), payload={"text": doc})
    for i, doc in enumerate(documents)
]

qdrant.upsert(collection_name="docs", points=points)

# 4. Search
question = "What database should I use for semantic search?"

hits = qdrant.search(
    collection_name="docs",
    query_vector=embed(question),
    limit=2,
)

context = "\n\n".join(hit.payload["text"] for hit in hits)

prompt = f"""Answer the question using only the context below.

Context:
{context}

Question: {question}
"""

# 5. Generate answer
response = requests.post(
    f"{OLLAMA_BASE}/api/generate",
    json={"model": CHAT_MODEL, "prompt": prompt, "stream": False},
)

print(response.json()["response"])

In a real project you would add proper document loading (PyMuPDF for PDFs, python-docx for Word files), better chunking logic, error handling, and a web API layer, but the core logic is exactly this.

Things to Be Careful About

The most important thing is to not change the embedding model after you already indexed your documents. The vectors from different models are not compatible, so if you switch models everything in Qdrant becomes useless and you need to re-index from the beginning. It is a good habit to keep the model name in your configuration and treat it like part of your data schema.

If the answers are not good, the problem is usually in the chunking. Chunks that are too big bring too much irrelevant text and the model gets confused. Chunks that are too small lose context and the answer is incomplete. Try smaller chunks with more overlap, or split by paragraph instead of by character count. This depends a lot on the type of documents you have.

The context window is also something to watch. You are passing retrieved chunks plus the question into the LLM, and if you include too many large chunks you can go over the limit. A safe approach is to retrieve 3 to 5 chunks and keep each one under 400 tokens. llama3.2 has an 8k token context window by default, which is enough if you are careful with the chunk size.

On the hardware side, a 7B model in 4-bit quantization needs around 5 to 6 GB of RAM. Adding Qdrant, which is very lightweight, and the application, the total is around 8 to 10 GB. On a 16 GB machine this is comfortable. If you have less RAM, a smaller model like phi3.5 at 3.8B parameters is a good alternative that still gives useful results.

What I Found in My Experiments

Qdrant was very simple to start with. Just run the Docker image and it works with no configuration needed. For persistent storage you only need to add a volume mount, and in a docker-compose.yml alongside the rest of the services it integrates cleanly without any special networking configuration:

docker run -d -p 6333:6333 -p 6334:6334 \
  -v $(pwd)/qdrant_storage:/qdrant/storage \
  qdrant/qdrant

The embeddings from Ollama worked well from the first test. I did not need to tune anything. nomic-embed-text already gave useful retrieval results for domain-specific documents without any changes.

Chunk size made a real difference in quality. I tested with 256, 512, and 1024 tokens. With 1024 the results had too much irrelevant surrounding text that diluted the retrieval signal, and with 256 some answers were missing important context. 512 was the best balance for the type of documents I was working with.

The Qdrant dashboard at http://localhost:6333/dashboard was also more useful than I expected. When a retrieval is not working as expected, you can open it and see exactly what is stored and what is being returned for a query. It saves a lot of time compared to adding print statements to the code.

When to Use This Stack

This setup works well for internal knowledge bases, documentation search, or any project where documents cannot leave the company network. It is also good for simple Q&A over a set of documents, or for prototyping when you do not want to pay for API calls while you are still experimenting.

It is not the best choice when you need complex reasoning. Smaller local models are not as capable as GPT-4-class models for that. If your document collection is very large, with millions of vectors, Qdrant supports distributed mode for that but it is a different and more complex setup. And if your project needs support for many languages, it is worth checking the embedding model benchmarks carefully before choosing one, because quality varies a lot between models.

Conclusion

When I needed to add RAG to my project, I wanted something that runs local, works in Docker, and is not too complex to set up. Qdrant was the right choice for that. Together with Ollama, the stack is straightforward: Ollama handles the models for both embedding and generation, and Qdrant handles the storage and search.

It is not the most powerful setup you can build, and I know there are more complete frameworks available. But for requirements like mine, it works very well, the setup time is short, and the result is a RAG system with no external dependencies, no token costs, and no data leaving the infrastructure.

If you are thinking about adding local RAG to a project, this is a good place to start.

References

Qdrant Documentation
Qdrant GitHub - 27,000+ stars as of 2025
Ollama Official Site
Qdrant Python Client
nomic-embed-text on Ollama
mxbai-embed-large on Ollama
Qdrant 2025 Recap: Powering the Agentic Era

From Python ML to Swift: Translating MicroGPT

Libardo Ramirez — Tue, 17 Feb 2026 14:29:18 +0000

From Python ML to Swift: Translating MicroGPT

A few months ago I wrote about moving from Pandas in Python to TabularData in Swift for data exploration. That experiment showed me that Swift could handle data work in ways I did not expect. I wanted to take the next step.

When Andrej Karpathy posted his MicroGPT on LinkedIn it got my attention right away. It is a complete GPT implementation in 200 lines of pure Python. No PyTorch, no TensorFlow. Just the raw algorithm.

I work with machine learning and I understand how transformers, attention and backpropagation work. So this was not about learning ML from scratch. It was about continuing my journey of bringing ML into Swift.

I asked myself: can I translate this Python ML code into Swift and keep the same clarity?

The answer is yes. And the Swift version turned out to be 3 to 4x faster.

Why I Did This

I am not trying to replace Python for ML. Python has the best ecosystem for research and prototyping. That is not going to change.

But I want Swift as part of my ML workflow. After the TabularData experiment I saw that Swift can handle data. Now I wanted to see if it could handle models too. If it can, that opens the door to on-device inference, native Apple integration and production use without depending on Python bridges.

What MicroGPT Is

Karpathy's code builds a tiny GPT from scratch. It has a custom autograd engine for backpropagation, an attention mechanism, an Adam optimizer, character-level training and text generation. Everything fits in 200 lines.

The beauty is in what it does not have. No frameworks. No abstractions. Just the core ideas in plain code. That makes it perfect for translation because every line matters and every concept is visible.

Translating the Code

I did not copy and paste. I read the Python, understood what each part does and wrote the Swift version from scratch.

Some parts translated directly. The math is the same in any language. Softmax, cross-entropy, matrix operations. The formulas do not care about syntax.

Other parts needed Swift thinking. Python lets you be loose with types. Swift makes you say exactly what you mean. That felt limiting at first but it catches mistakes before you run the code.

Here is the core Value class in both languages.

Python:

class Value:
    def __init__(self, data, children=(), local_grads=()):
        self.data = data
        self.grad = 0
        self._children = children
        self._local_grads = local_grads

    def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        return Value(self.data + other.data, (self, other), (1, 1))

Swift:

class Value {
    var data: Double
    var grad: Double = 0
    var children: [Value]
    var localGrads: [Double]

    init(_ data: Double, children: [Value] = [], localGrads: [Double] = []) {
        self.data = data
        self.children = children
        self.localGrads = localGrads
    }

    static func + (lhs: Value, rhs: Value) -> Value {
        return Value(lhs.data + rhs.data, children: [lhs, rhs], localGrads: [1, 1])
    }
}

The structure is nearly identical. The Swift version adds type annotations like var data: Double instead of just self.data = data. That is not extra work. It is documentation built into the code.

The same applies to softmax.

Python:

def softmax(logits):
    max_val = max(val.data for val in logits)
    exps = [(val - max_val).exp() for val in logits]
    total = sum(exps)
    return [e / total for e in exps]

Swift:

func softmax(_ logits: [Value]) -> [Value] {
    let maxVal = logits.map { $0.data }.max()!
    let exps = logits.map { ($0 - maxVal).exp() }
    let total = exps.reduce(Value(0), +)
    return exps.map { $0 / total }
}

Same algorithm. Same structure. Different syntax. The translation feels natural once you get used to Swift's way of expressing things.

One thing that takes more work in Swift is operator overloading. Python lets you add an integer to a Value object with one method. Swift makes you define each combination:

static func + (lhs: Value, rhs: Value) -> Value {
    return Value(lhs.data + rhs.data, children: [lhs, rhs], localGrads: [1, 1])
}

static func + (lhs: Value, rhs: Double) -> Value {
    return lhs + Value(rhs)
}

static func + (lhs: Double, rhs: Value) -> Value {
    return Value(lhs) + rhs
}

More code. But explicit. The compiler will not let you pass the wrong type by accident.

Performance

I ran both versions on my M1 Mac with two datasets.

First the names dataset with 32,033 baby names, training for 1000 steps:

Python:
num docs: 32033, vocab size: 27, num params: 4192
step 1000 / 1000 | loss 2.6497
Time: 75 seconds

Swift:
num docs: 32033, vocab size: 27, num params: 4192
step 1000 / 1000 | loss 1.8899
Time: 25 seconds

Then the Tiny Shakespeare dataset with 1,115,394 characters split into 32,777 lines, also 1000 steps:

Python:
num docs: 32777, vocab size: 39, num params: 4576
step 1000 / 1000 | loss 2.2949
Time: 186.16 seconds

Swift:
num docs: 32777, vocab size: 39, num params: 4576
step 1000 / 1000 | loss 1.9374
Time: 46.58 seconds

Dataset	Python	Swift	Speedup
Names (32K docs)	75.0s	25.0s	3.0x
Shakespeare (1.1M chars)	186.16s	46.58s	4.0x

Swift is 3x faster on the names dataset and 4x faster on Shakespeare. The larger dataset shows an even bigger advantage because the compiled code handles the bigger vocab and longer sequences without the interpreter overhead that Python carries on every operation.

The reason is straightforward. Python interprets bytecode at runtime. Swift compiles to native machine code. The compiler sees the whole program, specializes types and inlines functions before execution. For tight loops doing millions of math operations, that overhead adds up. With Shakespeare's larger vocabulary of 39 characters and more parameters, the difference becomes even more pronounced.

The loss values are also different. Swift reaches a lower loss in the same number of steps. I think this comes from how each language handles floating-point math. Small differences in order of operations and rounding compound over 1000 training steps.

Both versions produce working results. The names model generates plausible names and the Shakespeare model picks up patterns like word boundaries, common words and even colon-delimited speaker labels.

What I Learned

Translating code between languages forces you to understand the algorithm in a way that reading alone does not. I had to think about why the topological sort matters for backpropagation, how attention weights flow through the network and where gradients accumulate.

The Swift version is 366 lines compared to 200 in Python. Swift needs more type annotations and more operator overloads. But every line is clear. There is no guessing about what type a variable holds or what operations are valid.

The other thing I noticed is that Swift catches errors at compile time. In Python I would run the code, wait for it to crash and then fix the problem. In Swift the compiler tells me before I waste that time.

Where This Goes Next

Python is still the right choice for ML research and prototyping. The ecosystem is too strong. PyTorch and TensorFlow are not going anywhere.

But for on-device ML on Apple platforms, Swift makes sense. You get native speed, type safety and direct access to Core ML and Metal. If you are already writing Swift for the UI, why add a Python bridge?

I started this journey with TabularData for data exploration. Now I have MicroGPT for model training. The pattern is clear. Swift can handle more of the ML pipeline than I expected.

After this experiment I want to add model caching to save and load trained weights, build a SwiftUI interface around it and try converting to Core ML for production use.

The complete Swift code is available in my Gist. You can download it, compile it and run it:

curl -O https://gist.githubusercontent.com/libardoram/e7168a6513640921dd96148d33fd9302/raw/microgpt.swift
swiftc -O microgpt.swift -o microgpt
./microgpt

After ten years with Python I did not expect Swift to feel this natural for ML work. It is not a replacement. It is an addition. The best tool depends on what you are building.

Building EarnScreen: Turning Screen Time from Waiting Game into Opportunity

Libardo Ramirez — Tue, 10 Feb 2026 05:21:44 +0000

Screens aren't the enemy in our house. My kid texts friends, watches tutorials, stays connected. My wife and I read the news, video-call family, manage our lives. We all have good reasons to pick up our phones. That's exactly what makes it hard to put them down.

So why did I build a screen time app?

Because having good reasons to use screens doesn't mean we're using them well. The average teen spends over 7 hours a day on screens — nearly half their waking life. A 2025 JAMA study found that addictive screen use is linked to anxiety and depression in preteens. Our family isn't there. But screens creep in quietly, and I wanted to change the equation before "normal use" became the only kind of use we knew.

The Idea: Earn It First

Most screen time tools are built for kids who already have a problem. They block, restrict, punish. But what about kids who are doing fine? What about keeping them fine?

I didn't need a digital lockdown. I needed a gentle nudge. So I built EarnScreen around a simple idea: screen time is fine, but earn it by doing something in the real world first.

Apps start blocked by default. To unlock them, you complete a quest — walk 10 minutes, read for 15, stretch, meditate, clean your room. Then the apps open for 15, 30, or 60 minutes. The screen becomes the reward for stepping away from it.

What Actually Changed at Home

I expected resistance. I got curiosity. By the second week, my son started picking quests on his own. "I'm going to walk to the park to earn some credits," he'd say, grabbing his shoes. The language shifted from "Can I use my phone?" to "I'm going to earn my phone."

That shift — from asking permission to taking ownership — was everything. He wasn't fighting me or the app. He was building a routine where real-world activity came before screen time.

After a few weeks, some activities stopped being "just for credits." He started reading before bed because he got into a book during a quest. He walks the dog without being asked. The quests created a door. He walked through it on his own.

Not a Cage. A Structure.

Traditional screen time controls are a cage — set a limit, lock the door, let the kid bang against it. EarnScreen is a structure. Kids choose which apps to manage, which quests to complete, how to spend their credits. When credits run low, it's not because Dad said no. It's because they made choices.

The thing I value most isn't the screen time reduction. It's the conversations. Before, screens were a source of tension. Now we talk about quest strategies and credit budgets. I'm not the enforcer anymore. I'm the cheerleader.

How It Works Under the Hood

EarnScreen uses iOS Family Controls for system-level blocking — you can't swipe past it. Walking quests use CoreMotion to verify actual movement. Timed quests require full completion. Night Mode auto-blocks everything from 10 PM to 7 AM. In a pinch, math puzzles earn bonus credits.

Everything runs on-device. No data collection, no tracking. Parent-child mode syncs quest progress through CloudKit so parents can see without hovering.

Prevention, Not Intervention

Most screen time apps are interventions — designed to fix something already broken. EarnScreen is prevention. It builds habits before they're needed, teaches balance while balance still feels easy.

We don't wait for cavities to teach kids to brush their teeth. So why wait for screen addiction before doing something about screen habits?

The app is available on the App Store: EarnScreen

If you're a parent who isn't panicking about screens but wants to stay ahead of the curve, give it a look. The best time to build healthy habits is before you desperately need them.

From Pandas in Python to TabularData in Swift for Data Exploration

Libardo Ramirez — Mon, 10 Nov 2025 16:48:44 +0000

I have used Python for over ten years. Most of that time I built software and automation tools. In the last few months I started working with data science.

The tools in Python for data work are excellent. Libraries like Pandas, NumPy and Matplotlib let me load, clean and show data in just a few lines of code.

At the same time I have written Swift code for Apple platforms for many years. I created server applications and iOS projects. Swift feels clean, modern and safe. But when I worked with data I missed the fast interactive style that Pandas gives in notebooks.

So I asked myself: can I bring some of that data experience into Swift?

The answer led me to Apples TabularData framework. It offers a DataFrame that works like Pandas. I also used Swift Playgrounds to run code line by line and see results right away.

First Steps with Data in Swift

I wanted to try a simple task. I took a small weather dataset and analyzed it in two ways. First with Python and Pandas. Then with Swift and TabularData.

This was not a speed test or a full comparison. Pandas is much more complete. I only wanted to see how close Swift can come to a normal data workflow.

Here is the file I used, named weather.csv:

date,temp_max,temp_min,precipitation
2024-07-01,30,22,0
2024-07-02,31,21,0
2024-07-03,28,20,5
2024-07-04,29,19,2

How I Work with Pandas

After many years the Pandas code feels natural:

import pandas as pd

df = pd.read_csv("weather.csv")
df["temp_range"] = df["temp_max"] - df["temp_min"]
df.dropna(inplace=True)

print(df.head())
print("Average Max Temp:", df["temp_max"].mean())
print("Total Rainfall:", df["precipitation"].sum())

I load the file, add a new column, remove empty rows and calculate basic numbers. It is quick and easy.

This simple flow is why Pandas is so popular for data work.

This time I did not open Jupyter or Matplotlib. I opened Swift Playgrounds instead.

Moving the Same Steps to Swift

I started a macOS Playground and added two imports:

import TabularData
import Foundation

Loading the CSV file was straightforward:

let fileURL = URL(fileURLWithPath: "/Users/Shared/weather.csv")
let df = try DataFrame(contentsOfCSVFile: fileURL)
df.prefix(5)

The table appeared in the Playground sidebar. It was not as detailed as Pandas output but it gave me the same feeling of exploring data.

Next I added a column for the temperature range. The Swift way needs a few more lines:

var df = df.dropMissingRows()
let tempRange = zip(df["temp_max"], df["temp_min"]).map(-)
df.append(column: Column(name: "temp_range", contents: tempRange))

It takes more code than Pandas. But every step is clear. This makes mistakes less likely later.

Getting Results in Swift

Printing the numbers felt familiar and safe:

let avgTemp = df.mean(of: "temp_max")
let totalRain = df.sum(of: "precipitation")

print("Average Max Temp:", avgTemp ?? 0)
print("Total Rainfall:", totalRain ?? 0)

The values matched the Pandas results exactly.

Pandas is built for fast exploration. Swift is built for safety and clear types. You write a bit more but you gain confidence.

Swift Playgrounds make the process enjoyable. They are interactive and clean. You write normal Swift code and see results instantly.

Creating Charts in Swift

In Pandas I normally use Matplotlib to draw lines:

import matplotlib.pyplot as plt
plt.plot(df["date"], df["temp_max"], label="Max Temp")
plt.plot(df["date"], df["temp_min"], label="Min Temp")
plt.legend()
plt.show()

In Swift I used the Charts framework with SwiftUI:

import Charts

let chart = Chart(df.rows, id: \.self) {
    LineMark(
        x: .value("Date", $0["date"]!),
        y: .value("Max Temp", $0["temp_max"]!)
    )
    LineMark(
        x: .value("Date", $0["date"]!),
        y: .value("Min Temp", $0["temp_min"]!)
    )
}

chart

The chart appeared directly in the Playground.

For someone used to Python plotting this felt smooth and exciting. It is not as flexible yet but it shows that Swift can handle visualization too.

What I Learned from This Test

This small project taught me a few clear points. Pandas is still the best choice for big data tasks because it is fast, powerful and has a huge community. TabularData adds useful data tools to Swift but it does not replace Pandas. Instead it gives a safe way to work with tables inside Swift code. Swift Playgrounds are great for trying data ideas since they feel like notebooks but stay inside the normal Swift environment. SwiftUI Charts make simple graphs easy and look good right in the code output.

After ten years with Python I did not expect Swift to feel good for data work. TabularData changed my view.

The framework is still young and misses many Pandas features. But it brings Swifts strengths: clean code, strong types and tight connection to Apple platforms.

For me, this experiment was less about comparing tools and more about curiosity. It showed me that Swift can handle data in ways I didn’t expect — not as a replacement for Python, but as a new space to explore ideas, visualize results, and keep everything within Apple’s ecosystem.

Building a Bedtime Stories App

Libardo Ramirez — Mon, 22 Sep 2025 13:38:18 +0000

Developer Insights with Apple's Foundation Models and Image Playground

I built a bedtime stories app for kids using Apples Foundation Models and Image Playground, and here are the key takeaways for other developers. Part of Apple Intelligence from WWDC 2025, these frameworks let you generate text and images on-device, keeping things fast and private. This article walks through the backend logic for creating structured stories with a custom StoryResponse struct and generating an image, sharing what worked and what didn’t.

Step 1: Set Up the Environment

Takeaway: Check compatibility first to avoid crashes. Target iOS 18+ or macOS 15+ on Apple Silicon (A17 Pro, M1, or later) with Apple Intelligence enabled. In Xcode, import the frameworks:

import FoundationModels
import ImagePlayground

FoundationModels handles text generation, ImagePlayground does images. Models download on first use, but verify availability to catch unsupported devices early.

Step 2: Define the Structured Output

Takeaway: Keep your struct simple to avoid confusing the LLM. The @Generable macro enforces structure, but too many fields can break things. Heres the StoryResponse struct I used:

@Generable
struct StoryResponse {
    @Guide(description: "Story title, catchy, relevant, under 5 words.")
    let title: String
    @Guide(description: "Story body, whimsical style for kids.")
    let paragraph: [Paragraph]
    @Guide(description: "Image description, complements story, whimsical style.")
    let imageDescription: String
    @Guide(description: "Category like adventure, fantasy, or general if unsure.")
    let category: String
    @Guide(description: "One-sentence story summary.")
    let summary: String
    @Guide(description: "Main characters, comma-separated.")
    let characters: String

    @Generable
    struct Paragraph {
        @Guide(description: "Paragraph order in story.")
        let order: Int
        @Guide(description: "Paragraph text.")
        let text: String
    }
}

This gives you a short title, 3 to 5 paragraphs, an image description, a category, a summary, and characters. @Guide annotations tell the LLM exactly what you want, cutting down on bad outputs.

Step 3: Initialize Model and Session

Takeaway: Clear session instructions save time. Without them, I got random stories. Heres how to set up the model and session:

let model = SystemLanguageModel.default
guard case .available = model.availability else {
    fatalError("Model unavailable: \(model.availability)")
}

let session = LanguageModelSession(
    instructions: "Youre a storyteller for kids aged 3 to 7. Use simple, dreamy language and positive themes. Output a title under 5 words, 3 to 5 paragraphs, an image description, a category like fantasy, a one-sentence summary, and comma-separated characters."
)

Checking model.availability early caught issues with disabled Apple Intelligence or old hardware.

Step 4: Craft the Prompt

Takeaway: Specific prompts beat vague ones, specially when trying to get a clean and safe response from the Apple local models. In the final app I had to be carefull building the intructions for the model and the prompt to get usable results:

let childName = "Emma"
let theme = "brave little dragon learning to fly"
let prompt = """
Generate a bedtime story for \(childName) about \(theme). Include:
1. Title under 5 words.
2. 3 to 5 paragraphs, 50 to 100 words each, whimsical style for kids 3 to 7.
3. Image description for a key scene, like a dragon over a glowing forest.
4. Category, like fantasy or adventure.
5. One-sentence summary.
6. Characters, comma-separated.
Make it happy, soothing, with a perseverance moral.
"""

Set options for better control:

let options = GenerationOptions(
    temperature: 0.7, // Keeps it creative but not wild
    maximumResponseTokens: 600 // Around 300 to 400 words
)

A temperature of 0.7 worked best; anything above 0.8 got too random.

Step 5: Stream the Response and Generate Image

Takeaway: Streaming is great but needs careful buffer handling. Adding ImageCreator tied text to visuals cleanly but before get the image, it's better to get a summarized version f the story and put some restrictions, Apple Image Playground is very picky when you try to create images with people and need a persona to be defined:

do {
    var storyBuffer = StoryResponse(
        title: "",
        paragraph: [],
        imageDescription: "",
        category: "general",
        summary: "",
        characters: ""
    )
    for try await partialResponse in session.streamResponse(to: prompt, generating: StoryResponse.self, options: options) {
        storyBuffer = partialResponse
        // Log for debugging
        print("Title: \(storyBuffer.title)")
        print("Paragraphs: \(storyBuffer.paragraph.map { \"\($0.order): \($0.text)\" })")
        print("Image Description: \(storyBuffer.imageDescription)")
        print("Category: \(storyBuffer.category)")
        print("Summary: \(storyBuffer.summary)")
        print("Characters: \(storyBuffer.characters)")
    }
    // Final story
    let finalStory = storyBuffer
    print("Complete story: \(finalStory)")

    // Generate image
    let imageCreator = try await ImageCreator()
    let style = ImagePlaygroundStyle.animation
    let images = try await imageCreator.images(
        for: [.text(finalStory.imageDescription)],
        style: style,
        limit: 1
    )
    // `images` has one animation-style image
    print("Generated image for: \(finalStory.imageDescription)")
} catch {
    print("Error: \(error)")
}

Streaming Tip

Streaming StoryResponse objects lets you process in real-time, but initialize storyBuffer with defaults to avoid null issues. @Generable keeps the output on track.

Image Tip

The imageDescription field feeds ImageCreator. ImagePlaygroundStyle.animation gives kid-friendly visuals, but short, specific descriptions (like "dragon on glowing hill") work better than long ones. Sticking to one image keeps things fast.

Step 6: Error Handling and Optimization

Takeaway: Expect errors and plan for them. I hit these:

Text: .tokenLimitExceeded from long prompts; 600 tokens fixed it.
Image: Bad imageDescription inputs; tweaking prompts helped.
Streaming: Partial responses; storyBuffer accumulation solved it.

Optimization notes:

Text: Stick to temperature 0.6 to 0.8, reuse sessions for context.
Image: Short imageDescription, single image for speed.
Performance: Test on real devices, older Apple Silicon lags on images.

Step 7: Advanced Enhancements

Takeaway: Chaining prompts adds polish. Generating the title first improved flow:

let titlePrompt = "Generate a title under 5 words for a story about \(theme)."
let title = try await session.respond(to: titlePrompt).content
let fullPrompt = "Using title '\(title)', generate a story for \(childName)..."

Personalizing with user inputs (like favorite animals) and reusing sessions for sequels made stories feel connected.

Conclusion

Building this bedtime stories app showed how to combine Apples Foundation Models and Image Playground for a fast, private storytelling engine. The @Generable macro kept StoryResponse outputs structured, streamResponse enabled real-time text generation, and ImageCreator added kid-friendly visuals. Key lessons for developers: verify device compatibility early, craft precise prompts, and simplify structs to avoid LLM issues. Testing on target devices caught performance hiccups, and tweaking imageDescription was critical for solid images with a nice representation including only animal, objects and enviroment definitions.

Theres plenty of room to improve. You can add multi-language support with prompt tweaks or integrate Apples speech APIs for narration. For images, experimenting with other ImagePlaygroundStyle options or generating multiple images per story could enhance visuals. Chaining inferences, like creating character backstories first, can deepen narratives. Start small, test frequently, and refine prompts to maximize LLM output. Want to see it in action? Check out my app (Bedtime Snuggles) on the App Store to try the stories and images for yourself.