Forem: Lingdas1

👻 Crash #1: The Gateway Ghost — When Your AI Pretends to Work

Lingdas1 — Sun, 24 May 2026 11:51:59 +0000

 1|# 👻 Crash #1: The Gateway Ghost
 2|
 3|> *"Did I do something wrong? Let me reinstall everything."*
 4|
 5|---
 6|
 7|## What Happened
 8|
 9|I followed the tutorial step by step. Everything installed perfectly. I was thrilled.
10|
11|Then the gateway — the bridge connecting my AI assistant to the messaging app — started disconnecting randomly. Sometimes it worked for hours. Sometimes it died after 10 minutes. No pattern, no error message, nothing to Google.
12|
13|**My response:** I wiped everything and reinstalled. Twice.
14|
15|**The actual fix:** I just needed to restart the gateway. That's it.
16|
17|---
18|
19|## What I Learned
20|
21|**Before you assume you broke something, try turning it off and on again.**
22|
23|It's a cliché because it works. I wasted an entire evening reinstalling software that was fine. The problem was a process that needed a kick, not a configuration that needed a rewrite.
24|
25|---
26|
27|## 🛡️ Golden Rule Reminder
28|
29|> **If it works, don't touch it.** I reinstalled a perfectly good setup twice before trying the simplest fix. Always try the 10-second solution before the 2-hour one.
30|
31|> **Run everything in a VM.** If my gateway was already inside a VM with a snapshot, I could have just rolled back instead of reinstalling from scratch.
32|
33|---
34|
35|*← Full story: [I Broke My AI Assistant 7 Times](https://dev.to/lingdas1/i-broke-my-ai-assistant-7-times-heres-what-i-learned-47le)*
36|

💬 Your Turn

Have you run into a similar problem? Or hit a wall I didn't mention?

Drop a comment below — I read every single one. Your experience might help someone else who's stuck on the same thing.

The more we share our screw-ups, the fewer people have to make them. 🤝

🏗️ Day 1: I Almost Bought a Phone for AI (And Other Beginner Mistakes)

Lingdas1 — Sun, 24 May 2026 11:03:15 +0000

 1|# 🏗️ Day 1: I Almost Bought a Phone for AI (And Other Beginner Mistakes)

 2|

 3|> The story of how I went from "I want a Jarvis" to actually building one — one crash at a time.

 4|

 5|---

 6|

 7|## How It Started

 8|

 9|I found out about AI the same way most people do: scrolling through videos.

10|

11|One day, it was the "Doubao Phone" — a smartphone with a built-in AI assistant that could order food, compare prices, and even play games for you. "Finally, my own Jarvis!" I thought. I almost bought one.

12|

13|Then the app stores blocked it. The hype died. On to the next thing.

14|

15|Next up: farming crayfish with AI. Yes, that was a real trend. A virtual crayfish farm managed by an AI agent. Fun to watch, but the token costs were insane, and the AI kept forgetting what happened five minutes ago.

16|

17|I kept watching, kept wanting, kept feeling like AI was something other people did.

18|

19|Then I found Hermes Agent — an open-source AI assistant you can run on your own machine. Free. Private. No subscription.

20|

21|I searched for tutorials. Downloaded the files. And started the most frustrating, educational tech journey of my life.

22|

23|---

24|

25|## The Big Lesson

26|

27|Looking back, the problem wasn't that I didn't know enough. It was that I kept chasing the next shiny thing instead of picking one path and sticking with it.

28|

29|The real lesson: Stop waiting for the perfect AI product. The tools are already free and open source. You just need to pick one and start — even if you break it a few times along the way.

30|

31|---

32|

33|## 🛡️ The Golden Rule (Read This Before the Next Article)

34|

35|> If it works, don't touch it.

36|>

37|> You never know which piece of your setup is holding everything together. That random config file you're not sure about? Leave it alone. Every time I thought "I'll just fix this one small thing," I spent 3 hours recovering.

38|>

39|> Even a stable system can break for no reason. When it does, fix only that one thing — don't "improve" everything else while you're at it.

40|

41|My #1 recommendation for beginners: Run everything inside a virtual machine (VM) with Linux. Give it 100-200GB of disk space (not C: drive!). This isolates 90% of problems — host OS breaks? VM still works. VM breaks? Just restore a snapshot.

42|

43|---

44|

45|← Read the full story first: I Broke My AI Assistant 7 Times

46|

47|Next: The Gateway Ghost 👻 →

48|---

💬 Your Turn

Have you run into a similar problem? Or hit a wall I didn't mention?

Drop a comment below — I read every single one. Your experience might help someone else who's stuck on the same thing.

The more we share our screw-ups, the fewer people have to make them. 🤝

I Broke My AI Assistant 7 Times. Here's What I Learned.

Lingdas1 — Sun, 24 May 2026 10:33:01 +0000

I Broke My AI Assistant 7 Times. Here's What I Learned.

One medical student's journey from "I want a Jarvis" to accidentally becoming a self-taught DevOps engineer.

The Beginning: I Almost Bought a Phone for AI

It started with a video.

I was scrolling through Bilibili (think YouTube, but Chinese) and saw something that blew my mind: the "Doubao Phone." A smartphone with a built-in AI assistant that could do everything — order food, compare prices across stores, play games for you, book appointments. "Finally," I thought, "my own Jarvis."

I almost bought it.

Then the app store drama happened. The big companies blocked Doubao's integrations. The phone stopped being magical. And I moved on to the next viral thing.

Farming crayfish with AI.

Yes, that was a real trend. You could deploy an AI agent that managed a virtual crayfish farm. It was hilarious but also... expensive. The token costs were insane, and the AI kept forgetting what happened five minutes ago.

I watched from the sidelines, feeling that familiar itch: "I want to do this too, but I don't know how."

Then I found Hermes Agent — an open-source AI assistant you can run on your own computer. Free. Private. Controllable.

I searched Bilibili for tutorials. Downloaded the files. And thus began the longest, most frustrating, most educational tech journey of my life.

The Setup: 7 Times I Broke Everything

Here's the honest story of what happened when a medical student with no coding background tried to deploy an AI assistant on his own.

💥 Crash #1: The Gateway Ghost

What happened: I followed the tutorial step by step. Everything installed fine. Then the gateway started disconnecting randomly. Sometimes it worked for hours. Sometimes it died after 10 minutes.

My reaction: "Did I do something wrong? Let me reinstall everything."

What actually fixed it: Restarting the gateway. That's it. Just... restarting it. I had already wiped and reinstalled twice before I figured this out.

Lesson learned: Before assuming you broke something, try turning it off and on again. It's cliché because it works.

💥 Crash #2: Russia's Internet Hates Me

What happened: I'm studying in Russia, and the internet here is... let's say unstable. The VPN blocks. The DNS dies. The whole building loses connection for hours at a time.

I thought: "No problem — I'll download some local AI models so my assistant can work offline."

I spent a weekend downloading models. Got everything set up. It was beautiful.

The next morning, Windows gave me a blue screen of death. When it rebooted, all my downloaded models were gone. Corrupted. Unreadable.

My reaction: Staring at my screen in disbelief. 20GB of models, gone.

What actually fixed it: I switched to a different model loader, redownloaded everything, and took a screenshot of the working config this time.

Lesson learned: Backup your configuration before you think you need it. Not after.

💥 Crash #3: The C: Drive Betrayal

What happened: Everything installed to C: drive by default. Models, tools, environments — all happily eating up space on my system drive.

One morning, Windows greeted me with: "Your C: drive is almost full."

Panic.

I decided to move everything to D: drive. I consulted with another AI, got detailed migration instructions, and followed them carefully.

Everything broke.

My assistant couldn't find its files. WSL refused to start. Models were looking for paths that no longer existed.

My reaction: "But... I followed the instructions!"

What actually fixed it: I restored from a backup I thankfully made before starting, and did the migration one piece at a time — move WSL first, confirm it works, then move the model loader, confirm it works, then move the assistant.

Lesson learned: Never migrate everything at once. One step at a time. And always have a rollback plan.

💥 Crash #4: The Emulator War

What happened: Remember that Android emulator I installed months ago to play mobile games? I had uninstalled it. No big deal, right?

Wrong.

After uninstalling the emulator, WSL2 started throwing this error: HCS_E_SERVICE_NOT_AVAILABLE. Virtualization broke. Windows Subsystem for Linux stopped working. My AI couldn't run.

It turned out the emulator and WSL2 were fighting over the same virtualization resources. And when I removed the emulator, it took something with it.

My reaction: "I just deleted a game emulator. How does that break my AI assistant?"

What actually fixed it: Multiple restarts, repairing Windows Hyper-V components, and a lot of swearing at my screen.

Lesson learned: Your computer's virtualization layer is like a house of cards. Remove one component and the whole thing can collapse. Also: Windows 11 Home edition hides virtualization settings, making this 10x harder to debug.

💥 Crash #5: The Great OS Migration

What happened: After the emulator war, I decided enough was enough. I backed up everything, wiped my computer, and installed a fresh Windows. This time, I would run my AI inside a virtual machine with Linux. No more WSL2 headaches.

It worked. For about a day.

My reaction: Relief followed by confusion.

What actually fixed it: Nothing — it worked fine. I just didn't trust it anymore.

💥 Crash #6: The Invisible Network Cable

What happened: My host computer (Windows) had internet. My VM (Linux) didn't. The network adapter was set to NAT, just like every tutorial said. But the VM couldn't reach the outside world.

I spent hours checking settings, reinstalling network drivers, changing adapter types.

My reaction: "The internet works on my laptop. Why doesn't it work INSIDE my laptop?"

What actually fixed it: The VMware NAT Service and DHCP Service weren't running in Windows. They're supposed to start automatically. They didn't. One click to start them, and everything worked.

Lesson learned: When virtualization networking breaks, check the host services first, not the VM settings. And ping and curl are better debugging tools than staring at network icons.

💥 Crash #7: The Gateway That Lied to Me

What happened: I had set up the gateway to auto-start on boot. I checked the configuration. It said enabled: true. I was confident.

The next morning, my AI was offline again.

The gateway had "started" but hadn't actually connected. It was running as a process, but doing nothing useful.

My reaction: "But I set it to auto-start! Why is it lying to me?"

What actually fixed it: I wrote a simple script that checks every 5 minutes whether the gateway is actually connected, and restarts it if not. Bulletproof.

Lesson learned: "Running" and "working" are two different things. Always add a health check.

The Golden Rule: Don't Touch It

After weeks of crashes, debugging, and existential crises, my setup finally stabilized. Everything worked. The gateway stayed connected. The models loaded correctly. Messages flowed.

And I learned the most important lesson of all:

If it works, don't touch it.

You never know which piece of your spaghetti-code setup is holding everything together. That random config file? The one you're not sure does anything? Yeah, it probably does something. Leave it alone.

Every time I thought "I'll just fix this one small thing," I ended up spending 3 hours recovering from the consequences.

What I Want You to Know

I'm telling you all this not because I'm an expert — I'm not. I'm a medical student. I study anatomy, not APIs. I chose this career because I wanted to help people, not because I wanted to debug network services at 2 AM.

But I got it working. And if I can, you can too.

Here's what I learned that actually matters:

Before I started	After I broke everything 7 times
"AI is for programmers"	"AI is for anyone stubborn enough to try"
"I'll just follow the tutorial"	"I'll follow the tutorial and backup first"
"It should work perfectly"	"It will break, and that's normal"
"I'm not technical enough"	"Being patient matters more than being technical"

Your Turn

If you're reading this and thinking "That sounds like me" — good. You're exactly who I wrote this for.

Start with something small. Expect it to break. Backup before you change anything. And when it finally works, leave it alone.

I'm still learning. Every day something new confuses me. But I'm not scared of it anymore — because I've already broken everything that could break.

And the AI is still running.

Hi, I'm Ling. I'm a medical student in China who somehow became a self-taught AI deployer. No CS degree, no big tech job — just a laptop, broken internet, and way too much stubbornness.

This is the first of my "Real People, Real AI" series. ⭐ Star the GitHub repo to get notified when the next one drops.

P.S. — If you've broken your own AI setup in a creative way, leave a comment. Misery loves company. 😄

What Is an LLM? (No, It's Not Magic — Here's What's Actually Happening)

Lingdas1 — Sun, 24 May 2026 09:36:12 +0000

What Is an LLM? (No, It's Not Magic — Here's What's Actually Happening)

The plain-English guide to understanding AI — no jargon, no code, just the stuff that matters.

My grandfather called it "the thinking computer."

I showed him ChatGPT, and he asked: "Does it... think? Like a person?"

It's a good question. And honestly, most explanations of AI are terrible at answering it. Either they're too technical ("a transformer-based neural network with self-attention mechanisms" — whatever that means) or too mystical ("it's like a digital brain!" — no, it's not).

So let me explain what an LLM actually is. No jargon. No magic. Just the truth.

The Analogy: A Chef Who's Tried Every Recipe

Imagine the world's most experienced chef. This chef has read every cookbook ever written. Every recipe from every culture. Every food blog. Every handwritten note from every grandmother.

You ask this chef: "Can you make me something with chicken, lemon, and garlic?"

The chef has never made that exact dish before, but they've read millions of recipes. They know what works. They know chicken + lemon + garlic usually means a Mediterranean-style dish. They know garlic should be minced, not whole. They know lemon juice goes in near the end, not the beginning.

So they create a new recipe, perfectly reasonable, that has never existed before.

That's what an LLM does.

It's not "thinking." It's not "conscious." It has read an unimaginable amount of human text — books, articles, conversations, code — and learned the patterns of how we write and reason.

When you ask it a question, it doesn't "look up" an answer. It generates one, word by word, based on everything it has learned.

What LLM Actually Stands For

Large Language Model.

Let's break that down:

Language — It works with words. Text in, text out. That's its native language (pun intended).
Model — A mathematical representation of patterns. Think of it as a super-complex set of probabilities: "After the word 'I', the next word is usually a verb, and after 'I want to', the next word is often 'go' or 'get' or 'make'..." × a billion.
Large — Really, really large. These models have been trained on most of the public internet. The biggest ones have learned patterns from trillions of words.

What It's NOT

Let me clear up some common confusion:

Myth	Truth
🧠 "It thinks like a human"	❌ No. It predicts words based on patterns. No consciousness, no feelings, no self-awareness.
📚 "It knows everything"	❌ It knows what it was trained on, which has a cutoff date. It doesn't "know" anything — it generates plausible text.
🎯 "It's always right"	❌ It can be confidently wrong. It's great at sounding correct even when it's making things up.
📝 "It copies from the internet"	❌ It doesn't store copies of web pages. It learned patterns and generates original text based on those patterns.

Why "Large" Matters

Imagine two chefs:

Chef A has read 10 recipes. They know how to make exactly 10 dishes.
Chef B has read 10 million recipes. They understand cuisine at a deep level.

LLMs work the same way. The "large" in "Large Language Model" refers to:

The amount of training data — billions of web pages, books, and documents
The number of parameters — think of these as "connections" in the model. A 7-billion-parameter model (small) has learned 7 billion patterns. A 70-billion-parameter model (large) has learned 70 billion.

More parameters = more pattern recognition = better reasoning (usually).

But here's the good news: you don't need the biggest model. A 7-billion-parameter model, running on a laptop, can handle most everyday tasks just fine. It's like having Chef B-lite — still experienced, still useful, much more practical.

How It Actually Works (The Simplest Explanation)

When you type a message, here's what happens:

You type: "What is the capital of France?"

Step 1: The model breaks your question into tokens (words and pieces of words).
         ["What", " is", " the", " capital", " of", " France", "?"]

Step 2: The model starts predicting the answer, one word at a time.
         "The" → "capital" → "of" → "France" → "is" → "Paris" → "."

Step 3: Each word is chosen based on probability.
         "The capital of France is..." → P(Paris) = 95%, P(Lyon) = 2%, P(Marseille) = 1%
         → It picks "Paris" (the most probable)

Step 4: Done! "The capital of France is Paris."

It's not magic. It's a very, very sophisticated version of your phone's autocomplete — trained on the entire internet.

Why This Matters to You (a Regular Person)

Here's why understanding this matters:

1. You Don't Need to Be a Programmer

If you understand that an LLM predicts words based on patterns, you already understand enough to use it. The tools are designed for everyone now.

2. You Can Run It on Your Laptop

Because LLMs are just math (very complicated math, but still math), they can run on any computer. A smaller model on your laptop is slower than ChatGPT — but it's private, free, and always available.

3. You Should Be Skeptical

Knowing that LLMs can be confidently wrong helps you use them better. Always fact-check important information. Use AI as a brainstorming partner, not an encyclopedia.

4. You're Not Left Behind

The people who benefit most from AI aren't programmers — they're writers, students, small business owners, artists, and curious people who ask good questions. That's probably you.

The Different Types of AI (In Two Sentences)

Type	What It Does	Example
LLM	Understands and generates text	ChatGPT, Claude, DeepSeek
Image generator	Creates pictures from descriptions	Midjourney, DALL-E, Stable Diffusion
Voice AI	Understands and generates speech	Siri, Whisper
Recommendation	Predicts what you'll like	TikTok, Netflix, YouTube

This series focuses on LLMs — the text-based AI that can write, explain, analyze, and assist. It's the most useful type for everyday tasks.

What You Can Actually DO with This Knowledge

Now that you know what an LLM is:

You can use one right now, for free — Ollama + a small model on your laptop
You know the limits — It's not magic, it's pattern recognition. Use it as a tool, not an oracle.
You can explain it to others — When your friends say "AI is taking over," you can say "Actually, it's just really good autocomplete, trained on a lot of data."

What's Next

Now that you know what an LLM is, the next guide shows you how to actually run one:

👉 Part 3: "Step-by-Step: Run Your First AI Model in 10 Minutes" — (coming next)

No terminal commands you don't understand. No unexplained jargon. Just a simple walkthrough with screenshots.

Hi, I'm Ling. I'm a medical student who got tired of feeling left behind by AI. I started learning, broke things, fixed them, and now I'm sharing what I've learned — in plain English, for regular people.

Found this useful? ⭐ Star the GitHub repo to get notified when new guides drop. Or leave a comment — I'd love to hear what questions you still have.

AI Is Too Expensive? I Run It for Free on My Laptop

Lingdas1 — Sun, 24 May 2026 09:36:11 +0000

AI Is Too Expensive? I Run It for Free on My Laptop (Here's How)

A medical student's guide to using AI without paying a cent in subscription fees.

I remember the exact moment I gave up on AI.

It was January 2026. I was staring at ChatGPT Pro's $200/month price tag, then at my bank account. A medical student in China — my monthly budget for "extras" was about enough for two bubble teas.

"AI is for rich people," I thought. "Or people whose companies pay for it."

I closed the tab and went back to studying.

But I couldn't shake the feeling that I was missing out. Everyone was talking about AI — coding assistants, research tools, writing helpers. And there I was, stuck with Google and a prayer.

Three months later, I'm running GPT-4-class models on my five-year-old laptop. For free. No subscriptions, no API bills, no cloud credits.

This is how I did it — and how you can too, even if you're not a programmer.

The Lie We've Been Told

Here's the thing nobody tells you about AI: you don't need the cloud.

Every AI company wants you to believe you need their $20/month plan. Or their $200/month Pro plan. Or their enterprise plan (ask for pricing!).

Why? Because they make money every time you type a message.

But the technology itself? The actual AI model? It's open source. Free. Public. Available for anyone to download and run.

The only reason we don't is that nobody told us we could.

What I Thought vs What I Learned

Before:

"Running AI locally? You need a $5,000 gaming PC with liquid cooling or something."

After:

My laptop has 8GB RAM and a mid-range GPU from 2021. I run AI models that answer questions, summarize articles, and help me study — all locally, all free.

Before:

"You need to be a programmer to set this up."

After:

I'm a medical student. I know anatomy, not APIs. If I can do it, anyone can.

Before:

"Local AI is worse than ChatGPT."

After:

For most everyday tasks — writing, research, brainstorming — the difference is unnoticeable. And for some things (privacy, no censorship, unlimited use), local AI is actually better.

What You Can Actually Do with Free AI

Let me show you what I do daily, all on my laptop, all free:

1. Study Assistant

I paste textbook chapters and ask questions. The model explains difficult concepts in simpler terms. No more watching expensive YouTube tutorials.

2. Writing Helper

Essays, emails, notes — I draft them faster. The model suggests improvements but doesn't rewrite everything (I'm still learning English, so I need the practice).

3. Research Buddy

I download research papers as PDFs and ask questions about them. "Summarize this in three bullet points." "What's the main limitation of this study?"

4. Brainstorming Partner

When I'm stuck on an idea, I talk it out with the AI. It's like having a friend who never gets tired of your questions.

5. Language Practice

I write something, ask the AI to correct my grammar, and learn from the feedback. It's like a free tutor who's available 24/7.

What You Need (Real Talk)

Let's be honest about what you need. No corporate marketing, just facts.

The Minimum Setup

Any computer (Windows, Mac, Linux — even a $200 used laptop)
At least 8GB of RAM (16GB is better, but 8GB works)
Internet connection for the initial download (takes 10-15 minutes)

That's it. No special GPU required. No expensive hardware.

"Wait, I thought you needed a gaming graphics card?"

You can get better speed with a gaming GPU — but you don't need one. Models that run on CPU are slower (think 5-10 seconds per response instead of 1-2 seconds), but they work perfectly fine for most tasks.

What It Looks Like

The whole setup is basically this:

1. Download a free program (Ollama) — 2 minutes
2. Pick a model (the "brain") — 1 click
3. Start chatting — immediately

That's the entire process. I'll write a step-by-step guide with screenshots soon. For now, just know that it's much simpler than you think.

The Privacy Bonus Nobody Talks About

Here's something I didn't expect: privacy.

When you use ChatGPT or Claude, everything you type goes to their servers. Your questions, your documents, your private thoughts.

When you run AI locally:

🔒 Everything stays on your computer
🔒 No one sees your conversations
🔒 No data collection
🔒 Works even without internet

For a medical student handling sensitive patient data during rotations, this is huge. But even for everyday use — journal entries, personal projects, private brainstorming — it's nice to know your data is yours.

But Wait, Is It Actually Good?

This is the question I get most. Let me give you an honest answer:

For most everyday tasks? Yes, it's good enough.

Writing emails → ✅ Great
Summarizing articles → ✅ Great
Brainstorming ideas → ✅ Great
Explaining concepts → ✅ Great
Writing code → ✅ Good (with the right model)
Complex math → ✅ Good (with DeepSeek-R1)
Creative writing → 🟡 Decent (varies by model)
Real-time conversation → 🟡 A bit slower on CPU

The only thing you really miss: The absolute top-tier models (GPT-4o, Claude Opus) are still cloud-only. But 90% of what I need AI for, my local models handle just fine.

Why I'm Writing This

I'm not a tech influencer. I don't sell courses or have affiliate links. I'm just a medical student who was frustrated by how expensive AI seemed — and then discovered it didn't have to be.

Every guide I found was written by programmers, for programmers. They assumed I knew what a "terminal" was, what "GGUF" meant, how to "clone a repo."

I didn't know any of that. I still barely do.

But I learned enough to get it working. And if I can do it, you can too.

What's Coming Next

I'm writing a series of plain-English guides for people who feel left behind by AI:

Part 2: "What Is an LLM? (No, It's Not Magic)" — Explaining AI in simple terms
Part 3: "Step-by-Step: Run Your First AI Model in 10 Minutes" — Screenshots included
Part 4: "5 Free Things You Can Do with Local AI Right Now" — Practical use cases
Part 5: "Local AI vs ChatGPT: An Honest Comparison" — No bias, just facts

Star the repo or follow me here to get notified when they drop.

The Bottom Line

AI shouldn't be a luxury. The technology is free, the tools are simple, and the only thing standing between you and free AI is knowing it exists.

I spent months thinking I couldn't afford AI. Turns out, I could afford it all along — I just didn't know where to look.

You can run AI on your laptop right now. For free. And it works.

If a medical student with zero coding background can figure it out, so can you.

Hi, I'm Ling. I'm a medical student in China who fell into AI by accident. No CS degree, no big tech job — just a laptop, a lot of curiosity, and a belief that AI should be for everyone. This is the first of my "AI for the Rest of Us" series.

Found this useful? ⭐ Star the GitHub repo to get notified when new guides drop. Or leave a comment — I read every one.

Local LLM Guide: The Complete Series — Find Your Starting Point 👋

Lingdas1 — Sun, 24 May 2026 09:21:04 +0000

Welcome to Local LLM Guide 👋

Hi, I'm Ling. I'm a medical student who fell into AI by accident. No CS degree, no big tech job — just a laptop, a lot of curiosity, and a belief that AI should be for everyone.

Not sure where to start? Pick your path:

👨‍💻 For Developers

You want to run LLMs locally — on your own hardware, with your own data, without paying API fees.

#	Article	Level	Read Time
01	Getting Started: Run Your First Local LLM in 5 Minutes	🟢 Beginner	5 min
02	Hardware Guide: What You Actually Need	🟢 Beginner	8 min
03	DeepSeek-R1: The $0 o1 Alternative	🟡 Intermediate	10 min
04	Qwen 3.6 & 2.5: The Most Versatile Local Models	🟡 Intermediate	10 min
05	Open WebUI: Your Local ChatGPT	🟡 Intermediate	8 min
06	GGUF & Modelfile: The Power User's Guide	🟡 Intermediate	12 min
07	Local RAG: Chat With Your Documents	🟡 Intermediate	10 min
08	Production-Ready Local LLMs: From Terminal to Team Deployment	🔴 Advanced	15 min
09	Function Calling for Local LLMs: DeepSeek, Qwen, GLM-4 & LangChain	🔴 Advanced	15 min

👉 Full source code & scripts: GitHub: Lingdas1/local-llm-guide

🧑‍🏫 New to AI? Start Here

You've heard about AI but feel overwhelmed. You have a regular laptop and want to understand what's possible — in plain English, no jargon.

#	Guide	Read Time
01	AI Is Too Expensive? I Run It for Free on My Laptop	5 min
02	What Is an LLM? (No, It's Not Magic)	6 min
03	Step-by-Step: Run Your First AI Model in 10 Minutes	Coming soon!
04	5 Free Things You Can Do with Local AI	Coming soon!

💡 Don't know where to start? Begin with Article 01 — it explains why you don't need money or technical skills to use AI.

All guides are written by a medical student who learned this stuff from zero. No assumed knowledge, no skipped steps.

📚 The Complete Guide (All-in-One)

If you prefer one long read, start here:

👉 The Complete Guide to Running LLMs Locally in 2026: From Ollama to Production

Covers everything from installing Ollama to production deployment — all in one article.

📊 What This Series Covers

🟢 Beginner ──────────────────────────────────────┐
    ├── 01. Getting Started (5 min setup)         │
    ├── 02. Hardware Guide (what you need)        │
                                                  │
🟡 Intermediate ──────────────────────────────────┤
    ├── 03. DeepSeek-R1 Guide                     │
    ├── 04. Qwen 3.6 & 2.5 Guide                  │
    ├── 05. Open WebUI Setup                      │
    ├── 06. GGUF & Modelfile Customization        │
    ├── 07. Local RAG with AnythingLLM            │
                                                  │
🔴 Advanced ──────────────────────────────────────┤
    ├── 08. Production Deployment                  │
    ├── 09. Function Calling & Tool Use            │
                                                  │
📦 Bonus: Scripts + Docker Compose + Benchmarks   │
    All on GitHub ⬇️                              │
                                                  │
    GitHub.com/Lingdas1/local-llm-guide ──────────┘

🎯 Why This Series Exists

I started this journey because I was frustrated. Every AI tutorial assumed I had:

Unlimited API budget ($200/month for ChatGPT Pro)
A rack of A100 GPUs
A CS degree from Stanford

Real life is different. I have a laptop, a curious mind, and no budget for API fees. I'm a medical student — not a software engineer.

If I can figure this out, so can you. That's the whole point of this series.

🔗 Quick Links

What	Where
All source code	github.com/Lingdas1/local-llm-guide
Complete guide (one article)	Here
Beginner path	Start at Article #1 above
Developer path	Start at Article #3 above
Found this useful?	⭐ Star the repo

If this guide helped you, consider:

⭐ Starring the repo — it helps others find it and you'll get notified when new chapters drop
💬 Leaving a comment — I read every one
🔁 Sharing with a friend who's curious about running AI locally

Ling — May 2026

I'm a medical student sharing what I learn about local AI. No CS degree, no big tech — just honest guides for real people.

Function Calling for Local LLMs: DeepSeek, Qwen, GLM-4 & LangChain

Lingdas1 — Sun, 24 May 2026 09:17:04 +0000

06 — Function Calling & Tool Use

🔴 Advanced — Give your local LLM superpowers: let it call APIs, run code, search the web, and interact with other software — all autonomously.

What Is Function Calling? (Plain English First)

Imagine you ask an assistant: "What's the weather in Tokyo right now?"

A normal LLM can only guess — it doesn't know today's weather. But with function calling, the LLM can say:

"I don't know the weather, but I know someone who does. Let me call the weather API."

The pattern is simple:

User: "What's the weather in Tokyo?"
  ↓
LLM: "I should call get_weather(city='Tokyo')"
  ↓
Your code: calls the actual weather API → gets result
  ↓
LLM: "The weather in Tokyo is 22°C and sunny."

Function calling = the LLM decides when to use a tool, and your code executes it.

💡 Why this matters without the cloud: On a cloud API (GPT-4, Claude), function calling is a checkbox feature. On local LLMs, it's not automatic — you need to know which models support it, how to format the tool definitions, and how to handle the response correctly. That's what this chapter covers.

How Function Calling Works (The Technical Pattern)

Every function calling flow follows the same 5-step cycle:

Step 1: Define your tools (as JSON schema)
Step 2: Send user message + tool definitions to the LLM
Step 3: LLM responds with either:
         - A normal text reply (no tool needed)
         - A "tool call" request (which tool + what arguments)
Step 4: Your code executes the requested tool
Step 5: Send the tool result back to the LLM
         → LLM produces the final response

Here's what a tool definition looks like in JSON:

{
  "type": "function",
  "function": {
    "name": "get_weather",
    "description": "Get current weather for a city",
    "parameters": {
      "type": "object",
      "properties": {
        "city": {
          "type": "string",
          "description": "City name, e.g., 'Tokyo'"
        }
      },
      "required": ["city"]
    }
  }
}

1. DeepSeek-R1: Function Calling

DeepSeek-R1 is excellent at function calling — it's one of its standout features. It uses the OpenAI-compatible format, which means you can use the same code you'd use with GPT-4.

Basic Setup

First, make sure DeepSeek-R1 is running locally:

ollama pull deepseek-r1:14b

# Or for smaller setups:
ollama pull deepseek-r1:7b

Single Tool Call Example (Python)

import json
import requests

# Step 1: Define the tools available to the LLM
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "City name"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                },
                "required": ["city"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "calculator",
            "description": "Perform a mathematical calculation",
            "parameters": {
                "type": "object",
                "properties": {
                    "expression": {
                        "type": "string",
                        "description": "Math expression, e.g., '2 + 2' or 'sqrt(144)'"
                    }
                },
                "required": ["expression"]
            }
        }
    }
]

# Step 2: Send message + tools to the model
def chat_with_tools(messages, tools):
    response = requests.post(
        "http://localhost:11434/v1/chat/completions",
        json={
            "model": "deepseek-r1:14b",
            "messages": messages,
            "tools": tools,
            "stream": False
        }
    )
    return response.json()

# Step 3: Execute tool calls and return results
def execute_tool(tool_call):
    """Execute the tool the LLM requested and return the result."""
    name = tool_call["function"]["name"]
    args = json.loads(tool_call["function"]["arguments"])

    if name == "get_weather":
        # In real code, you'd call a real weather API here
        city = args["city"]
        unit = args.get("unit", "celsius")
        return json.dumps({
            "city": city,
            "temperature": 22 if unit == "celsius" else 72,
            "condition": "Sunny",
            "humidity": "65%"
        })

    elif name == "calculator":
        try:
            result = eval(args["expression"], {"__builtins__": {}}, {
                "sqrt": __import__("math").sqrt,
                "sin": __import__("math").sin,
                "cos": __import__("math").cos,
                "pi": __import__("math").pi
            })
            return json.dumps({"result": result})
        except Exception as e:
            return json.dumps({"error": str(e)})

    return json.dumps({"error": f"Unknown tool: {name}"})

# Step 4: Run the full interaction
def run_with_tools(user_message):
    messages = [
        {"role": "system", "content": "You are a helpful assistant with access to tools."},
        {"role": "user", "content": user_message}
    ]

    # First LLM call
    response = chat_with_tools(messages, tools)
    response_message = response["choices"][0]["message"]
    messages.append(response_message)

    # Check if the LLM wants to call tools
    if response_message.get("tool_calls"):
        for tool_call in response_message["tool_calls"]:
            result = execute_tool(tool_call)
            messages.append({
                "role": "tool",
                "tool_call_id": tool_call["id"],
                "content": result
            })

        # Second LLM call — now it has the tool results
        final_response = chat_with_tools(messages, tools)
        return final_response["choices"][0]["message"]["content"]

    return response_message["content"]

# Test it
print(run_with_tools("What's the weather in Tokyo in celsius?"))
# → "The weather in Tokyo is 22°C and sunny."

print(run_with_tools("Calculate 2^10 + 5*3"))
# → "The result is 1024 + 15 = 1039."

Key Differences from Cloud APIs

Aspect	GPT-4 (Cloud)	DeepSeek-R1 (Local)
`tool_choice`	Supports `"auto"`, `"required"`, `"none"`	Supports `"auto"` and `"none"`
Parallel tool calls	✅ Yes	✅ Yes (multiple tools in one response)
Streaming with tools	✅ Yes	⚠️ Partially (use `stream: false` for reliability)
Response format	OpenAI format	OpenAI-compatible ✅

Tip: If DeepSeek-R1 doesn't call tools when you expect it to, try adding explicit instructions in the system prompt like: "You have access to tools. Use them when the user asks for information you don't know."

2. Qwen 3.6 / 2.5: Function Calling

Qwen models have native function calling support and are particularly good at following complex tool schemas.

Setup

# Qwen 3.6 (newer, better function calling)
ollama pull qwen3.6:8b

# Or Qwen 2.5 (more widely tested)
ollama pull qwen2.5:7b

Example: Multi-Tool Chatbot

import json
import requests

def qwen_chat_with_tools(messages, tools):
    """Qwen uses the same OpenAI-compatible format."""
    response = requests.post(
        "http://localhost:11434/v1/chat/completions",
        json={
            "model": "qwen3.6:8b",  # or "qwen2.5:7b"
            "messages": messages,
            "tools": tools,
            "tool_choice": "auto",
            "temperature": 0.3,  # Lower = more deterministic tool selection
            "stream": False
        }
    )
    return response.json()

# Define a web search tool (mock)
tools = [
    {
        "type": "function",
        "function": {
            "name": "search_web",
            "description": "Search the internet for current information",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string", "description": "Search query"}
                },
                "required": ["query"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "read_file",
            "description": "Read contents of a file on the local system",
            "parameters": {
                "type": "object",
                "properties": {
                    "path": {"type": "string", "description": "Absolute file path"}
                },
                "required": ["path"]
            }
        }
    }
]

def execute_qwen_tool(tool_call):
    name = tool_call["function"]["name"]
    args = json.loads(tool_call["function"]["arguments"])

    if name == "search_web":
        # In production, use a real search API
        return json.dumps({
            "query": args["query"],
            "results": [
                {"title": f"Result about {args['query']}", "url": "https://example.com"}
            ]
        })
    elif name == "read_file":
        try:
            with open(args["path"], "r") as f:
                content = f.read()[:2000]  # Limit to 2000 chars
            return json.dumps({"path": args["path"], "content": content})
        except Exception as e:
            return json.dumps({"error": str(e)})

    return json.dumps({"error": "Unknown tool"})

# Full interaction loop
messages = [
    {"role": "system", "content": "You are an AI assistant with access to search and file tools. Use them when needed."}
]

user_input = "Can you read my config file and tell me what model I'm using?"
messages.append({"role": "user", "content": user_input})

# First response
response = qwen_chat_with_tools(messages, tools)
msg = response["choices"][0]["message"]
messages.append(msg)

# Handle tool calls
if msg.get("tool_calls"):
    for tc in msg["tool_calls"]:
        result = execute_qwen_tool(tc)
        messages.append({
            "role": "tool",
            "tool_call_id": tc["id"],
            "content": result
        })

    # Get final response
    final = qwen_chat_with_tools(messages, tools)
    print(final["choices"][0]["message"]["content"])

Qwen-Specific Tips

Tip	Why
Use `temperature: 0.3`	Qwen is more creative by default; lower temp = more reliable tool selection
Describe tools in Chinese + English	Qwen was trained bilingually; descriptions in English work fine, but Chinese descriptions can improve accuracy
Max 5 parallel tools	Qwen 3.6 supports parallel tool calls but performs best with ≤5 at once
Use `tool_choice: "auto"`	Explicitly setting this prevents the model from ignoring tools

3. GLM-4.7: Tool Use & Agents

GLM-4 (from Zhipu AI / z.ai) is specifically designed for agentic workflows. It has the strongest tool-use capabilities among Chinese local models — it was trained with tool use as a first-class feature, not an afterthought.

Setup

ollama pull glm4:9b

GLM's Unique Tool Format

GLM uses a slightly different tool definition format. Note the required_parameters field instead of required:

import json
import requests

# GLM tool definition format
glm_tools = [
    {
        "type": "function",
        "function": {
            "name": "send_email",
            "description": "Send an email to a recipient",
            "parameters": {
                "type": "object",
                "properties": {
                    "to": {"type": "string", "description": "Recipient email address"},
                    "subject": {"type": "string", "description": "Email subject"},
                    "body": {"type": "string", "description": "Email body content"}
                },
                "required_parameters": ["to", "subject", "body"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "list_directory",
            "description": "List files in a directory",
            "parameters": {
                "type": "object",
                "properties": {
                    "path": {"type": "string", "description": "Directory path"}
                },
                "required_parameters": ["path"]
            }
        }
    }
]

def glm_chat(messages, tools):
    response = requests.post(
        "http://localhost:11434/v1/chat/completions",
        json={
            "model": "glm4:9b",
            "messages": messages,
            "tools": tools,
            "stream": False
        }
    )
    return response.json()

Multi-Step Agent Example

GLM-4 excels at multi-step reasoning — deciding to call tools in sequence:

messages = [
    {"role": "system", "content": "You are an AI assistant that can use tools. Use them when helpful."},
    {"role": "user", "content": "List the files in /home/user/projects, then tell me which ones are Python files."}
]

# GLM will:
# 1. Call list_directory("/home/user/projects")
# 2. Receive the file list
# 3. Analyze and respond with which are Python files

response = glm_chat(messages, glm_tools)
msg = response["choices"][0]["message"]

if msg.get("tool_calls"):
    for tc in msg["tool_calls"]:
        result = execute_glm_tool(tc)  # Your tool execution function
        messages.append({
            "role": "tool",
            "tool_call_id": tc["id"],
            "content": result
        })

    # GLM will now synthesize the results
    final = glm_chat(messages, glm_tools)
    print(final["choices"][0]["message"]["content"])

GLM vs Others: When to Use Each

Task	Best Model	Why
Simple tool call (1-2 tools)	DeepSeek-R1:7b	Fastest inference, reliable
Complex multi-step (3+ tools)	GLM-4:9b	Best agentic reasoning
Following exact tool schema	Qwen 3.6:8b	Most accurate parameter extraction
Cost-sensitive (low VRAM)	Qwen 2.5:7b	4.5GB at Q4, works on most GPUs

4. LangChain Integration

LangChain is the most popular framework for building LLM-powered applications. Here's how to use your local models with function calling in LangChain.

Installation

pip install langchain langchain-community

Basic LangChain + Ollama Tools

from langchain_community.chat_models import ChatOllama
from langchain.agents import AgentExecutor, create_tool_calling_agent
from langchain.tools import tool
from langchain_core.prompts import ChatPromptTemplate

# Step 1: Define tools using the @tool decorator
@tool
def get_weather(city: str) -> str:
    """Get current weather for a city. Input: city name."""
    # Replace with real API call
    return f"The weather in {city} is 22°C and sunny."

@tool
def calculate(expression: str) -> str:
    """Perform a mathematical calculation. Input: math expression string."""
    import math
    safe_dict = {
        "sqrt": math.sqrt, "sin": math.sin, "cos": math.cos,
        "pi": math.pi, "e": math.e, "abs": abs
    }
    try:
        result = eval(expression, {"__builtins__": {}}, safe_dict)
        return f"Result: {result}"
    except Exception as e:
        return f"Error: {e}"

@tool
def search_web(query: str) -> str:
    """Search the web for current information. Input: search query."""
    # In production, use DuckDuckGo or similar
    return f"Top result for '{query}': [Example result]"

# Step 2: Create the LLM
llm = ChatOllama(
    model="qwen2.5:7b",  # or "deepseek-r1:7b", "glm4:9b"
    temperature=0.3,
)

# Step 3: Create the agent
tools = [get_weather, calculate, search_web]
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful AI assistant with access to tools."),
    ("human", "{input}"),
    ("placeholder", "{agent_scratchpad}"),
])

agent = create_tool_calling_agent(llm, tools, prompt)
agent_executor = AgentExecutor(
    agent=agent,
    tools=tools,
    verbose=True,  # Shows you what tools are being called
    max_iterations=5,  # Safety limit
)

# Step 4: Run it
result = agent_executor.invoke({
    "input": "What's the weather in London and calculate 15% of 200?"
})
print(result["output"])
# → "The weather in London is 22°C and sunny. 15% of 200 is 30."

Running the LangChain Example

# Save the code above as langchain-agent.py
python langchain-agent.py

# You should see:
# > Entering new AgentExecutor chain...
# > Invoking: get_weather with {'city': 'London'}
# > Invoking: calculate with {'expression': '0.15 * 200'}
# > The weather in London is 22°C and sunny. 15% of 200 is 30.

Model-Specific LangChain Tips

Model	LangChain Model Class	Notes
DeepSeek-R1	`ChatOllama(model="deepseek-r1:14b")`	Best for reasoning-heavy agents
Qwen 3.6/2.5	`ChatOllama(model="qwen3.6:8b")`	Most reliable with LangChain's tool format
GLM-4	`ChatOllama(model="glm4:9b")`	May need `stop: ["<

5. Practical: Build a Code Assistant Bot

Let's put it all together — a real tool-using assistant that can:

Read and write files
Run shell commands
Search for packages
Answer questions about your codebase {% raw %}

import json
import requests
import subprocess
import os

# === Tool Definitions ===

TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "read_file",
            "description": "Read the contents of a file",
            "parameters": {
                "type": "object",
                "properties": {
                    "path": {"type": "string", "description": "Absolute path to file"}
                },
                "required": ["path"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "write_file",
            "description": "Write content to a file (overwrites existing)",
            "parameters": {
                "type": "object",
                "properties": {
                    "path": {"type": "string"},
                    "content": {"type": "string", "description": "File content"}
                },
                "required": ["path", "content"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "run_command",
            "description": "Run a shell command (read-only, safe commands only)",
            "parameters": {
                "type": "object",
                "properties": {
                    "command": {"type": "string", "description": "Shell command to run"}
                },
                "required": ["command"]
            }
        }
    }
]

def execute(name, args):
    if name == "read_file":
        try:
            with open(args["path"], "r") as f:
                return f.read()[:3000]
        except Exception as e:
            return f"Error: {e}"

    elif name == "write_file":
        try:
            with open(args["path"], "w") as f:
                f.write(args["content"])
            return f"Written {len(args['content'])} bytes to {args['path']}"
        except Exception as e:
            return f"Error: {e}"

    elif name == "run_command":
        # Safety: only allow read-only commands
        safe_prefixes = ["ls", "cat", "grep", "find", "pwd", "echo", "which", "head", "tail"]
        cmd = args["command"].split()[0]
        if cmd not in safe_prefixes:
            return f"Blocked: '{cmd}' is not in the allowed command list."
        try:
            result = subprocess.run(
                args["command"], shell=True, capture_output=True,
                text=True, timeout=10
            )
            output = (result.stdout + result.stderr)[:3000]
            return output if output else "(no output)"
        except subprocess.TimeoutExpired:
            return "Command timed out after 10 seconds"
        except Exception as e:
            return f"Error: {e}"

# === Main Loop ===

def chat_tool(ollama_host="http://localhost:11434", model="qwen2.5:7b"):
    messages = [{
        "role": "system",
        "content": "You are a coding assistant. Use your tools to read files, write code, and run commands."
    }]

    print(f"🤖 Code Assistant ({model}) — type 'quit' to exit\n")

    while True:
        user = input("You: ")
        if user.lower() in ("quit", "exit", "q"):
            break

        messages.append({"role": "user", "content": user})

        # Tool-call loop (max 5 iterations to prevent infinite loops)
        for i in range(5):
            resp = requests.post(
                f"{ollama_host}/v1/chat/completions",
                json={
                    "model": model,
                    "messages": messages,
                    "tools": TOOLS,
                    "stream": False
                }
            ).json()

            msg = resp["choices"][0]["message"]
            messages.append(msg)

            if not msg.get("tool_calls"):
                break  # No more tools needed

            # Execute each tool
            for tc in msg["tool_calls"]:
                fn_name = tc["function"]["name"]
                fn_args = json.loads(tc["function"]["arguments"])
                print(f"  🔧 Calling: {fn_name}({json.dumps(fn_args)})")
                result = execute(fn_name, fn_args)
                messages.append({
                    "role": "tool",
                    "tool_call_id": tc["id"],
                    "content": str(result)
                })

        # Print final response
        print(f"🤖 {msg['content']}\n")

if __name__ == "__main__":
    chat_tool()

Save and run:

python3 code-assistant.py

Example interaction:

You: Read my main.py and tell me if there are any bugs
  🔧 Calling: read_file({"path": "./main.py"})
🤖 I can see your main.py. It looks mostly fine, but I notice
   line 42 has a typo: "retrun" should be "return".

You: Fix it
  🔧 Calling: read_file({"path": "./main.py"})
  🔧 Calling: write_file({"path": "./main.py", "content": "..."})
🤖 Fixed! Changed "retrun" to "return" on line 42.

Quick Reference: Model Function Calling Support

Feature	DeepSeek-R1	Qwen 3.6 / 2.5	GLM-4	Notes
OpenAI format	✅	✅	✅	Same `tools` parameter
Parallel calls	✅	✅	✅	Multiple tools at once
`tool_choice: "auto"`	✅	✅	✅	LLM decides when to use tools
`tool_choice: "required"`	❌	⚠️ Partial	❌	Not widely supported locally
Streaming + tools	⚠️ Partial	✅	⚠️ Partial	Use `stream: false` to be safe
Multi-step reasoning	Good	Very Good	Excellent	GLM-4 leads on agentic workflows
Min VRAM (Q4)	~4.5 GB (7b)	~5 GB (8b)	~5.5 GB (9b)	All fit on 8GB GPUs

Common Mistakes & Solutions

Mistake	Symptom	Fix
Wrong model name	"does not support tools" error	Verify: `curl -s http://localhost:11434/api/tags`
Missing system prompt	Model never calls tools	Add: "You have access to tools. Use them when helpful."
Too many tools	Model calls wrong tool	Limit to ≤5 tool definitions per call
No `tool_choice: "auto"`	Model ignores tools	Explicitly set `tool_choice: "auto"`
Infinite tool loop	Model keeps calling tools	Add `max_iterations` guard (e.g., 5)
Temperature too high	Tool calls are random/lazy	Set `temperature: 0.3` or lower
Wrong Ollama port	Connection refused	Check: `ollama serve` is running on 11434

What's Next

You now have a local LLM that can see files, run commands, search the web, and execute code. This is the foundation for building:

AI coding assistants that read and modify your codebase
Personal research agents that search the web and summarize
Automation bots that interact with APIs and databases
Your own AutoGPT — a multi-step reasoning agent

The GitHub repo has ready-to-run scripts for all the examples above. Star it to get notified when new chapters drop! ⭐

Found this useful? ⭐ Star the repo — it helps others find it and you'll get notified when new chapters drop.

Production-Ready Local LLMs: From Terminal to Team Deployment

Lingdas1 — Sun, 24 May 2026 09:13:27 +0000

05 — Production: From Personal Setup to Team Deployment

🔴 Advanced — You've got local LLMs running on your machine. Now let's make them available to your team, your apps, and your users — securely and reliably.

What You'll Learn

By the end of this chapter, you'll be able to:

✅ Set up multi-user access with Open WebUI (users, groups, permissions)
✅ Expose Ollama models via REST API with rate limiting and authentication
✅ Deploy the full stack (Ollama + Open WebUI + RAG) using Docker
✅ Monitor usage, performance, and errors
✅ Calculate when local beats the cloud — with real 2026 pricing
✅ Secure your deployment with an actionable checklist

1. Multi-User Management with Open WebUI

Open WebUI isn't just a pretty interface — it comes with a built-in user management system. Here's how to set it up for multiple users.

Enabling Sign-Up

By default, Open WebUI allows anyone to create an account. For team use, you'll want to control this:

# Option A: Allow sign-up (good for small teams)
docker run -d -p 3000:8080 \
  -v open-webui:/app/backend/data \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  -e WEBUI_NAME="Team AI" \
  -e ENABLE_SIGNUP=true \
  --name open-webui --restart always \
  ghcr.io/open-webui/open-webui:main

The --restart always flag means the container auto-starts when the machine reboots. Without it, your team loses access every time you restart.

# Option B: Invite-only (recommended for production)
# Same as above, but set:
-e ENABLE_SIGNUP=false

With sign-up disabled, you create accounts manually from the Admin Panel (Settings → Users → Add User).

Creating User Roles

Open WebUI supports three roles out of the box:

Role	Permissions	Best For
User	Chat with assigned models, create conversations	Team members who just need to use AI
Admin	Full access: manage users, models, settings	Team leads, IT admins
Pending	Can't chat yet — awaiting approval	New sign-ups waiting for review

How to assign roles:

Log in as admin → click the gear icon ⚙️ (bottom left)
Navigate to Admin Panel → Users
Click the pencil icon on any user → change their role

Model Access Control (Per-User Models)

This is one of Open WebUI's killer features for production use. You can control which models each user can see and use:

Admin Panel → Models
Click a model → Permissions tab
Select which users/groups can access it

Why this matters: You might have a 70B model that only your power users should run (saves VRAM for everyone), or a cheap 7B model for general queries. Per-user model access lets you balance resources intelligently.

2. API Deployment: Exposing Ollama to Applications

Ollama runs an HTTP server on http://localhost:11434 by default. To make it accessible to other applications (or other machines on your network), you need to configure it properly.

Step 1: Allow External Connections

By default, Ollama only listens on 127.0.0.1 (localhost). To allow network access:

Linux (systemd):

# Edit the Ollama service configuration
sudo mkdir -p /etc/systemd/system/ollama.service.d
cat > /etc/systemd/system/ollama.service.d/override.conf << 'EOF'
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
EOF

# Reload and restart
sudo systemctl daemon-reload
sudo systemctl restart ollama

0.0.0.0 means "listen on all network interfaces." 127.0.0.1 means "listen on localhost only."

macOS: Set the environment variable in your shell profile:

# Add to ~/.zshrc or ~/.bashrc
export OLLAMA_HOST=0.0.0.0:11434
# Then restart Ollama from the menu bar icon

Docker Ollama:

docker run -d \
  -p 11434:11434 \
  -v ollama:/root/.ollama \
  --name ollama \
  --restart always \
  ollama/ollama

Step 2: Add Authentication (htpasswd)

⚠️ Never expose Ollama to the public internet without authentication. Anyone who finds your endpoint can run models on your GPU, costing you electricity and compute.

The simplest auth layer is Basic Auth via a reverse proxy:

# Install nginx and create a password file
sudo apt-get install nginx apache2-utils
sudo htpasswd -c /etc/nginx/.htpasswd admin
# Enter a strong password when prompted

# Optional: add more users
sudo htpasswd /etc/nginx/.htpasswd teammate1

Step 3: Set Up Nginx Reverse Proxy

# /etc/nginx/sites-available/ollama
server {
    listen 8080;
    server_name _;

    # === Authentication ===
    # Every request must provide a valid username:password
    auth_basic "Ollama API — Authorized Access Only";
    auth_basic_user_file /etc/nginx/.htpasswd;

    # === Rate Limiting ===
    # Max 30 requests per minute per IP address
    limit_req zone=ollama burst=5 nodelay;
    limit_req_status 429;

    location / {
        proxy_pass http://127.0.0.1:11434;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;

        # Increase timeout for long-running model generations
        proxy_read_timeout 300;
        proxy_send_timeout 300;
    }
}

# Define the rate limit zone (in http block, usually /etc/nginx/nginx.conf)
# limit_req_zone $binary_remote_addr zone=ollama:10m rate=30r/m;

What each directive does:

auth_basic — Prompts for username/password on every request
limit_req — Prevents a single user from overwhelming your GPU
proxy_read_timeout 300 — Models can take minutes to generate; this keeps the connection open

# Enable the site and restart nginx
sudo ln -s /etc/nginx/sites-available/ollama /etc/nginx/sites-enabled/
sudo nginx -t  # Test config for syntax errors
sudo systemctl restart nginx

Step 4: Test Your API

# Without auth — should get 401 Unauthorized
curl -sk http://your-server-ip:8080/api/tags

# With auth — should work
curl -sk -u admin:yourpassword http://your-server-ip:8080/api/tags

# Send a chat request via API
curl -sk -u admin:yourpassword \
  -X POST http://your-server-ip:8080/api/chat \
  -d '{
    "model": "qwen2.5:7b",
    "messages": [{"role": "user", "content": "Say hello in one word"}],
    "stream": false
  }'

The stream: false parameter returns the full response at once. For production apps, you'll want stream: true for streaming responses.

Step 5: OpenAI-Compatible Endpoint

Ollama also exposes an OpenAI-compatible endpoint, which means any tool that works with OpenAI's API can work with your local models:

# Using the OpenAI Python library with Ollama
curl -sk -u admin:yourpassword \
  -X POST http://your-server-ip:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5:7b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Compatibility note: The OpenAI-compatible endpoint (/v1/) works with most tools that support OpenAI — including LangChain, LlamaIndex, Continue.dev (VS Code extension), Cursor, and custom scripts. Just change the base URL and skip the API key (or use a dummy one if the tool requires it).

3. Docker Deployment: Full Stack, Containerized

This section has two versions. Start with the quick version to get running fast, then use the deep version when you need a proper production setup.

Quick Version: `docker run`

Run each component individually. Good for testing and single-user setups:

# 1. Ollama (model server)
docker run -d \
  -p 11434:11434 \
  -v ollama:/root/.ollama \
  --name ollama \
  --restart always \
  ollama/ollama

# What this does:
# - `-d` → run in background (detached mode)
# - `-p 11434:11434` → map port 11434 from container to your machine
# - `-v ollama:/root/.ollama` → save models to a persistent volume
#    (without this, models disappear when the container is recreated)
# - `--restart always` → auto-start on boot and after crashes

# 2. Open WebUI (chat interface, connects to Ollama)
docker run -d \
  -p 3000:8080 \
  -v open-webui:/app/backend/data \
  -e OLLAMA_BASE_URL=http://ollama:11434 \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

# If Ollama and WebUI are on different machines:
# -e OLLAMA_BASE_URL=http://YOUR_SERVER_IP:11434

Deep Version: `docker-compose` (Production-Ready)

Create a file called docker-compose.yml in your project directory:

version: "3.8"

services:
  # === Model Server ===
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: always
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0:11434
    # 🖥️ GPU support (remove this section if you don't have NVIDIA GPU)
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

  # === Web Interface ===
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    restart: always
    ports:
      - "3000:8080"
    volumes:
      - open_webui_data:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
      - WEBUI_NAME=My Local AI
      - ENABLE_SIGNUP=false
    depends_on:
      - ollama

  # === Optional: AnythingLLM (RAG) ===
  anythingllm:
    image: mintplexlabs/anythingllm:latest
    container_name: anythingllm
    restart: always
    ports:
      - "3001:3001"
    volumes:
      - anythingllm_data:/app/server/storage
    environment:
      - LLM_PROVIDER=ollama
      - OLLAMA_BASE_PATH=http://ollama:11434
      - OLLAMA_MODEL_PREF=qwen2.5:7b
      - EMBEDDING_PROVIDER=ollama
      - OLLAMA_EMBEDDING_MODEL=qwen2.5:7b
    depends_on:
      - ollama

volumes:
  ollama_data:
  open_webui_data:
  anythingllm_data:

How to use it:

# Start everything
cd /path/to/your/project
docker compose up -d

# Check if containers are running
docker compose ps

# View logs
docker compose logs -f ollama

# Pull a model
docker exec ollama ollama pull qwen2.5:7b

# Stop everything
docker compose down

# Update images (when new versions are available)
docker compose pull
docker compose up -d

Line-by-line explanation of the compose file:

services: — Each service is a separate container

ollama: / open-webui: — Service names; Open WebUI uses http://ollama:11434 to connect because Docker Compose creates an internal network where service names act as hostnames

volumes: — Persistent storage that survives container recreation

depends_on: — Wait for Ollama to start before starting Open WebUI

deploy.resources — GPU passthrough (only works with nvidia-container-toolkit installed)

Docker Tips for Different Platforms

Platform	Path Mounting	GPU Support	Notes
Linux	`/home/user/data:/data`	✅ NVIDIA (`nvidia-container-toolkit`)	Easiest setup
macOS	`~/Documents/data:/data`	❌ (no GPU passthrough to Docker)	Models run slower (CPU only in Docker)
Windows	`C:\data:/data`	✅ NVIDIA (requires WSL2 backend)	Use `\` path separators in Docker

macOS users: Docker Desktop doesn't support GPU passthrough. For GPU-accelerated Ollama on Mac, run Ollama natively (outside Docker) and point your Dockerized Open WebUI to it via OLLAMA_BASE_URL=http://host.docker.internal:11434.

4. Monitoring & Logging

You don't need a full observability stack. A few simple checks will tell you most of what you need.

Quick Health Check

# Check if Ollama is running
curl -s http://localhost:11434/api/tags | python3 -c "import sys,json; models=json.load(sys.stdin); print(f'{len(models[\"models\"])} models loaded')"

# Check GPU usage (NVIDIA only)
watch -n 2 nvidia-smi

# Check RAM usage
free -h

# Check Docker container status
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"

Basic Logging Setup

Ollama logs to stdout. If you're running it in Docker, logs are already captured:

# See real-time Ollama logs
docker logs -f ollama

# See Open WebUI logs
docker logs -f open-webui

# Save logs to a file (searchable later)
docker logs ollama > ~/ollama-logs-$(date +%Y%m%d).txt 2>&1

What to Watch For

Metric	Normal Range	Red Flag	What To Do
GPU VRAM usage	70–95%	100% (OOM)	Use smaller model or lower quantization
GPU temperature	65–80°C	>85°C	Clean fans, reduce ambient temp, lower power limit
Response time (7B model)	1–5 seconds	>15 seconds	Check VRAM, restart Ollama, reduce concurrent users
RAM usage	Within available	Swap usage >0	Add more RAM or reduce model count

Usage Tracking (Who's Using What)

Open WebUI's admin panel provides basic usage statistics:

Admin Panel → Chat Logs — See conversations (toggle anonymization for privacy)
Admin Panel → Users — See active users
Settings → Models — See which models are most popular

For more detailed tracking, add a simple log parser:

# Count API requests per hour from Ollama logs
docker logs ollama 2>&1 | grep "\[API\]" | \
  awk '{print $1}' | cut -d: -f1 | sort | uniq -c | sort -rn | head -10

5. Cost Analysis: Local vs Cloud (Updated for 2026)

💰 All estimates based on US average electricity rate ($0.15/kWh). Hardware prices as of May 2026. Actual costs vary by region.

The Full Picture

Scenario	Cloud (GPT-4o / Claude)	Local (Your Hardware)	Winner
Solo heavy user ($200/mo API)	$2,400/year	$325/year (electricity)	Local after 14 months
Small team (5 people)	$1,000/month ($200×5)	$380/year (electricity)	Local after ~6 months
Light user (<$50/mo API)	$600/year	$0 (existing hardware)	Local immediately
Enterprise (50 users)	Custom pricing (~$20K/yr)	$3,500 (one-time build) + $600/yr	Local from month 1

Detailed Breakdown: Solo Heavy User

# === Local Setup — One-Time Cost ===
RTX 4090 (or used RTX 3090)  $1,500–2,500
Rest of PC (if needed)        $500–1,000
Total upfront:                $2,000–3,500

# === Local — Monthly Costs ===
Electricity (0.4 kWh × 8h/day × $0.15)  ~$18/month
Internet (negligible)                    $0
Total monthly:                           ~$18/month

# === Cloud — Monthly Costs ===
ChatGPT Pro / Claude Pro                 $200/month
API calls (heavy user)                   $50–100/month
Total monthly:                           $200–300/month

Break-Even Calculator

Here's a simple way to calculate your personal break-even point:

Break-even (months) = Hardware Cost / (Cloud monthly cost - Local monthly cost)

Example with RTX 4090 ($2,500) vs ChatGPT Pro ($200):
$2,500 / ($200 - $18) = 13.7 months

After the break-even point, you're saving $182/month compared to the cloud. Over 3 years: $4,800 in savings (minus hardware depreciation).

When Cloud Still Makes Sense

Let's be honest — local isn't always better:

Situation	Recommendation
You need GPT-4o / Claude Opus quality	Keep a cloud subscription for hard tasks
Your GPU is <8GB VRAM	Use local for simple tasks, cloud for complex ones
You have zero upfront budget	Start with cloud, save for hardware
You need 100% uptime (SLA)	Cloud wins — your home power goes out sometimes
You process huge batches overnight	Local — no API limits, no per-token cost

The hybrid approach is what I personally recommend:

Daily use → Local LLM (Qwen 2.5:7b or DeepSeek-R1:14b)
Hard tasks → Cloud API (pay-per-use, ~$20–50/month)
Automated batch jobs → Local (unlimited, no rate limits)

6. Security Checklist

⚠️ This is the most important section in this chapter. An exposed, unauthenticated Ollama instance is a liability.

Before Going Live

[ ] Ollama is NOT directly exposed to the internet
- Verify: curl -s http://YOUR_PUBLIC_IP:11434/api/tags should fail from outside
- If it succeeds → your Ollama is visible to the entire internet!
[ ] Authentication is enabled (htpasswd / API key / SSO)
- Verify: curl -u test:test http://localhost:8080/api/tags returns 401
[ ] SSH / VPN only for remote access
- Best practice: Don't expose the API at all. Use Tailscale or WireGuard VPN
[ ] Firewall rules are configured

  # Allow only local network access (192.168.x.x)
  sudo ufw allow from 192.168.0.0/16 to any port 11434
  sudo ufw deny 11434  # Block external access

  # If using nginx reverse proxy (port 8080), allow from VPN only
  sudo ufw allow from 10.0.0.0/8 to any port 8080

[ ] Ollama version is up to date

  ollama --version
  # Compare with https://github.com/ollama/ollama/releases

[ ] Docker containers restart on failure
- Verify: docker inspect ollama | grep -A2 RestartPolicy

Quick Security Audit Script

#!/bin/bash
# save as: security-audit.sh && chmod +x security-audit.sh

echo "=== Local LLM Security Audit ==="

# Check 1: Is Ollama exposed?
EXTERNAL_CHECK=$(curl -s -o /dev/null -w "%{http_code}" \
  http://localhost:11434/api/tags 2>/dev/null)
if [ "$EXTERNAL_CHECK" == "200" ]; then
  echo "⚠️  Ollama API is accessible (port 11434)"
  echo "   If this machine has a public IP, the world can run models on your GPU!"
else
  echo "✅ Ollama API is not accessible"
fi

# Check 2: Is there authentication?
AUTH_CHECK=$(curl -s -o /dev/null -w "%{http_code}" \
  http://localhost:8080/api/tags 2>/dev/null)
if [ "$AUTH_CHECK" == "401" ]; then
  echo "✅ Reverse proxy auth is working"
elif [ "$AUTH_CHECK" == "200" ]; then
  echo "⚠️  No authentication on port 8080"
else
  echo "ℹ️  No reverse proxy detected on port 8080"
fi

# Check 3: Firewall status
if command -v ufw &>/dev/null; then
  echo "--- UFW Status ---"
  sudo ufw status verbose
fi

# Check 4: GPU usage (snapshot)
if command -v nvidia-smi &>/dev/null; then
  echo "--- GPU ---"
  nvidia-smi --query-gpu=name,memory.used,memory.total,temperature.gpu \
    --format=csv,noheader
fi

Run this script periodically or set it up as a cron job: */30 * * * * /path/to/security-audit.sh

Your Production Deployment Cheat Sheet

                 ┌──────────────┐
                 │   Internet   │
                 └──────┬───────┘
                        │
                 ┌──────▼───────┐     ┌──────────────┐
                 │  Nginx Proxy │────▶│ htpasswd Auth │
                 │  (port 8080) │     └──────────────┘
                 └──────┬───────┘
                        │ (internal network only)
                 ┌──────▼───────┐
                 │    Ollama    │
                 │  (port 11434)│
                 └──────┬───────┘
                        │
              ┌─────────┼─────────┐
              │         │         │
       ┌──────▼──┐ ┌───▼────┐ ┌──▼──────┐
       │Open WebUI│ │Anything│ │ Custom  │
       │ (3000)  │ │ LLM    │ │  Apps   │
       └─────────┘ └────────┘ └─────────┘

What's Next

You've gone from "running a model in the terminal" to a production-ready AI server that your team can use.

In Chapter 6 → Function Calling & Tool Use, we'll make your local LLM actually do things — call APIs, interact with databases, and control other software.

Found this useful? ⭐ Star the repo — it helps others find it and you'll get notified when new chapters drop.

Getting Started: Run Your First Local LLM in 5 Minutes

Lingdas1 — Sat, 23 May 2026 19:01:20 +0000

01 — Getting Started: Run Your First Local LLM (5 Minutes)

🟢 Beginner — No experience needed. Just a computer and 5 minutes.

What Is a Local LLM? (Plain English)

An LLM (Large Language Model) is the brain behind ChatGPT, Claude, and Gemini.

A local LLM runs that brain on your own computer — not on someone else's server.

Why does that matter?

Cloud AI (ChatGPT, Claude)	Local AI (Ollama + models)
$20–$200/month subscription	$0 — completely free
Your data is sent to their servers	Private — everything stays on your machine
Requires internet	Works offline
Censored, filtered, rate-limited	No limits — you control everything
One-size-fits-all model	Choose any model for any task

💡 Think of it this way: Cloud AI is like renting a car. Local AI is like owning a bicycle. The bicycle is slower, but it's yours, it's free, and nobody can take it away from you.

What You Need

Minimum requirements:

A computer (Windows, macOS, or Linux)
At least 8 GB of RAM (16 GB recommended)
A few GB of free disk space

Nice to have (but not required):

A GPU with 4+ GB VRAM (models run faster, but CPU is fine to start)

My setup: I'm running this on a [your hardware] with [your specs]. If it works for me, it'll work for you.

Step 1: Install Ollama

Ollama is the easiest way to run local LLMs. Think of it as the "App Store for AI models."

macOS

curl -fsSL https://ollama.com/install.sh | sh

Linux

curl -fsSL https://ollama.com/install.sh | sh

Windows

Download the installer from ollama.com/download and run it.

Verify Installation

Open a new terminal and type:

ollama --version

You should see something like:

ollama version 0.6.0

🔥 Pro tip: If you get "command not found" on Linux/macOS, restart your terminal or run: export PATH=$PATH:/usr/local/bin

Step 2: Pull Your First Model

Now for the fun part — downloading an actual AI brain to run on your computer.

ollama pull qwen2.5:7b

This downloads a 4.7 GB model. On a typical internet connection, it takes 2–5 minutes.

While it downloads, here's what's happening:

Ollama is downloading a GGUF file (the compressed model format)
It's auto-detecting your GPU
It's setting up the inference engine

What if the download is too big? Try a smaller model:

# For 8 GB RAM laptops — works on almost anything
ollama pull qwen2.5:1.5b

# For 4 GB RAM or very old computers
ollama pull qwen2.5:0.5b

Step 3: Chat With Your Model

ollama run qwen2.5:7b

You'll see a prompt like >>>. Type something:

>>> Write a haiku about a cat sitting on a computer

The model will think for a moment and then respond. Congratulations — you just ran an AI on your own hardware! 🎉

Try These First Commands

>>> Write a Python function to calculate fibonacci

>>> Explain quantum computing like I'm 10

>>> What's the meaning of life?

>>> /? -- show all available commands

>>> /exit -- quit the chat

⚠️ Expect it to be slower than ChatGPT. That's normal! Local models run at 15–40 tokens per second on a GPU, or 2–6 tok/s on CPU. It's still faster than most people read.

Step 4: Choose the Right Model for Your Hardware

Not sure which model to pick? Use this decision tree:

Your GPU VRAM?
├── No GPU (CPU only)
│   ├── 32 GB RAM → qwen2.5:7b (slow but works)
│   ├── 16 GB RAM → qwen2.5:1.5b
│   └── 8 GB RAM  → qwen2.5:0.5b
├── 4–6 GB VRAM   → qwen2.5:7b
├── 8–12 GB VRAM  → deepseek-r1:14b (🟢 BEST for most people)
├── 12–16 GB VRAM → deepseek-r1:32b
├── 24 GB VRAM    → qwen3.6:27b or deepseek-r1:32b (Q4)
└── 36+ GB VRAM   → deepseek-r1:70b or qwen2.5:72b

Model Comparison Table

Model	Ollama Command	Size (Disk)	Min RAM	Min VRAM	Quality
Qwen 2.5:0.5B	`ollama pull qwen2.5:0.5b`	0.5 GB	4 GB	None	Basic text
Qwen 2.5:1.5B	`ollama pull qwen2.5:1.5b`	1.1 GB	8 GB	None	Simple tasks
Qwen 2.5:7B	`ollama pull qwen2.5:7b`	4.7 GB	8 GB	4 GB	🟢 Good start
Qwen 2.5:14B	`ollama pull qwen2.5:14b`	9.0 GB	16 GB	8 GB	Excellent
DeepSeek-R1:14B	`ollama pull deepseek-r1:14b`	8.2 GB	16 GB	8 GB	🏆 Best value
DeepSeek-R1:32B	`ollama pull deepseek-r1:32b`	18.7 GB	32 GB	16 GB	Near o1 level
Qwen 3.6:27B	`ollama pull qwen3.6:27b`	15 GB	32 GB	16 GB	Cutting-edge
Llama 4:8B	`ollama pull llama4`	4.9 GB	8 GB	4 GB	Good general

My recommendation for first-timers: Start with qwen2.5:7b. It runs on almost anything, and it's good enough to be genuinely useful.

What to Do After Your First Chat

You've run your first local LLM. Now what?

Next steps in order:

#	Task	Why	Guide
1	Customize your model with a Modelfile	Control temperature, context length, and behavior	GGUF & Modelfile Guide
2	Install Open WebUI	Get a ChatGPT-like web interface instead of the terminal	Open WebUI Setup
3	Benchmark your hardware	See what speeds your setup can achieve	Script: `./scripts/ollama-benchmark.sh`
4	Add document search (RAG)	Let your LLM answer questions about your own files	RAG Guide
5	Try a reasoning model	Switch to DeepSeek-R1 for harder problems	DeepSeek-R1 Guide

Common First-Timer Problems (And Fixes)

Problem	Why	Fix
"ollama: command not found"	Ollama not in PATH	Restart terminal, or run: `export PATH=$PATH:/usr/local/bin`
Download is very slow	Big file on slow internet	Try `ollama pull qwen2.5:1.5b` instead (much smaller)
Model responds very slowly	Running on CPU	This is normal! See speed expectations in the table above
Model responds in Chinese	Default template includes Chinese	Add `SYSTEM "Always respond in English."` to a Modelfile
"CUDA out of memory"	Model too big for your GPU	Use a smaller model or lower quantization
"Connection refused"	Ollama server not running	Run `ollama serve` in a separate terminal first

Quick Reference: Common Ollama Commands

# List all downloaded models
ollama list

# Show currently running models
ollama ps

# Delete a model to free space
ollama rm qwen2.5:7b

# Update a model to the latest version
ollama pull qwen2.5:7b

# Run a model with a one-shot prompt (non-interactive)
ollama run qwen2.5:7b "Write a Python script to download images from a URL"

# Use the API (OpenAI compatible)
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen2.5:7b", "messages": [{"role": "user", "content": "Hello!"}]}'

Your First Week Plan

Day	Task	Time
Day 1	Install Ollama + pull a model + chat with it	5 minutes ✅
Day 2	Try different models (small vs large)	15 minutes
Day 3	Customize with a Modelfile	30 minutes
Day 4	Install Open WebUI	30 minutes
Day 5	Ask your LLM to write code or help with real work	1 hour
Weekend	Try RAG — let your LLM read your documents	1 hour

🎯 You've taken the first step. Running a local LLM is like learning to ride a bike — wobbly at first, but once you get it, you'll wonder why you didn't start sooner.

Found this helpful? ⭐ Star the repo — it helps others find it too.

— Ling, a medical student who accidentally fell into AI and wants to help you do the same.

Hardware Guide: What Do You Actually Need to Run Local LLMs?

Lingdas1 — Sat, 23 May 2026 18:57:27 +0000

02 — Hardware Guide: What Do You Actually Need?

🟢 Beginner — No matter what computer you have, there's a model that will run on it.

The Most Important Thing to Know

VRAM is the bottleneck, not compute.

A model running on a 5-year-old RTX 3060 at Q4 quantization gives you 96% of the quality of the same model on an A100 — just slower. And "slower" for most use cases (chat, coding, document analysis) still means 20-40 tokens per second, which is faster than most people read.

💡 Analogy: Running AI locally is like cooking at home vs. going to a Michelin-star restaurant. The restaurant (cloud AI) is faster and fancier. But your home cooking (local AI) is free, private, and tastes just as good — just takes a bit longer.

The Quick Decision Tree

What computer do you have?
├── Gaming PC / Workstation with NVIDIA GPU
│   ├── 24 GB VRAM (RTX 4090/5090, RTX 3090) → deepseek-r1:32b or qwen3.6:27b
│   ├── 12-16 GB VRAM (RTX 4070/5070, RTX 4080) → qwen2.5:14b or deepseek-r1:14b  🟢
│   └── 8-12 GB VRAM (RTX 3060/4060) → qwen2.5:7b or deepseek-r1:7b
├── Mac
│   ├── 36 GB+ unified (M4 Max, M3 Max) → qwen3.6:27b or deepseek-r1:32b
│   └── 16 GB unified (M1/M2/M3) → qwen2.5:7b or phi-4:14b
├── AMD GPU
│   ├── 16 GB+ VRAM (RX 7900 XTX) → qwen2.5:14b
│   └── 8-12 GB VRAM (RX 7600/7700) → qwen2.5:7b
├── Intel Arc GPU → qwen2.5:7b (experimental support)
├── CPU only, 32 GB+ RAM → qwen2.5:7b (1-4 tok/s)
├── CPU only, 16 GB RAM → qwen2.5:1.5b (5-10 tok/s)
└── CPU only, 8 GB RAM → qwen2.5:0.5b (10-15 tok/s)

GPU Comparison Table

GPU	VRAM	Architecture	Best Model	Speed (tok/s)	Used Price
RTX 3060 12GB	12 GB	Ampere	Qwen 2.5:14B (Q4)	25-35	~$200
RTX 4060 8GB	8 GB	Ada Lovelace	Qwen 2.5:7B (Q4)	35-50	~$280
RTX 4060 Ti 16GB	16 GB	Ada Lovelace	DeepSeek-R1:14B (Q4)	30-45	~$400
RTX 4070 12GB	12 GB	Ada Lovelace	Qwen 2.5:14B (Q4)	40-55	~$500
RTX 4070 Ti 16GB	16 GB	Ada Lovelace	DeepSeek-R1:14B (Q4)	35-50	~$650
RTX 4080 16GB	16 GB	Ada Lovelace	DeepSeek-R1:32B (Q4)	20-30	~$800
RTX 4090 24GB	24 GB	Ada Lovelace	DeepSeek-R1:32B (Q3)/Qwen 3.6:27B	25-35	~$1,500
RTX 5090 32GB	32 GB	Blackwell	DeepSeek-R1:70B (Q3)	18-25	~$2,000
RTX 3090 24GB	24 GB	Ampere	DeepSeek-R1:32B (Q3)/Qwen 3.6:27B	15-25	~$700 🟢
RTX 3080 10/12GB	10/12 GB	Ampere	Qwen 2.5:14B (Q4)	20-30	~$350
RX 7900 XTX 24GB	24 GB	RDNA 3	Qwen 3.6:27B	20-30	~$800
RX 7800 XT 16GB	16 GB	RDNA 3	DeepSeek-R1:14B (Q4)	25-35	~$450
Arc A770 16GB	16 GB	Alchemist	Qwen 2.5:14B (Q4)	15-25	~$250

🟢 Best value picks: Used RTX 3090 ($700 for 24 GB VRAM) or used RTX 3060 12GB ($200 for a starter).

Model × VRAM Compatibility Matrix

Model	Q8 (Full Quality)	Q6_K	Q4_K_M 🟢	Q3_K_M	Q2_K
Qwen 2.5:0.5B	0.7 GB	0.5 GB	0.4 GB	0.3 GB	0.2 GB
Qwen 2.5:1.5B	1.9 GB	1.5 GB	1.1 GB	0.9 GB	0.7 GB
Qwen 2.5:7B	8.1 GB	6.3 GB	4.7 GB	3.8 GB	2.9 GB
Qwen 2.5:14B	15.5 GB	12.1 GB	9.0 GB	7.2 GB	5.4 GB
DeepSeek-R1:7B	8.1 GB	6.3 GB	4.7 GB	3.8 GB	2.9 GB
DeepSeek-R1:14B	14.7 GB	11.2 GB	8.2 GB	6.4 GB	4.9 GB
DeepSeek-R1:32B	33.6 GB	25.4 GB	18.7 GB	14.5 GB	10.8 GB
DeepSeek-R1:70B	72.0 GB	55.0 GB	40.0 GB	31.0 GB	23.0 GB
Qwen 3.6:27B	30.0 GB	23.0 GB	15.0 GB	11.5 GB	8.5 GB
Llama 4:8B	9.0 GB	7.0 GB	4.9 GB	3.8 GB	2.8 GB
GPT-OSS:20B	22.0 GB	17.0 GB	11.5 GB	8.5 GB	6.5 GB

How to read this table: Find your VRAM in the first section, then look across the Q4_K_M column to see which models fit. For example, with 12 GB VRAM, Qwen 2.5:14B (Q4_K_M = 9.0 GB) fits comfortably.

Budget Builds

The "Get Started" Build — $0 (Use What You Have)

If you already have a computer, you can probably run something:

Your Current PC	Best Free Option	Can It Be Useful?
Any laptop with 8 GB RAM	Qwen 2.5:1.5B	✅ Basic Q&A, simple tasks
Any laptop with 16 GB RAM	Qwen 2.5:7B (CPU mode)	✅ ✅ Writing, brainstorming
Old gaming PC with GTX 1060	Qwen 2.5:7B (GPU accel)	✅ ✅ ✅ Coding, summarization
MacBook M1 with 8 GB	Qwen 2.5:1.5B	✅ Basic assistance

Cost: $0. You already own it. Just install Ollama.

The "Best Bang for Buck" Build — ~$700

Used RTX 3090 (24 GB VRAM)  → $700
Rest of PC (keep what you have)
Total: ~$700

What you can run with this:

DeepSeek-R1:32B (Q4_K_M) — o1-level reasoning
Qwen 3.6:27B — latest cutting-edge
Any 7B-14B model at full quality

APIs this replaces: ChatGPT Pro ($200/mo) + Claude Pro ($20/mo) = break-even in ~3 months

The "Serious" Build — ~$2,500

New RTX 5090 (32 GB VRAM)    → $2,000
64 GB DDR5 RAM                → $200
1 TB NVMe SSD                 → $80
Rest is your existing PC
Total: ~$2,500

What you can run: DeepSeek-R1:70B, Qwen 2.5:72B, multiple models at once

CPU-Only Guide

No GPU? No problem. You can still run local LLMs — just slower.

What to Expect

CPU	RAM	Best Model	Expected Speed	Readable?
Modern i5/i7 (2020+)	32 GB	Qwen 2.5:7B	2-6 tok/s	✅ Yes, like reading speed
Modern i5/i7 (2020+)	16 GB	Qwen 2.5:1.5B	5-10 tok/s	✅ Comfortable
Older i5 (2017+)	16 GB	Qwen 2.5:1.5B	3-6 tok/s	✅ Yes
Laptop (any)	8 GB	Qwen 2.5:0.5B	8-15 tok/s	✅ Fast enough

2-6 tok/s means a 100-word paragraph takes 15-30 seconds to generate. It's slow by GPU standards but perfectly usable for getting answers.

Tips for CPU Users

Use smaller models: Qwen 2.5:1.5B is surprisingly capable and runs well on any CPU
Close other apps: Free up RAM for the model
Use Q2_K quantization: Smaller but still useful
Try llama.cpp directly: Sometimes faster than Ollama on CPU

RAM & VRAM Deep Dive

How much RAM do you need?

Usage	Minimum RAM	Recommended RAM
CPU-only with small models (1.5B)	8 GB	16 GB
CPU-only with medium models (7B)	16 GB	32 GB
GPU offloading + OS + browser	16 GB	32 GB
Running multiple models	32 GB	64 GB
Production server	32 GB	64 GB

How VRAM is actually used

When you run a model, VRAM is consumed by:

Model weights (the biggest chunk — see the matrix above)
KV Cache (~1 GB per 8K tokens of context)
Compute buffers (~0.5 GB)
Other apps (your OS, browser, etc.)

Rule of thumb: Pick a model whose Q4_K_M size is at least 2 GB less than your total VRAM. The extra headroom handles the KV cache.

Mac Users: Special Considerations

Apple Silicon Macs are surprisingly good for local LLMs because of unified memory — the GPU can access all system RAM.

Mac Model	Unified Memory	Best Model	Notes
M1 MacBook Air	8 GB	Qwen 2.5:1.5B	Surprising quality from small model
M1/M2 MacBook Pro	16 GB	Qwen 2.5:7B	Sweet spot for Mac users
M3 Pro/Max	36 GB	Qwen 3.6:27B	Top-tier performance
M4 Max	48-128 GB	DeepSeek-R1:70B	Ultimate local AI machine
Mac Studio M4 Ultra	128-256 GB	Run anything	Absolute beast

Pro tip: Macs with MLX (Apple's ML framework) can run models faster than Ollama's default backend. Try:

# Install Ollama with MLX support
ollama pull qwen2.5:7b
# For MLX-native, try mlx-lm instead
pip install mlx-lm
mlx_lm.generate --model qwen2.5:7b --prompt "Hello"

AMD & Intel GPU Users

AMD (ROCm support)

# Install Ollama with ROCm
curl -fsSL https://ollama.com/install.sh | OLLAMA_ROCM=1 sh

# Verify GPU detection
ollama run qwen2.5:7b
# Should show "GPU = 1" in startup

Known quirks:

RX 6000 series works well
RX 7000 series needs ROCm 6.0+
Integrated AMD GPUs (like in laptops) are not supported
Performance is about 80-90% of equivalent NVIDIA

Intel Arc

# Intel Arc support was added in Ollama 0.22+
# Check your version first
ollama --version

# If 0.22+, just pull and run
ollama run qwen2.5:7b

Known quirks:

Arc A770 16GB is a surprisingly good budget option (~$250 used)
Arc A580/A750 have limited support
Expect 60-70% of NVIDIA performance
Some models may fail on first load (retry usually works)

The "I Just Want to Buy Something" Recommendation

Budget	Buy This	Why
$200	Used RTX 3060 12GB	Best cheap entry point
$700	Used RTX 3090 24GB	Best value for serious local AI
$2,000	New RTX 5090 32GB	Best new card for AI (2026)
$4,000+	Mac Studio M4 Ultra	If you also do video/audio work

Quick Reference Card

# Check your GPU (Linux)
nvidia-smi
# Look for: "Memory-Usage: 4096MiB / 12288MiB" — the second number is your VRAM

# Check your GPU (macOS)
system_profiler SPDisplaysDataType | grep VRAM

# Check your RAM (Linux)
free -h

# Check your RAM (macOS)
system_profiler SPHardwareDataType | grep Memory

# See if Ollama detected your GPU
ollama run qwen2.5:7b --verbose 2>&1 | grep -i gpu

Bottom line: Don't overthink hardware. Download Ollama, try qwen2.5:1.5b or qwen2.5:7b, and see how it feels. You can always upgrade later.

Part of the Local LLM Guide — the definitive resource for running AI on your own hardware.

Open WebUI: Your Local ChatGPT

Lingdas1 — Sat, 23 May 2026 18:55:33 +0000

Open WebUI: Your Local ChatGPT

Transform your local LLM into a beautiful, full-featured web interface — like ChatGPT, but running entirely on your machine.

What Is Open WebUI?

Open WebUI is a self-hosted web interface for Ollama. It gives you:

🖥️ A ChatGPT-like chat interface in your browser
🔄 Switch between models mid-conversation
📁 Upload documents and chat with them (RAG)
🖼️ Image generation (via Automatic1111 / ComfyUI)
🎤 Voice input / text-to-speech
👥 Multi-user support (share with family or team)
📱 Mobile-friendly (works on phone browsers)
🔌 Plugins for images, web search, and more

Best of all: It connects to your local Ollama instance — no data ever leaves your machine.

Prerequisites

✅ Ollama installed and working (see Getting Started)
✅ At least one model pulled (e.g., qwen2.5:7b)
✅ Docker installed (recommended) OR Python 3.11+

Option A: Install with Docker (Recommended — 2 Minutes)

Docker is the easiest way. One command and you're done:

docker run -d \
  -p 3000:8080 \
  -v open-webui:/app/backend/data \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

What this does:

-p 3000:8080 — makes it available at http://localhost:3000
-v open-webui:/app/backend/data — keeps your chats saved even if you restart
-e OLLAMA_BASE_URL — tells it where your Ollama is running
--restart always — auto-starts when your computer boots

Verify It's Running

# Check logs — you should see "Application startup complete"
docker logs open-webui --tail 20

Then open http://localhost:3000 in your browser.

First time? Create an account. Don't worry — it's local only. Your data stays on your machine.

Option B: Install with pip (No Docker)

If you don't have Docker:

# Install
pip install open-webui

# Run
open-webui serve

Then open http://localhost:8080.

What You'll See

After logging in, Open WebUI looks and feels like ChatGPT:

Key areas:

Area	What It Does
Chat panel (left)	Your conversation history
Model selector (top)	Switch between all your downloaded models
Chat input (bottom)	Type your message
Paperclip icon	Upload documents
Settings gear	Configure model parameters, RAG, voice

Cool Things to Try

1. Switch Models Mid-Chat

In the top dropdown, you can switch models during a conversation. Each model sees the same chat history.

Start with qwen2.5:7b for general chat
Switch to deepseek-r1:14b when you need hard reasoning
Switch to codellama for code tasks

2. Upload Documents (Built-in RAG)

Click the paperclip icon and upload a PDF, Word doc, or text file. The model can then answer questions about it.

Use cases:

Upload a research paper and ask questions
Upload your company's handbook
Upload a textbook chapter for study help

3. Use Voice Input

Click the microphone icon to speak instead of type. This works in Chrome and Edge.

4. Customize the Model's Behavior

In Settings → Model, you can adjust:

Temperature: 0.2 (precise) to 1.0 (creative)
Context length: How much the model remembers
System prompt: The model's persona

Advanced: Connecting to Other Services

Image Generation

Open WebUI can integrate with local image generators:

# Add Automatic1111 (Stable Diffusion)
docker run -d \
  -p 7860:7860 \
  -v sd-models:/models \
  --gpus all \
  asd/stable-diffusion-webui:latest

Then configure in Open WebUI Settings → Image Generation.

Web Search (Experimental)

Enable web search in Settings → Web Search. Open WebUI will search the internet when answering questions.

Production Setup

With HTTPS

For secure remote access (behind a VPN or tunnel):

# Using Caddy as a reverse proxy
docker run -d \
  -p 443:443 \
  -v open-webui:/app/backend/data \
  -e OLLAMA_BASE_URL=http://ollama:11434 \
  -e WEBUI_SECRET_KEY=your-secret-here \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

Multi-User Setup

Open WebUI supports multiple users out of the box. Each user:

Gets their own chat history
Can't see other users' chats
Can choose from any model you've pulled

To add users: Go to Settings → Admin Panel → Users → Create User.

Troubleshooting

Problem	Cause	Fix
"Connection refused"	Ollama not running	Start Ollama first: `ollama serve`
Blank page at localhost:3000	Container not started	`docker start open-webui`
"No models available"	No models pulled	`ollama pull qwen2.5:7b`
Slow document Q&A	Embedding model not loaded	First doc upload takes extra time to load embeddings
Port 3000 already in use	Another service using it	Change port: `-p 8080:8080` and use `http://localhost:8080`
Container won't start	Docker not running	Start Docker Desktop or Docker daemon

Resources

Official docs: docs.openwebui.com
GitHub: github.com/open-webui/open-webui
Docker Hub: ghcr.io/open-webui/open-webui

Next step: Now that you have a GUI, try setting up Local RAG — let your LLM answer questions about your own documents.

Part of the Local LLM Guide — the definitive resource for running AI on your own hardware.

Local RAG: Chat With Your Documents (Open Source, Private)

Lingdas1 — Sat, 23 May 2026 18:49:41 +0000

Local RAG: Chat With Your Documents

Upload PDFs, code, research papers, or entire books — then ask your local LLM questions about them. No data ever leaves your machine.

What Is RAG? (Plain English)

RAG (Retrieval-Augmented Generation) means your LLM can look up information from your own documents before answering.

Think of it like this:

Normal LLM: Has a great memory, but only knows what it learned during training
RAG: The LLM gets a "cheat sheet" — your documents — that it can read before answering

💡 Analogy: Without RAG, the LLM is like a student taking a closed-book exam. With RAG, they get an open-book exam — and you get to write the book.

Real-World Uses

Use Case	What You Upload	What You Can Ask
Research	PDF papers, articles	"What were the key findings in this study?"
Studying	Textbooks, lecture notes	"Explain chapter 7 in simpler terms"
Work	Company docs, reports	"What's our Q3 strategy?"
Legal	Contracts, agreements	"What are the termination clauses?"
Coding	Codebase, documentation	"How does the auth module work?"
Personal	Journals, notes, books	"What did I write about in March?"

Option A: Built-in RAG in Open WebUI (Simplest)

If you already have Open WebUI installed, RAG is built-in.

How to Use It

Open http://localhost:3000 in your browser
Click the paperclip icon next to the chat input
Upload a PDF, .txt, .docx, or .md file
Wait for the "embedding" process to finish (usually 10-30 seconds)
Ask questions about the document

That's it. No configuration needed.

Pro Tips

Multiple documents: You can upload several files at once. Open WebUI indexes them all.
Model choice: Use qwen3.6:27b or deepseek-r1:14b for best RAG quality — they have larger context windows.
Document size: Open WebUI handles documents up to hundreds of pages. For very large documents, consider chunking them.

Option B: AnythingLLM (More Powerful)

AnythingLLM is a dedicated RAG application with more features than Open WebUI's built-in system.

Installation

With Docker (Recommended):

docker run -d \
  -p 3001:3001 \
  -v anythingllm:/app/server/storage \
  -e STORAGE_DIR=/app/server/storage \
  --name anythingllm \
  --restart always \
  ghcr.io/anythingllm/anything-llm:latest

Then open http://localhost:3001.

Without Docker:

Download from anythingllm.com and run the installer for your OS.

Configuration

Open AnythingLLM at http://localhost:3001
Create an admin account (local only — no data leaves your machine)
Go to Settings → LLM Provider
Select Ollama from the dropdown
Choose your model (e.g., qwen2.5:7b or deepseek-r1:14b)
Click Save

Now set up embeddings:

Go to Settings → Embedding Provider
Select Ollama
Choose an embedding model (AnythingLLM will download a small embedding model — about 500 MB)
Click Save

Uploading Documents

Click "New Workspace" and give it a name (e.g., "Research Papers")
Click the upload icon (or drag and drop files)
Supported formats: PDF, DOCX, TXT, MD, CSV, JSON, code files
Click "Save and Embed"
Wait for indexing (progress shows in the UI)

Chatting With Your Documents

Once embedded, just type your question:

"What are the three main conclusions from these papers?"

AnythingLLM searches your documents for relevant passages and feeds them to the LLM along with your question. The result is accurate, sourced answers — not guesses.

🔥 Pro tip: AnythingLLM shows you which document each answer came from. Hover over the citation to see the exact source passage.

Option C: Manual RAG with LangChain (For Developers)

For maximum control, build RAG with Python and LangChain. This is particularly useful if you want to automate document processing.

Setup

pip install langchain langchain-ollama chromadb

Basic RAG Script

from langchain_ollama import ChatOllama, OllamaEmbeddings
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA

# 1. Load your documents
loader = DirectoryLoader("./my-docs/", glob="**/*.txt", loader_cls=TextLoader)
documents = loader.load()
print(f"Loaded {len(documents)} documents")

# 2. Split into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
chunks = splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks")

# 3. Create embeddings and vector store
embeddings = OllamaEmbeddings(model="qwen2.5:7b")
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

# 4. Create RAG chain
llm = ChatOllama(model="qwen2.5:7b", temperature=0.3)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 3})
)

# 5. Ask questions
while True:
    question = input("\nAsk a question (or 'quit'): ")
    if question.lower() == 'quit':
        break
    answer = qa_chain.invoke(question)
    print(f"\nAnswer: {answer['result']}")

Run It

# Put your documents in a folder called "my-docs/"
mkdir -p my-docs
# Copy your PDFs/txts there

# Run the script
python rag.py

Choosing the Right RAG Setup

Factor	Open WebUI RAG	AnythingLLM	LangChain
Setup time	1 click	5 minutes	30 minutes
Features	Basic	Advanced	Full control
Document types	PDF, TXT, MD	PDF, DOCX, TXT, MD, CSV, code	Anything with a loader
Multi-document	✅	✅	✅
Citations	❌	✅	✅ (manual)
Customization	Low	Medium	High
Best for	Quick personal use	Serious knowledge work	Automation & production

My recommendation:

Start with Open WebUI's built-in RAG (fastest)
Move to AnythingLLM when you need citations and multiple workspaces
Use LangChain when you need to automate document processing

Best Practices for Better RAG Results

1. Use the Right Model

RAG works best with models that have large context windows:

Model	Context	Why It's Good for RAG
Qwen 3.6:27B	262K	Can process entire chapters at once
Qwen 2.5:14B	128K	Excellent balance of quality and speed
DeepSeek-R1:14B	128K	Best for reasoning about documents
DeepSeek-R1:32B	128K	Best overall RAG quality

2. Write Good Questions

❌ Bad Question	✅ Good Question
"Tell me about it"	"Summarize the methodology used in section 3"
"What's in this?"	"What are the three main arguments presented in chapter 2?"
"Is this useful?"	"What evidence does the author provide for their claim on page 15?"

3. Optimize Chunk Size

The chunk size determines how much text the LLM sees at once:

Chunk Size	Best For
500 chars	Short lookup questions ("What is X?")
1000 chars	General Q&A 🟢 Default
2000 chars	Summarization tasks
4000+ chars	Long-context analysis (Qwen 3.6 recommended)

Common Pitfalls

Problem	Cause	Fix
"I don't know" to document questions	Embedding not matching	Re-save documents in workspace
Wrong answers despite having docs	Chunk size too small	Increase chunk_size to 2000+
Very slow document processing	Large files on CPU	Be patient — first embed takes longest
"Model not responding"	Context overflow	Use a model with larger context (Qwen 3.6)
Can't upload PDFs	PDF is scanned/image-based	Use OCR first (tools like marker-pdf)

Next Steps

Set up Open WebUI first (it includes RAG out of the box) → Open WebUI Guide
Try it with Chinese models → Qwen 3.6 is excellent for RAG due to its 262K context
Combine RAG with Function Calling → Chapter 06: Function Calling
Deploy in production → Chapter 05: Production

Part of the Local LLM Guide — the definitive resource for running AI on your own hardware.