Forem: Bag of words

Set up an open-source AI analyst for PostgreSQL in 2 minutes

Bag of words — Fri, 24 Oct 2025 07:45:13 +0000

AI is going to be the interface for data - that's clear. But most teams aren't running AI analysts in production yet. They're stuck experimenting because the AI doesn't understand their business context, answers are inconsistent, and there's no way to see what's breaking.

Bag of words is an open-source framework that solves this. Deploy an AI analyst on PostgreSQL with full observability and control. Customize the context by teaching it your business definitions, connect your dbt models, BI, and documentation, and watch it improve over time as it learns from usage patterns and feedback. Your team asks questions in plain language and gets dashboards that actually make sense. Setup takes a few minutes.

Here's how to do it.

Prerequisites

Before you begin, you'll need:

Docker installed on your machine (installation guide)
A PostgreSQL database (local or remote) with some data
Your Postgres connection string ready (e.g., postgresql://user:password@host:5432/database)
An API key from your preferred LLM provider (OpenAI, Anthropic, Azure OpenAI, or Google)

Step 1 — Deploy Bag of words

Let's start by deploying Bag of words locally using Docker.

Run the following command:

docker run --pull always -d -p 3000:3000 bagofwords/bagofwords

After a few seconds, the service will be running. Open your browser and navigate to:

http://0.0.0.0:3000

You'll see the Bag of words onboarding flow. Let's walk through it.

Step 2 — Configure Your LLM

The first step in onboarding is connecting to an LLM provider.

Choose your LLM provider: OpenAI, Anthropic, Azure OpenAI, or Google
Enter your API key
(Optional) Select a specific model (e.g., GPT, Claude)
Click Test Connection to verify
Click Next

Why this matters:

Bag of words is LLM-agnostic. You bring your own key and choose your provider. This gives you control over cost, performance, and data residency.

Step 3 — Connect Your Data Source and Select Tables

Now let's connect to your PostgreSQL database.

Connect the Data Source

Select PostgreSQL from the list of available data sources
Enter your connection details:
- Name: Something descriptive like "Production Analytics"
- Host: Your database host (e.g., localhost or db.example.com)
- Port: 5432 (default)
- Database: Your database name
- Username and Password
- Or paste your full connection string
Click Test Connection to verify
Click Next

Select Tables

After connecting, Bag of words introspects your database schema. You'll see a list of all available tables.

Choose which tables the AI can access:

Select all tables if you want the AI to have full visibility
Or pick specific tables based on what your team needs
You can always adjust this later in Settings

This table selection impacts AI performance. Start with a few relevant tables and gradually add more as needed. Fewer tables means more focused context and better query accuracy.

Click Next when ready.

Step 4 — Ask Your First Question

The onboarding is complete! Now you can start asking questions.

The interface will show you some conversation starters to get you going, or you can type your own question in plain English.

Let's try an example:

Show me a line chart of daily active users for the last 30 days

Hit Enter and watch what happens.

Behind the scenes, the AI agent:

Reads your question
Examines the database schema and context
If the agent feels low confidence/context is not enough, it stops and asks for clarification. Otherwise - continue
Plans a data model and generates code
Executes it against your Postgres database
Returns the result as a table and (if appropriate) a chart

You'll see the data model being constructed in real time, followed by code generation and execution to get the data. Then you'll get a line chart showing the trend.

When Questions Are Ambiguous

If your question is ambiguous, the AI will ask for clarification before proceeding. For example:

Show me revenue by region

The AI might respond: "I found both gross_revenue and net_revenue columns. Which one should I use?"

You clarify: "Use gross_revenue"

The AI then generates the correct query and returns your chart.

Here's the best part: After answering your question, the AI will suggest an instruction based on the clarification:

"When user asks for data related to 'revenue', by default use gross_revenue"

Click Accept, and this rule is saved as an instruction. Next time anyone asks about revenue, the AI will know what you mean—no clarification needed.

This feedback loop means the system gets smarter over time, learning your business logic with every interaction.

Step 5 — Add Context to Improve Accuracy

This is the key to reliable AI analyst deployments. The solution is only as good as the context you build.

Context comes from two sources:

Machine-generated — Usage patterns, clarifications, and learnings from production queries
Human-provided — Instructions, dbt docs, and semantic layer enrichments

Add instructions: Click Instructions above the prompt box. Write business rules in plain language like "Revenue means gross_revenue" or "Active users have last_seen_at within 30 days". These apply to every query.

Connect your semantic layer: Go to Integrations → Context and connect your dbt project, LookML files, or markdown documentation. The AI will automatically index your models, descriptions, and relationships—then reference them by name when generating queries.

Together, these create a knowledge base that makes the AI increasingly accurate and aligned with your business logic.

Step 6 — Monitor and Track AI Analyst Quality

Everything is tracked. Go to Monitoring to see a complete audit trail of all AI interactions—every query, every clarification, every piece of feedback.

Key Metrics to Track

While the product exposes many detailed metrics, here are the high-level indicators you should monitor:

Context coverage — Frequency with which the context (instructions, enrichments) equips the agent with adequate confidence for your prompt
Accuracy (judge) — Automated quality scores from AI judges evaluating correctness
Negative feedback — User thumbs down signals that need investigation
Clarification rate — How often the AI needs to ask for clarification

These metrics tell you whether your AI analyst is production-ready and where to focus your context-building efforts.

Conclusion

In just a few minutes, you've deployed an open-source AI analyst on PostgreSQL, connected your database, asked natural language questions, added business context, and learned how to monitor and evaluate query quality.

Unlike black-box AI SQL tools, Bag of words gives you the tools to actually get to production. Every query is traceable. Every decision is visible. You can see when your metrics show it's ready, and you control the context, the LLM, and the governance rules.

What You've Built

You now have:

A natural language interface to your Postgres database
Context-aware query generation using your business definitions
Full observability into how the AI reasons and what SQL it generates
A foundation for building dashboards, Slack bots, or embedded analytics

Next Steps

From here, you can:

Invite members and manage permissions set governance for data sources and reports
Integrate with Slack to let your team ask questions directly in channels
Build dashboards by saving queries and pinning them to a shared view
Customize the LLM (swap OpenAI for Anthropic, Gemini, or a self-hosted model)
Deploy to production using Docker Compose or Kubernetes in your own VPC

To learn more:

AI is becoming the interface for data—but only if it's trustworthy. Now you have the tools to make it so.

Building Reliable AI Analysts: An Observability Framework for Text-to-SQL Systems

Bag of words — Wed, 15 Oct 2025 15:23:15 +0000

You're building a text-to-SQL system. The value prop is obvious: natural language over your data warehouse, instant answers for the business, no waiting on BI teams. The demo works. Your stakeholders love it.

Then you put it in production and the accuracy problems start. "Active users" means three different things across teams. Joins look right but query the wrong grain. Fiscal quarters don't match calendar quarters. The SQL runs, the numbers look plausible, but they're quietly wrong. Trust erodes fast.

This post covers the tactical pieces: where accuracy actually fails, the few metrics that matter for monitoring, and how to build a feedback loop that turns failures into improvements. These patterns come from building and shipping Bow, an open-source AI analyst, but apply to any text-to-SQL system in production.

Where Accuracy Breaks

Assume you have the agentic infrastructure right: ReAct loops that reason and validate, retrieval systems, the ability to say "I don't know" when uncertain, and comprehensive context coverage.

Even with all that, AI still fails in:

Ambiguous metrics — "Active users" means different things across teams. Product uses recency windows and user type filters. Marketing excludes test accounts. Finance counts paying customers only. The model queries dim_active_users but misses the filters business actually uses this quarter.
Schema traps — Table and column names that seem obvious lead the model down wrong join paths. Built on wrong grain, missing crucial filters. The SQL runs, numbers look plausible, no error message—just quietly wrong results.
Code errors — Syntax failures, permission boundaries, query timeouts. The model reaches for tables or patterns it doesn't understand. Small runtime errors compound into retried plans and inconsistent behavior.
User CSAT — Low satisfaction scores, wrong answers flagged by users, eroded trust. When users continue iterating or reject answers, you've lost reliability.

All these failures boil down to one root cause: prompt-context misalignment. When the prompt falls within your encoded context, the model produces reliable results. When it falls outside, the model guesses—and guesses look plausible but are often wrong.

What to Track

You don't need a complex dashboard filled with vanity metrics. You need signals that tell you where context is missing and what to fix.

Track these four metrics:

Answer Quality What: Is the answer correct and useful? Would you share it with a stakeholder? Detected by: LLM judges scoring context-prompt match and answer correctness. In practice, combine automated checks (SQL validity, result plausibility) with human review of a rotating sample. Start with labeling 10-20 queries per day across different question types. Catches: Ambiguous metrics, schema traps, business logic mismatches.
Context Effectiveness What: Did the system retrieve and use the right instructions, schema, and metadata? Detected by: Semantic similarity between questions and your definitions, clarification request patterns, agent action traces. For data engineers: spikes in clarifications around specific tables signal documentation gaps. Catches: Missing metric definitions, incomplete documentation, context gaps by domain or table.
Code Errors What: SQL execution failures that indicate the model reached for things it doesn't understand. Detected by: Syntax failures, permission issues, query timeouts. Track which tables/columns consistently trigger errors. Catches: Schema traps, wrong join paths, execution fragility.
User Feedback What: Ground truth of what's actually broken in production. Detected by: Users flagging wrong answers, continued iteration, answer rejections. Catches: All failure modes in production, especially edge cases testing didn't cover.

In practice: Instrument these signals at the agent run level—every query, every user interaction. Store the full trace: what context was retrieved, what actions the agent took, what SQL was generated, what results came back, and how the user reacted.

As patterns emerge, go deeper. Track negative feedback by the specific table or column that caused the issue. Measure which type of context is most effective (instructions vs. schema vs. dbt models). Analyze clarification clusters by domain. Score feedback from power users differently—they understand the data model and their signals are high-quality.

Think of this as unit testing for AI outputs. You wouldn't ship code without tests—why ship answers without validation? The difference is that your tests evolve: what fails today becomes tomorrow's regression test, encoded as instructions that prevent the same failure from happening again.

How to Turn Observations Into Fixes

Metrics without action are just numbers. The real value comes from closing the loop: using what you observe to systematically improve the system.

Every failure points to missing context—a metric definition the model doesn't know, a join path it shouldn't take, or business logic that's not codified. The fix isn't rebuilding your model or restructuring your warehouse. It's encoding that missing context as an instruction the system can apply automatically next time a similar question appears.

This approach complements your data modeling work—instructions handle business logic and edge cases without requiring schema changes or dbt rebuilds. Think of them as runtime metadata that sits alongside your warehouse, capturing the operational context that doesn't belong in table definitions.

Diagnose the root cause Start by looking at the different failure types: code errors, negative feedback, low-quality answers, and clarification requests. Then dig into the agent traces—the step-by-step reasoning and decision path the system followed. What action did it take: generate SQL, search data, or ask for clarification? If it generated SQL, where did it go wrong? Did it retrieve the wrong schema? Misinterpret a metric? Choose a bad join path? If it asked for clarification, what context or tool was missing? Understanding the "why" is critical before you write a fix.
Draft and test the instruction Write a scoped rule that addresses the root cause. Then test it: run through a cycle with recent prompts that failed and verify the instruction actually fixes them. This is your chance to catch edge cases before rolling anything out.
Review or approve Decide if the instruction needs human review or can be approved immediately. Treat it like a pull request—some changes are obviously safe (fixing a typo in a metric name), others need domain expertise (redefining "active users"). You wouldn't merge code without review—don't merge business logic without it either. Route accordingly.
Roll out and track Once approved, the instruction gets attached to the relevant domains, tables, or metrics. It automatically applies when similar prompts appear. Then track your metrics over time: did answer quality improve? Did code errors drop? Did negative feedback decrease?
Self-learning mode (highly recommended) For teams that want to move faster, enable AI auto-generation of instructions. When the system detects low-quality results or recurring errors, it can draft a proposed instruction automatically, test it against recent failed queries, and route it for approval. This works by prompting the model to analyze the failure pattern, propose a fix as a natural language instruction, and validate it against a test set. The human remains in the loop for approval, but the heavy lifting of diagnosis and drafting happens automatically. This dramatically shortens the feedback loop from days to minutes, though you'll want to start with human-in-the-loop mode until you trust the quality of auto-generated instructions.

This workflow is faster than traditional data modeling cycles, more transparent than black-box model tuning, and safer than letting the model improvise business definitions on the fly.

Summary

Text-to-SQL will become the interface for data because AI can reason, explore, and surface insights that static dashboards never will. But moving from demo to dependable production requires structure: understanding where accuracy breaks, measuring what actually matters, and closing the loop by encoding failures as instructions with visible impact.

The promise is real. The path to get there just requires more rigor than most demos let on.

Try it yourself

This observability framework is built into Bag of words, an open-source AI analyst designed for production use. Deploy it to your warehouse and start tracking these metrics today.

→ Documentation: https://docs.bagofwords.com

→ GitHub: https://github.com/bagofwords1/bagofwords

Building in the open—contributions and feedback welcome.

Build a Product Usage Dashboard with Bag of words (Open Source)

Bag of words — Mon, 03 Feb 2025 19:18:32 +0000

In this guide, we’ll build a product usage dashboard using Bag of words, an open-source AI-powered data tool. It connects to your databases, APIs, and even unstructured data like PDFs, allowing you to create dashboards through simple prompts — no manual SQL needed.

Setup

Bag of words is designed to let you generate reports, charts, and dashboards using natural language prompts. Best of all, it’s open-source and quick to set up using Docker.

*Deploy with Docker *
By default, Bag of words uses SQLite, but you can configure PostgreSQL if needed.

Run the following command to get started:

docker run --pull always -d -p 3000:3000 bagofwords/bagofwords

Connect Your Data & Configure LLM

Add an LLM provider at http://localhost:3000/settings/models to enable report generation.
Connect your data sources at http://localhost:3000/integrations and follow the instructions for your specific data sources.

Build your dashboard

Understand Data Schema
Before diving into data visualization, it’s crucial to understand your data schema — the structure of your data, how tables are connected, and what fields are available. Besides viewing tables via UI, you can also explore the data schema via prompting:

• “Show me the list of all tables and their relationships.”
• “What columns are available in the users table?”
• “How is the transactions table linked to the users table?”
• “List all fields related to user activity and session data.”

This will give you a clear view of your dataset, helping you craft more precise queries and visualizations.

Generate Key Metrics with Prompts
Once you’re familiar with your schema, you can start generating key metrics:

“Show me daily active users over the past month in a line chart.”
“Create a dashboard with total signups, feature usage, and churn rate.”
“List the top 10 most used features by active users in a bar chart.”
“Display the retention rate for users over the last six months.”
You can also run this in a single prompt.

Publish and Share

Now that your data is ready, you can set the layout in the dashboard by dragging the different elements. You can also ask the AI to do it for you.

Once everything’s ready, you can click to the top right Share button and configure the settings and get the sharable URL.