Forem: Prince Raj

Part 3: Turning Text Into Numbers - Bag of Words, Keywords, and Embeddings Without the Magic

Prince Raj — Mon, 27 Apr 2026 06:14:17 +0000

The question every beginner eventually asks

At some point in every AI project, you run into the same confusing idea:

How does a sentence become something a model can actually use?

Humans read:

"refund nahi mila yet"

A model does not "read" that in the human sense. A model receives numbers. So we need a bridge between human language and machine math. That bridge in this project is the feature extraction pipeline.

Start with the least magical explanation

Here is the simple version:

Text
  ->
Normalize it
  ->
Split it into tokens
  ->
Create several kinds of numeric clues
  ->
Join those clues into one vector

That is it. The interesting part is what kinds of clues we create.

Step 1: Normalize the text

Before counting anything, we reduce noise.

Examples:

Refund Nahi Mila!!! -> refund not received
my email is abc@x.com -> my email is <email>
call me at 12345 -> call me at <num>

Why do this?

Because models are sensitive to surface variation.

If one user writes PLZ refund and another writes please refund. We usually want the system to treat those as the same idea.

This project normalizes:

Lowercase
URLs
Emails
Numbers
Hinglish shortcuts
Punctuation and spacing

Plain-English version:

We clean away formatting noise so the model can focus on meaning.

Step 2: Tokenize the text

After normalization, we split the text into tokens. For this project, tokenization is intentionally simple:

whitespace split

Example:

"payment failed and money got deducted"

becomes:

["payment", "failed", "and", "money", "got", "deducted"]

This is not the fanciest tokenizer in the world. That is okay. For a narrow support-ticket domain, simple tokenization can work surprisingly well.

This is a good beginner lesson:

"simple" is not the same as "bad"

The project uses a hybrid feature vector

This is where the system gets interesting. Instead of relying on only one representation of text, it builds three:

Bag-of-words
Keyword flags
Averaged embeddings

Then it concatenates them into one big feature vector. Why hybrid? Because each representation has different strengths.

Feature type 1: Bag-of-words

Bag-of-words is one of the oldest and simplest ideas in NLP. The name sounds odd, but the concept is easy:

Keep a vocabulary of important words and count how many times each one appears.

If the vocabulary contains:

refund
payment
error
pricing

and the ticket is:

"refund for duplicate payment"

then the bag-of-words vector might look like:

[1, 1, 0, 0]

Plain-English version:

It is a checklist of words that showed up.

This project also applies log1p to those counts. Why?

Because raw counts can get too large. log1p compresses the scale so repeated words still matter, but not too aggressively.

You can think of it like this:

Seeing a word 3 times is more important than seeing it once, but not 3x more important.

Feature type 2: Keyword flags

Keyword flags are even simpler. For each important phrase, we ask:

Is this present or not?

Examples of keywords in the project:

refund
cancel
not working
pricing
demo
api
refund chahiye

If a keyword appears, its flag becomes 1. Otherwise, it stays 0.

Why keep keyword flags if we already have bag-of-words? Because business signals often deserve direct emphasis.

For example:

refund
close account
pricing

are not just words. They are operationally meaningful patterns.

Plain-English version:

Keyword flags are the model’s "red flag" and "green flag" indicators.

Feature type 3: Token embeddings

This is the part that usually sounds magical, but it can be explained simply. An embedding is just a learned vector for a token.

Instead of saying:

the word refund is token 127

we say:

the word refund also has a learned numeric representation that captures patterns from training

In this project:

each token gets an ID
the ID looks up a small embedding vector
the ticket’s embedding vectors are averaged

So if a sentence has tokens:

["refund", "money", "not", "received"]

the model looks up four vectors and averages them.

Plain-English version:

Bag-of-words tells us what words were present.
Embeddings help the model learn what kinds of words behave similarly.

Technical term:

This is embedding lookup with average pooling.

Why average the embeddings?

Because this is a tiny model. We are not building a giant sequence model with attention. We are building something cheap, fast, and deployable in pure Go.

Averaging embeddings gives us:

low cost
low complexity
useful semantic signal

without needing a much bigger model.

This is a recurring theme in the whole project:

choose the smallest method that solves the problem well enough

The final feature vector

After building all three parts, we join them together:

[ bag_of_words | keyword_flags | pooled_embedding ]

That becomes the input to the neural network.

Plain-English version:

We combine direct word evidence, domain-specific business hints, and learned semantic context into one numeric summary.

Why not just use one representation?

This is worth pausing on.

If we only used bag-of-words:

we would miss softer semantic patterns

If we only used embeddings:

we might weaken explicit domain phrases like refund or pricing

If we only used keyword flags:

the model would be too brittle and too hand-written

The hybrid setup works because the three feature families complement each other.

That is a very useful design pattern in applied AI:

let simple signals and learned signals work together

A backend analogy

If you come from backend engineering, think of the feature vector like a request context object.

It contains:

raw facts
derived facts
domain-specific hints

No single field tells the whole story. But together, they make downstream decision logic much stronger.

Why the Go inference engine had to mirror this exactly

This part matters more than many beginners expect.

The production Go service cannot do "approximately the same preprocessing."
It has to do the same preprocessing.

If training used:

Hinglish normalization
log1p bag-of-words
<unk> token fallback
max token truncation

then inference has to do those too.

Otherwise you get a mismatch:

the model learned one world, but production serves another

That is why the exported artifact includes preprocessing metadata, vocabularies, keywords, and embedding info.

If you only remember one thing from this article

Remember this:

AI models do not work on "text."
They work on representations of text.

The quality of that representation often matters as much as the model itself.

What comes next

In Part 4, we will take this feature vector and pass it into the actual neural network.

That is where we will cover:

dense layers
ReLU
shared base + multiple heads
loss functions
class weights
validation metrics
early stopping

In plain language first, of course.

Part 2: The Dataset - Labels, Heuristics, Synthetic Data, and Why AI Starts Before the Model

Prince Raj — Fri, 17 Apr 2026 10:34:50 +0000

Before we begin, if you have come directly to this post (Part 2 of 6), here is Part 1 where I explain the basics and set the expectations from this series.

Prince Raj

Apr 16

Part 1: What We Built - A Tiny AI System for Support Ticket Classification

#ai #go #backend #machinelearning

Comments 1

5 min read

The part most people skip

When many developers first approach AI, they jump straight to the model.

They ask:

Which neural network should I use?
Should I use transformers?
How many layers should I add?

Those are fair questions, but not the first questions.

For this project, the first real job was:

define what the model is supposed to mean

That sounds obvious, but it is the foundation of everything else.

If your labels are vague, inconsistent, or impossible to infer from text, the model will struggle no matter how fancy the architecture is.

The five things this model predicts

This classifier does not output one label. It outputs five:

department
sentiment
lead_intent
churn_risk
intent

That means every training example needs a shape like this:

{
  "text": "refund nahi mila yet",
  "department": "billing",
  "sentiment": "negative",
  "lead_intent": "low",
  "churn_risk": "high",
  "intent": "refund"
}

This is the canonical schema of the training set.

Plain-English version:

Every ticket must be translated into one consistent answer sheet.

Why schema design matters

Imagine you have data from three places:

a banking support dataset
a sentiment dataset
a general intent dataset

None of them naturally match your product.

One dataset may have label=payment_issue. Another may only know positive vs negative sentiment. Yet another may say nothing about churn risk at all. So the job is not only "load data."

The job is:

convert different sources into one shared language

That is what the dataset pipeline in this project does.

The label strategy

Let’s go through each output the way a backend engineer would.

1. Department

This is a routing problem. The question is:

Which team should probably handle this?

Examples:

refund -> billing
password reset -> technical
tracking issue -> logistics
pricing request -> sales

This label is operational.
It exists to move work to the right queue.

2. Sentiment

This measures emotional tone:

positive
neutral
negative

This is not the same as intent.

A pricing question can be neutral.
A refund request can be negative.
A thank-you note can be positive.

This label helps downstream prioritization and messaging.

3. Lead intent

This is where business context starts to matter.

The question is:

Does this message look like a buying opportunity?

Examples:

demo request -> high
pricing inquiry -> high
feature request -> medium
complaint -> low

This label is not just language understanding. It is business interpretation.

That matters later, because it is one reason small custom models can beat general-purpose LLMs on narrow tasks.

4. Churn risk

This estimates whether the customer may leave.

Examples:

cancellation request -> high
repeated refund frustration -> high
neutral tracking question -> low

Again, this is partly semantic and partly business logic.

5. Intent

This is the most specific task.

Examples:

refund
cancellation
delivery_issue
pricing_inquiry
technical_issue

Turning messy data into this schema

The training pipeline pulls data from multiple sources:

Hugging Face datasets like banking77
Sentiment data like tweet_eval/sentiment
Intent datasets like clinc_oos
Local JSONL files
Synthetic examples
Manual correction data

But raw source labels do not line up nicely with our five-task schema. So we normalize them.

Technical term:

This is schema normalization.

Plain-English version:

We take many different spreadsheets and convert them into one house format.

Where heuristics come in

Here is an important beginner lesson:

Not every training label has to come from a human manually writing every field.

Sometimes a dataset gives you only one known label. You can infer the others using domain rules.

For example:

if intent is refund, department is probably billing
if intent is pricing_inquiry, lead intent is probably high
if intent is complaint, sentiment is probably negative
if intent is cancellation, churn risk is probably high

That is exactly what this project does.

In plain language:

When we know one strong clue, we can responsibly fill in related labels.

This is not perfect.
But it is often very useful when building a practical system from mixed data sources.

Why synthetic data was necessary

This is one of my favorite parts of the project, because it is very relatable for backend engineers.

Real support data is usually messy in two ways:

it is incomplete
it is uneven

Maybe you have lots of billing messages but not many sales leads.
Maybe you have clean English examples but not Hinglish.
Maybe you do not have enough high-churn refund tickets.

So the pipeline generates synthetic tickets using templates.

Examples of synthetic patterns:

"I want a refund for my subscription"
"Refund nahi mila for my order"
"Can I get a demo for my team?"
"Payment failed but money got deducted"

Then it adds style noise:

typos
shorthand
uppercase
casual phrasing
Hinglish variants

Plain-English version:

We manufacture extra training examples for situations we care about but do not have enough of.

Technical term:

This is synthetic data generation or data augmentation.

Why Hinglish normalization matters

A lot of AI tutorials quietly assume clean English input. Real production systems do not get that luxury.

Users write things like:

refund chahiye
paisa mila nahi
app kharab hai
jaldi fix karo

If you ignore that kind of variation, your model will feel fragile in production.

So this project includes simple but valuable normalization rules that map common Hinglish words to normalized English equivalents:

nahi -> not
paisa -> money
kharab -> broken
chahiye -> want

This is not "full multilingual AI."
It is something more practical:

targeted robustness for the language patterns your users actually type

The corrections loop is the most production-friendly part

This project also supports a corrections.jsonl file.

That means once the model is live, you can capture corrected labels and feed them back into training.

The workflow looks like this:

Model makes a prediction in production
Human or system corrects bad labels
Corrected example gets appended to corrections.jsonl
Next training run boosts those corrections

I love this because it feels very familiar to backend teams. It is not mystical. It is a feedback loop.

You ship.
You observe.
You correct.
You retrain.

That is how production systems grow up.

Training and validation split

After collecting all examples, the pipeline splits them into:

Training data
Validation data

Why do we need validation?

Because if we only measure performance on the same examples the model learned from, the scores can be misleading.

Plain-English version:

Training data is the study material. Validation data is the exam.

The project also tries to stratify by intent when splitting.

That means it attempts to preserve label balance, so the validation set does not accidentally miss important classes.

A simple but important truth

At this point, we still have not talked about embeddings, dense layers, or PyTorch math.

And that is the point.

The AI project already contains a lot of engineering value before the neural network starts training:

Schema design
Label definitions
Heuristics
Dataset normalization
Synthetic example generation
Production corrections
Validation setup

This is why I keep telling backend engineers:

You already have a lot of the mindset needed for AI systems.

Good AI pipelines reward the same habits as good backend systems:

Consistent contracts
Thoughtful data modeling
Clear assumptions
Measurable feedback loops

If you can only remember one thing from this article

Remember this:

Training data is not "whatever text you found."
Training data is a product design decision.

You are deciding:

What the model should notice
What tradeoffs it should care about
What your labels really mean in the business

That is the real beginning of AI work.

What comes next

In Part 3, we will finally answer the question that makes many people feel like AI is magic:

How does text become numbers?

I will explain:

Bag-of-words
Keyword flags
Token IDs
Embeddings
Why this project combines all of them

And I’ll do it in plain language first, then connect each idea to the proper technical terms.

Disclosure: AI was used to frame the article.

Part 1: What We Built - A Tiny AI System for Support Ticket Classification

Prince Raj — Thu, 16 Apr 2026 14:07:48 +0000

Why this series exists

If you are a backend engineer, you already know how to build reliable systems.

You know how requests flow through services.
You know how data gets cleaned before it is useful.
You know how APIs hide complicated internals behind simple contracts.

AI systems are not as magical as they look from the outside.

They are still systems.
They still have inputs, processing stages, outputs, tradeoffs, and production constraints.

In this series, I am going to break down a real project I built:

a tiny support ticket classifier
trained in Python
exported to JSON
served in pure Go
fast enough to run in a few milliseconds on CPU

This is not a "train a giant LLM on a cluster" story.

This is a practical story for backend engineers who want to understand how AI products are actually assembled.

Github Repos:

Inference: Built in pure Go
Trainig on dataset: Python service

What the model does

The input is simple:

one raw text support ticket

The output is richer than a single label. The model predicts five things at once:

department
sentiment
lead_intent
churn_risk
intent

So for one ticket like:

"I was charged twice and need a refund"

the system can produce something like:

department: billing
sentiment: negative
lead_intent: low
churn_risk: high
intent: refund

That makes it a multi-task classifier.

Plain-English version:

We built one small brain that answers five related questions about the same ticket.

The full system in layman terms

Before we get technical, here is the project in everyday language.

Raw ticket
   ->
Clean the text
   ->
Pull out useful clues
   ->
Turn those clues into numbers
   ->
Pass the numbers through a tiny neural network
   ->
Get 5 answers
   ->
Package the result for production use

Now let me expand each block.

Block 1: Raw ticket

This is the message a user writes.

Examples:

"refund nahi mila yet"
"pricing for enterprise plan?"
"app is not working after reset"

At this stage, the text is messy.
People type casually.
They use typos.
They mix Hindi and English.
They write with emotion.

Block 2: Clean the text

The model cannot reason about raw text the way a human does.
So first we normalize it.

That means things like:

convert to lowercase
replace URLs with <url>
replace emails with <email>
replace numbers with <num>
normalize Hinglish words like nahi to not and paisa to money

Plain-English version:

We reduce unnecessary variation so the model sees the same idea in a more consistent form.

Technical term:

This is text preprocessing or normalization.

Block 3: Pull out useful clues

After cleaning the text, we extract signals.

We do not rely on only one trick.
We use a hybrid set of features:

bag-of-words counts
keyword flags
token embeddings

Why three kinds?

Because each one catches something different.

Bag-of-words helps with direct vocabulary signals.
Keyword flags help with business-important phrases like refund, cancel, or not working.
Embeddings help the model capture softer meaning patterns.

Plain-English version:

Instead of asking the model to "just understand everything," we hand it several different kinds of clues.

Block 4: Turn those clues into numbers

Neural networks do not consume text directly.
They consume arrays of numbers.

So the cleaned ticket becomes:

one numeric vector for word counts
one numeric vector for keyword flags
one numeric vector from averaged token embeddings

Then we combine them into one final feature vector.

Technical term:

This is feature engineering plus vectorization.

Block 5: Pass the numbers through a tiny neural network

This project uses a very small network.

At a high level:

Feature vector
   ->
Dense layer
   ->
ReLU
   ->
Dense layer
   ->
ReLU
   ->
5 output heads

Each output head is responsible for one prediction task.

Why this shape?

Because the tasks are related.

For example:

refund often points to billing
angry language can affect sentiment and churn_risk
pricing questions often affect department and lead_intent

So the model first learns a shared internal representation and then each task gets its own small output layer.

Technical term:

This is a shared-base multi-head neural network.

Block 6: Get 5 answers

Each head produces a score.
Then the system converts scores into labels:

softmax for multi-class outputs like department or intent
sigmoid for binary output like churn risk

Plain-English version:

The model does not directly shout "billing." It first scores all options, then picks the most likely one.

Block 7: Package the result for production

Training happens in Python.
Production inference happens in Go.

That design was intentional.

Why?

Because I wanted:

easy training with PyTorch
a simple export format
a lightweight production runtime
low latency
low memory usage
no external ML runtime in production

So the trained model is exported into JSON, and the Go service loads that artifact and runs the forward pass manually.

Plain-English version:

Python is the workshop. Go is the factory floor.

Mapping the real project to these blocks

Here is how the actual project maps to the conceptual flow.

Training side

preprocess.py Cleans text, normalizes Hinglish, replaces URLs/emails/numbers, injects style noise for synthetic data
features.py Builds bag-of-words vocab, embedding vocab, keywords, and encodes text
datasets.py Loads Hugging Face datasets, local JSONL files, corrections, and normalizes everything into one schema
synth.py Generates synthetic support-ticket examples to improve domain coverage
model.py Defines the tiny hybrid neural network
train.py Handles training, validation, class weights, metrics, early stopping, and artifact creation
export.py Writes the trained model to JSON for production

Inference side

features/ Rebuilds the same preprocessing and feature extraction logic in Go
model/ Loads exported JSON and defines dense layers, embeddings, softmax, sigmoid, and validation
quantization/ Supports int8 inference for smaller and faster runtime
inference/ Orchestrates prediction and creates the final result object
benchmark/ Compares the local model against hosted models like GPT-5-mini

Why I did not use an LLM for everything

This question comes up a lot now.

Why build a tiny model at all when an LLM can classify text?

Because production engineering is about fit, not hype.

For this use case, the tiny model has real advantages:

much lower latency
much lower cost
predictable output shape
simpler deployment
easier control over labels
easier offline benchmarking

LLMs are great when you need open-ended reasoning or generation.

But if you need:

narrow labels
stable routing
predictable performance
cheap per-request inference

then a smaller custom model can be the better tool.

The main lesson from this project

The most important idea I want you to take from Part 1 is this:

An AI application is not "a model."
It is a pipeline.

The model matters, yes.
But so do:

your labels
your preprocessing
your synthetic data
your export format
your inference runtime
your benchmark setup

If you only focus on the neural network block, you miss most of the engineering work.

What is coming next

In Part 2, I will go one level deeper into the most underrated part of AI work:

the dataset and label design

That is where this project really starts.
Not in PyTorch.
Not in matrix multiplication.
Not in fancy model architecture.

It starts with deciding:

what we want the system to predict
what "good" labels even mean
how to combine real data, heuristics, and synthetic examples into one usable training set

If you can understand that piece, the rest of the system becomes much easier to follow.

Disclosure: AI was used to frame the article.