Forem: Dev J. Shah 🥑

Cracking the Databricks Generative AI Engineer Certification

Dev J. Shah 🥑 — Mon, 23 Mar 2026 13:00:00 +0000

Introduction

To set expectations, this guide walks you through the introduction of the exam: what it is, the topics covered, and if you are interested, some resources to help you prepare for it.

The Databricks Generative AI Engineer Associate is a certification offered by Databricks that helps candidates demonstrate their ability to design and build AI-powered solutions using their platform. Here is the link to the Databricks page to learn more about it:

Databricks Certified Generative AI Engineer Associate | Databricks

Become a Databricks Certified Generative AI Engineer Associate! Prove your skills in designing and implementing LLM-enabled solutions using Databricks' cutting-edge tools.

databricks.com

Here is my certificate for the creditibility and check my score.

Accredible • Certificates, Badges and Blockchain

Home of digital credentials

credentials.databricks.com

Topics Covered

This exam covers 6 sections, each focusing on different aspects of building and managing generative AI solutions on Databricks. Below is a brief overview of each section along with its percentage weightage.

Design Applications – 14%

As the name suggests, this section focuses on designing various aspects of AI solutions: crafting the right prompt, choosing appropriate models and components, translating business requirements into desired inputs and outputs, and deciding which tools to give the model access to.

Data Preparation – 14%

The first step in creating an AI solution is often preparing the data in a way that helps leverage the model to its full potential. This section includes deciding on chunking strategies, filtering unnecessary data, and choosing the appropriate Python package for extracting data.

Application Development – 30%

This is the most interesting part. It includes creating tools to extract required data, selecting an orchestration framework like LangChain to create chains of prompt templates and models, crafting prompts by augmenting them with retrieved data, implementing guardrails, reducing hallucinations, and selecting the right models and embedding strategies for your use case.

Assembling and Deploying Apps – 22%

Once your application is built, you need to get it out into the world. This section covers deploying model serving endpoints, registering models to Unity Catalog via MLflow, setting up Vector Search indexes, and understanding the end-to-end deployment process for a RAG (Retrieval-Augmented Generation) application.

Governance – 8%

This covers how to protect your application and your data. Think guardrails against malicious inputs, data masking, and staying on the right side of legal and licensing requirements.

Evaluation and Monitoring – 12%

Building it is one thing; making sure it keeps working well is another. This section tests your ability to evaluate model performance, monitor live deployments, control costs, and understand the difference between the evaluation and monitoring phases of the AI application lifecycle.

Exam Overview

The exam consists of 45 scored multiple-choice and multiple-selection questions, though you may encounter a few additional unscored questions. Extra time is factored in to account for these. You will have 90 minutes to complete the exam.

Prerequisites

There are no formal prerequisites required to sit for this exam, which makes it accessible to anyone looking to get started with generative AI on Databricks. That said, Databricks recommends coming in with some solid preparation.

Beyond the coursework, you will want to have a good working knowledge of current LLMs and their capabilities, prompt engineering, and tools like LangChain and Hugging Face Transformers. On the technical side, comfort with Python and its relevant libraries is important, as is familiarity with APIs used for data preparation and model chaining.

Study Plan, Resources, and Practice Exams

I used Databricks Academy's Generative AI Engineering Pathway to prepare for the certification. It consists of 6 courses with a total duration of 10 hours and 35 minutes. I personally took almost a week (including a long weekend) to complete this course. Upon completing each course, you also earn a Databricks badge that you can share. Each course also includes a tutorial walkthrough of the explained topic. Since I have worked on these AI topics before, I did not follow the tutorials hands-on, but depending on your experience level, you may want to. Upon completing the pathway, you will receive a coupon code via email after a few days that you can use to book the exam for free.

partner-academy.databricks.com

After completing the pathway, it was time to take some practice exams. Since this is a fairly new certification, there are limited resources available online. After some research, I decided to buy practice questions from SkillCertPro. This package contains around 670 questions across 12 mock exams. There were questions in the actual exam that overlapped with these practice tests, so I highly recommend getting this one. (Not sponsored!)

Databricks Generative AI Engineer Associate Practice Questions 2026

Databricks Generative AI Engineer Associate Exam Questions 2026. Prepare and pass in first attempt. with valid dumps. 100% real exam questions

skillcertpro.com

After completing 1–2 practice exams, I booked my (as always, in-person) exam for the following week. During that week, I worked through one practice exam per day, reviewed the explanations for wrong answers, and used Claude to dig deeper into concepts I wasn't sure about.

I also came across a couple of blogs that really helped with preparation:

Beyond these, depending on your weak domains, Databricks also publishes topic-specific blogs that can help deepen your understanding.

My Score

Design Applications: 80%
Data Preparation: 57%
Application Development: 71%
Assembling and Deploying Apps: 100%
Governance: 100%
Evaluation and Monitoring: 80%

Conclusion

To wrap up, most of the exam questions were not just Databricks-specific, but covered real-world Generative AI engineering concepts. So while preparing, make sure you also understand the core concepts of GenAI beyond the Databricks platform. This will help you build transferable skills that go beyond the exam itself.

Best of luck! Let me know if you have any further questions, and don't forget to let me know once you pass.

Hands-on: Azure AI Search & AI Foundry for RAG

Dev J. Shah 🥑 — Sun, 08 Mar 2026 12:37:13 +0000

Index

Introduction
Azure Resources
- Azure AI Search
- Azure AI Foundry
Code
Do not forget to Clean the Cloud
Conclusion

Introduction

In this lab, we will build a full RAG pipeline using Azure. RAG is a technique where, instead of relying solely on a language model's training data, we first retrieve relevant documents from an external knowledge base and then pass them to the model to generate a more accurate and grounded answer.

To do this, we will use two Azure services: Azure AI Search as the vector database to store and retrieve document embeddings, and Azure AI Foundry to deploy the embedding model and the generation model.
By the end of this lab, you will have a working RAG pipeline running on Azure.

Azure Resources

Azure AI Search

Azure AI Search is a cloud search service that supports full-text search, filters, and vector search. In this lab, we are using it as a vector database. We store document embeddings in it and query them using cosine similarity to find the most relevant documents for a given input.

To set it up, we need to provision the resource and get two values: VECTOR_SEARCH_ENDPOINT and VECTOR_SEARCH_KEY, which will be used as environment variables.

Go to the Azure Portal and open AI Search.

Click Create to create a new search service.

Create a new or use an existing resource group. (Suggest: create a new one so its easy to delete the resources later on.)
Give a unique name for Service name.
Make sure the Pricing tier is free, unless you want to experience paid service.
Finally, click Review + Create button at the bottom and then click Create button to create the resource.

Next, do to the resource dashboard and copy the Url from the Essentials. This Url will be used as VECTOR_SEARCH_ENDPOINT.

To get the VECTOR_SEARCH_KEY go to the Keys tab under Settings section from the left navbar. From this screen, copy the Primary admin key.

Azure AI Foundry

Azure AI Foundry is a platform for deploying and managing AI models on Azure. It lets you deploy base models (like OpenAI models) as endpoints that you can call from your own code. In this lab, we are using it to deploy two models: text-embedding-3-small to generate embeddings, and a generation model to produce the final answer.

To set it up, we need to provision the resource and get two values: AZURE_OPEN_API_KEY and AZURE_OPEN_API_ENDPOINT, which will be used as environment variables.

Go to the Azure AI Foundry
Click Create new button to create a new project.
For the resource type, keep the recommended option and click Next.

Give it a good name, keep everything else default and click Create.

Finally, from the project overview page, copy API Key to use as AZURE_OPEN_API_KEY and Microsoft Foundry project endpoint to use as AZURE_OPEN_API_ENDPOINT.

Now from the left navbar, click Models + endpoints under My assets section.
Click Deploy base model. Now we will deploy an embedding model and a generation model.

Search for text-embedding-3-small and click Confirm.

Change the Deployment type to Standard and click Deploy to deploy the model.

Code

Open this notebook in Google Colab.
Add the following environment variables by clicking on this key button, and grant them notebook access
- AZURE_OPEN_API_ENDPOINT
- AZURE_OPEN_API_KEY
- VECTOR_SEARCH_ENDPOINT
- VECTOR_SEARCH_KEY

Finally, you can run the commands in the notebook.

Clean the Cloud

On Azure Portal go to All Resources and delete all the resources we created for this lab.

Conclusion

In this lab, we set up a full RAG pipeline on Azure using Azure AI Search as the vector database and Azure AI Foundry to deploy the embedding and generation models. Thanks for reading! If you want to understand the math behind how the retrieval step works, check out my other blog on the Math behind Embeddings and Cosine Similarity.

Dev J. Shah 🥑

Jul 29 '25

Exploring RAG: Math behind Embeddings & Cosine Similarity

#vectordatabase #rag #ai

Comments

10 min read

If you want to learn more about AI Engineering and related terminologies, I highly recommended getting the book “AI Engineering” by Chip Huyen.

5 AI Agent Design Patterns Every Developer Needs to Know

Dev J. Shah 🥑 — Sun, 01 Mar 2026 21:01:28 +0000

What is an AI Agent?

To explain this concept, I will use a very simple analogy. Humans have two core powers which they use to build things. These powers are Brain (to think, decide and plan) and Body Parts (to verify, execute, and get feedback). Similarly, an AI agent is a digital version of a human which has an LLM (as Brain) and Tools (as Body Parts) to build things.

In other words, an AI agent is a software program which does not work on predefined steps. It gets the request, uses the LLM to understand it and plan the steps to be taken to fulfill it. Further, it uses tools, which is a fancy word for a function, to complete those steps as and when needed. Hence, essentially, the agent is as good as the reasoning capability of the LLM and the tools it has access to.

Single Agent

As the name suggests, this is an application where only a single AI agent is used to get the work done.

Multi-Agent

When an application uses multiple AI agents, it is considered a multi-agent design pattern. Moreover, within the multi-agent design pattern, there are further sub-types.

Sequential

For a kind of workflow where the agents have to work in coordination with each other, in the sense that the input of one agent depends on the output of the other, a sequential pattern can be used.

For instance, in a resume screening system:

Agent 1: Extract text from resume (PDF → structured data)
Agent 2: Identify skills & experience
Agent 3: Match against job description
Agent 4: Score candidate
Agent 5: Generate recruiter summary

All the tasks to be performed by AI cannot be done simultaneously. Hence, agents need to be arranged in series with each other.

Parallel

On the other hand, when the workflows can run independently, that is, when the agents can start working simultaneously, the pattern is called the Parallel AI Agent Pattern.
For example, in a research agent tool:

Agent 1: Searches web articles
Agent 2: Searches academic papers
Agent 3: Searches internal company docs
Agent 4: Searches forums (e.g., dev discussions)

Loop

In a loop pattern of AI agents, a loop agent iteratively runs a sequence of AI agents until a specific condition is met. For instance, in a content evaluation agent:

Agent 1: Generates the content
Agent 2: Evaluates the grammar
Agent 3: Checks if the content is coherent
Agent 4: Checks if the content addresses all the requirements

Other Patterns

Other than these major patterns, there are, and can be, several more patterns to design an efficient AI system. For instance:

Review and Critique Pattern: It has one agent which generates the content and another which criticizes it. Until either the generated response is satisfactory or the maximum number of iterations has been reached, it keeps looping.
Iterative Refinement Pattern: In this pattern, there can be three agents: one which generates the response, another which evaluates the quality, and if it is not up to the mark, a third agent which enhances the prompt, by adding more instructions, and sends it back to the first agent.
Coordinator Pattern: As the name suggests, it has one coordinator or master agent which plans the steps to be taken to resolve the query and assigns each step to a specialized agent based on the task to be completed.
and more.

Conclusion

The introduction of AI Agents started a whole new discipline in the industry of software engineering. In the near future, more software developers will work on creating agents and specialized tools, building multi-agent systems, and more. Not to forget, the invention of the Model Context Protocol (MCP) has made this easier than before.

If you want to get into the world of AI engineering, I highly recommend getting the book, “AI Engineering” by Chip Huyen.

Citation

This blog is inspired from Google documentation in Cloud Architecture Center: "Choose a design pattern for your agentic AI system"

Chunking for context: 6 Strategies Every AI Engineer Should Know

Dev J. Shah 🥑 — Sun, 22 Feb 2026 22:22:05 +0000

Introduction

The very purpose of chunking is to split the data into chunks, convert each chunk into embeddings, and store it in a vector database to be able to use it to provide context to the LLM.

This data can be split using multiple strategies. The ultimate goal is to make sure that whenever the relevant chunk/s are being fetched, it provides enough context to the LLM to make sure that the query of the user is properly addressed. The following are some of the strategies that can be used to split the data.

Fixed Size

The most famous method of chunking is fixed-size. The data can be split into a fixed size, that is, the text can be split into a fixed number of characters. For instance, if the data contains 5000 characters and the chunk size is 250 characters, then the data gets divided into 20 chunks.

For this strategy, one important parameter used is overlapping. Developers often choose to overlap the characters to make sure that the context is maintained in each chunk. In simple words, if the overlap is of 20 characters, this means that the last 20 characters of a chunk will become the first 20 characters of the next chunk. This helps persist the context.

Recursive

Recursive chunking splits text by using the largest meaningful structure first, and only falls back to smaller ones if the chunk is still too large. This needs to be explained with the help of an example.

Assuming the following data is to be split.

Recursive chunking splits text step by step while preserving meaning.

In retrieval-augmented generation systems, documents often exceed the context window of large language models. If text is split blindly into fixed sizes, important ideas may be broken across chunks, and retrieval quality suffers.

Recursive chunking solves this by attempting to keep larger semantic units intact first, such as paragraphs. Only when a paragraph exceeds the size limit does the system split it further into sentences or smaller units.

Considering the required size of each chunk is 20 tokens (~ 20 words).

Following is the order in which recursive chunking divides the text.

Whole document
Paragraph \n\n
Sentence .
Token

Following this order, the splitter first checks if the whole document can fit into a single chunk. However, it exceeds the limit of 20 tokens. Hence, it would divide the content into separate paragraphs.

P1: 10 tokens

Recursive chunking splits text step by step while preserving meaning.

P2: 33 tokens

In retrieval-augmented generation systems, documents often exceed the context window of large language models. If text is split blindly into fixed sizes, important ideas may be broken across chunks, and retrieval quality suffers.

P3: 35 tokens

Recursive chunking solves this by attempting to keep larger semantic units intact first, such as paragraphs. Only when a paragraph exceeds the size limit does the system split it further into sentences or smaller units.

Since P1 fits within the limit of 20 tokens, P1 becomes the first chunk. P2 and P3 exceed the limit and will be further divided into sentences.

P2, S1: 17 tokens

In retrieval-augmented generation systems, documents often exceed the context window of large language models.

P2, S2: 19 tokens

If text is split blindly into fixed sizes, important ideas may be broken across chunks, and retrieval quality suffers.

P3, S1: 17 tokens

Recursive chunking solves this by attempting to keep larger semantic units intact first, such as paragraphs.

P3, S2: 18 tokens

Only when a paragraph exceeds the size limit does the system split it further into sentences or smaller units.

All of these fit within separate chunks. Hence, the final list of chunks will be P1, P2;S1, P2;S2, P3;S1, P3;S2.

Document-Based

Document-based chunking splits text along natural document boundaries like sections, subsections, and paragraphs, which is similar to recursive chunking. However, the key difference is that when a section exceeds the token limit, document-based chunking identifies semantically meaningful split points (like topic shifts) rather than mechanically splitting at arbitrary separators. This ensures each chunk contains coherent, self-contained information.

For instance, the following data,

# Analysis Framework
We employed a mixed-methods approach combining quantitative regression analysis with qualitative thematic coding. The quantitative component used multiple linear regression to identify predictors of user satisfaction, controlling for demographic variables including age, gender, and geographic location. Model selection involved comparing AIC values across nested models. The qualitative component involved coding open-ended responses using an inductive approach to identify emergent themes. Two independent coders achieved an inter-rater reliability of κ=0.85.

will split as follows

# Analysis Framework - Quantitative
We employed a mixed-methods approach combining quantitative regression analysis with qualitative thematic coding. The quantitative component used multiple linear regression to identify predictors of user satisfaction, controlling for demographic variables including age, gender, and geographic location. Model selection involved comparing AIC values across nested models.

# Analysis Framework - Qualitative
The qualitative component involved coding open-ended responses using an inductive approach to identify emergent themes. Two independent coders achieved an inter-rater reliability of κ=0.85.

Hierarchical

In this type of chunking strategy, the data is first divided into small semantic units, that is, either sentences or small paragraphs, to form separate chunks. This can be considered a level 3 division (the most granular level). Further, level 3 chunks are grouped together based on similarity, which forms level 2 chunks, and lastly, based on the main topic, level 2 chunks are grouped to become level 1 chunks. This bottom-up aggregation is what makes hierarchical chunking unique.

In this strategy, the action happens during the retrieval. The user query is used to find the most relevant level 3 chunk. Upon a successful match, the level 2 or level 1 chunk is retrieved.

Semantic

In Semantic Chunking, content is divided based on the actual meaning and topic of the text.

Imagine the algorithm reading through your document sentence by sentence. It starts building a "chunk" and asks itself: "Is this next sentence still talking about the same thing?" As long as the next sentence is semantically related to the previous one, it is added to the current chunk.

As soon as the topic shifts or if the chunk reaches a maximum token limit, the current chunk is closed, and a new one begins. This ensures that a single idea stays together, making it much easier for an AI to retrieve the right context later.

The model identifies topic shifts by calculating the mathematical distance between sentences using embeddings and cosine similarity. Check out the following blog to know more about it.

Dev J. Shah 🥑

Jul 29 '25

Exploring RAG: Math behind Embeddings & Cosine Similarity

#vectordatabase #rag #ai

Comments

10 min read

LLM Based

In LLM-based chunking, the model acts as an intelligent editor, choosing precisely where to place dividers.

Key Considerations for Breaking Content:

Semantic Drift: It detects subtle transitions in subject matter, asking: "Are we still talking about the same core concept?"
Conceptual Integrity: It ensures an idea is fully explained before cutting.
Logical Grouping: It groups related ideas even if they use different wording or span across multiple paragraphs.
Structural Intelligence: It respects the flow of headings and sections but prioritizes the actual narrative flow over rigid formatting.
Relational Awareness: The LLM excels at identifying functional pairs, such as:
- Cause ↔ Effect
- Definition ↔ Example
- Problem ↔ Solution

Key Considerations for Chunking

Context Window Limits: Every embedding model has a maximum token limit (e.g., 512, 8192 tokens). If a text chunk exceeds this limit, the model will simply truncate the text, ignoring any content beyond the limit. This results in incomplete vector representations and lost data. Therefore, your maximum chunk size must always be safely below the embedding model's context window limit.

Granularity vs. Context: A larger context window allows for bigger chunks, capturing more context but potentially diluting specific details. Smaller windows force smaller chunks, which are more precise but may lack surrounding context. The choice of chunk size is a direct trade-off that must align with the capabilities of the specific embedding model you intend to use downstream.

Conclusion

Choosing the right chunking strategy depends on the nature of your data, the embedding model you use, and the kind of queries your system needs to handle. Each strategy covered above comes with its own trade-offs between simplicity, semantic accuracy, and computational cost.

If you want to learn more about AI Engineering and related terminologies, I highly recommended getting the book “AI Engineering” by Chip Huyen.

Model Adaptation: Prompt-Based Techniques vs Fine-Tuning

Dev J. Shah 🥑 — Sat, 31 Jan 2026 17:43:24 +0000

Introduction

Firstly, it is important to understand why we need to adapt a model and what adapting a model actually means.

Here, a model refers to a foundation model, which is trained on a very large amount of general-purpose data. These models are capable of handling a wide variety of tasks, but they are not optimized for any specific use case by default.

Adapting a model means customizing a foundation model for a specific use case so that we can leverage its capabilities more effectively and fulfill a defined purpose.

Let us assume a use case where we need to build a customer support chatbot backed by an LLM. For this, we would start by selecting a foundation model and then customizing it so that it can directly interact with customers, resolve their issues, or take appropriate actions.

To achieve this, there are two main approaches. These approaches are not alternatives, but instead serve different use cases, depending on the requirements.

Prompt-Based Techniques

The first technique used to adapt a model is prompt-based adaptation. In this approach, there is a middle layer between the user’s query and the LLM.

When a user submits a question, this middle layer adds:

Additional context
Instructions
Constraints or rules

along with the original user query. This combined prompt is then sent to the LLM. Based on these instructions and context, the model generates a response that is more aligned with the expected behavior.

There are multiple prompt-based techniques, including but not limited to:

Zero-shot Prompting
Few-shot Prompting
Role Prompting
Retrieval-Augmented Generation (RAG)
Tool or Function Calling

Limitations of Prompt-Based Techniques

One major limitation of prompt-based techniques is inconsistency. This approach does not reliably enforce behavior, which means the output can vary even when similar instructions are provided. As a result, the response is not always guaranteed to follow the expected structure or tone.

Another limitation is the reduction of the available context window. Since instructions and additional context need to be sent with every request, they consume a good part of the model’s context window.

Fine-Tuning

Another strategy to adapt a model is fine-tuning. This approach requires more technical knowledge and high-quality data compared to prompt-based methods.

In fine-tuning, the weights of the foundation model are updated so that the model learns to behave in a specific way by default. Instead of guiding the model through instructions at runtime, we change how the model responds internally.

Prompt-based techniques can be compared to giving instructions to a smart generalist, whereas fine-tuning is like sending that generalist back to school to become a specialist. Instead of reminding them what to do every time, their training changes how they think and respond by default.

Some common fine-tuning methods include:

Supervised Fine-Tuning (SFT)
Reinforcement Learning from Human Feedback (RLHF)
Instruction Fine-Tuning

With fine-tuning, the model becomes highly reliable. It consistently follows a fixed tone, response structure, and style. Additionally, it can learn company-specific jargon, terminology, and phrasing, making it more suitable for long-term and stable use cases.

Conclusion

In general, prompt-based techniques make more sense when the instructions given to the model change frequently. For example, product FAQs or dynamic content that evolves over time can be efficiently handled using prompt-based methods.

On the other hand, fine-tuning is more suitable for behaviors that remain consistent over long periods. This includes company policies, tone of communication, customer interaction rules, and compliance requirements.

In practice, a hybrid approach often works best. Parameters that are expected to remain stable for a long time can be used to fine-tune the foundation model. At the same time, variables that evolve more frequently can be provided dynamically through prompts at inference time.

This further reinforces the idea that prompt-based adaptation and fine-tuning are not alternatives, but complementary techniques, each with specific and well-defined use cases.

Citation

This blog is inspired by the book “AI Engineering” by Chip Huyen. If you want to go beyond high-level concepts and understand how real-world AI systems are designed, adapted, evaluated, and deployed, this book is an excellent resource. It covers model adaptation, system design, data considerations, production AI workflows, etc., making it valuable for developers building practical AI applications.

Why Am I a SWE?

Dev J. Shah 🥑 — Thu, 16 Oct 2025 13:44:48 +0000

I am not a Software Engineer just because I studied Computer Science; I am one because of the journey that led me here, which includes a mix of wrong turns, realizations, courage, and a quiet but persistent voice that never stopped telling me where I truly belonged.

When I look back, I sometimes wonder how different my life would have been if I hadn’t switched my profession. Considering the captivating pace of today’s tech innovations, I can’t even imagine how bad I would feel if I had stayed where I was.

I wasn’t always a Software Engineer. In fact, I studied Electrical Engineering. To be honest, I didn’t have a passion for it. I chose it mostly out of fear. Back then, people would tell me that computers were constantly evolving and that if I studied computer science, I would have to keep learning something new almost every single day. As a sixteen-year-old, that sounded intimidating. So, I ended up choosing Electrical Engineering instead. But as my studies went on, something kept tugging at me.

During my diploma, we had a couple of computer-related courses as part of the electrical curriculum. Around the last year, I found out that one of my friends had started his Bachelor’s in CS. That news haunted me; not in a bad way, but in a way that made me question my own path. The thought that I wasn’t doing computer programming kept bothering me. Deep down, something was constantly telling me that I should be in computers.

Eventually, I gathered the courage to tell my dad that I wanted to switch to computer programming. Fortunately, he supported my decision, and that conversation changed the entire trajectory of my career.

Looking back even further, the signs were always there. I think my real connection with computers began in 7th or 8th grade, when I was first introduced to HTML. Later, in 10th grade, I was reintroduced to HTML and also learned C programming in school. I was really good at it. During computer lectures at my tuition classes in 10th grade, I would always be the first one to complete the assigned tasks and then immediately crave something more challenging. I still remember that during our final examinations, the topper of the batch copied from my C program. That moment still makes me smile.

I also remember building a Facebook-like website using raw HTML and inline CSS back in 2017, a time when learning programming through online tutorials wasn’t as common as it is today. Most of my learning came from pure curiosity and trial and error.

In hindsight, studying Electrical Engineering wasn’t a waste at all. It taught me logic, discipline, and a way of thinking that still helps me today, not just professionally, but in life. Yet, it was programming that gave me purpose, creativity, and the thrill of building something from nothing.

So, why am I a software engineer?

Because, at every step, even when I didn’t realize it, my heart was already wired for it.

How Retrieval Algorithms Shape Better LLM Responses?

Dev J. Shah 🥑 — Tue, 26 Aug 2025 00:34:43 +0000

Introduction

In the era of LLMs, specifically in Retrieval-Augmented Generation (RAG), retrieval algorithms play one of the most important roles. The better the retrieval results, the better the context provided to the LLM, and the better the responses it generates.

The method of retrieving information is also the backbone of search engines. However, this blog only talks about retrieval specifically for providing context to LLMs.

The way it works is by ranking documents based on their relevance to the given query. Retrieval algorithms can be classified based on how the relevance score is computed. A relevance score is a numerical measure that indicates how well a piece of information matches a given query.

The two common retrieval methods are: Term-based Retrieval and Embedding-based Retrieval.

Term-based Retrieval

As the name suggests, Term-based Retrieval uses the keywords from the query to find the most relevant documents. However, this approach can have issues.

Many documents may contain the same keyword. Not every document can fit in the LLM’s context window. As a result, the document with the actual useful context might not get included. A simple approach is to include the document that contains the keyword the highest number of times. The number of times a term appears in the document is called Term Frequency (TF).

A query may contain multiple keywords, out of which some are more important than others. The importance of each keyword is inversely proportional to the number of documents in which it appears. The more documents a keyword appears in, the less important it becomes. This metric is called Inverse Document Frequency (IDF).

Mathematically, IDF = (Total number of documents) ÷ (Number of documents containing the keyword). A higher IDF value indicates greater importance of the keyword.

The well-known algorithm that combines these two metrics, Term Frequency (TF) and Inverse Document Frequency (IDF), is TF-IDF.

Embedding-based Retrieval

Term-based Retrieval is focused on keywords rather than meaning, which can result in irrelevant documents being retrieved. On the other hand, Embedding-based Retrieval ranks documents based on how closely they align with the query in terms of semantic meaning.

With Embedding-based Retrieval, indexing involves an additional step: converting documents into embeddings. Embeddings are high-dimensional vectors that preserve important properties of the original data. These embeddings are then stored in a specialized database called a Vector Database.

To learn more about embeddings, I recommend checking out my other blog, which explains how text is converted into embeddings and how retrieval is performed using cosine similarity, one of the most common embedding-based retrieval techniques.

Comparing Term-based and Embedding-based Retrieval

Term-based Retrieval is generally faster than Embedding-based Retrieval during both storing (indexing) and fetching (querying). However, Embedding-based Retrieval can significantly improve retrieval quality over time.

Two metrics often used in RAG to evaluate the quality of a retriever are:

\frac{Relevant\ retrieved\ documents}{All\ retrieved\ documents}

\frac{Relevant\ retrieved\ documents}{All\ relevant\ documents}

Another consideration is cost. Generating embeddings requires compute resources and often involves API costs. In addition, depending on the vector database, both vector storage and vector search queries can also be expensive.

Combining Retrieval Methods

Combining both retrieval algorithms is called Hybrid Search.

There are two common approaches:

Sequential Combination:
- First, use Term-based Retrieval to fetch all documents containing the keyword.
- Then, use Embedding-based Retrieval to re-rank those documents based on semantic meaning.
Parallel Combination:
- Both retrieval methods run in parallel.
- Each produces a ranking of documents by relevance.
- The results are then merged or compared to generate a final ranking.

Hybrid Search allows leveraging the strengths of both approaches: the speed of keyword search and the semantic depth of embeddings.

Citation

This blog is inspired by the “Retrieval Algorithms” topic in the book “AI Engineer” by Chip Huyen. This is a brief introduction to the topic. To learn more in detail, I recommend referring to the book.

AI-900 Guide: From Prep to Pass

Dev J. Shah 🥑 — Mon, 04 Aug 2025 21:07:00 +0000

Introduction

Whatsup everyone? This blog is about Microsoft’s certificate “Azure AI Fundamentals”, which is also referred to as AI 900, which is the certificate code. This is a fundamental certificate that helps the candidate demonstrate understanding and knowledge related to Machine Learning and AI, and all the Microsoft Azure’s services and tools about the topic. The following is the link to Microsoft Learn’s page to know the most updated information about it.

Microsoft Certified: Azure AI Fundamentals - Certifications

Topics that it covers

The certificate covers the following topics:

Artificial Intelligence workloads and considerations

Includes a brief introduction to Computer Vision workloads, Natural Language Processing Workloads, Document Processing Workloads, and GenAI Workloads. Further, it discusses about principles of responsible AI.

Fundamental principles of Machine Learning on Azure

This section talks about various types of Machine Learning, such as Regression, Classification, Clustering, Deep Learning, and Transformer Architecture. Later on, it also teaches how to use Azure tools and services to create ML models.

Computer Vision workloads on Azure

Computer Vision modules introduce various Computer Vision solutions like Image Classification, Object Detection, Optical Character Recognition, and Facial Detection and Analysis. Again, moving forward, it demonstrates how to leverage Azure to implement these services.

Natural Language Processing (NLP) workloads on Azure

In this section, various features of Natural Language Processing are introduced, such as Key Phrase Extraction, Entity Recognition, Sentiment Analysis, Language Modeling, Speech Recognition and Synthesis, and Translation. Additionally, walks through the documentation that helps the user set up all these services using Azure.

Generative AI workloads on Azure

As the name suggests, this section introduces Generative AI workloads on Azure, including its features, scenarios, and considerations, and all the tools provided by Azure in this vertical.

How did I prepare for it?

To prepare for the exam, I followed all the modules described in Microsoft Learn.

Introduction to AI in Azure - Training

I went through all the units in each module, read, and tried understanding all the topics while making sure to get some hands-on experience, as included in the above-mentioned learning path, which gave me more exposure to all the services and tools discussed in the units. Meanwhile, I also wrote notes that try to explain all the topics in simpler terms.

Alternatively, you can buy any Udemy course or follow a FreeCodeCamp video, as per your capacity. Nonetheless, I feel the material provided by Microsoft Learn is good enough.

Note: There are a couple of places in my notes where I added a detailed explanation of the topic, which might be out of scope for the certification. I researched the content for my understanding and later decided to keep it in my notes so it's easy for me to understand it while I revise, and might also help others.

I would also recommend that you go through all the units in the modules while also making your own notes to help you revise them.

Recommendation: As a suggestion, try understanding all the topics fundamentally, even if it's not required to pass the exam. The reason behind doing this is that once you understand it from the foundation, it will be stored in your memory, and you will not have to revise it again and again.

How did I revise?

Once you are done with all the units in all the modules, I would suggest that you go ahead and book the test. After booking, you will have a fixed amount of time to revise everything. Following is how I used to revise for the exam, after booking the test.

// Detect dark theme var iframe = document.getElementById('tweet-1946916929626591351-9'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1946916929626591351&theme=dark" }

Microsoft Learn provides the practice tests that you can take to simulate the exam environment.

Microsoft Certified: Azure AI Fundamentals - Certifications

The best thing about these practice tests is that at the end of each test, along with the score, it gives you the list of modules in which you performed poorly and need to revise. This feature helped me evaluate the modules where I was lacking, so I could focus specifically on those, not everything. I would give one practice test daily and revisit my notes according to the list of modules provided.

// Detect dark theme var iframe = document.getElementById('tweet-1947614569763115522-754'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1947614569763115522&theme=dark" }

Before 1-2 days of the exam, I would suggest that you go through some external websites that provide sample exam questions, such as Tutorial Dojo, Exam Pro, etc. Try solving these sample questions. This helps you get fully prepared for your test.

How did I appear for the exam?

Talking about booking the exam, it costs USD 99. To get a discount, I would recommend that you go to Microsoft Virtual Training Days, browse for AI 900, and enroll for the training day as per your convenience. Once you complete these virtual training day sessions, you will automatically get a 50% discount on the exam fees.

Note: This discount percentage or availability might vary in the future as decided by Microsoft.

// Detect dark theme var iframe = document.getElementById('tweet-1950860676370477154-855'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1950860676370477154&theme=dark" }

As for the exam mode, you can take the exam from home or go to a testing center. I preferred going to a testing center rather than clearing everything off my desk and background. Also, since this was my first certification exam, it made more sense to take it at a testing center.

Conclusion

To conclude the blog, the material not only talks about services and tools offered by Azure, but also teaches lots of fundamental concepts related to AI and ML. Hence, this can be a stepping stone, especially if you want to transition into AI and ML. This certificate may help you establish your credibility, which in turn can help you get that opportunity that you have been looking for.

// Detect dark theme var iframe = document.getElementById('tweet-1950914135526371814-90'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1950914135526371814&theme=dark" }

Also, I hope you understand that this is a fundamental-level certificate, so you still need to do a lot of work and learning. Thus, have a next step ready for yourself after this. Best of Luck!!

Disclaimer:

Please note that the content and structure of this certification may change in the future. Microsoft Learn can update the modules, add or remove topics, or change how the exam is evaluated. The information shared in this blog is based on what was available at the time I prepared for the exam.

Career Reflection: No More Survival Jobs

Dev J. Shah 🥑 — Mon, 04 Aug 2025 12:00:00 +0000

Disclaimer

The following blog was written on 1st October 2024.

With a miserable heart, I am writing this blog today. I might not even post it anytime. However, I guess expressing my feelings through writing might help me calm down.

Hello, I am Dev. I recently graduated from Seneca with an Advanced Diploma in Computer Programming and Analysis. Besides having a diploma, I also have 8 months of co-op experience, 3-4 months of volunteering experience, 2 months of freelancing experience, and currently, this is the 10th month working as a Research Assistant. These are all the experiences working on a Software Development project, apart from doing all the survival jobs.

I started applying for jobs in Feb/March. However, I have not got anything. I don't mean to have an ego, but I believe not everyone graduates with the length as well as the variety of experience that I have. I could have just avoided the volunteering or freelancing work if I wanted to. There was no significant financial benefit involved based on the work and hours that I kept. But I still chose to work because I genuinely wanted to. With all these experiences, I believe that I have good enough skills to start working as a Developer. This is the last month of my contract, and after that, I don't know what I'm going to do. How am I planning to earn money?

Frankly, I have almost nothing in my account at the moment. Not that I have not saved while I was earning, but I have also paid 50% of my tuition fees by myself, which sums up to almost CAD 25,000. Additionally, I have been living on my own for almost 3 years now, and I have never ever asked my family for any financial help apart from some food which they send me, and that too only if anyone is coming from India to Canada.

Coming back to the topic, this blog is not to get any kind of sympathy. If I get nothing, the only option I have is to go back to working survival jobs, and since I am being honest in this blog, I literally hate it. No offence to anyone working at survival jobs, because of course someone has to do it. But I think if I have to do survival jobs, that would be a waste of my skills and a human resource. I love Tim Hortons, but I don't want to work there to serve others coffee. I am not made for that thing. And again, I know that I don't want to do survival jobs because I have done survival jobs for 2 years and 8 months, to be specific, including working at Tim Hortons for 10 months. It would be a curse on me if I had to go back and do survival jobs.

My duty is to develop software, and as I can call it, my dharma is to develop software for now and not serve food. I want to believe in the famous Bhagwat Geeta quote, धर्मो रक्षति रक्षितः, which means if you protect your dharma, your dharma will protect you.

Moving forward, I also got rejections for a few entry-level positions. I took it personally and got offended that these companies think I am not good enough for their entry-level jobs. Again, not being egoistic, but I bet not most of the graduates have experience as I do. I do not know what to do!

I shall have no self-respect if I push myself to do a survival job. I understand that men should adapt as per the situation and not everyone gets what they want, but I want my self-respect.

I shall protect my career, and my career shall protect me. I did write some harsh things. I am feeling a bit relieved but still miserable. A bit less than when I started writing this. Thanks for reading this.

Exploring RAG: Math behind Embeddings & Cosine Similarity

Dev J. Shah 🥑 — Tue, 29 Jul 2025 12:00:00 +0000

Introduction

This blog will discuss two main components of Retrieval Augmented Generation: the ingestion of data into a vector database and the retrieval of a relevant chunk of data using cosine similarity.

Brief About RAG

Before going further, as a prerequisite, here is a brief explanation of Retrieval Augmented Generation for those who are not familiar with this concept. Please feel free to skip to the next sections if you already know RAG.

This technique was designed to provide context to the LLM when it comes to generating responses to domain-specific questions. LLMs were trained on a vast amount of general data available on the internet. Hence, when a user asks a domain-specific question, for example, related to medical, legal, etc., a usual tendency of LLMs is to hallucinate. To resolve these issues, RAG was introduced.

The way the RAG technique works is that, first, the domain-specific data is split into various chunks. For example, if the data is in the form of multiple paragraphs, that data is split into each paragraph.

Note: There are various methods of data splitting, such as by the number of characters, paragraphs, etc.

Once the data is split into chunks, each chunk is not converted into an array of numbers. The reason for converting text into numbers is that computers can only understand numbers and not words. Further, this array of numbers, along with their associated data chunk, is stored in a vector database.

Now, when a user asks a domain-specific question to LLM, the text of this question is also converted into an array of numbers using the same method that was used to convert the data. Further, this array of numbers (of the question’s text) is passed to the vector database. The vector database now uses something known as cosine similarity to search for the most relevant chunk/s of data that can help answer the user’s question and returns these chunk/s. At this point in time, we have the user’s question and the most relevant chunk/s of data (that can answer the user’s question) from the vector database. Subsequently, we can pass the user’s question to the LLM along with the chunk/s of data, as context, which ensures that the LLM’s response is grounded in the actual stored data, making the answer more accurate, up-to-date, and trustworthy. If you want to know more about RAG, please check out RAG Explained series.

As you would have noticed, RAG has two core functionalities: Converting text into an array of numbers and retrieving the relevant chunk/s of data using cosine similarity. In the next sections, we are going to learn how these are done and have a good overall understanding.

Text into Numbers

First, we will discuss how the text is converted into an array of numbers. But instead of directly discussing the actual technique used, let me also discuss the alternatives. This will help you give more perspective on why we use what we use.

When someone mentions converting text into numbers, the first method that comes to mind is to create a big vocabulary table that stores all the words along with an index. Now, we can refer to this table and easily convert sentences/paragraphs into an array of numbers.

Consider the example of the following vocabulary table, alphabetically ordered.

Index	Word
1	best
2	client
3	for
4	high-quality
5	i
6	is
7	know
8	provide
9	service
10	software
11	the
12	think
13	tribalscale
…	…
100,000	zod

Using the above table to convert the following sentence: “I think TribalScale provides the best services.”. The array of numbers will become [5, 12, 13, 8, 11, 1, 9].

Taking one more example of the following sentence: “TribalScale is known for providing high-quality service.”. The array of numbers for this sentence becomes [13, 6, 7, 3, 8, 4, 9].

Great. This sounds like a straightforward approach. However, considering our use case, there are a couple of issues with this method. To start with, unless the vocabulary table is there, these numbers are random. Moreover, we required something that can help us find the relation between two chunks, to be able to search for the most relevant chunk of data. Nonetheless, when we keep the array of numbers for both the above sentences together without the text,

Sentence 1: [5, 12, 13, 8, 11, 1, 9]
Sentence 2: [13, 6, 7, 3, 8, 4, 9]

It's hard to tell if both of these sentences are related to each other. Plus, with the number of words increasing in the vocabulary table, we might have indices in millions. Hence, this would not be an ideal approach, looking at the compute required and the cost associated with that compute.

Therefore, this calls for another approach. Another approach should help identify if two sentences are related to each other. This is when we should talk about ‘Word Embeddings’.

“Word Embeddings” is a technique that allows converting text into an array of numbers while also capturing the relationship between all chunks. In this method, instead of using a vocabulary table, we have a list of parameters or, in simple words, flags. For example, the following list.

Parameters/Flags
Is Company?
Is Opinion?
Is Positive?
Is a service?
Is location?
…

Now, based on the sentence, we will answer each flag, anywhere between 0 and 1, including decimals. If the answer is yes, we use 1; if the answer is no, we use 0, and if the answer is anywhere in between, we use a decimal value in the range. Considering both the above sentences, along with a new sentence.

Sentence 1: “I think TribalScale provides the best services.”
Sentence 2: “TribalScale is known for providing high-quality service.”
Sentence 3: “I live in Toronto”

Parameters/Flags	Sentence 1	Sentence 2	Sentence 3
Is Company?	1	1	0.1
Is Opinion?	1	0.8	0
Is Positive?	0.9	0.7	0.1
Is a service?	0.9	0.9	0.2
Is location?	0.1	0	1
Is city?	0	0	1
…	…	…	…

Note: The values are approx and used for explanation.

This array of numbers is called embedding. Thus, the embeddings for all three sentences become,

Sentence 1: [ 1, 1, 0.9, 0.9, 0.1, 0, …]
Sentence 2: [ 1, 0.8, 0.7, 0.9, 0, 0, …]
Sentence 3: [ 0.1, 0, 0.1, 0.2, 1, 1, …]

This time, when you see the embeddings for all the sentences together, it is evident that for each index, the values of the index of sentence 1 and sentence 2 are very similar or near to each other; nonetheless, the same for sentence 3 is far. This indicates that sentences 1 and 2 are more similar to each other than sentence 3.

Note:

In real life, there are hundreds and thousands of such parameters/flags, which increase the accuracy of embeddings.

The model comes up with these parameters by itself when it is being trained on a huge amount of text.

For simplicity, we took the range of 0 to 1 to answer the parameters/flags, but it can vary based on the model used for generating the embeddings. It can be from -1 to 1, -3 to 3, but the fundamental purpose remains the same.

To revise what we saw so far. We first had data (paragraphs), which we split into chunks (each paragraph), then generated embeddings for each chunk, and lastly stored these embeddings along with the text into a vector database.

Cosine Similarity to search for relevant chunk/s

Moving towards the next section, since we now have all the data stored in the vector database, it is time to see how does vector database searches for the more relevant data based on the user's question. To understand this, we need to plot some graphs.

Getting back the embeddings of all the sentences,

Sentence 1: [ 1, 1, 0.9, 0.9, 0.1, 0, …]
Sentence 2: [ 1, 0.8, 0.7, 0.9, 0, 0, …]
Sentence 3: [ 0.1, 0, 0.1, 0.2, 1, 1, …]

Each index in these embeddings represents a coordinate in the graph. So {1, 1, 0.1} (first index value of three sentences) becomes the x-axis, {1, 0.8, 0} becomes the y-axis, and so on. These current embeddings have 6 coordinates. Since it is easy for us to see and observe a 3D graph, for explanation purposes, we will only consider the first three indices of each sentence. Hence, the following.

Sentence 1: [ 1, 1, 0.9]
Sentence 2: [ 1, 0.8, 0.7]
Sentence 3: [ 0.1, 0, 0.1]

Upon plotting these values on a graph, we get,

The graph clearly states that sentences 1 and 2 are going in the same direction, while sentence 3 is going in another direction.

Now, let's take a sample user’s question, “Which company provides the best services?”. To find the relevant chunk of data to answer this question, we need to use the same strategy to convert this text into embeddings.

Parameters/Flags	User’s Question
Is Company?	1
Is Opinion?	0.8
Is Positive?	0.8
Is a service?	1
Is location?	0.1
Is city?	0.1
…	…

The embeddings for the user’s question become [1, 0.8, 0.8, 1, 0.1, 0.1, …], and to plot it on a graph, consider the first three embeddings [1, 0.8, 0.8]. Upon plotting this user’s question on the same graph as above, we get,

Here, you can see that the line for the user’s prompt and that of sentences 1 and 2 are going in the same direction, whereas that of sentence 3 is in a different direction. Visually, it's evident that since the line of the user’s prompt is going in the same direction as sentences 1 and 2; hence, they are the most relevant chunk/s of data to answer the user’s question. But we need numbers to prove this. This is where cosine similarity comes in.

Before going into calculations, let me first explain the reason for using cosine similarity. In simple words, cosine similarity means to find the value of cos θ, where θ is the angle between two lines. The value of cos θ can be anywhere between -1 to 1. If the value of cos θ is

1; it indicates that both lines are perfectly aligned with each other; hence, the angle between them is 0.
0; it indicates that both lines are perpendicular to each other.
-1; it indicates that both lines are in different directions, having a 180-degree angle with each other.

With this analogy, if we find the angle between the line of the user’s question and all three sentences and calculate the value cos θ of those angles, we can get the number of how aligned the lines of each sentence are with the line of the user’s question. This is the reason for using cosine similarity.

The following are the calculated angles between the line of the user’s question and the line of each sentence.

If we calculate the cos θ of all these three angles, we get,

Angle between the user’s prompt and sentence 1: 5.38°
cos 5.38°: 0.9955

Angle between the user’s prompt and sentence 2: 3.33°
cos 3.33°: 0.9983

Angle between the user’s prompt and sentence 3: 32.55°
cos 32.55°: 0.8429

Since the values of cos θ for sentences 1 and 2 are very close to 1, it indicates that they are more relevant to the user’s question.

Note: The value of cos θ between sentence 3 and the user’s question is also close to 1 because, for calculations, we only used the first 3 parameters/flags to plot the graph. But in real life, all the hundreds and thousands of parameters/flags are considered, which increases the accuracy of these calculations.

Alright, before going further, let me do a quick recap of what we learnt so far.

First, we converted the text into numbers, then we plotted a line on the graph using those numbers.
Second, we converted the text of the user’s question into numbers and plotted those numbers on a graph.
Lastly, we calculated the angle between the data’s line and the user’s question’s line and used that angle to calculate the value of cos θ, which specifies how aligned each data chunk is to the user’s question.

Perfect. Now, the issue is, for all the questions that the user asks, we cannot go ahead and plot all these lines on a graph, calculate the angle, and cos θ to find which is the most relevant chunk of data based on the user’s question. Therefore, we need a formula that can help us calculate the value of cos θ directly using the embeddings of the sentences. For this purpose, we use the formula of the dot product.

\vec{A}\space.\space\vec{B}\space= \Vert A \Vert\space\times\space\Vert B \Vert\space\times\space cos\space\theta \newline cos\space\theta = \frac{\vec{A}\space.\space\vec{B}}{\Vert A \Vert\space\times\space\Vert B \Vert} \newline cos\space\theta = \frac{\sum\limits_{i=1}^{n}A_{i}\times B_{i}}{\sqrt{\sum\limits_{i=1}^{n}A_{i}^{2}}\times\sqrt{\sum\limits_{i=1}^{n}B_{i}^{2}}}

Applying this formula for sentence 2 and the user’s question,

Sentence 2: [ 1, 0.8, 0.7]
User’s Question: [ 1, 0.8, 0.8]

cos\space\theta = \frac{(1\times1) + (0.8\times0.8)+(0.7\times0.8)}{\sqrt{1^2 + 0.8^2 + 0.7^2}\times\sqrt{1^2 + 0.8^2 + 0.8^2}} \newline cos\space\theta = \frac{2.2}{\sqrt{2.13}\times\sqrt{2.28}} \newline cos\space\theta = \frac{2.2}{1.459\times1.509} \newline cos\space\theta = 0.9995

The value of cos θ for sentence 2 and the user’s question’s embeddings, which was obtained from calculations, is very close to what we found via plotting lines from the graph. Using this formula, we can get the value of cos θ directly from this formula by directly using the embeddings.

Conclusion

Ah, alright, guys. The main purpose of this blog was to understand how these two core functionalities work in Vector Database. The good news is that we don't need to do all these calculations to build a RAG-powered application. There are existing frameworks, such as LangChain, that can do all these things, and all you need to do is call the appropriate function as required.

In case you want to try a RAG-powered application, here is the documentation.

Linear Regression Model using Math!

Dev J. Shah 🥑 — Sat, 31 May 2025 14:00:00 +0000

If you are not familiar with basic Machine Learning terms like Features, Label, Training, Inferencing, please check out the following blog as a prerequisite.

Dev J. Shah 🥑

May 21 '25

Machine Learning 101

#machinelearning #ai

Comments

2 min read

Introduction

Regression Models Supervised Machine Learning models that are trained to predict the label values based on the training data. In this blog, we will discuss Linear Regression.

Linear Regression

Let me first set up the context. Throughout the blog, we will consider a practical example to understand the concept. Considering that we have an existing data of an ice cream shop. The data has 2 columns, one is the average temperature of each day and the second one has the number of ice cream sold.

We are creating an algorithm using this existing data. The algorithm can further be used to predict the number of ice creams that can be sold given the average temperature.

For this use case, we will try using linear regression. Hence, consider the algorithm to be the following,

\beta_0 + \beta_1x + \varepsilon

In this algorithm, $y$ is the label (number of ice creams), $x$ is the feature (average temperature), including $β0\beta_0$ and $β1\beta_1$ as parameters.

Lets understand how this algorithm is derived.

As the value of $x$ (features) increases the value of $y$ (prediction) will increase/decrease because we are considering a linear regression.
Hence, $y$ is either directly or inversely proportional to $x$ .
In our case, the value of number of ice creams will increase with the increase in temperature.
Therefore, $y∝xy\propto x$
Therefore, $\beta_1 . x$ where $β1\beta_1$ = constant of proportionality. Also called the slope of the line describing the relationship between $y$ and $x$ .
Further, consider a point at which the value of $y$ starts. In our case, we can call it the base value or the number of ice creams that are sold regardless of the temperature. This value can also be 0. Lets represent this value by $β0\beta_0$ . It is the y-intercept of the line.
Now the equation becomes $\beta_0 + \beta_1x + \varepsilon$ where $ε\varepsilon$ = The error or the difference between the predicted label and the actual label of the feature.

Training

Following are the steps taken place while training a linear regression model.

Step 1

The available data, that is, both features and labels are randomly splited into multiple groups.
This creates various groups of data which can be used to train the model.
One group is hold back, which can be further use to validate the trained model.

Step 2

Take one dataset from the multiple groups that we created in the previous step.
Use a regression algorithm such as linear regression to fit the training data into a model. In other words, create a formula, based on the known data, by assuming the values of $β0\beta_0$ and $β1\beta_1$ such that it predicts the right label for given feature.

Step 3

Use the group of data that we held, to validate the model by letting it predict the labels for the features.

Step 4

Compare the known actual labels in the group of data, with the labels that model predicted.
Then aggregate the differences between the predicted and actual label, to calculate a metric that indicates how accurately the model predicted for the validation data.

After each train, validate and evaluate iteration. You can repeat the process with different algorithms and parameters, until an acceptable evaluation metric is achieved.

Regression evaluation metrics

Based on the predicted and actual values, you can calculate some common metrics that are used to evaluate a regression model.
For understanding each metrics, consider the following observations for the ice cream sales.

Temperature ( $X\mathcal{X}$ )	Actual sales ( $Y\mathcal{Y}$ )	Predicted sales ( $Y^\hat{\mathcal{Y}}$ )	Absolute Difference ( $∣Y^−Y∣\lvert\hat{\mathcal{Y}} - \mathcal{Y}\lvert$ )
52	0	2	2
67	14	17	3
70	23	20	3
73	22	23	1
78	26	28	2
83	36	33	3

Mean Absolute Error (MAE)

The value of MAE is the average of all the absolute differences. Hence, the name Mean Absolute Error.
In the ice cream example, the mean (average) of the absolute errors (2, 3, 3, 1, 2, and 3) is 2.33.

Mean Squared Error (MSE)

The Mean Absolute Error takes into account, all the discrepancies between the predicted and actual labels equally. However, it is more desirable to have a model that consistently makes small errors vs a model that makes fewer but large errors.
One way of getting that metrics that amplifies the large errors is by squaring the individual errors and calculating the mean of the squared values. This metric is known as Mean Squared Error.
In our ice cream example, the mean of the squared absolute values (which are 4, 9, 9, 1, 4, and 9) is 6.

Root Mean Squared Error (RMSE)

The Mean Squared Error helps take the magnitude of errors into account, but because it squares the error values, the resulting metric no longer represents the quantity measured by the label.
To get the error in terms of the unit of label, we need to calculate the square root of MSE. It produces a metric called Root Mean Squared Error.
In this case $6\sqrt{6}$ , which is 2.45 (ice creams).

Coefficient of determination ( $R^2$ )

All the metrics so far, compare the discrepancy between the predicted and the actual value in order to evaluate the model. However, in reality, there is some natural random variance in the daily data that model takes into account.
To find the natural variation existing in each data, we need to have a reference point. This reference point can be the average of all the data. Using this reference point, we can calculate the variation that exist in the data.
In this case, the average of the actual sales is $≈20.167\approx20.167$ . Now the absolute variation in each data can be calculated as 20.167, 6.167, 3.167, 2.167, 6.167 and 16.167.
Now we will find the RMS value of these data due to the same reasons as mentioned in Mean Square Error description. The RMS value of the data is $≈11.25\approx11.25$ . This is the variation that already exists in the data.
Now, we need to find the variation in the predicted data and the actual data. For this one we do not need a reference value because we already have 2 entities.
The absolute variation in the data predicted by the model is 2, 3, 3, 1, 2, 3. The RMS value of this variation is 2.45.
Now, the actual (or ideal) variation in the data is 11.25 and the total variation by the model is 2.45. If we remove the total variation by the model from the actual variation in the data, we get 11.25 - 2.45 = 8.8. This is the proportion of the variation from the actual variation that we can get from the model.
Hence, to calculate how well the model explains the data, we divide the variation the model is able to capture (which is 11.25−2.45=8.8) by the total variation in the data (which is 11.25). The value that we get, indicates how accurate the model is. This value is call the coefficient of determination, which ranges between 0 to 1.
1 indicates that the model is efficiently able to get the variation that already exists in the data; while 0 indicates that the model is inefficient and it is only able to guess the mean.

All the metrics explained above are used to evaluate a regression model. A data scientist uses an iterative approach to repeatedly train and evaluate a model, varying:

Feature Selection and Preparation: Choosing which features to include in the model, and calculations applied to them to help ensure a better fit.
Algorithm selection: There are many regression algorithms
Algorithm parameters: In case of linear regression algorithm, the parameters were $β0,β1\beta_0, \beta_1$ etc. However, in general parameters means the coefficients that represents the relationship between the features and the predicted value of labels.

Final Words

Thank you for reading the blog. I understand that these concepts are hard to understand, especially with limited math and statistics knowledge. Hence, if you have any questions or thoughts, feel free to discuss them in the comments or try to contact me via any of my social profiles.

Citation

This blog post is inspired by a Microsoft Learn course's module. While the foundational concepts are based on the course material, I have expanded on them with additional explanations, examples, and insights to better simplify and contextualize the information for readers.

Machine Learning 101

Dev J. Shah 🥑 — Wed, 21 May 2025 22:08:43 +0000

In very simple terms, Machine Learning is the process of using existing data to create a mathematical function.

Example

For example, we have an ice cream shop's data that includes the average temperature of the past 30 days, along with the number of ice creams sold on each day. We can analyze this data (average temperature and number of ice creams sold) and can create a mathematical function that fits with the existing data.

Further, this function can take the average temperature as input and return the prediction of the number of ice creams that can be sold on that day.

Terminologies

From the existing data, the data that we use as the condition, in our case, the average temperature, is called a Feature. The main data that we are targeting to predict, in our case, the number of ice creams sold, is called the Label.

In mathematical terms, the features are referred using $x$ . Hence, the features can be represented as [ $x_1, x_2, x_3, ..., x_n$ ]; whereas the label is referred to as $y$ .

The process of defining the function using the existing data is called Training. This is the step where the model (a general term used for the mathematical function we create) learns the relationship between features and labels.

Further, the process of using the defined function to get the predicted value is called Inferencing.

In mathematical terms, the function is referred to as $f (x)$ . Moreover, the predicted value by the function is referred to as $y^=f(x)\hat{y} = f(x)$

There are multiple types of Machine Learning; but at the core, Machine Learning can be classified into two types:

Supervised Machine Learning
Unsupervised Machine Learning

Supervised Machine Learning

Supervised Machine Learning is a type of Machine Learning algorithm in which the training data includes both the feature values and the known label values. This type of algorithm is used to train models by determining a relationship between the features and labels in past observations. This helps the model predict unknown labels for new inputs where the feature values are known.

Unsupervised Machine Learning

Unsupervised Machine Learning is a type of algorithm in which the training data only includes the feature values, but no known labels. It determines the relationships and patterns between the features of the observations in the training data.