<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Brittany </title>
    <description>The latest articles on Forem by Brittany  (@brittany_37606c0775530a57).</description>
    <link>https://forem.com/brittany_37606c0775530a57</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3262910%2F68ed7ba7-ff95-4943-b28e-eeb2ed4a2708.png</url>
      <title>Forem: Brittany </title>
      <link>https://forem.com/brittany_37606c0775530a57</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/brittany_37606c0775530a57"/>
    <language>en</language>
    <item>
      <title># How to Build a Beginner-Friendly HR RAG Chatbot with Amazon Bedrock</title>
      <dc:creator>Brittany </dc:creator>
      <pubDate>Sat, 09 May 2026 03:18:25 +0000</pubDate>
      <link>https://forem.com/brittany_37606c0775530a57/-how-to-build-a-beginner-friendly-hr-rag-chatbot-with-amazon-bedrock-3m1c</link>
      <guid>https://forem.com/brittany_37606c0775530a57/-how-to-build-a-beginner-friendly-hr-rag-chatbot-with-amazon-bedrock-3m1c</guid>
      <description>&lt;p&gt;Learn how Retrieval-Augmented Generation (RAG) works on AWS using Amazon Bedrock, Knowledge Bases, embeddings, and Amazon S3 through a realistic HR chatbot scenario.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Everyone Is Talking About RAG
&lt;/h2&gt;

&lt;p&gt;AI chatbots are becoming a major part of modern businesses.&lt;/p&gt;

&lt;p&gt;Companies want AI assistants that can answer questions about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;internal company documents&lt;/li&gt;
&lt;li&gt;support articles&lt;/li&gt;
&lt;li&gt;training manuals&lt;/li&gt;
&lt;li&gt;policies&lt;/li&gt;
&lt;li&gt;product documentation&lt;/li&gt;
&lt;li&gt;customer knowledge bases&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Problem
&lt;/h3&gt;

&lt;p&gt;Large language models (LLMs) do not automatically know your company’s private information.&lt;/p&gt;

&lt;p&gt;This is where Retrieval-Augmented Generation (RAG) becomes incredibly important.&lt;/p&gt;

&lt;p&gt;Instead of training a custom model from scratch, businesses can allow an AI model to retrieve relevant company information in real time before generating a response.&lt;/p&gt;

&lt;p&gt;This creates a smarter, more cost-effective AI assistant.&lt;/p&gt;

&lt;p&gt;In this article, we’ll walk through a beginner-friendly AWS architecture for building a RAG chatbot using Amazon Bedrock.&lt;/p&gt;




&lt;h2&gt;
  
  
  What is RAG?
&lt;/h2&gt;

&lt;p&gt;Retrieval-Augmented Generation (RAG) is an AI architecture pattern where:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A user asks a question&lt;/li&gt;
&lt;li&gt;The system retrieves relevant information from a data source&lt;/li&gt;
&lt;li&gt;The AI model uses that retrieved information to generate a better response&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Instead of relying only on the model’s built-in training knowledge, the model can access updated and company-specific information.&lt;/p&gt;

&lt;p&gt;This helps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;reduce hallucinations&lt;/li&gt;
&lt;li&gt;improve answer accuracy&lt;/li&gt;
&lt;li&gt;provide more current information&lt;/li&gt;
&lt;li&gt;avoid expensive model retraining&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Real-World Business Scenario
&lt;/h2&gt;

&lt;p&gt;Imagine a fictional company called Northstar Health Services.&lt;/p&gt;

&lt;p&gt;Northstar Health Services is a growing healthcare staffing company with over 2,000 employees across multiple states.&lt;/p&gt;

&lt;p&gt;As the company expanded, the HR department started facing a major problem:&lt;/p&gt;

&lt;p&gt;Employees constantly asked repetitive questions such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“How many PTO days do I receive?”&lt;/li&gt;
&lt;li&gt;“What is the remote work policy?”&lt;/li&gt;
&lt;li&gt;“How do I update my healthcare benefits?”&lt;/li&gt;
&lt;li&gt;“Where can I find onboarding documents?”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The HR team spent hours every week responding to the same requests.&lt;/p&gt;

&lt;p&gt;The company wanted a solution that could:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;provide employees with fast answers&lt;/li&gt;
&lt;li&gt;reduce HR workload&lt;/li&gt;
&lt;li&gt;search internal company documents&lt;/li&gt;
&lt;li&gt;avoid building and training a custom AI model from scratch&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To solve this problem, Northstar Health Services decided to build an internal AI chatbot using Amazon Bedrock and a Retrieval-Augmented Generation (RAG) architecture.&lt;/p&gt;

&lt;p&gt;The chatbot retrieves information from company documents stored in Amazon S3 and generates conversational responses for employees.&lt;/p&gt;

&lt;p&gt;This is one of the most common real-world RAG use cases businesses are exploring today.&lt;/p&gt;




&lt;h2&gt;
  
  
  AWS Services Used in This Architecture
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Amazon S3
&lt;/h3&gt;

&lt;p&gt;Amazon S3 stores the company documents.&lt;/p&gt;

&lt;p&gt;Examples include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PDFs&lt;/li&gt;
&lt;li&gt;text documents&lt;/li&gt;
&lt;li&gt;policies&lt;/li&gt;
&lt;li&gt;manuals&lt;/li&gt;
&lt;li&gt;FAQ files&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;S3 acts as the document storage layer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Amazon Bedrock
&lt;/h3&gt;

&lt;p&gt;Amazon Bedrock provides access to foundation models without requiring businesses to manage infrastructure.&lt;/p&gt;

&lt;p&gt;This is one reason Bedrock is becoming extremely popular.&lt;/p&gt;

&lt;p&gt;Businesses can use models from providers like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Anthropic Claude&lt;/li&gt;
&lt;li&gt;Amazon Titan&lt;/li&gt;
&lt;li&gt;AI21 Labs&lt;/li&gt;
&lt;li&gt;Cohere&lt;/li&gt;
&lt;li&gt;Meta&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Bedrock reduces operational overhead because AWS manages the infrastructure.&lt;/p&gt;

&lt;h3&gt;
  
  
  Amazon Bedrock Knowledge Bases
&lt;/h3&gt;

&lt;p&gt;Knowledge Bases simplify the RAG workflow.&lt;/p&gt;

&lt;p&gt;Instead of manually building:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;vector databases&lt;/li&gt;
&lt;li&gt;embedding pipelines&lt;/li&gt;
&lt;li&gt;retrieval logic&lt;/li&gt;
&lt;li&gt;indexing systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;AWS can manage much of the workflow automatically.&lt;/p&gt;

&lt;p&gt;This makes RAG architectures more beginner-friendly.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;

&lt;p&gt;Northstar Health Services wants a fully managed AI solution with minimal operational overhead.&lt;/p&gt;

&lt;p&gt;Instead of building custom infrastructure for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;vector databases&lt;/li&gt;
&lt;li&gt;model hosting&lt;/li&gt;
&lt;li&gt;GPU management&lt;/li&gt;
&lt;li&gt;retrieval pipelines&lt;/li&gt;
&lt;li&gt;embedding systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The company uses managed AWS AI services to simplify deployment.&lt;/p&gt;

&lt;p&gt;This architecture follows a common enterprise AI workflow pattern:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;documents stored in Amazon S3&lt;/li&gt;
&lt;li&gt;semantic search using embeddings&lt;/li&gt;
&lt;li&gt;retrieval through a knowledge base&lt;/li&gt;
&lt;li&gt;response generation through a foundation model&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This approach helps reduce infrastructure complexity while improving scalability and maintainability.&lt;/p&gt;




&lt;h2&gt;
  
  
  Beginner-Friendly RAG Workflow
&lt;/h2&gt;

&lt;p&gt;Here’s the simplified workflow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Company Documents
        ↓
Amazon S3
        ↓
Bedrock Knowledge Base
        ↓
Embeddings + Vector Search
        ↓
Foundation Model (LLM)
        ↓
Generated Response
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step-by-Step Explanation of the Workflow
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: Upload Documents to Amazon S3
&lt;/h3&gt;

&lt;p&gt;The company uploads documents into an S3 bucket.&lt;/p&gt;

&lt;p&gt;Examples include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;employee handbooks&lt;/li&gt;
&lt;li&gt;training documents&lt;/li&gt;
&lt;li&gt;customer support guides&lt;/li&gt;
&lt;li&gt;product information&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These documents become the knowledge source for the chatbot.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Create a Bedrock Knowledge Base
&lt;/h3&gt;

&lt;p&gt;Amazon Bedrock Knowledge Bases connect to the S3 bucket.&lt;/p&gt;

&lt;p&gt;AWS then prepares the documents for retrieval.&lt;/p&gt;

&lt;p&gt;This includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;chunking documents into smaller pieces&lt;/li&gt;
&lt;li&gt;generating embeddings&lt;/li&gt;
&lt;li&gt;indexing the content for semantic search&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This process is critical because AI systems work better when information is broken into smaller searchable sections.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Convert Data into Embeddings
&lt;/h3&gt;

&lt;p&gt;Embeddings are numerical representations of text.&lt;/p&gt;

&lt;p&gt;They allow the system to understand semantic meaning instead of relying only on keyword matching.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“How many PTO days do I receive?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;can still retrieve a document section that says:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Employees are eligible for 15 vacation days annually.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Even though the wording is different, embeddings help the system understand the meaning.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Retrieve Relevant Information
&lt;/h3&gt;

&lt;p&gt;When the user submits a question, the system searches for the most relevant document chunks.&lt;/p&gt;

&lt;p&gt;This process is called retrieval.&lt;/p&gt;

&lt;p&gt;The retrieved information is then passed to the language model.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5: Generate the Final Response
&lt;/h3&gt;

&lt;p&gt;The foundation model uses:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the user’s question&lt;/li&gt;
&lt;li&gt;the retrieved document context&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;to generate a final response.&lt;/p&gt;

&lt;p&gt;This improves accuracy and relevance.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Businesses Prefer RAG Over Fine-Tuning
&lt;/h2&gt;

&lt;p&gt;One important AWS AI design decision is understanding when to use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;RAG architectures&lt;/li&gt;
&lt;li&gt;fine-tuning&lt;/li&gt;
&lt;li&gt;custom model training&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For many enterprise chatbot workloads, RAG is often the better starting point because businesses can retrieve company-specific information without retraining a model.&lt;/p&gt;

&lt;p&gt;This reduces:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;operational overhead&lt;/li&gt;
&lt;li&gt;training complexity&lt;/li&gt;
&lt;li&gt;infrastructure management&lt;/li&gt;
&lt;li&gt;model retraining costs&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Benefits of RAG
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Lower operational overhead
&lt;/h4&gt;

&lt;p&gt;Businesses avoid managing large training pipelines.&lt;/p&gt;

&lt;h4&gt;
  
  
  Faster updates
&lt;/h4&gt;

&lt;p&gt;Documents can simply be updated in S3.&lt;/p&gt;

&lt;h4&gt;
  
  
  Lower cost
&lt;/h4&gt;

&lt;p&gt;No expensive retraining process is required.&lt;/p&gt;

&lt;h4&gt;
  
  
  Better for dynamic information
&lt;/h4&gt;

&lt;p&gt;Policies and documentation frequently change.&lt;/p&gt;

&lt;p&gt;RAG allows businesses to update information quickly.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Amazon Bedrock is a Strong Choice
&lt;/h2&gt;

&lt;p&gt;Amazon Bedrock is becoming one of the most important AWS AI services.&lt;/p&gt;

&lt;p&gt;Why?&lt;/p&gt;

&lt;p&gt;Because businesses want:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;managed AI services&lt;/li&gt;
&lt;li&gt;less infrastructure management&lt;/li&gt;
&lt;li&gt;faster deployment&lt;/li&gt;
&lt;li&gt;access to multiple foundation models&lt;/li&gt;
&lt;li&gt;enterprise security&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Bedrock simplifies AI adoption for many organizations.&lt;/p&gt;

&lt;p&gt;This is especially valuable for companies that do not want to manage GPUs, infrastructure scaling, or custom model hosting.&lt;/p&gt;




&lt;h2&gt;
  
  
  Common Beginner Mistakes in RAG Architectures
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Mistake #1: Trying to Fine-Tune Everything
&lt;/h3&gt;

&lt;p&gt;Many workloads do not require custom model training.&lt;/p&gt;

&lt;p&gt;RAG is often enough.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mistake #2: Using Massive Documents Without Chunking
&lt;/h3&gt;

&lt;p&gt;Large documents are difficult to retrieve efficiently.&lt;/p&gt;

&lt;p&gt;Chunking improves search quality.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mistake #3: Ignoring Cost Considerations
&lt;/h3&gt;

&lt;p&gt;Real-time AI systems can become expensive.&lt;/p&gt;

&lt;p&gt;Businesses should understand:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;inference costs&lt;/li&gt;
&lt;li&gt;storage costs&lt;/li&gt;
&lt;li&gt;retrieval costs&lt;/li&gt;
&lt;li&gt;scaling requirements&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Mistake #4: Treating RAG Like Simple Keyword Search
&lt;/h3&gt;

&lt;p&gt;RAG systems rely heavily on semantic understanding.&lt;/p&gt;

&lt;p&gt;Embeddings are a critical part of the architecture.&lt;/p&gt;




&lt;h2&gt;
  
  
  Future Improvements for This Architecture
&lt;/h2&gt;

&lt;p&gt;This beginner architecture could later evolve into:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;API-based chatbot systems&lt;/li&gt;
&lt;li&gt;customer support assistants&lt;/li&gt;
&lt;li&gt;travel recommendation systems&lt;/li&gt;
&lt;li&gt;ecommerce recommendation engines&lt;/li&gt;
&lt;li&gt;internal enterprise copilots&lt;/li&gt;
&lt;li&gt;voice-enabled AI assistants&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Additional AWS services could include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AWS Lambda&lt;/li&gt;
&lt;li&gt;Amazon API Gateway&lt;/li&gt;
&lt;li&gt;Amazon OpenSearch Service&lt;/li&gt;
&lt;li&gt;Amazon CloudWatch&lt;/li&gt;
&lt;li&gt;Amazon Cognito&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;RAG is rapidly becoming one of the most important AI architecture patterns.&lt;/p&gt;

&lt;p&gt;Businesses want AI systems that can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;retrieve accurate information&lt;/li&gt;
&lt;li&gt;reduce hallucinations&lt;/li&gt;
&lt;li&gt;work with company documents&lt;/li&gt;
&lt;li&gt;scale efficiently&lt;/li&gt;
&lt;li&gt;reduce operational overhead&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Amazon Bedrock and Knowledge Bases make this process much more approachable for beginners.&lt;/p&gt;

&lt;p&gt;For anyone learning AWS AI, understanding RAG workflows is becoming an essential skill.&lt;/p&gt;

&lt;p&gt;The ability to explain AI systems clearly — especially real-world workflows like this — is becoming just as valuable as building the systems themselves.&lt;/p&gt;




&lt;h2&gt;
  
  
  What’s Next?
&lt;/h2&gt;

&lt;p&gt;Future AWS AI workflow articles could include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Real-Time vs Batch Inference on AWS&lt;/li&gt;
&lt;li&gt;CI/CD for Amazon SageMaker Models&lt;/li&gt;
&lt;li&gt;Building ML Feature Store Pipelines&lt;/li&gt;
&lt;li&gt;Streaming ML Workflows with Kinesis&lt;/li&gt;
&lt;li&gt;AI Recommendation Systems on AWS&lt;/li&gt;
&lt;li&gt;Multi-Agent AI Architectures with Bedrock&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;If you enjoyed this article, feel free to connect and follow along as I continue building beginner-friendly AWS AI and Machine Learning workflow tutorials.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>gpt3</category>
      <category>machinelearning</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Most ML accuracy issues aren’t model problems.
They’re upstream SQL problems.

JOIN granularity.
Silent NULLs.
Distorted aggregations.

Sometimes the biggest ML improvement isn’t a new model — it’s a better query.</title>
      <dc:creator>Brittany </dc:creator>
      <pubDate>Sun, 22 Feb 2026 03:06:01 +0000</pubDate>
      <link>https://forem.com/brittany_37606c0775530a57/most-ml-accuracy-issues-arent-model-problems-theyre-upstream-sql-problems-join-4ejo</link>
      <guid>https://forem.com/brittany_37606c0775530a57/most-ml-accuracy-issues-arent-model-problems-theyre-upstream-sql-problems-join-4ejo</guid>
      <description>&lt;p&gt;

&lt;/p&gt;
&lt;div class="ltag__link--embedded"&gt;
  &lt;div class="crayons-story "&gt;
  &lt;a href="https://dev.to/brittany_37606c0775530a57/your-ml-model-isnt-wrong-your-sql-probably-is-42ba" class="crayons-story__hidden-navigation-link"&gt;Your ML Model Isn’t Wrong. Your SQL Probably Is.&lt;/a&gt;


  &lt;div class="crayons-story__body crayons-story__body-full_post"&gt;
    &lt;div class="crayons-story__top"&gt;
      &lt;div class="crayons-story__meta"&gt;
        &lt;div class="crayons-story__author-pic"&gt;

          &lt;a href="/brittany_37606c0775530a57" class="crayons-avatar  crayons-avatar--l  "&gt;
            &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3262910%2F68ed7ba7-ff95-4943-b28e-eeb2ed4a2708.png" alt="brittany_37606c0775530a57 profile" class="crayons-avatar__image"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
        &lt;div&gt;
          &lt;div&gt;
            &lt;a href="/brittany_37606c0775530a57" class="crayons-story__secondary fw-medium m:hidden"&gt;
              Brittany 
            &lt;/a&gt;
            &lt;div class="profile-preview-card relative mb-4 s:mb-0 fw-medium hidden m:inline-block"&gt;
              
                Brittany 
                
              
              &lt;div id="story-author-preview-content-3274277" class="profile-preview-card__content crayons-dropdown branded-7 p-4 pt-0"&gt;
                &lt;div class="gap-4 grid"&gt;
                  &lt;div class="-mt-4"&gt;
                    &lt;a href="/brittany_37606c0775530a57" class="flex"&gt;
                      &lt;span class="crayons-avatar crayons-avatar--xl mr-2 shrink-0"&gt;
                        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3262910%2F68ed7ba7-ff95-4943-b28e-eeb2ed4a2708.png" class="crayons-avatar__image" alt=""&gt;
                      &lt;/span&gt;
                      &lt;span class="crayons-link crayons-subtitle-2 mt-5"&gt;Brittany &lt;/span&gt;
                    &lt;/a&gt;
                  &lt;/div&gt;
                  &lt;div class="print-hidden"&gt;
                    
                      Follow
                    
                  &lt;/div&gt;
                  &lt;div class="author-preview-metadata-container"&gt;&lt;/div&gt;
                &lt;/div&gt;
              &lt;/div&gt;
            &lt;/div&gt;

          &lt;/div&gt;
          &lt;a href="https://dev.to/brittany_37606c0775530a57/your-ml-model-isnt-wrong-your-sql-probably-is-42ba" class="crayons-story__tertiary fs-xs"&gt;&lt;time&gt;Feb 22&lt;/time&gt;&lt;span class="time-ago-indicator-initial-placeholder"&gt;&lt;/span&gt;&lt;/a&gt;
        &lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;

    &lt;div class="crayons-story__indention"&gt;
      &lt;h2 class="crayons-story__title crayons-story__title-full_post"&gt;
        &lt;a href="https://dev.to/brittany_37606c0775530a57/your-ml-model-isnt-wrong-your-sql-probably-is-42ba" id="article-link-3274277"&gt;
          Your ML Model Isn’t Wrong. Your SQL Probably Is.
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;div class="crayons-story__tags"&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/machinelearning"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;machinelearning&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/dataengineering"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;dataengineering&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/sql"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;sql&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/mlops"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;mlops&lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="crayons-story__bottom"&gt;
        &lt;div class="crayons-story__details"&gt;
            &lt;a href="https://dev.to/brittany_37606c0775530a57/your-ml-model-isnt-wrong-your-sql-probably-is-42ba#comments" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left flex items-center"&gt;
              Comments


              1&lt;span class="hidden s:inline"&gt; comment&lt;/span&gt;
            &lt;/a&gt;
        &lt;/div&gt;
        &lt;div class="crayons-story__save"&gt;
          &lt;small class="crayons-story__tertiary fs-xs mr-2"&gt;
            2 min read
          &lt;/small&gt;
            
              &lt;span class="bm-initial"&gt;
                

              &lt;/span&gt;
              &lt;span class="bm-success"&gt;
                

              &lt;/span&gt;
            
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;




</description>
      <category>machinelearning</category>
      <category>dataengineering</category>
      <category>sql</category>
      <category>mlops</category>
    </item>
    <item>
      <title>Your ML Model Isn’t Wrong. Your SQL Probably Is.</title>
      <dc:creator>Brittany </dc:creator>
      <pubDate>Sun, 22 Feb 2026 02:54:36 +0000</pubDate>
      <link>https://forem.com/brittany_37606c0775530a57/your-ml-model-isnt-wrong-your-sql-probably-is-42ba</link>
      <guid>https://forem.com/brittany_37606c0775530a57/your-ml-model-isnt-wrong-your-sql-probably-is-42ba</guid>
      <description>&lt;p&gt;Your churn model isn’t degrading because the algorithm is weak.&lt;/p&gt;

&lt;p&gt;It might be degrading because of a JOIN.&lt;/p&gt;

&lt;p&gt;I’ve seen teams spend weeks tuning hyperparameters, switching architectures, and debating feature importance — only to discover the real issue was upstream data logic.&lt;/p&gt;

&lt;p&gt;Before you tune the model, check your SQL.&lt;/p&gt;

&lt;p&gt;The Problem Most Teams Misdiagnose&lt;/p&gt;

&lt;p&gt;When performance drops, we usually suspect:&lt;/p&gt;

&lt;p&gt;Model drift&lt;/p&gt;

&lt;p&gt;Hyperparameter tuning&lt;/p&gt;

&lt;p&gt;Feature scaling&lt;/p&gt;

&lt;p&gt;Algorithm choice&lt;/p&gt;

&lt;p&gt;Those are valid concerns.&lt;/p&gt;

&lt;p&gt;But machine learning models don’t invent patterns.&lt;/p&gt;

&lt;p&gt;They learn from the data we feed them.&lt;/p&gt;

&lt;p&gt;If the dataset is flawed, the model will faithfully learn those flaws.&lt;/p&gt;

&lt;p&gt;Upstream data logic determines downstream model behavior.&lt;/p&gt;

&lt;p&gt;Scenario: The “Failing” Churn Model&lt;/p&gt;

&lt;p&gt;A churn prediction model starts underperforming.&lt;/p&gt;

&lt;p&gt;Same architecture.&lt;br&gt;
Same training pipeline.&lt;br&gt;
Same evaluation framework.&lt;/p&gt;

&lt;p&gt;Nothing obvious changed.&lt;/p&gt;

&lt;p&gt;After investigation, the issue wasn’t model complexity.&lt;/p&gt;

&lt;p&gt;It was this:&lt;/p&gt;

&lt;p&gt;SELECT *&lt;br&gt;
FROM customers c&lt;br&gt;
JOIN orders o&lt;br&gt;
ON c.customer_id = o.customer_id;&lt;/p&gt;

&lt;p&gt;It looks harmless.&lt;/p&gt;

&lt;p&gt;It runs fast.&lt;br&gt;
It returns data.&lt;br&gt;
It passes basic tests.&lt;/p&gt;

&lt;p&gt;But customers with multiple orders are duplicated across rows.&lt;/p&gt;

&lt;p&gt;High-activity users become unintentionally overweighted in the training dataset.&lt;/p&gt;

&lt;p&gt;The model didn’t fail.&lt;/p&gt;

&lt;p&gt;It did exactly what we told it to do.&lt;/p&gt;

&lt;p&gt;Mistake #1: Duplicate Rows from JOINs&lt;/p&gt;

&lt;p&gt;If your model expects one row per customer but your query returns one row per transaction, you’ve changed the learning problem.&lt;/p&gt;

&lt;p&gt;The issue isn’t SQL skill — it’s granularity awareness.&lt;/p&gt;

&lt;p&gt;A better approach:&lt;/p&gt;

&lt;p&gt;SELECT&lt;br&gt;
    c.customer_id,&lt;br&gt;
    COUNT(o.order_id) AS total_orders&lt;br&gt;
FROM customers c&lt;br&gt;
LEFT JOIN orders o&lt;br&gt;
ON c.customer_id = o.customer_id&lt;br&gt;
GROUP BY c.customer_id;&lt;/p&gt;

&lt;p&gt;Aggregate intentionally before training.&lt;/p&gt;

&lt;p&gt;Define the learning unit.&lt;/p&gt;

&lt;p&gt;Mistake #2: Silent NULL Handling&lt;/p&gt;

&lt;p&gt;NULLs rarely crash pipelines.&lt;/p&gt;

&lt;p&gt;They quietly distort them.&lt;/p&gt;

&lt;p&gt;SELECT income&lt;br&gt;
FROM customers;&lt;/p&gt;

&lt;p&gt;If income contains NULLs and you don’t handle them deliberately, the model sees noise.&lt;/p&gt;

&lt;p&gt;Even something simple like:&lt;/p&gt;

&lt;p&gt;SELECT&lt;br&gt;
    COALESCE(income, 0) AS income&lt;br&gt;
FROM customers;&lt;/p&gt;

&lt;p&gt;forces you to define intent.&lt;/p&gt;

&lt;p&gt;The important part isn’t the function.&lt;/p&gt;

&lt;p&gt;It’s the decision.&lt;/p&gt;

&lt;p&gt;Mistake #3: Distorted Aggregations&lt;/p&gt;

&lt;p&gt;Global averages can hide meaningful segmentation.&lt;/p&gt;

&lt;p&gt;SELECT AVG(transaction_amount)&lt;br&gt;
FROM transactions;&lt;/p&gt;

&lt;p&gt;It works.&lt;br&gt;
It returns a number.&lt;br&gt;
It feels reasonable.&lt;/p&gt;

&lt;p&gt;But a model trained on broad aggregates may underperform in production because it lacks entity-level context.&lt;/p&gt;

&lt;p&gt;Instead:&lt;/p&gt;

&lt;p&gt;SELECT&lt;br&gt;
    customer_id,&lt;br&gt;
    AVG(transaction_amount) AS avg_txn&lt;br&gt;
FROM transactions&lt;br&gt;
GROUP BY customer_id;&lt;/p&gt;

&lt;p&gt;Aggregation logic should reflect the model objective — not convenience.&lt;/p&gt;

&lt;p&gt;Aggregation is feature construction.&lt;/p&gt;

&lt;p&gt;Feature construction is model behavior.&lt;/p&gt;

&lt;p&gt;The Bigger Pattern&lt;/p&gt;

&lt;p&gt;Many ML failures blamed on “model accuracy” are actually upstream data logic issues.&lt;/p&gt;

&lt;p&gt;Strong ML systems require strong SQL foundations.&lt;/p&gt;

&lt;p&gt;Data pipelines are part of the model architecture — not just preprocessing.&lt;/p&gt;

&lt;p&gt;Strong models are built on strong data contracts.&lt;/p&gt;

&lt;p&gt;Before You Tune the Model, Ask:&lt;/p&gt;

&lt;p&gt;Are joins intentional?&lt;/p&gt;

&lt;p&gt;Is entity granularity clearly defined?&lt;/p&gt;

&lt;p&gt;Are aggregations aligned with the objective?&lt;/p&gt;

&lt;p&gt;Are NULLs handled deliberately?&lt;/p&gt;

&lt;p&gt;Is the training dataset versioned?&lt;/p&gt;

&lt;p&gt;Sometimes the biggest ML improvement isn’t a new model.&lt;/p&gt;

&lt;p&gt;It’s a better query.&lt;/p&gt;

&lt;p&gt;If you’d like to see the structured breakdown with examples and commentary, I documented it here:&lt;/p&gt;

&lt;p&gt;👉 GitHub repository:&lt;br&gt;
&lt;a href="https://github.com/brie1807/sql-to-ml-pipeline-mistakes" rel="noopener noreferrer"&gt;https://github.com/brie1807/sql-to-ml-pipeline-mistakes&lt;/a&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>dataengineering</category>
      <category>sql</category>
      <category>mlops</category>
    </item>
    <item>
      <title>We debate neural networks, but SQL quietly shapes what a model is allowed to believe. A systems-level perspective on ML architecture.</title>
      <dc:creator>Brittany </dc:creator>
      <pubDate>Mon, 16 Feb 2026 23:30:52 +0000</pubDate>
      <link>https://forem.com/brittany_37606c0775530a57/we-debate-neural-networks-but-sql-quietly-shapes-what-a-model-is-allowed-to-believe-sharing-a-17n2</link>
      <guid>https://forem.com/brittany_37606c0775530a57/we-debate-neural-networks-but-sql-quietly-shapes-what-a-model-is-allowed-to-believe-sharing-a-17n2</guid>
      <description>&lt;div class="ltag__link"&gt;
  &lt;a href="/brittany_37606c0775530a57" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__pic"&gt;
      &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3262910%2F68ed7ba7-ff95-4943-b28e-eeb2ed4a2708.png" alt="brittany_37606c0775530a57"&gt;
    &lt;/div&gt;
  &lt;/a&gt;
  &lt;a href="https://dev.to/brittany_37606c0775530a57/machine-learning-starts-with-a-where-clause-52da" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__content"&gt;
      &lt;h2&gt;Machine Learning Starts With a WHERE Clause&lt;/h2&gt;
      &lt;h3&gt;Brittany  ・ Feb 16&lt;/h3&gt;
      &lt;div class="ltag__link__taglist"&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/a&gt;
&lt;/div&gt;


</description>
    </item>
    <item>
      <title>Machine Learning Starts With a WHERE Clause</title>
      <dc:creator>Brittany </dc:creator>
      <pubDate>Mon, 16 Feb 2026 23:26:15 +0000</pubDate>
      <link>https://forem.com/brittany_37606c0775530a57/machine-learning-starts-with-a-where-clause-52da</link>
      <guid>https://forem.com/brittany_37606c0775530a57/machine-learning-starts-with-a-where-clause-52da</guid>
      <description>&lt;p&gt;🧠 Intro (Systems-Level Tone)&lt;/p&gt;

&lt;p&gt;Most people think machine learning starts with a model.&lt;/p&gt;

&lt;p&gt;It doesn’t.&lt;/p&gt;

&lt;p&gt;It starts with a query.&lt;/p&gt;

&lt;p&gt;Before SageMaker trains.&lt;br&gt;
Before scikit-learn fits.&lt;br&gt;
Before hyperparameters are tuned.&lt;/p&gt;

&lt;p&gt;Someone writes a WHERE clause.&lt;/p&gt;

&lt;p&gt;And that clause quietly decides what the model is allowed to learn.&lt;/p&gt;

&lt;p&gt;🏗️ SQL Is Architectural — Not Just Operational&lt;/p&gt;

&lt;p&gt;In real ML systems, SQL isn’t just for “getting data.”&lt;/p&gt;

&lt;p&gt;It defines:&lt;/p&gt;

&lt;p&gt;Which records are included&lt;/p&gt;

&lt;p&gt;Which time windows matter&lt;/p&gt;

&lt;p&gt;Which behaviors become features&lt;/p&gt;

&lt;p&gt;Which outcomes are excluded&lt;/p&gt;

&lt;p&gt;Which bias is unintentionally preserved&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;SELECT&lt;br&gt;
    customer_id,&lt;br&gt;
    COUNT(*) AS total_orders,&lt;br&gt;
    SUM(amount) AS lifetime_value,&lt;br&gt;
    MAX(order_date) AS last_purchase&lt;br&gt;
FROM transactions&lt;br&gt;
WHERE order_date &amp;gt;= '2024-01-01'&lt;br&gt;
GROUP BY customer_id;&lt;/p&gt;

&lt;p&gt;That single WHERE clause just decided:&lt;/p&gt;

&lt;p&gt;The time boundary of learning&lt;/p&gt;

&lt;p&gt;What counts as “recent behavior”&lt;/p&gt;

&lt;p&gt;Whether seasonality exists&lt;/p&gt;

&lt;p&gt;Whether older patterns are erased&lt;/p&gt;

&lt;p&gt;The model hasn’t trained yet.&lt;/p&gt;

&lt;p&gt;But its worldview has already been shaped.&lt;/p&gt;

&lt;p&gt;📊 Feature Engineering Happens Before Python&lt;/p&gt;

&lt;p&gt;Most ML discussions focus on:&lt;/p&gt;

&lt;p&gt;Neural networks&lt;/p&gt;

&lt;p&gt;Gradient descent&lt;/p&gt;

&lt;p&gt;Model selection&lt;/p&gt;

&lt;p&gt;But feature engineering often happens inside the database.&lt;/p&gt;

&lt;p&gt;Aggregations like:&lt;/p&gt;

&lt;p&gt;SUM()&lt;/p&gt;

&lt;p&gt;AVG()&lt;/p&gt;

&lt;p&gt;COUNT()&lt;/p&gt;

&lt;p&gt;Window functions&lt;/p&gt;

&lt;p&gt;Time-based grouping&lt;/p&gt;

&lt;p&gt;These are not “data prep steps.”&lt;/p&gt;

&lt;p&gt;They are architectural decisions.&lt;/p&gt;

&lt;p&gt;If you compute:&lt;/p&gt;

&lt;p&gt;AVG(amount)&lt;/p&gt;

&lt;p&gt;Instead of:&lt;/p&gt;

&lt;p&gt;SUM(amount)&lt;/p&gt;

&lt;p&gt;You change the scale of influence.&lt;/p&gt;

&lt;p&gt;If you group by week instead of month, you change volatility.&lt;/p&gt;

&lt;p&gt;If you filter out NULLs, you may remove entire demographic signals.&lt;/p&gt;

&lt;p&gt;SQL quietly determines signal strength.&lt;/p&gt;

&lt;p&gt;⚠️ Data Leakage Is Often a Query Problem&lt;/p&gt;

&lt;p&gt;Many ML failures aren’t algorithmic.&lt;/p&gt;

&lt;p&gt;They’re temporal mistakes.&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;SELECT *&lt;br&gt;
FROM training_data&lt;br&gt;
WHERE prediction_date &amp;gt; outcome_date;&lt;/p&gt;

&lt;p&gt;If your query accidentally includes future outcomes,&lt;br&gt;
you’ve created a perfect model.&lt;/p&gt;

&lt;p&gt;And a useless one.&lt;/p&gt;

&lt;p&gt;Leakage is rarely a Python issue.&lt;/p&gt;

&lt;p&gt;It’s usually a SQL design issue.&lt;/p&gt;

&lt;p&gt;🧠 The System View&lt;/p&gt;

&lt;p&gt;Machine learning is often presented as:&lt;/p&gt;

&lt;p&gt;Data → Model → Prediction&lt;/p&gt;

&lt;p&gt;In reality, it’s:&lt;/p&gt;

&lt;p&gt;Raw Data → SQL Constraints → Engineered Features → Training Dataset → Model&lt;/p&gt;

&lt;p&gt;SQL is the gatekeeper.&lt;/p&gt;

&lt;p&gt;The model only sees what the query allows.&lt;/p&gt;

&lt;p&gt;💡 Why This Matters (Cost + Architecture)&lt;/p&gt;

&lt;p&gt;In AWS environments:&lt;/p&gt;

&lt;p&gt;Bad queries increase Athena/Redshift cost&lt;/p&gt;

&lt;p&gt;Poor feature aggregation increases training time&lt;/p&gt;

&lt;p&gt;Overly wide datasets increase memory usage&lt;/p&gt;

&lt;p&gt;Incorrect joins inflate SageMaker compute bills&lt;/p&gt;

&lt;p&gt;SQL decisions scale financially.&lt;/p&gt;

&lt;p&gt;Models amplify whatever SQL defines.&lt;/p&gt;

&lt;p&gt;🛠 GitHub Companion Plan&lt;/p&gt;

&lt;p&gt;Create repo:&lt;/p&gt;

&lt;p&gt;sql-ml-architecture-foundations&lt;/p&gt;

&lt;p&gt;Include:&lt;/p&gt;

&lt;p&gt;queries.sql (example feature engineering queries)&lt;/p&gt;

&lt;p&gt;Small sample dataset (CSV)&lt;/p&gt;

&lt;p&gt;README explaining:&lt;/p&gt;

&lt;p&gt;How each query changes model behavior&lt;/p&gt;

&lt;p&gt;How SQL affects cost, bias, and drift&lt;/p&gt;

&lt;p&gt;How this ties into ML pipelines (SageMaker, Glue, Feature Store)&lt;/p&gt;

&lt;p&gt;This makes the article:&lt;/p&gt;

&lt;p&gt;Conceptual&lt;/p&gt;

&lt;p&gt;Applied&lt;/p&gt;

&lt;p&gt;Portfolio-ready&lt;/p&gt;

&lt;p&gt;🔥 This Is Authority Writing&lt;/p&gt;

&lt;p&gt;You are not saying:&lt;br&gt;
“SQL is important.”&lt;/p&gt;

&lt;p&gt;You are saying:&lt;/p&gt;

&lt;p&gt;SQL is the architectural layer that defines what a model is allowed to believe.&lt;/p&gt;

&lt;p&gt;That is senior-level framing.&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>datascience</category>
      <category>machinelearning</category>
      <category>sql</category>
    </item>
    <item>
      <title>Missing Data Isn’t a Cleanup Problem — It’s a Signal</title>
      <dc:creator>Brittany </dc:creator>
      <pubDate>Tue, 10 Feb 2026 19:56:29 +0000</pubDate>
      <link>https://forem.com/brittany_37606c0775530a57/missing-data-isnt-a-cleanup-problem-its-a-signal-407n</link>
      <guid>https://forem.com/brittany_37606c0775530a57/missing-data-isnt-a-cleanup-problem-its-a-signal-407n</guid>
      <description>&lt;p&gt;Most machine learning courses teach you how to handle missing data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fill it.&lt;br&gt;
Drop it.&lt;br&gt;
Impute it.&lt;br&gt;
Move on.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;And for exams, that’s usually enough.&lt;/p&gt;

&lt;p&gt;But production systems tell a different story.&lt;/p&gt;

&lt;p&gt;In the real world, missing data isn’t just something to fix —&lt;br&gt;
it’s often the first signal that something upstream is breaking.&lt;/p&gt;

&lt;p&gt;This is where the gap between passing exams and building durable ML systems begins.&lt;/p&gt;

&lt;p&gt;What Exams Teach About Missing Data&lt;/p&gt;

&lt;p&gt;In exam scenarios, missing values are treated as a technical inconvenience:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Replace with the mean or median&lt;/li&gt;
&lt;li&gt;Forward-fill or backward-fill&lt;/li&gt;
&lt;li&gt;Drop rows with too many nulls&lt;/li&gt;
&lt;li&gt;Use models that tolerate missing values
These techniques are valid.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They’re also context-free.&lt;/p&gt;

&lt;p&gt;The exam assumes the data problem already happened —&lt;br&gt;
your job is just to make the model run.&lt;/p&gt;

&lt;p&gt;Production doesn’t care that your model runs.&lt;br&gt;
It cares that it keeps running.&lt;/p&gt;

&lt;p&gt;What Production Systems Teach Instead&lt;/p&gt;

&lt;p&gt;In production, missing data usually shows up for a reason.&lt;/p&gt;

&lt;p&gt;And that reason matters more than the fix.&lt;/p&gt;

&lt;p&gt;Missing values often mean:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A pipeline failed silently&lt;/li&gt;
&lt;li&gt;An upstream service timed out&lt;/li&gt;
&lt;li&gt;A schema changed without notice&lt;/li&gt;
&lt;li&gt;A feature stopped being generated&lt;/li&gt;
&lt;li&gt;A data source degraded slowly over time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these are modeling problems.&lt;br&gt;
They’re system problems.&lt;/p&gt;

&lt;p&gt;If you immediately impute and move on, the model may keep producing outputs —&lt;br&gt;
but now it’s learning from broken assumptions.&lt;/p&gt;

&lt;p&gt;That’s how models degrade quietly.&lt;/p&gt;

&lt;p&gt;Missing Data as a Diagnostic Signal&lt;/p&gt;

&lt;p&gt;Missing values are often symptoms, not errors.&lt;/p&gt;

&lt;p&gt;Instead of asking:&lt;/p&gt;

&lt;p&gt;“How do I fill this?”&lt;/p&gt;

&lt;p&gt;Production systems force you to ask:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Why did this feature go missing?&lt;/li&gt;
&lt;li&gt;Is the missingness random or systematic?&lt;/li&gt;
&lt;li&gt;Did this appear suddenly or gradually?&lt;/li&gt;
&lt;li&gt;Does missing data correlate with certain users, times, or regions?
Those questions don’t show up on exams.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They do decide whether a system survives in the real world.&lt;/p&gt;

&lt;p&gt;Why Simple Methods Sometimes Win&lt;/p&gt;

&lt;p&gt;This is why simpler techniques often outperform complex ones in production.&lt;/p&gt;

&lt;p&gt;Not because they’re smarter —&lt;br&gt;
but because they’re more stable when assumptions break.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mean imputation is predictable&lt;/li&gt;
&lt;li&gt;Dropping features is transparent&lt;/li&gt;
&lt;li&gt;Rule-based fallbacks are debuggable
Complex models can hide data issues by adapting too well —
until performance suddenly collapses weeks later.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Real Skill Gap&lt;/p&gt;

&lt;p&gt;Passing exams proves you know what to do when data is missing.&lt;/p&gt;

&lt;p&gt;Building durable ML systems requires knowing when missing data is trying to tell you something.&lt;/p&gt;

&lt;p&gt;That’s the gap.&lt;/p&gt;

&lt;p&gt;Exams ask: “What’s the correct technique?”&lt;br&gt;
Production asks: “Why is this happening now?”&lt;/p&gt;

&lt;p&gt;Exams optimize for correctness&lt;/p&gt;

&lt;p&gt;Production optimizes for awareness&lt;/p&gt;

&lt;p&gt;And awareness is what keeps models alive.&lt;/p&gt;

&lt;p&gt;Final Thought&lt;/p&gt;

&lt;p&gt;Missing data isn’t just a preprocessing step.&lt;/p&gt;

&lt;p&gt;It’s feedback.&lt;/p&gt;

&lt;p&gt;If you listen to it early, you fix pipelines.&lt;br&gt;
If you ignore it, you retrain models that are already drifting.&lt;/p&gt;

&lt;p&gt;And that’s where the real difference between learning ML&lt;br&gt;
and operating ML begins.&lt;/p&gt;

&lt;p&gt;DEV Tags&lt;/p&gt;

&lt;p&gt;machinelearning datascience mlops careerdevelopment artificialintelligence&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>datascience</category>
      <category>featureengineering</category>
      <category>ai</category>
    </item>
    <item>
      <title>Missing Data in Machine Learning: A Practical Step-by-Step Approach</title>
      <dc:creator>Brittany </dc:creator>
      <pubDate>Sat, 07 Feb 2026 19:15:07 +0000</pubDate>
      <link>https://forem.com/brittany_37606c0775530a57/missing-data-in-machine-learning-a-practical-step-by-step-approach-1p85</link>
      <guid>https://forem.com/brittany_37606c0775530a57/missing-data-in-machine-learning-a-practical-step-by-step-approach-1p85</guid>
      <description>&lt;p&gt;Missing data breaks more machine learning models than bad algorithms — not because it’s hard to detect, but because it’s easy to overthink.&lt;/p&gt;

&lt;p&gt;When datasets contain NaNs, sparse features, or incomplete records, the default reaction is often to add complexity.&lt;br&gt;
In practice, stability usually matters more than sophistication.&lt;/p&gt;

&lt;p&gt;Here’s a practical, step-by-step way to think about missing data in real ML systems.&lt;/p&gt;

&lt;p&gt;Step 1: Assume Missing Data Is Normal&lt;/p&gt;

&lt;p&gt;In real systems, missing data isn’t an edge case.&lt;/p&gt;

&lt;p&gt;It comes from:&lt;/p&gt;

&lt;p&gt;partially filled forms&lt;/p&gt;

&lt;p&gt;dropped logs or sensors&lt;/p&gt;

&lt;p&gt;schema changes over time&lt;/p&gt;

&lt;p&gt;merged datasets from different sources&lt;/p&gt;

&lt;p&gt;If you treat missing values as rare exceptions, you’ll design fragile pipelines.&lt;br&gt;
Instead, assume they’re part of the data distribution.&lt;/p&gt;

&lt;p&gt;Goal: design preprocessing that keeps working as systems evolve.&lt;/p&gt;

&lt;p&gt;Step 2: Identify Why the Data Is Missing (Not Just Where)&lt;/p&gt;

&lt;p&gt;Not all missing data is random.&lt;/p&gt;

&lt;p&gt;Ask:&lt;/p&gt;

&lt;p&gt;Did users skip a field?&lt;/p&gt;

&lt;p&gt;Did a service fail?&lt;/p&gt;

&lt;p&gt;Did a logging or schema change occur?&lt;/p&gt;

&lt;p&gt;When missingness is tied to behavior or infrastructure, it carries information — but it also introduces risk.&lt;br&gt;
Models trained on one missingness pattern may fail when that pattern changes.&lt;/p&gt;

&lt;p&gt;Goal: avoid baking temporary assumptions into your model.&lt;/p&gt;

&lt;p&gt;Step 3: Start With the Simplest Stable Baseline&lt;/p&gt;

&lt;p&gt;Before reaching for advanced techniques, establish a stable baseline.&lt;/p&gt;

&lt;p&gt;Simple imputation methods (mean or median):&lt;/p&gt;

&lt;p&gt;reduce variance&lt;/p&gt;

&lt;p&gt;preserve feature scale&lt;/p&gt;

&lt;p&gt;behave consistently over time&lt;/p&gt;

&lt;p&gt;They don’t adapt. They don’t infer.&lt;br&gt;
That predictability is exactly what makes them reliable in production.&lt;/p&gt;

&lt;p&gt;Goal: maximize stability before optimizing accuracy.&lt;/p&gt;

&lt;p&gt;Step 4: Be Careful With “Smart” Solutions&lt;/p&gt;

&lt;p&gt;Advanced imputers, PCA, and neural networks can accidentally learn the pattern of missingness, not the underlying signal.&lt;/p&gt;

&lt;p&gt;Common failure modes:&lt;/p&gt;

&lt;p&gt;great validation metrics&lt;/p&gt;

&lt;p&gt;poor generalization&lt;/p&gt;

&lt;p&gt;silent performance decay after deployment&lt;/p&gt;

&lt;p&gt;Complexity increases sensitivity to distribution shifts — especially when missing data is involved.&lt;/p&gt;

&lt;p&gt;Goal: avoid solutions that look good during training but fail quietly later.&lt;/p&gt;

&lt;p&gt;Step 5: Use PCA and Deep Learning Only When the Pipeline Is Stable&lt;/p&gt;

&lt;p&gt;Advanced techniques work best when:&lt;/p&gt;

&lt;p&gt;missingness is minimal or well-understood&lt;/p&gt;

&lt;p&gt;feature definitions are consistent&lt;/p&gt;

&lt;p&gt;training data matches production patterns&lt;/p&gt;

&lt;p&gt;PCA is useful for noise reduction — not for “fixing” missing values.&lt;br&gt;
Deep learning handles missing data well only when designed explicitly for it.&lt;/p&gt;

&lt;p&gt;Goal: earn complexity after stability is proven.&lt;/p&gt;

&lt;p&gt;Step 6: Treat Missing Data as System Feedback&lt;/p&gt;

&lt;p&gt;Missing values often signal:&lt;/p&gt;

&lt;p&gt;broken pipelines&lt;/p&gt;

&lt;p&gt;misaligned teams&lt;/p&gt;

&lt;p&gt;shifting assumptions&lt;/p&gt;

&lt;p&gt;Feature stores help by enforcing consistent definitions and freshness.&lt;br&gt;
Monitoring helps detect when missingness patterns change.&lt;/p&gt;

&lt;p&gt;Fixing the system upstream is often more effective than adding intelligence downstream.&lt;/p&gt;

&lt;p&gt;Goal: solve the root cause, not just the symptom.&lt;/p&gt;

&lt;p&gt;Step 7: Optimize for Long-Term Behavior, Not Short-Term Metrics&lt;/p&gt;

&lt;p&gt;A slightly less accurate model that behaves predictably will outperform a fragile one over time.&lt;/p&gt;

&lt;p&gt;This is why simple preprocessing approaches persist in production systems:&lt;br&gt;
they survive real-world variability.&lt;/p&gt;

&lt;p&gt;Goal: choose approaches that fail gracefully.&lt;/p&gt;

&lt;p&gt;Final Takeaway&lt;/p&gt;

&lt;p&gt;When handling missing data:&lt;/p&gt;

&lt;p&gt;assume it’s normal&lt;/p&gt;

&lt;p&gt;understand why it exists&lt;/p&gt;

&lt;p&gt;start simple&lt;/p&gt;

&lt;p&gt;earn complexity&lt;/p&gt;

&lt;p&gt;prioritize stability&lt;/p&gt;

&lt;p&gt;Machine learning systems don’t fail because they’re not smart enough.&lt;br&gt;
They fail because they’re not stable enough.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>dataengineering</category>
      <category>datascience</category>
    </item>
    <item>
      <title>Why Categorical Data Can Quietly Break Your ML Model</title>
      <dc:creator>Brittany </dc:creator>
      <pubDate>Fri, 06 Feb 2026 02:07:48 +0000</pubDate>
      <link>https://forem.com/brittany_37606c0775530a57/why-categorical-data-can-quietly-break-your-ml-model-53na</link>
      <guid>https://forem.com/brittany_37606c0775530a57/why-categorical-data-can-quietly-break-your-ml-model-53na</guid>
      <description>&lt;p&gt;Your model didn’t fail because of the algorithm.&lt;br&gt;
It failed because of how your data was represented.&lt;/p&gt;

&lt;p&gt;One of the easiest ways to break a machine learning model isn’t choosing the wrong algorithm.&lt;/p&gt;

&lt;p&gt;It’s feeding the model categorical data without thinking about how the model actually interprets numbers.&lt;/p&gt;

&lt;p&gt;This problem shows up constantly in real ML pipelines—especially when models perform well during training but behave unpredictably in production.&lt;/p&gt;

&lt;p&gt;The Hidden Problem with Categorical Data&lt;/p&gt;

&lt;p&gt;Machine learning models don’t understand categories.&lt;/p&gt;

&lt;p&gt;They understand numbers.&lt;/p&gt;

&lt;p&gt;When you pass categorical values like:&lt;/p&gt;

&lt;p&gt;country&lt;/p&gt;

&lt;p&gt;product type&lt;/p&gt;

&lt;p&gt;customer segment&lt;/p&gt;

&lt;p&gt;status&lt;/p&gt;

&lt;p&gt;you’re forced to decide how those categories are represented numerically.&lt;/p&gt;

&lt;p&gt;That decision matters more than many people realize.&lt;/p&gt;

&lt;p&gt;Why “Just Assigning Numbers” Is Dangerous&lt;/p&gt;

&lt;p&gt;A common mistake is encoding categories like this:&lt;/p&gt;

&lt;p&gt;Red → 1&lt;/p&gt;

&lt;p&gt;Blue → 2&lt;/p&gt;

&lt;p&gt;Green → 3&lt;/p&gt;

&lt;p&gt;To a human, these are just labels.&lt;/p&gt;

&lt;p&gt;To a model, they look like ordered values.&lt;/p&gt;

&lt;p&gt;The model now assumes:&lt;/p&gt;

&lt;p&gt;Green &amp;gt; Blue &amp;gt; Red&lt;/p&gt;

&lt;p&gt;The “distance” between categories has meaning&lt;/p&gt;

&lt;p&gt;But in most real-world problems, that relationship doesn’t exist.&lt;/p&gt;

&lt;p&gt;This can quietly distort model behavior without throwing errors or warnings.&lt;/p&gt;

&lt;p&gt;What One-Hot Encoding Actually Fixes&lt;/p&gt;

&lt;p&gt;One-hot encoding removes false relationships.&lt;/p&gt;

&lt;p&gt;Instead of a single numeric column, each category becomes its own binary feature:&lt;/p&gt;

&lt;p&gt;Red → [1, 0, 0]&lt;/p&gt;

&lt;p&gt;Blue → [0, 1, 0]&lt;/p&gt;

&lt;p&gt;Green → [0, 0, 1]&lt;/p&gt;

&lt;p&gt;Now the model sees:&lt;/p&gt;

&lt;p&gt;No ordering&lt;/p&gt;

&lt;p&gt;No implied distance&lt;/p&gt;

&lt;p&gt;Each category as an independent signal&lt;/p&gt;

&lt;p&gt;This is why one-hot encoding is often the default choice in many ML pipelines.&lt;/p&gt;

&lt;p&gt;When One-Hot Encoding Helps Most&lt;/p&gt;

&lt;p&gt;One-hot encoding works best when:&lt;/p&gt;

&lt;p&gt;Categories have no natural order&lt;/p&gt;

&lt;p&gt;Models assume numeric relationships (e.g., linear models)&lt;/p&gt;

&lt;p&gt;You want to avoid injecting unintended bias&lt;/p&gt;

&lt;p&gt;You’ll often see it used with:&lt;/p&gt;

&lt;p&gt;Linear regression&lt;/p&gt;

&lt;p&gt;Logistic regression&lt;/p&gt;

&lt;p&gt;Feature engineering pipelines before training&lt;/p&gt;

&lt;p&gt;When One-Hot Encoding Creates New Problems&lt;/p&gt;

&lt;p&gt;One-hot encoding isn’t free.&lt;/p&gt;

&lt;p&gt;It introduces:&lt;/p&gt;

&lt;p&gt;High dimensionality&lt;/p&gt;

&lt;p&gt;Sparse data&lt;/p&gt;

&lt;p&gt;Increased memory and compute cost&lt;/p&gt;

&lt;p&gt;This becomes an issue when:&lt;/p&gt;

&lt;p&gt;Categories have high cardinality (thousands of values)&lt;/p&gt;

&lt;p&gt;You’re working with large datasets&lt;/p&gt;

&lt;p&gt;You’re deploying models with tight performance constraints&lt;/p&gt;

&lt;p&gt;At that point, encoding strategy becomes a system design decision, not just preprocessing.&lt;/p&gt;

&lt;p&gt;Why This Matters in Real ML Systems&lt;/p&gt;

&lt;p&gt;Encoding choices affect:&lt;/p&gt;

&lt;p&gt;Model performance&lt;/p&gt;

&lt;p&gt;Training time&lt;/p&gt;

&lt;p&gt;Inference cost&lt;/p&gt;

&lt;p&gt;Data consistency between training and production&lt;/p&gt;

&lt;p&gt;A model may look accurate in experiments and still fail quietly after deployment if encoding isn’t handled consistently.&lt;/p&gt;

&lt;p&gt;This is why many ML failures aren’t algorithm failures.&lt;/p&gt;

&lt;p&gt;They’re data representation failures.&lt;/p&gt;

&lt;p&gt;The Bigger Takeaway&lt;/p&gt;

&lt;p&gt;Feature engineering decisions shape how a model understands the world.&lt;/p&gt;

&lt;p&gt;One-hot encoding isn’t just a technical detail—it’s a way of protecting your model from learning relationships that don’t exist.&lt;/p&gt;

&lt;p&gt;If a model behaves strangely, don’t start by changing the algorithm.&lt;/p&gt;

&lt;p&gt;Start by asking:&lt;/p&gt;

&lt;p&gt;How is this data represented?&lt;/p&gt;

&lt;p&gt;What assumptions does this encoding introduce?&lt;/p&gt;

&lt;p&gt;Is the model learning real patterns—or artificial ones?&lt;/p&gt;

&lt;p&gt;Most ML issues begin there.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>datascience</category>
      <category>machinelearning</category>
      <category>mlops</category>
    </item>
  </channel>
</rss>
