<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Sandeep Salwan</title>
    <description>The latest articles on Forem by Sandeep Salwan (@sandeep_salwan).</description>
    <link>https://forem.com/sandeep_salwan</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3504722%2F7f2bf869-381d-40f3-90ef-400e832ec35f.png</url>
      <title>Forem: Sandeep Salwan</title>
      <link>https://forem.com/sandeep_salwan</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/sandeep_salwan"/>
    <language>en</language>
    <item>
      <title>Testing My New sandeeps.tech Publish Flow</title>
      <dc:creator>Sandeep Salwan</dc:creator>
      <pubDate>Sun, 19 Apr 2026 20:20:07 +0000</pubDate>
      <link>https://forem.com/sandeep_salwan/testing-my-new-sandeepstech-publish-flow-43jk</link>
      <guid>https://forem.com/sandeep_salwan/testing-my-new-sandeepstech-publish-flow-43jk</guid>
      <description>&lt;p&gt;Testing My New sandeeps.tech Publish Flow&lt;/p&gt;

&lt;p&gt;I just finished wiring a new publishing flow for sandeeps.tech.&lt;/p&gt;

&lt;p&gt;This is a real live test from the new setup.&lt;/p&gt;

&lt;p&gt;The goal is simple.&lt;br&gt;
Write once.&lt;br&gt;
Publish cleanly.&lt;br&gt;
Keep one source of truth.&lt;/p&gt;

&lt;p&gt;Right now this test proves a few things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A plain text draft can become a blog post quickly.&lt;/li&gt;
&lt;li&gt;The post metadata is generated automatically.&lt;/li&gt;
&lt;li&gt;The same post can be sent to dev.to and Hashnode from the repo.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;More real writing soon.&lt;/p&gt;

</description>
      <category>automation</category>
      <category>blogging</category>
      <category>ai</category>
    </item>
    <item>
      <title>Why your AI assistant lies to you (and how to fix it)</title>
      <dc:creator>Sandeep Salwan</dc:creator>
      <pubDate>Thu, 11 Dec 2025 03:16:36 +0000</pubDate>
      <link>https://forem.com/sandeep_salwan/why-your-ai-assistant-lies-to-you-and-how-to-fix-it-322a</link>
      <guid>https://forem.com/sandeep_salwan/why-your-ai-assistant-lies-to-you-and-how-to-fix-it-322a</guid>
      <description>&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Live web version:&lt;/strong&gt; &lt;a href="https://ai-accuracy.vercel.app/" rel="noopener noreferrer"&gt;ai-accuracy.vercel.app&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Podcast and presentation:&lt;/strong&gt; &lt;a href="https://drive.google.com/drive/u/0/folders/1fYrX0Ll6zR31Sxj0x8M4xMyObbMukrqQ" rel="noopener noreferrer"&gt;Google Drive folder&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You ask your AI assistant a simple history question about the 184th president of the United States. The model does not hesitate. It does not pause to consider that there have only been 47 presidents. Instead, it confidently generates a credible name and a fake inauguration ceremony.&lt;/p&gt;

&lt;p&gt;That behavior is called hallucination. It is still the biggest obstacle preventing AI systems from being truly reliable in high-stakes fields like healthcare, law, finance, and enterprise operations. To understand how to reduce hallucinations, we first need to understand why they happen at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Scale of the Problem
&lt;/h2&gt;

&lt;p&gt;It is easy to assume these failures are rare or that the major labs have mostly solved them by now. The data says otherwise.&lt;/p&gt;

&lt;p&gt;Recent studies tested major AI models on difficult medical questions and found false information in 50% to 82% of answers. Even when researchers used prompting strategies designed to improve reliability, nearly half of the responses still included fabricated details.&lt;/p&gt;

&lt;p&gt;That creates a huge hidden business cost. A 2024 survey found that 47% of enterprise users made business decisions based on hallucinated AI-generated content. Employees now spend about 4.3 hours every week fact-checking model outputs and cleaning up mistakes from tools that were supposed to automate their work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the Machine Lies
&lt;/h2&gt;

&lt;p&gt;To fix the problem, you have to understand the mechanism behind it. Large language models do not store truth the way people imagine. They are not internal databases of facts. They are prediction engines.&lt;/p&gt;

&lt;p&gt;When you ask a question, the model examines your words and estimates the most likely next word. Then it repeats that process over and over again. In a sense, it is a highly advanced version of your phone's autocomplete.&lt;/p&gt;

&lt;p&gt;If you ask about the 184th president, the model does not stop to check a history book. It identifies the pattern of a presidential biography, predicts words that sound like a biography, and prioritizes fluent language over factual accuracy.&lt;/p&gt;

&lt;p&gt;This gets worse with what researchers call long-tail knowledge deficits. If a fact appears rarely in the training data, the model struggles to recall it correctly. Researchers found that when a fact appears only once in training data, the model is statistically guaranteed to hallucinate it at least 20% of the time. Because the model is trained to be helpful, it often guesses instead of refusing to answer.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Illustration source: &lt;a href="https://www.ssw.com.au/" rel="noopener noreferrer"&gt;SSW&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Old Theory Was Wrong
&lt;/h2&gt;

&lt;p&gt;For a long time, the default solution was simple: build bigger models.&lt;/p&gt;

&lt;p&gt;The assumption was that a larger model would make fewer mistakes. That turns out to be incomplete at best. Recent benchmarks show that larger and more reasoning-heavy models can still hallucinate at meaningful rates. OpenAI's &lt;code&gt;o3&lt;/code&gt; model showed a hallucination rate of 33% on specific tests. The smaller &lt;code&gt;o4-mini&lt;/code&gt; reached 48%.&lt;/p&gt;

&lt;p&gt;Raw intelligence does not automatically produce honesty. That is why engineers are moving away from brute force scaling and toward system design choices that force models to stay grounded.&lt;/p&gt;

&lt;h2&gt;
  
  
  Solution 1: The Open Book Test With RAG
&lt;/h2&gt;

&lt;p&gt;One of the most practical solutions today is Retrieval-Augmented Generation, or RAG.&lt;/p&gt;

&lt;p&gt;RAG gives the model an open-book test instead of a closed-book one. Instead of guessing from memory, the system pauses, searches a trusted set of documents, retrieves the most relevant evidence, and generates a response based on that material.&lt;/p&gt;

&lt;p&gt;That makes a huge difference because the model is no longer relying only on fuzzy patterns from training. It is being asked to answer from evidence it just read.&lt;/p&gt;

&lt;p&gt;RAG is not magic, though. If the retrieved documents are outdated, incomplete, or wrong, the system will still produce confident but flawed responses. Garbage in, garbage out still applies.&lt;/p&gt;

&lt;h2&gt;
  
  
  Solution 2: Multi-Agent Verification
&lt;/h2&gt;

&lt;p&gt;Another promising direction is to use multiple models or multiple agent roles together.&lt;/p&gt;

&lt;p&gt;In a multi-agent verification system, one model acts as the writer and another acts as the critic. The writer produces a draft. The critic looks for factual gaps, logical errors, or unsupported claims. If the critic finds a problem, it rejects the draft and forces another pass.&lt;/p&gt;

&lt;p&gt;This kind of adversarial review starts to resemble peer review in human systems. Instead of trusting one model's first answer, you build a process that expects failure and actively tries to catch it before the output reaches the user.&lt;/p&gt;

&lt;p&gt;Recent research from Yang and colleagues suggests this approach can improve accuracy on complex reasoning tasks compared to relying on a single model alone.&lt;/p&gt;

&lt;h2&gt;
  
  
  Solution 3: Calibration and Honest Uncertainty
&lt;/h2&gt;

&lt;p&gt;The most interesting shift is not just architectural. It is behavioral.&lt;/p&gt;

&lt;p&gt;Most mainstream models have been trained with reinforcement learning from human feedback, or RLHF. In practice, that often rewards answers that sound polished, useful, and confident. The side effect is obvious: confidence is rewarded more consistently than calibrated honesty.&lt;/p&gt;

&lt;p&gt;The fix is to change the reward system. Engineers can heavily penalize incorrect confident answers while giving small rewards when the model honestly admits uncertainty or refuses unsupported claims. That pushes the system toward better calibration, where its internal confidence lines up more closely with actual accuracy.&lt;/p&gt;

&lt;p&gt;This requires real human infrastructure. Companies like Scale AI employ very large reviewer networks to evaluate outputs, label failures, and teach models when they should refuse to answer instead of bluffing.&lt;/p&gt;

&lt;h2&gt;
  
  
  What You Can Do Right Now
&lt;/h2&gt;

&lt;p&gt;For now, the practical answer is not blind trust. It is workflow design.&lt;/p&gt;

&lt;p&gt;Treat AI output like a first draft, not a final answer. Verify important claims. Follow citations back to the source. Prefer tools and systems that expose evidence directly. If you are using AI in professional work, assume you need guardrails, review layers, and verification before the output is safe.&lt;/p&gt;

&lt;p&gt;The goal is not to eliminate hallucinations entirely. With current architectures, that is not realistic. The goal is to build systems that catch the lies before they reach you.&lt;/p&gt;

&lt;p&gt;We are not teaching the machine to be perfect. We are teaching it that saying "I don't know" is better than pretending it does.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>rag</category>
      <category>llm</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Analysis: "Attention Is All You Need"</title>
      <dc:creator>Sandeep Salwan</dc:creator>
      <pubDate>Fri, 10 Oct 2025 15:10:33 +0000</pubDate>
      <link>https://forem.com/sandeep_salwan/analysis-attention-is-all-you-need-i9i</link>
      <guid>https://forem.com/sandeep_salwan/analysis-attention-is-all-you-need-i9i</guid>
      <description>&lt;p&gt;"Attention Is All You Need" introduced the Transformer architecture which is the foundation for modern language models. Its communication style shows the values of the AI research community.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Building Ethos&lt;/strong&gt;&lt;br&gt;
The paper lists eight authors who work at Google Brain, Google Research, and the University of Toronto. There is a note stating that the author listing is random, highlighting the researchers' focus on teamwork rather than one-upping each other. They establish authority through their significant affiliations and by having well-known researchers contribute to this paper, so they began the paper without needing to discuss their own authority. A footnote on the first page details each author's contribution. For example, it credits Noam Shazeer with proposing scaled dot-product attention. The footnote was remarkable, reinforcing this authority with transparency. It closely details each person's role, from designing the first models to accelerating research with a new codebase. This footnote gathers trust in a community valuing open and transparent collaboration. The authors do not need to boast about their credentials. Their affiliations and the paper's venue do that work for them. The paper was presented at the NIPS 2017 conference, which is very famous, and publication at NIPS signals that the work has passed a complicated peer-review process.  I can tell from the venue of NIPS as a whole that their work had an immediate stamp of approval. This is based on the present day.&lt;br&gt;
&lt;strong&gt;Purpose, Audience, and Content Level&lt;/strong&gt;&lt;br&gt;
The text informs and persuades. It presents a new Transformer architecture while concurrently arguing that this model is better than the antique, previously SOTA methods like recurrent neural networks. The audience is experts in machine learning because the paper uses technical terms (almost on par with Jameson Postmodernism) like "sequence transduction" and "auto-regressive," and is a challenging read without a great understanding of linear algebra and neural networks. This specialized language allowed for efficient communication between researchers, but was written in a way that made it unclear to these researchers how beneficial their model would be to the AI community.&lt;br&gt;
Additionally, this paper should be written for an audience with limited time, allowing readers to skip directly to the results. There is a straightforward narrative of how the introduction starts with the problem the community has, like RNNs "precludes parallelization," signaling a problem that the dominant technology had as a bottleneck. This helped people see that this new architecture is vital.  Also, math is the primary tool of explanation because it seems more credible and proves that the work is tested. &lt;br&gt;
&lt;strong&gt;Context and Sources&lt;/strong&gt;&lt;br&gt;
The authors cite many sources, like the paper on the Adam optimizer used for training. There are no ads surrounding the text. The paper's persuasive power comes from its problem-solution structure. The introduction establishes a clear problem. It highlights the "inherently sequential nature" of RNNs as a "fundamental constraint.” This language frames the old method as a barrier to progress. This situates their work within existing research. They treat sources as a foundation for their own ideas, citing "residual dropout" and "Adam optimizer," as well as their competition/ alternative approaches. The end of the paper attempts to provide a solution to the problem RNNs have, and it focuses heavily on preventing ambiguity by being clear.  They are citing both foundational work and competing models like ByteNet and ConvS2S, which provide this research paper with more ethos.  Also, the paper's conclusion is unique because it does not present a typical summary but ends with an agenda for future research by stating, "We are excited about the future of attention-based models and plan to apply them to other tasks." They presented this paper to allow other researchers to figure out how they can use this. &lt;br&gt;
&lt;strong&gt;Format and Language&lt;/strong&gt;&lt;br&gt;
The paper follows a typical scientific structure. It moves in a clear order with sections for model training and results. Each part is labeled and easy to follow. The tone stays formal and focused. The writing is tight and exact. The authors use active voice and write as "we," keeping the focus on their methods and results. The style feels deliberate, confident, and built around precision. A sample sentence is: "We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely". The text does not use prose like metaphors or similes because they want the results to be very reproducible. The abstract is very essential because it acts as a high-density executive summary. They provide proof like beating the old "28.4 BLEU" with a new SOTA score of "41.8."  Another example at "3.1 Encoder and Decoder Stacks" lets readers go directly to the information they need. This reliance on quantitative benchmarks is a key rhetorical strategy because AI research establishes authority through measurable and erproducible progress. The researchers persuade by presenting tricky numbers as proof of success, which is more profound than any descriptive language. The title "Attention is All You Need" is atypical of academic paper titles, almost making the paper more accessible, and symbolizes how these researchers are providing a comprehensive solution. &lt;br&gt;
&lt;strong&gt;Visuals and Mathematics&lt;/strong&gt;&lt;br&gt;
Visuals are critical to the paper's argument. Figure 1 provides a famous schematic of the Transformer architecture, which is often referenced and discussed in all AI courses. Figure 2, which shows Scaled Dot-Product and multi-head attention, is an important mathematical function that presents data.  Table 2 compares the Transformer architecture performance to previous SOTA models by comparing BLEU scores and training costs. Figure 3 makes tricky concepts easier to grasp by providing visibility and evidence of how the model is learning linguistics. They also have latex equations like "Attention(Q, K, V) = softmax((QK^T) / sqrt(d_k)) V"  functions rhetorically, signaling to readers this proposed mechanism is like a fundamental truth.&lt;br&gt;
&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;br&gt;
"Attention Is All You Need" shows the communication style of the AI research community.. These values serve as empirical proof and are grounded in prior work. The authors inform their audience about a new architecture and persuade readers with performance data. They even had a public code repo displaying confidence in their work, and it was an extra gesture helping make this paper so foundational. The paper's dense writing prioritizes extreme precision. In this field of CS+AI, arguments are won with better models and superior results, as demonstrated by the current LLMS battle. This paper presented both.&lt;/p&gt;

</description>
      <category>programming</category>
      <category>ai</category>
    </item>
    <item>
      <title>In-Depth Analysis: "Attention Is All You Need"</title>
      <dc:creator>Sandeep Salwan</dc:creator>
      <pubDate>Fri, 10 Oct 2025 15:07:36 +0000</pubDate>
      <link>https://forem.com/sandeep_salwan/in-depth-analysis-attention-is-all-you-need-267n</link>
      <guid>https://forem.com/sandeep_salwan/in-depth-analysis-attention-is-all-you-need-267n</guid>
      <description>&lt;p&gt;"Attention Is All You Need" introduced the Transformer architecture which is the foundation for modern language models. Its communication style shows the values of the AI research community.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Building Ethos&lt;/strong&gt;&lt;br&gt;
The paper lists eight authors who work at Google Brain, Google Research, and the University of Toronto. There is a note stating that the author listing is random, highlighting the researchers' focus on teamwork rather than one-upping each other. They establish authority through their significant affiliations and by having well-known researchers contribute to this paper, so they began the paper without needing to discuss their own authority. A footnote on the first page details each author's contribution. For example, it credits Noam Shazeer with proposing scaled dot-product attention. The footnote was remarkable, reinforcing this authority with transparency. It closely details each person's role, from designing the first models to accelerating research with a new codebase. This footnote gathers trust in a community valuing open and transparent collaboration. The authors do not need to boast about their credentials. Their affiliations and the paper's venue do that work for them. The paper was presented at the NIPS 2017 conference, which is very famous, and publication at NIPS signals that the work has passed a complicated peer-review process.  I can tell from the venue of NIPS as a whole that their work had an immediate stamp of approval.&lt;br&gt;
&lt;strong&gt;Purpose, Audience, and Content Level&lt;/strong&gt;&lt;br&gt;
The text informs and persuades. It presents a new Transformer architecture while concurrently arguing that this model is better than the antique, previously SOTA methods like recurrent neural networks. The audience is experts in machine learning because the paper uses technical terms (almost on par with Jameson Postmodernism) like "sequence transduction" and "auto-regressive," and is a challenging read without a great understanding of linear algebra and neural networks. This specialized language allowed for efficient communication between researchers, but was written in a way that made it unclear to these researchers how beneficial their model would be to the AI community.&lt;br&gt;
Additionally, this paper should be written for an audience with limited time, allowing readers to skip directly to the results. There is a straightforward narrative of how the introduction starts with the problem the community has, like RNNs "precludes parallelization," signaling a problem that the dominant technology had as a bottleneck. This helped people see that this new architecture is vital.  Also, math is the primary tool of explanation because it seems more credible and proves that the work is tested. &lt;br&gt;
&lt;strong&gt;Context and Sources&lt;/strong&gt;&lt;br&gt;
The authors cite many sources, like the paper on the Adam optimizer used for training. There are no ads surrounding the text. The paper's persuasive power comes from its problem-solution structure. The introduction establishes a clear problem. It highlights the "inherently sequential nature" of RNNs as a "fundamental constraint.” This language frames the old method as a barrier to progress. This situates their work within existing research. They treat sources as a foundation for their own ideas, citing "residual dropout" and "Adam optimizer," as well as their competition/ alternative approaches. The end of the paper attempts to provide a solution to the problem RNNs have, and it focuses heavily on preventing ambiguity by being clear.  They are citing both foundational work and competing models like ByteNet and ConvS2S, which provide this research paper with more ethos.  Also, the paper's conclusion is unique because it does not present a typical summary but ends with an agenda for future research by stating, "We are excited about the future of attention-based models and plan to apply them to other tasks." They presented this paper to allow other researchers to figure out how they can use this. &lt;br&gt;
&lt;strong&gt;Format and Language&lt;/strong&gt;&lt;br&gt;
The paper follows a typical scientific structure. It moves in a clear order with sections for model training and results. Each part is labeled and easy to follow. The tone stays formal and focused. The writing is tight and exact. The authors use active voice and write as "we," keeping the focus on their methods and results. The style feels deliberate, confident, and built around precision. A sample sentence is: "We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely". The text does not use prose like metaphors or similes because they want the results to be very reproducible. The abstract is very essential because it acts as a high-density executive summary. They provide proof like beating the old "28.4 BLEU" with a new SOTA score of "41.8."  Another example at "3.1 Encoder and Decoder Stacks" lets readers go directly to the information they need. This reliance on quantitative benchmarks is a key rhetorical strategy because AI research establishes authority through measurable and erproducible progress. The researchers persuade by presenting tricky numbers as proof of success, which is more profound than any descriptive language. The title "Attention is All You Need" is atypical of academic paper titles, almost making the paper more accessible, and symbolizes how these researchers are providing a comprehensive solution. &lt;br&gt;
&lt;strong&gt;Visuals and Mathematics&lt;/strong&gt;&lt;br&gt;
Visuals are critical to the paper's argument. Figure 1 provides a famous schematic of the Transformer architecture, which is often referenced and discussed in all AI courses. Figure 2, which shows Scaled Dot-Product and multi-head attention, is an important mathematical function that presents data.  Table 2 compares the Transformer architecture performance to previous SOTA models by comparing BLEU scores and training costs. Figure 3 makes tricky concepts easier to grasp by providing visibility and evidence of how the model is learning linguistics. They also have latex equations like "Attention(Q, K, V) = softmax((QK^T) / sqrt(d_k)) V"  functions rhetorically, signaling to readers this proposed mechanism is like a fundamental truth.&lt;br&gt;
&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;br&gt;
"Attention Is All You Need" shows the communication style of the AI research community.. These values serve as empirical proof and are grounded in prior work. The authors inform their audience about a new architecture and persuade readers with performance data. They even had a public code repo displaying confidence in their work, and it was an extra gesture helping make this paper so foundational. The paper's dense writing prioritizes extreme precision. In this field of CS+AI, arguments are won with better models and superior results, as demonstrated by the current LLMS battle. This paper presented both.&lt;/p&gt;

</description>
      <category>programming</category>
      <category>ai</category>
      <category>architecture</category>
      <category>discuss</category>
    </item>
    <item>
      <title>Transformer Architecture</title>
      <dc:creator>Sandeep Salwan</dc:creator>
      <pubDate>Mon, 15 Sep 2025 19:37:28 +0000</pubDate>
      <link>https://forem.com/sandeep_salwan/transformer-architecture-4eo1</link>
      <guid>https://forem.com/sandeep_salwan/transformer-architecture-4eo1</guid>
      <description>&lt;p&gt;Before Transformers, models called RNNs were used, but Transformers are better because they solve issues like being difficult to parallelize and the exploding gradient problem.&lt;/p&gt;

&lt;p&gt;Line 1: “The person executed the swap because it was trained to do so.”&lt;br&gt;
Line 2: “The person executed the swap because it was an effective hedge.”&lt;/p&gt;

&lt;p&gt;Look carefully at those two lines. Notice how in line 1, “it” refers to the "person".&lt;br&gt;
In line 2, “it” refers to the "swap".&lt;/p&gt;

&lt;p&gt;Transformers figure out what “it” refers to entirely through numbers by discovering how related the word pairs are.&lt;/p&gt;

&lt;p&gt;These numbers are stored in tensors: a vector is a 1D tensor, a matrix is a 2D tensor, and higher-dimensional arrays are ND tensors. Embeddings for the input are based on frequency and co-occurrence of other words.&lt;/p&gt;

&lt;p&gt;This architecture relies on three key inputs: the Query matrix, the Key matrix, and the Value matrix.&lt;/p&gt;

&lt;p&gt;Imagine you are a detective. The Query is like your list of questions (Who or what is “it”?). The Key is the evidence each word carries (what every word offers as a clue). When you multiply Query by Key, you get a set of attention scores (numbers showing which clues are most relevant).&lt;/p&gt;

&lt;p&gt;Lot of math occurs here  {these scores are scaled (to keep them stable), normalized with softmax (so they become probabilities that sum to 1), and then used as weights.}&lt;/p&gt;

&lt;p&gt;Finally, the Value is the actual content of the evidence (the meaning of each word e.g. person is living and swap is an action). Multiplying the attention weights by the Value matrix gives the final information the model carries forward to make the right decision about “it.”&lt;/p&gt;

&lt;p&gt;All of these abstract (Q, K, V) matrix numbers are trained through backpropagation. Training works by predicting an output, comparing it to the true label, measuring the loss (higher loss the worse because it's calculated by difference of calculated vs actual output), calculating gradients (slopes showing how much each weight contributed to that error), and then updating the weights in the opposite direction of the slope (e.g., if the slope of loss is y = 2x, the weights move in the y = –2x direction).&lt;/p&gt;

&lt;p&gt;Now you know at a high level how Transformers (used by top LLM's today) work: they’re just predicting the next word in a sequence. &lt;/p&gt;

</description>
      <category>programming</category>
      <category>fullstack</category>
      <category>ai</category>
      <category>discuss</category>
    </item>
  </channel>
</rss>
