<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Tuntufye Mwakalasya </title>
    <description>The latest articles on Forem by Tuntufye Mwakalasya  (@tmwakalasya).</description>
    <link>https://forem.com/tmwakalasya</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1507219%2F922bc9e4-0ae5-4990-87fc-768591a3493f.jpeg</url>
      <title>Forem: Tuntufye Mwakalasya </title>
      <link>https://forem.com/tmwakalasya</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/tmwakalasya"/>
    <language>en</language>
    <item>
      <title>DISTRIBUTED SYSTEMS</title>
      <dc:creator>Tuntufye Mwakalasya </dc:creator>
      <pubDate>Wed, 17 Dec 2025 16:35:35 +0000</pubDate>
      <link>https://forem.com/tmwakalasya/distributed-systems-202a</link>
      <guid>https://forem.com/tmwakalasya/distributed-systems-202a</guid>
      <description>&lt;p&gt;Distributed vs Decentralized systems. &lt;/p&gt;

&lt;p&gt;The most important distinction between the two is integrative view and expansive view&lt;br&gt;
An integrative view is when there is a need to connect existing computer systems to each other.&lt;/p&gt;

&lt;p&gt;An expansive view is when an existing system requires an extension through additional computers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Distributed vs. Decentralized Systems
&lt;/h2&gt;

&lt;p&gt;A key distinction between distributed and decentralized systems lies in their conceptual origin, which can be understood through two primary perspectives: the &lt;strong&gt;integrative view&lt;/strong&gt; and the &lt;strong&gt;expansive view&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Core Concepts: Integrative vs. Expansive Views
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Integrative View:&lt;/strong&gt; This perspective arises from the need to connect and integrate pre-existing, often autonomous, computer systems. The goal is to make separate systems work together while respecting their administrative boundaries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Expansive View:&lt;/strong&gt; This perspective emerges when an existing, centrally-managed system needs to be extended or scaled out by adding more computers. The goal is to enhance the system's capabilities, such as performance or fault tolerance, while presenting a unified, single-system image to the user.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Decentralized Systems: The Integrative View
&lt;/h3&gt;

&lt;p&gt;Decentralized systems are typically formed when the processes and resources of a networked system are necessarily split across multiple, often administratively separate, computers. They are born from the desire to connect existing systems that must remain independent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example: Federated Learning&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;In traditional machine learning (ML), massive datasets are brought to a central High-Performance Computing (HPC) cluster for model training. However, when data must remain within an organization's boundaries due to privacy or legal constraints, the training must be brought to the data.&lt;/p&gt;

&lt;p&gt;Federated learning enables this by running multiple, parallel training sessions on separate, localized datasets. Each session produces a "local model." These local models are then aggregated (e.g., through model weight averaging) to build a more generalized "global model." This approach contrasts with centralized techniques where all datasets are merged into a single location for one large training session.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Distributed Systems: The Expansive View
&lt;/h3&gt;

&lt;p&gt;A distributed system is a networked computer system where processes and resources are split across multiple computers to achieve scalability and reliability, while appearing to users as a single, coherent system. These systems are typically associated with the expansive view.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example: Google Mail&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Consider an email service like Gmail. A user configures their client with server addresses like imap.gmail.com and smtp.gmail.com, giving the impression of interacting with just two machines.&lt;/p&gt;

&lt;p&gt;In reality, with billions of users, the service is supported by a massive, complex system spread across countless computers in data centers worldwide. This distributed system is designed to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ensure Scalability:&lt;/strong&gt; Handle the immense load of millions of concurrent users.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Provide Fault Tolerance:&lt;/strong&gt; Minimize the risk of losing mail due to hardware or software failures.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Maintain Transparency:&lt;/strong&gt; Hide the underlying complexity from the end-user, who only sees a simple, unified service.
The system expands or shrinks based on user demand and dependability requirements.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Key Challenges in Distributed &amp;amp; Decentralized Systems
&lt;/h2&gt;

&lt;p&gt;Understanding the distinction between these systems is crucial because both face a unique set of complex challenges that are not present in single-machine systems.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Partial Failures:&lt;/strong&gt; Unlike a centralized system that either works or fails completely, these systems can experience partial failures where one component fails while others continue to run. This makes error detection and recovery incredibly complex.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;High Dynamism:&lt;/strong&gt; Nodes (participating computers) can join and leave the network frequently and unpredictably. This dynamism requires sophisticated, automated management and maintenance protocols.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security Vulnerabilities:&lt;/strong&gt; Because these systems are networked, used by many applications, and often span multiple administrative domains, they are inherently vulnerable to a wide range of security attacks.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Perspectives for Studying Distributed Systems
&lt;/h2&gt;

&lt;p&gt;To fully grasp their complexity, we study distributed systems from several different perspectives:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Architectural View:&lt;/strong&gt; Focuses on the common organizational styles and patterns to understand how components interact and what dependencies exist between them.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Process View:&lt;/strong&gt; Examines the different forms of processes that form the software backbone, including threads, virtualization, clients, and servers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Communication View:&lt;/strong&gt; Concerns the mechanisms and protocols that systems provide for exchanging data between processes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Coordination View:&lt;/strong&gt; Describes the fundamental coordination tasks (e.g., consensus, leader election) that happen "under the hood" to allow applications to execute correctly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Naming View:&lt;/strong&gt; Explores how processes, resources, and other entities are named and located. Effective naming schemes are crucial for accessing any entity in the system.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consistency &amp;amp; Replication View:&lt;/strong&gt; To achieve high performance and dependability, resources are often replicated. This view analyzes the challenges of keeping all copies of a resource consistent, especially after updates.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fault Tolerance View:&lt;/strong&gt; Dives into the means for masking failures and enabling recovery. This is one of the toughest aspects, as it involves numerous trade-offs, and completely masking all failures is provably impossible.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security View:&lt;/strong&gt; Focuses on how to ensure authorized access to resources and protect the system's integrity and confidentiality.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Core Goals of Distributed Systems
&lt;/h2&gt;

&lt;p&gt;Building a distributed system is complex and should only be undertaken when necessary. The primary motivations are typically centered around two goals:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Resource Sharing:&lt;/strong&gt; A fundamental goal is to make it easy for users and applications to access and share remote resources, which can be anything from hardware (printers, disks) to software and data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Distribution Transparency:&lt;/strong&gt; A key objective is to hide the complexity of the system's distribution from users and applications. The system should appear as a single, unified computing environment, masking the fact that its processes and resources are physically separated. Achieving transparency is done through middleware.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Decentralized systems is when processes and resources of a  networked computer are split necessarily across multiple computers, &lt;/p&gt;

&lt;p&gt;A distributed system is a networked computer system in which processes and resources are split sufficiently across multiple computers.&lt;/p&gt;

&lt;p&gt;Decentralized systems are normally associated with integrative views of networked systems. They come to being because we want to  connect systems yet, we may be hindered by administrative boundaries, such as in AI, there is need for massive amounts of data, normally data is brought to the HPC(High performance computers) to train the models, but when data needs to stay within the constraints of an org, we need to bring the training to the data. This is called federated learning&lt;br&gt;
Federated learning is an ML approach that allows for multiple separate training sessions running in parallel to run across large boundaries, for example geographically, and aggregate the results to build a generalized model (global model) in the process. More specifically, each training session uses its own dataset and gets its own local model. Local models in different training sessions will be aggregated (for example, model weight aggregation) into a global model during the training process. This approach stands in contrast to centralized ML techniques where datasets are merged for one training session.&lt;/p&gt;

&lt;p&gt;Distributed systems are associated with the expansive view of networked systems. &lt;/p&gt;

&lt;p&gt;A well-known example is making use&lt;br&gt;
of e-mail services, such as Google Mail. What often happens is that a user logs&lt;br&gt;
into the system through a Web interface to read and send mails. More often,&lt;br&gt;
however, is that users configure their personal computer (such as a laptop) to&lt;br&gt;
make use of a specific mail client. To that end, they need to configure a few&lt;br&gt;
settings, such as the incoming and outgoing server. In the case of Google Mail,&lt;br&gt;
these are &lt;a href="http://imap.gmail.com/" rel="noopener noreferrer"&gt;imap.gmail.com&lt;/a&gt; and &lt;a href="http://smtp.gmail.com/" rel="noopener noreferrer"&gt;smtp.gmail.com&lt;/a&gt;, respectively. Logically, it seems&lt;br&gt;
as if these two servers will handle all your mail. However, with an estimate&lt;br&gt;
of close to 2 billion users as of 2022, it is unlikely that only two computers&lt;br&gt;
can handle all their e-mails (which was estimated to be more than 300 billion&lt;br&gt;
per year, that is, some 10,000 mails per second). Behind the scenes, of course,&lt;br&gt;
the entire Google Mail service has been implemented and spread across many&lt;br&gt;
computers, jointly forming a distributed system. That system has been set&lt;br&gt;
up to make sure that so many users can process their mails (i.e., ensures&lt;br&gt;
scalability), but also that the risk of losing mail because of failures, is minimal&lt;br&gt;
(i.e., the system ensures fault tolerance). To the user, however, the image of&lt;br&gt;
just two servers is kept up (i.e., the distribution itself is highly transparent&lt;br&gt;
to the user). The distributed system implementing an e-mail service, such&lt;br&gt;
as Google Mail, typically expands (or shrinks) as dictated by dependability&lt;br&gt;
requirements, in turn, dependent on the number of its users.&lt;/p&gt;

&lt;p&gt;Why do we make the distinction between decentralized and distributed systems:&lt;/p&gt;

&lt;p&gt;There are many often unexpected dependencies that hinder understanding the behavior of our systems, such as distributed and decentralized systems always suffer from partial failures.&lt;/p&gt;

&lt;p&gt;Secondly in systems like this, partcipating nodes which are part of our network come and go which makes this system very dynamic. This therefore requires forms of automated management and maintenance which increases the complexity of the systems.&lt;/p&gt;

&lt;p&gt;Lastly, the fact that these systems are networked, used by many users and applications across multiple administrative zones makes them vulnerable to security attacks. Therefore understanding their behavior and the systems as a whole requires that we understand how they can be and are secured.&lt;/p&gt;

&lt;p&gt;We will study the systems from different perspecitves:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Architectural view, what are the common organizations, styles. This will teach us how various components of existing systems interact and depend on each other.&lt;/li&gt;
&lt;li&gt;Process view, this is about understanding the different forms of processes that occur in distributed systems, such as threads, virtualization of hardware, processes, client, servers and so on. Processes form the software backbone of distributed systems.&lt;/li&gt;
&lt;li&gt;Communication view, concerns the facilities that distributed systems provide to exchange data between processes.&lt;/li&gt;
&lt;li&gt;Coordination view, what happens under the hood on top of which applications are executed. Describes the fundamental coordination tasks that need to be carried out as part of the system.&lt;/li&gt;
&lt;li&gt;Naming view, to access processes and resources we need naming. More or so naming schemes that will lead to the process resources or whatever other type of entity that is being names.&lt;/li&gt;
&lt;li&gt;Consistency and replication view, A critical aspect of distributed systems is that they perform well in terms
of efficiency and in terms of dependability. The key instrument for both
aspects is replicating resources. The only problem with replication is
that updates may happen, implying that all copies of a resource need
to be updated as well. It is here, that keeping up the appearance of a
nondistributed system becomes challenging.&lt;/li&gt;
&lt;li&gt;Fault tolerance view, dives into the means for masking failures and their recovery. This is the toughest perspective of understanding distributed systems. This is in part to so many trade offs and also completely masking failures and their recovery is provably impossible.&lt;/li&gt;
&lt;li&gt;Security view, there is no nonsecured distributed system, this will allow us to focus on how to ensure authorized access to resources.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In the case of building distributed systems, just because you can build one does not mean it is necessary as they are very complex unless under certain circumstances.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Resource sharing&lt;/p&gt;

&lt;p&gt;An important goal of a distributed system is to make it easy for users and applications to access and share remote resources. Resources can be virtually anything.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Distribution Transparency&lt;/p&gt;

&lt;p&gt;One thing that distributed systems try to hide is that it’s processes and resources are physically connected by computers that might be very long distances between them. Meaning that it tries to make its resources and processes transparent&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>architecture</category>
      <category>computerscience</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>Inside ChatGPT: Deconstructing "Attention Is All You Need" (Part 1)</title>
      <dc:creator>Tuntufye Mwakalasya </dc:creator>
      <pubDate>Fri, 21 Nov 2025 21:49:32 +0000</pubDate>
      <link>https://forem.com/tmwakalasya/inside-chatgpt-deconstructing-attention-is-all-you-need-part-1-34ap</link>
      <guid>https://forem.com/tmwakalasya/inside-chatgpt-deconstructing-attention-is-all-you-need-part-1-34ap</guid>
      <description>&lt;p&gt;To understand how modern Large Language Models (LLMs) like ChatGPT work, we must first understand the architecture that changed everything: the Transformer. Before we dive into the complex layers, we need to establish why we moved away from previous methods and how the model initially processes language.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The Predecessor: Recurrent Neural Networks (RNNs) and Their Limitations
&lt;/h3&gt;

&lt;p&gt;Before the "Attention Is All You Need" paper, the standard for processing sequential data (like text) was the Recurrent Neural Network (RNN).In an RNN, data is processed sequentially. We give the network an initial state (State 0) along with an input x1 to produce an output y1 and a hidden state. This hidden state is passed forward to the next step, allowing the network to "remember" previous inputs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnuo5bxbcze354rq8ki51.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnuo5bxbcze354rq8ki51.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  The Vanishing Gradient Problem
&lt;/h1&gt;

&lt;p&gt;While intuitive, RNNs suffer from severe limitations, specifically slow computation for long sequences and the &lt;strong&gt;vanishing&lt;/strong&gt; or &lt;strong&gt;exploding gradient problem&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;To understand this, let's look at calculus, specifically the Chain Rule.&lt;/p&gt;

&lt;p&gt;If we have a composite function 

&lt;/p&gt;
&lt;div class="katex-element"&gt;
  &lt;span class="katex-display"&gt;&lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;F(x)=(f∘g)(x)F(x) = (f \circ g)(x) &lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;F&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord mathnormal"&gt;x&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;=&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord mathnormal"&gt;f&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;∘&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;g&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord mathnormal"&gt;x&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/div&gt;
, the derivative is:


&lt;div class="katex-element"&gt;
  &lt;span class="katex-display"&gt;&lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;F′(x)=f′(g(x))⋅g′(x)
F'(x) = f'(g(x)) \cdot g'(x)
&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;F&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mtight"&gt;′&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord mathnormal"&gt;x&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;=&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;f&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mtight"&gt;′&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord mathnormal"&gt;g&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord mathnormal"&gt;x&lt;/span&gt;&lt;span class="mclose"&gt;))&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;⋅&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;g&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mtight"&gt;′&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord mathnormal"&gt;x&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/div&gt;


&lt;p&gt;In a deep neural network, backpropagation involves multiplying gradients layer by layer (like the chain rule). If we have many layers, we are essentially multiplying many numbers together.&lt;/p&gt;

&lt;p&gt;Imagine multiplying fractions like:&lt;/p&gt;


&lt;div class="katex-element"&gt;
  &lt;span class="katex-display"&gt;&lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;12×12×12×⋯
\frac{1}{2} \times \frac{1}{2} \times \frac{1}{2} \times \cdots
&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mopen nulldelimiter"&gt;&lt;/span&gt;&lt;span class="mfrac"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="frac-line"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose nulldelimiter"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;×&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mopen nulldelimiter"&gt;&lt;/span&gt;&lt;span class="mfrac"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="frac-line"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose nulldelimiter"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;×&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mopen nulldelimiter"&gt;&lt;/span&gt;&lt;span class="mfrac"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="frac-line"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose nulldelimiter"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;×&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="minner"&gt;⋯&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/div&gt;


&lt;p&gt;As the number of layers (or time steps) increases, this number becomes infinitesimally small ("vanishes") or massively large ("explodes"). This makes it incredibly difficult for the model to access or learn from information that appeared early in a long sequence.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. The Transformer Architecture
&lt;/h2&gt;

&lt;p&gt;The Transformer abandons recurrence entirely, relying instead on an &lt;strong&gt;Encoder-Decoder&lt;/strong&gt; architecture. It processes the entire sequence at once, which solves the speed and long-term dependency issues of RNNs.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Input Matrix
&lt;/h3&gt;

&lt;p&gt;Let's look at how data enters the model.&lt;/p&gt;

&lt;p&gt;If we have an input sentence of length 6 (Sequence Length) and a model dimension (
&lt;/p&gt;
&lt;div class="katex-element"&gt;
  &lt;span class="katex-display"&gt;&lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;dmodeld_{model}&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;d&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;m&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;o&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;d&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;e&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/div&gt;
) of 512, our input is a matrix of size 
&lt;div class="katex-element"&gt;
  &lt;span class="katex-display"&gt;&lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;(6,512)(6, 512)&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord"&gt;6&lt;/span&gt;&lt;span class="mpunct"&gt;,&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord"&gt;512&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/div&gt;
.

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frq1og6xg2j7sugx3yobr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frq1og6xg2j7sugx3yobr.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Each row represents a word, and the columns (length 512) represent that word as a vector. You might ask: &lt;strong&gt;Why 512 dimensions?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We need high-dimensional space to capture:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Semantic Meaning&lt;/strong&gt;: What the word actually means.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Syntactic Role&lt;/strong&gt;: Is it a noun, verb, or adjective?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Relationships&lt;/strong&gt;: How it relates to other words (e.g., "King" vs "Queen").&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context&lt;/strong&gt;: Multiple contexts the word can appear in.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;
  
  
  Input Embedding
&lt;/h3&gt;

&lt;p&gt;Computers don't understand strings; they understand numbers. We take our original sentence:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"Your cat is a lovely cat"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;First, we map these to Input IDs (their position in the vocabulary):&lt;br&gt;
We then map these IDs into a vector of size 512. Note that these vectors are &lt;strong&gt;not fixed&lt;/strong&gt;; they are learned parameters that change during training to better represent the word's meaning.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2jatm5ou1q877p6xv9wr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2jatm5ou1q877p6xv9wr.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  3. Positional Encoding
&lt;/h2&gt;

&lt;p&gt;Since the Transformer processes words in parallel (not sequentially like an RNN), it has no inherent concept of "order." It doesn't know that "Your" comes before "cat." We must inject this information manually using &lt;strong&gt;Positional Encodings&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;We want the model to treat words that appear close to each other as "close" mathematically. To do this, we use trigonometric functions because they naturally represent continuous patterns that the model can easily learn to extrapolate.&lt;/p&gt;

&lt;p&gt;We add this positional vector to our embedding vector. The formula used in the paper is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For even positions (
&lt;div class="katex-element"&gt;
  &lt;span class="katex-display"&gt;&lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;2i2i&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="mord mathnormal"&gt;i&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/div&gt;
):&lt;/li&gt;
&lt;/ul&gt;


&lt;div class="katex-element"&gt;
  &lt;span class="katex-display"&gt;&lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;PE(pos,2i)=sin⁡(pos100002i/dmodel)
PE(pos, 2i) = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)
&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;PE&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord mathnormal"&gt;p&lt;/span&gt;&lt;span class="mord mathnormal"&gt;os&lt;/span&gt;&lt;span class="mpunct"&gt;,&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="mord mathnormal"&gt;i&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;=&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mop"&gt;sin&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="minner"&gt;&lt;span class="mopen delimcenter"&gt;&lt;span class="delimsizing size2"&gt;(&lt;/span&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mopen nulldelimiter"&gt;&lt;/span&gt;&lt;span class="mfrac"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;1000&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;0&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mtight"&gt;2&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;i&lt;/span&gt;&lt;span class="mord mtight"&gt;/&lt;/span&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;d&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size3 size1 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;m&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;o&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;d&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;e&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="frac-line"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;p&lt;/span&gt;&lt;span class="mord mathnormal"&gt;os&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose nulldelimiter"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose delimcenter"&gt;&lt;span class="delimsizing size2"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/div&gt;


&lt;ul&gt;
&lt;li&gt;For odd positions (
&lt;div class="katex-element"&gt;
  &lt;span class="katex-display"&gt;&lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;2i+12i+1&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="mord mathnormal"&gt;i&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/div&gt;
):&lt;/li&gt;
&lt;/ul&gt;


&lt;div class="katex-element"&gt;
  &lt;span class="katex-display"&gt;&lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;PE(pos,2i+1)=cos⁡(pos100002i/dmodel)
PE(pos, 2i+1) = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)
&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;PE&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord mathnormal"&gt;p&lt;/span&gt;&lt;span class="mord mathnormal"&gt;os&lt;/span&gt;&lt;span class="mpunct"&gt;,&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="mord mathnormal"&gt;i&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;1&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;=&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mop"&gt;cos&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="minner"&gt;&lt;span class="mopen delimcenter"&gt;&lt;span class="delimsizing size2"&gt;(&lt;/span&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mopen nulldelimiter"&gt;&lt;/span&gt;&lt;span class="mfrac"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;1000&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;0&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mtight"&gt;2&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;i&lt;/span&gt;&lt;span class="mord mtight"&gt;/&lt;/span&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;d&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size3 size1 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;m&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;o&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;d&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;e&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;l&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="frac-line"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;p&lt;/span&gt;&lt;span class="mord mathnormal"&gt;os&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose nulldelimiter"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose delimcenter"&gt;&lt;span class="delimsizing size2"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/div&gt;


&lt;p&gt;This ensures that every position has a unique encoding that is consistent across training and inference.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Self-Attention: The Core Mechanism
&lt;/h2&gt;

&lt;p&gt;This is the "magic" of the architecture. Self-attention allows the model to relate words to each other within the same sentence. It determines how much "focus" the word "lovely" should have on the word "cat."&lt;/p&gt;

&lt;p&gt;The formula for Scaled Dot-Product Attention is:&lt;/p&gt;


&lt;div class="katex-element"&gt;
  &lt;span class="katex-display"&gt;&lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;Attention(Q,K,V)=softmax(QKTdk)V
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord text"&gt;&lt;span class="mord"&gt;Attention&lt;/span&gt;&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord mathnormal"&gt;Q&lt;/span&gt;&lt;span class="mpunct"&gt;,&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;K&lt;/span&gt;&lt;span class="mpunct"&gt;,&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;V&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;=&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord text"&gt;&lt;span class="mord"&gt;softmax&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="minner"&gt;&lt;span class="mopen delimcenter"&gt;&lt;span class="delimsizing size3"&gt;(&lt;/span&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mopen nulldelimiter"&gt;&lt;/span&gt;&lt;span class="mfrac"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord sqrt"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span class="svg-align"&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;d&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="hide-tail"&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="frac-line"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;Q&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;K&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;T&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose nulldelimiter"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose delimcenter"&gt;&lt;span class="delimsizing size3"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;V&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/div&gt;


&lt;p&gt;Where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Q&lt;/strong&gt; (Query): What I am looking for.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;K&lt;/strong&gt; (Key): What I contain.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;V&lt;/strong&gt; (Value): The actual content I will pass along.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Matrix Math
&lt;/h3&gt;

&lt;p&gt;For a sequence length of 6 and dimension 512:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;We multiply &lt;strong&gt;Q&lt;/strong&gt; (6 × 512) by &lt;strong&gt;K^T&lt;/strong&gt; (512 × 6).&lt;/li&gt;
&lt;li&gt;This results in a (6 × 6) matrix.&lt;/li&gt;
&lt;li&gt;We apply the Softmax function. This turns the scores into probabilities (summing up to 1).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This (6 × 6) matrix captures the interaction between every word and every other word. When we multiply this by &lt;strong&gt;V&lt;/strong&gt;, we get a weighted sum of the values, where the weights are determined by the compatibility of the Query and Key.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Benefits of Self-Attention:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Permutation Invariant&lt;/strong&gt;: It treats the sequence as a set of relationships rather than a strict list.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parameter Efficiency&lt;/strong&gt;: Pure self-attention requires no learnable parameters (though the linear layers surrounding it do).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Long-range Dependencies&lt;/strong&gt;: Words at the start of a sentence can attend to words at the end just as easily as adjacent words.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key Benefits of Self-Attention:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Permutation Invariant&lt;/strong&gt;: It treats the sequence as a set of relationships rather than a strict list.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parameter Efficiency&lt;/strong&gt;: Pure self-attention requires no learnable parameters (though the linear layers surrounding it do).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Long-range Dependencies&lt;/strong&gt;: Words at the start of a sentence can attend to words at the end just as easily as adjacent words.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Summary &amp;amp; Looking Ahead
&lt;/h2&gt;

&lt;p&gt;We have successfully moved away from the sequential limitations of RNNs and embraced the parallel nature of Transformers. We've learned how to convert text into meaningful vector spaces, inject order using positional encoding, and, most importantly, derive the mathematical foundation of how words "pay attention" to each other using Queries, Keys, and Values.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;But there is a catch.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The mechanism we just described, a single pass of 
&lt;/p&gt;
&lt;div class="katex-element"&gt;
  &lt;span class="katex-display"&gt;&lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;softmax(QKTdk)V\text{softmax}(\frac{QK^T}{\sqrt{d_k}})V&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord text"&gt;&lt;span class="mord"&gt;softmax&lt;/span&gt;&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mopen nulldelimiter"&gt;&lt;/span&gt;&lt;span class="mfrac"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord sqrt"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span class="svg-align"&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;d&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="hide-tail"&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="frac-line"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;Q&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;K&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;T&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose nulldelimiter"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;span class="mord mathnormal"&gt;V&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/div&gt;
, is only capable of focusing on one type of relationship at a time. For example, it might focus heavily on syntactic relationships (such as subject-verb agreement) but completely miss semantic nuances (like sarcasm or references).

&lt;p&gt;Real-world language is too complex for a single "gaze." To build a model like ChatGPT, we need it to look at the sentence through multiple lenses simultaneously.&lt;/p&gt;

&lt;p&gt;In Part 2, we will take the self-attention mechanism and clone it, creating &lt;strong&gt;Multi-Head Attention&lt;/strong&gt;. We will then see how these attention scores are processed through &lt;strong&gt;Feed-Forward Networks&lt;/strong&gt; to finally construct the complete Transformer block.&lt;/p&gt;

&lt;p&gt;Stay tuned.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>ai</category>
      <category>deeplearning</category>
      <category>chatgpt</category>
    </item>
    <item>
      <title>The Two Lists That Define Every Software Project</title>
      <dc:creator>Tuntufye Mwakalasya </dc:creator>
      <pubDate>Fri, 14 Nov 2025 21:43:22 +0000</pubDate>
      <link>https://forem.com/tmwakalasya/the-two-lists-that-define-every-software-project-2nhk</link>
      <guid>https://forem.com/tmwakalasya/the-two-lists-that-define-every-software-project-2nhk</guid>
      <description>&lt;p&gt;If you’ve ever been near a software developer, you’ve probably heard a frustrated groan followed by the classic phrase: &lt;strong&gt;"But it worked on my machine!"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This, and a million other frustrations like &lt;code&gt;File Not Found&lt;/code&gt; or &lt;code&gt;Symbol Not Found&lt;/code&gt;, often boil down to one of the most misunderstood parts of software engineering. It’s not a bug in the code, but a problem with the &lt;em&gt;lists&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The problem is that a computer is not a mind reader. It’s an incredibly fast, precise, and literal-minded robot. To get it to build your software, you have to give it two separate things: a &lt;strong&gt;Recipe&lt;/strong&gt; and a &lt;strong&gt;Shopping List&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;And the central conflict of all software development is that &lt;strong&gt;the robot &lt;em&gt;never&lt;/em&gt; reads the Recipe to figure out the Shopping List.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  👨‍🍳 The Metaphor: The Robot Chef
&lt;/h2&gt;

&lt;p&gt;Imagine you have a robot chef. Its job is to bake a cake.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The Recipe:&lt;/strong&gt; This is your &lt;strong&gt;source code&lt;/strong&gt;. It's the "ground truth" of what needs to be done. It might say "Step 1: Mix flour, eggs, and sugar."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Shopping List:&lt;/strong&gt; This is your &lt;strong&gt;build file&lt;/strong&gt; (a &lt;code&gt;Makefile&lt;/code&gt;, &lt;code&gt;BUILD.bazel&lt;/code&gt;, &lt;code&gt;package.json&lt;/code&gt;, etc.). It's the list of ingredients you &lt;em&gt;claim&lt;/em&gt; are needed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Robot Chef:&lt;/strong&gt; This is your &lt;strong&gt;build tool&lt;/strong&gt; (like &lt;code&gt;make&lt;/code&gt;, &lt;code&gt;Bazel&lt;/code&gt;, or &lt;code&gt;npm&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The robot's process is simple and unforgiving:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; It reads &lt;em&gt;only&lt;/em&gt; your &lt;strong&gt;Shopping List&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt; It goes to the store and gathers &lt;em&gt;every single item&lt;/em&gt; on that list.&lt;/li&gt;
&lt;li&gt; It returns to the kitchen and tries to follow the &lt;strong&gt;Recipe&lt;/strong&gt; using &lt;em&gt;only&lt;/em&gt; the items it just bought.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This simple process can fail in two major ways.&lt;/p&gt;




&lt;h3&gt;
  
  
  Scenario 1: The Broken Build (The Missing Ingredient)
&lt;/h3&gt;

&lt;p&gt;This is the most common error.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The Shopping List:&lt;/strong&gt; You write "flour, sugar."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Robot's Action:&lt;/strong&gt; The robot fetches flour and sugar.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Recipe (Code):&lt;/strong&gt; "Step 1: Mix flour, eggs, and sugar."&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The Result: FAILURE.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The robot stops, drops the bowl, and reports &lt;code&gt;FATAL ERROR: Ingredient 'eggs' not found.&lt;/code&gt; It doesn't matter that "eggs" are obviously needed. They weren't on the list.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is a Missing Dependency.&lt;/strong&gt; In technical terms, the &lt;strong&gt;Actual Dependency Graph&lt;/strong&gt; (we'll call it Ga) included an edge from &lt;code&gt;Cake&lt;/code&gt; to &lt;code&gt;Eggs&lt;/code&gt;, but the &lt;strong&gt;Declared Dependency Graph&lt;/strong&gt; (Gd) did not.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The build fails because your declared list was not a perfect representation of reality. You told the build tool a lie, and the compiler caught it.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;code&gt;[Your diagram/image for Scenario 1 here]&lt;/code&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Scenario 2: The Slow Build (The Useless Ingredient)
&lt;/h3&gt;

&lt;p&gt;This is a more subtle but equally important problem.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The Shopping List:&lt;/strong&gt; You write "flour, sugar, eggs... and cabbage."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Robot's Action:&lt;/strong&gt; The robot fetches flour, sugar, eggs, and a head of cabbage.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Recipe (Code):&lt;/strong&gt; "Step 1: Mix flour, eggs, and sugar..." The cake bakes perfectly.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The Result: SUCCESS... but.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The build &lt;em&gt;succeeded&lt;/em&gt;. The cake is delicious. But you now have a head of cabbage rotting on the counter, and the robot's shopping trip took twice as long.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is Overapproximation.&lt;/strong&gt; Your build is "correct" because the golden rule (Ga is a subset of Gd) is met. But it's inefficient. You've introduced &lt;strong&gt;build bloat&lt;/strong&gt;. The build tool wasted time and resources compiling, linking, and processing a library (&lt;code&gt;cabbage&lt;/code&gt;) that was never used. In a large project, this is the difference between a 2-minute build and a 40-minute build.&lt;/p&gt;




&lt;h3&gt;
  
  
  Scenario 3: The "It Works On My Machine" Nightmare
&lt;/h3&gt;

&lt;p&gt;This is the most complex problem, and it’s where our metaphor gets &lt;em&gt;really&lt;/em&gt; useful.&lt;/p&gt;

&lt;p&gt;Let's say you're making two things: a &lt;strong&gt;Chocolate Cake&lt;/strong&gt; and &lt;strong&gt;Brownies&lt;/strong&gt;.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Chocolate Cake:&lt;/strong&gt; Your Shopping List correctly lists "Cake Mix." This "Cake Mix" &lt;em&gt;happens&lt;/em&gt; to include a bag of "Cocoa Powder" inside.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Brownies:&lt;/strong&gt; Your Recipe for brownies &lt;em&gt;actually&lt;/em&gt; needs "Cocoa Powder," but you &lt;strong&gt;forget&lt;/strong&gt; to put it on the Brownie Shopping List.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You build the Chocolate Cake first. The robot buys the "Cake Mix" and leaves the "Cocoa Powder" on the counter.&lt;br&gt;
Then you build the Brownies. The robot (which &lt;em&gt;should&lt;/em&gt; fail) looks at the empty Brownie Shopping List, but then sees the leftover "Cocoa Powder" on the counter from the &lt;em&gt;last&lt;/em&gt; build. It shrugs, uses it, and the build succeeds!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Then, the disaster:&lt;/strong&gt; Your co-worker, trying to be efficient, switches the Chocolate Cake recipe to "Vanilla Cake Mix."&lt;/p&gt;

&lt;p&gt;Suddenly, your &lt;strong&gt;Brownie build breaks&lt;/strong&gt;. You're staring at your screen, shouting "But I didn't even &lt;em&gt;touch&lt;/em&gt; the Brownie code!"&lt;/p&gt;

&lt;p&gt;You were relying on a "ghost." This is a &lt;strong&gt;Transitive Dependency&lt;/strong&gt; nightmare.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your &lt;code&gt;Brownie_App&lt;/code&gt; (Recipe) had an &lt;em&gt;actual&lt;/em&gt; dependency on &lt;code&gt;Cocoa_Library&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;You never &lt;em&gt;declared&lt;/em&gt; it.&lt;/li&gt;
&lt;li&gt;It only worked because you declared a dependency on &lt;code&gt;Cake_Mix_Library&lt;/code&gt;, which &lt;em&gt;transitively&lt;/em&gt; depended on &lt;code&gt;Cocoa_Library&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The moment &lt;code&gt;Cake_Mix_Library&lt;/code&gt; no longer needed &lt;code&gt;Cocoa_Library&lt;/code&gt;, your build failed. Modern build systems like Bazel are designed to prevent this. They enforce &lt;strong&gt;strict dependency checking&lt;/strong&gt;, essentially "cleaning the counter" between every single step to ensure you're not using ingredients you didn't explicitly ask for.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Takeaway
&lt;/h2&gt;

&lt;p&gt;Writing software isn't just about the &lt;strong&gt;Recipe&lt;/strong&gt; (code). It's about meticulously maintaining the &lt;strong&gt;Shopping List&lt;/strong&gt; (build file).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If you miss an item, your build &lt;strong&gt;breaks&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;If you add extra items, your build &lt;strong&gt;slows down&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;If you "borrow" items from another recipe's list, your build becomes a &lt;strong&gt;fragile, ticking time bomb&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A good developer is a good chef. A &lt;em&gt;great&lt;/em&gt; developer writes a perfect shopping list every single time.&lt;/p&gt;

</description>
      <category>buildtools</category>
      <category>devops</category>
      <category>programming</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Building a Mini Build System in Go: Understanding How Bazel Works Under the Hood</title>
      <dc:creator>Tuntufye Mwakalasya </dc:creator>
      <pubDate>Sat, 08 Nov 2025 00:27:40 +0000</pubDate>
      <link>https://forem.com/tmwakalasya/building-a-mini-build-system-in-go-understanding-how-bazel-works-under-the-hood-3gp6</link>
      <guid>https://forem.com/tmwakalasya/building-a-mini-build-system-in-go-understanding-how-bazel-works-under-the-hood-3gp6</guid>
      <description>&lt;p&gt;Imagine you're running a restaurant kitchen. You have recipes (build targets), ingredients (source files), and some dishes that need other dishes to be ready first (dependencies). How do you organize this chaos? That's exactly what build systems like Bazel, Make, and our Mini-Bazel do for software.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Build Systems Matter (The Busy Kitchen Problem)
&lt;/h2&gt;

&lt;p&gt;Picture this: You're making a sandwich. You need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bread (must be toasted first)&lt;/li&gt;
&lt;li&gt;Butter (must be softened)&lt;/li&gt;
&lt;li&gt;Cheese (must be sliced)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the bread is already toasted from earlier, why toast it again? &lt;br&gt;
If someone changed the cheese type, you need to remake the sandwich.&lt;br&gt;
This is exactly what build systems figure out for code!&lt;/p&gt;

&lt;p&gt;In our Mini-Bazel example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;my_app&lt;/code&gt; is like the final sandwich&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;utils.a&lt;/code&gt; is like the toasted bread (a prerequisite)&lt;/li&gt;
&lt;li&gt;The .go files are our raw ingredients
// Let's break down our BUILD.mini file for everyone:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/*
Think of this as a recipe book where each recipe has:
- A name (what we're making)
- Instructions (the command to run)  
- Ingredients (source files)
- Other dishes needed first (dependencies)
*/

- name: "utils.a"          # Like "Toasted Bread"
  cmd: "go build..."       # "Put bread in toaster for 2 min"
  srcs: ["utils/greet.go"] # "You need: sliced bread"
  deps: []                 # "No other dishes needed first"

- name: "my_app"           # Like "Complete Sandwich"
  cmd: "go build..."       # "Assemble all parts"
  srcs: ["main.go"]        # "You need: plate, lettuce"
  deps: ["utils.a"]        # "First make: toasted bread"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Two Big Questions Every Build System Answers
&lt;/h2&gt;

&lt;p&gt;Remember our kitchen? Every build system (chef) must answer:&lt;/p&gt;

&lt;h3&gt;
  
  
  Question 1: "What order should I cook things?" (The Scheduler)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Option A - The Prep Cook (Topological/Make-style):&lt;/strong&gt;&lt;br&gt;
"I'll list everything needed for dinner, sort them by dependencies, &lt;br&gt;
then cook in that exact order."&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pro: Simple and predictable&lt;/li&gt;
&lt;li&gt;Con: Can't handle surprises (dynamic dependencies)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Option B - The Adaptive Chef (Restarting/Excel-style):&lt;/strong&gt;&lt;br&gt;
"I'll start cooking, and if I realize I need something not ready, &lt;br&gt;
I'll switch to making that first."&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pro: Handles surprises well&lt;/li&gt;
&lt;li&gt;Con: Might restart dishes multiple times&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Option C - The Multitasker (Suspending/Shake-style):&lt;/strong&gt;&lt;br&gt;
"I'll start multiple dishes, pause any that need something else, &lt;br&gt;
and resume when ready."&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pro: Very efficient&lt;/li&gt;
&lt;li&gt;Con: More complex to implement&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Question 2: "How do I know if something needs remaking?" (The Rebuilder)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Check the Timer (Timestamp/Make-style):&lt;/strong&gt;&lt;br&gt;
"If ingredients arrived after the dish was made, remake it."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Taste Test (Verifying Traces/Shake-style):&lt;/strong&gt;&lt;br&gt;
"I remember exactly what went into this dish. If any of those &lt;br&gt;
ingredients changed, remake it."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Recipe Card System (Constructive Traces/Bazel-style):&lt;/strong&gt;&lt;br&gt;
"I keep cards saying 'these exact ingredients make this exact dish'. &lt;br&gt;
If someone else made it already, just use theirs!"&lt;/p&gt;

&lt;h2&gt;
  
  
  Here's how our Mini-Bazel starts (Part 1: Loading recipes)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;func main() {
    // Step 1: Read the recipe book
    file, _ := os.ReadFile("BUILD.mini")

    // Step 2: Understand what each recipe means
    var targets []Target  // Our list of recipes
    yaml.Unmarshal(file, &amp;amp;targets)  // Parse the recipes

    // Step 3: Now we know what we CAN cook
    // (Part 2 will figure out what we SHOULD cook)
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Coming in Part 2: The Cooking Phase
&lt;/h2&gt;

&lt;p&gt;Now that our Mini-Bazel can read recipes, we need to teach it to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Build a dependency graph (figure out cooking order)&lt;/li&gt;
&lt;li&gt;Detect what needs rebuilding (check if ingredients changed)&lt;/li&gt;
&lt;li&gt;Execute builds in the right order (actually cook!)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We'll implement a simple scheduler (topological sort) and a basic &lt;br&gt;
rebuilder (timestamp checking), creating a Make-like build system &lt;br&gt;
in Go!&lt;/p&gt;

</description>
      <category>go</category>
      <category>systemdesign</category>
      <category>tooling</category>
    </item>
    <item>
      <title>Understanding the CAP Theorem Through a Hands-On Simulation in Golang</title>
      <dc:creator>Tuntufye Mwakalasya </dc:creator>
      <pubDate>Mon, 16 Dec 2024 01:29:34 +0000</pubDate>
      <link>https://forem.com/tmwakalasya/understanding-the-cap-theorem-through-a-hands-on-simulation-in-golang-372h</link>
      <guid>https://forem.com/tmwakalasya/understanding-the-cap-theorem-through-a-hands-on-simulation-in-golang-372h</guid>
      <description>&lt;p&gt;&lt;strong&gt;Introduction&lt;/strong&gt;&lt;br&gt;
Distributed systems are the backbone of modern software, enabling scalability and fault tolerance across networks. However, designing such systems comes with challenges, especially when ensuring reliability during failures. One fundamental principle in distributed systems is the CAP Theorem, which highlights the trade-offs every system must make.&lt;/p&gt;

&lt;p&gt;In this blog, we’ll explore the CAP Theorem through a practical simulation in Golang, showcasing how consistency, availability, and partition tolerance interact in real-world systems.&lt;/p&gt;

&lt;p&gt;What is the CAP Theorem?&lt;br&gt;
The CAP Theorem, introduced by Eric Brewer in 2000, states that a distributed system can only guarantee two out of the following three properties:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Consistency (C):&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;All nodes in the system return the same data at the same time.&lt;br&gt;
Example: When you update a database, all replicas immediately reflect the update.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Availability (A):&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every request receives a response, even during failures.&lt;br&gt;
Example: A load balancer that always returns a result, even if it’s stale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Partition Tolerance (P):&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The system continues operating despite network partitions.&lt;br&gt;
Example: Nodes can still process requests independently if communication between them is lost.&lt;br&gt;
Key Insight: During a network partition, a system must sacrifice either Consistency or Availability, but not both.&lt;/p&gt;

&lt;p&gt;Building the Simulation&lt;br&gt;
To understand CAP trade-offs, we will build a simple simulation in Golang. The system consists of:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Nodes:&lt;/strong&gt; Represent individual components of the system, each with a counter.&lt;br&gt;
&lt;strong&gt;Cluster:&lt;/strong&gt; Manages the nodes and synchronizes their state.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Node Struct&lt;/strong&gt;&lt;br&gt;
Each node has:&lt;br&gt;
A name for identification.&lt;br&gt;
A counter to hold data.&lt;br&gt;
A partitioned flag to indicate if the node is disconnected.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;type Node struct {
    name        string
    counter     int
    partitioned bool
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Cluster Struct&lt;/strong&gt;&lt;br&gt;
The cluster organizes multiple nodes and provides synchronization:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;type Cluster struct {
    nodes []*Node
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Core Methods&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Write Operation&lt;/strong&gt;&lt;br&gt;
The Write method updates a node’s counter only if it is not partitioned:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;func (n *Node) Write(value int) {
    if !n.partitioned {
        n.counter = value
        fmt.Printf("Node %s updated to %d\n", n.name, n.counter)
    } else {
        fmt.Printf("Node %s is partitioned, Write failed\n", n.name)
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Synchronization
The syncNodes method propagates updates from one node to others in the cluster:
Skips the updated node.
Skips nodes that are partitioned.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;func (c *Cluster) syncNodes(updatedNode *Node) {
    for _, node := range c.nodes {
        if node.name == updatedNode.name || node.partitioned {
            continue
        }
        node.counter = updatedNode.counter
        fmt.Printf("Node %s synchronized to %d\n", node.name, updatedNode.counter)
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Simulating CAP Trade-Offs&lt;/strong&gt;&lt;br&gt;
Scenario 1: Consistency&lt;br&gt;
When all nodes are connected, synchronization ensures that every node has the same data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;nodeA.Write(10)
cluster.syncNodes(nodeA)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Node Node A updated to 10
Node Node B synchronized to 10
Node Node C synchronized to 10
Node Node D synchronized to 10

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Scenario 2: Partition Tolerance&lt;br&gt;
When a node is partitioned, it cannot participate in synchronization:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;nodeB.partitioned = true
nodeA.Write(20)
cluster.syncNodes(nodeA)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Node Node A updated to 20
Node Node B is partitioned, sync skipped
Node Node C synchronized to 20
Node Node D synchronized to 20
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Scenario 3: Availability&lt;br&gt;
Despite partitions, nodes continue operating independently. Writes to partitioned nodes may result in stale data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;nodeB.Write(30)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
Node Node B is partitioned, Write failed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Lessons Learned&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;From this simulation, we observed:&lt;br&gt;
Consistency vs. Availability: Synchronization ensures consistency but sacrifices availability for partitioned nodes.&lt;br&gt;
Partition Tolerance: Partitioned nodes remain functional but risk divergence in state.&lt;br&gt;
Trade-Offs Are Inevitable: Designing distributed systems requires clear prioritization based on the use case.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Next Steps&lt;/strong&gt;&lt;br&gt;
This simulation is just the beginning. Potential extensions include:&lt;/p&gt;

&lt;p&gt;Asynchronous Synchronization: Use Goroutines to simulate real-world latencies.&lt;br&gt;
Recovery Mechanisms: Handle nodes rejoining the cluster after partitions.&lt;br&gt;
Monitoring: Add metrics for latency, synchronization rates, and failure handling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;br&gt;
The CAP Theorem encapsulates the complexity of distributed systems. Through this Golang simulation, we gained hands-on experience with its principles and trade-offs. Whether building databases or scalable services, understanding CAP is key to making informed architectural decisions.&lt;/p&gt;

&lt;p&gt;What are your thoughts on the CAP Theorem? Have you faced similar trade-offs in your projects? Let me know in the comments!&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
