<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: rajeevrajeshuni</title>
    <description>The latest articles on Forem by rajeevrajeshuni (@rajeevrajeshuni).</description>
    <link>https://forem.com/rajeevrajeshuni</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F551067%2Fc0e047d4-606b-46ad-b2b6-55d508f063f6.png</url>
      <title>Forem: rajeevrajeshuni</title>
      <link>https://forem.com/rajeevrajeshuni</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/rajeevrajeshuni"/>
    <language>en</language>
    <item>
      <title>Introduction to Bloom Filters</title>
      <dc:creator>rajeevrajeshuni</dc:creator>
      <pubDate>Sun, 01 Mar 2026 05:03:53 +0000</pubDate>
      <link>https://forem.com/rajeevrajeshuni/introduction-to-bloom-filters-da3</link>
      <guid>https://forem.com/rajeevrajeshuni/introduction-to-bloom-filters-da3</guid>
      <description>&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Introduction to Bloom Filters&lt;/li&gt;
&lt;li&gt;Operation on the Bloom Filter&lt;/li&gt;
&lt;li&gt;Example&lt;/li&gt;
&lt;li&gt;Advantages of a Bloom Filter Over a Set Data Structure&lt;/li&gt;
&lt;li&gt;Choosing a Hash Function&lt;/li&gt;
&lt;li&gt;Tuning a Bloom Filter&lt;/li&gt;
&lt;li&gt;Example Use Cases&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Introduction to Bloom Filters
&lt;/h2&gt;

&lt;p&gt;Bloom filter is a probabilistic data structure used to solve the set membership problem. A relaxed constraint here is that false positives are tolerated to an acceptable level.&lt;/p&gt;

&lt;p&gt;Bloom filter uses a bit array to store the presence of items. Bloom filter uses hash functions to save the presence of an item without saving the actual item.&lt;/p&gt;

&lt;p&gt;Bloom filters can give false positives, but never false negatives. This means that if the bloom filter returns positive, it means that the element may be present in the set, but if it returns negative, it means that the element is definitely not present in the set.&lt;/p&gt;

&lt;p&gt;Now let us define a few variables before jumping into the algorithm:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;n&lt;/code&gt; - The number of elements to be added in to the set.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;k&lt;/code&gt; - The number of hash functions we use.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;m&lt;/code&gt; - The length of the bit array we will use.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Let us assume we have &lt;code&gt;n&lt;/code&gt; elements to insert into the set. Before inserting elements, let us create a bit array of size &lt;code&gt;m&lt;/code&gt; with all values set to 0, and &lt;code&gt;k&lt;/code&gt; hash functions which return values in the range of (0, &lt;code&gt;m&lt;/code&gt;-1).&lt;/p&gt;

&lt;h2&gt;
  
  
  Operation on the Bloom Filter
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Steps for Adding an Element
&lt;/h3&gt;

&lt;p&gt;For each element we run &lt;code&gt;k&lt;/code&gt; hash functions and then use the result as the index in the bit array. We mark the element at that index as 1. We will also store the elements in secondary storage for bookkeeping.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;addElement&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="n"&gt;SetItem&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// Iterate over the array of hash functions&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hashFunction&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;hashFunctions&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;hashFunction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;bitarray&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="c"&gt;// Here you'd save the item to secondary storage.&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Steps for Looking Up an Element
&lt;/h3&gt;

&lt;p&gt;Similar to above, we run &lt;code&gt;k&lt;/code&gt; hash functions on an element and use the result as an index in the bit array. We check the value in each of those indexes and if the value is 1 in all of them, the bloom filter returns true.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;lookUpElement&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="n"&gt;SetItem&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// Iterating over the array of hash functions&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hashFunction&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;hashFunctions&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;hashFunction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;bitarray&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;false&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The bits at those indexes could be made 1 by different elements resulting in a false positive, but that is a tradeoff with bloom filter.&lt;/p&gt;
&lt;h3&gt;
  
  
  Steps for Deleting an Element
&lt;/h3&gt;

&lt;p&gt;You cannot delete an element from the bloom filter. This is a significant drawback, but that is the tradeoff for reducing space utilization.&lt;/p&gt;
&lt;h2&gt;
  
  
  Example
&lt;/h2&gt;

&lt;p&gt;We choose three hash functions and a bit array of size 16 with all bits as 0. Let us add elements &lt;code&gt;A&lt;/code&gt; and &lt;code&gt;B&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  Flow for Adding the Elements
&lt;/h3&gt;

&lt;p&gt;We take the element and get the hash value from the three functions and mark the corresponding bits as 1 in the bit array.&lt;/p&gt;

&lt;p&gt;Let the hash values for &lt;code&gt;A&lt;/code&gt; be 7, 1, and 14. So the array will change to:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Values&lt;/th&gt;
&lt;th&gt;0&lt;/th&gt;
&lt;th&gt;1&lt;/th&gt;
&lt;th&gt;0&lt;/th&gt;
&lt;th&gt;0&lt;/th&gt;
&lt;th&gt;0&lt;/th&gt;
&lt;th&gt;0&lt;/th&gt;
&lt;th&gt;0&lt;/th&gt;
&lt;th&gt;1&lt;/th&gt;
&lt;th&gt;0&lt;/th&gt;
&lt;th&gt;0&lt;/th&gt;
&lt;th&gt;0&lt;/th&gt;
&lt;th&gt;0&lt;/th&gt;
&lt;th&gt;0&lt;/th&gt;
&lt;th&gt;0&lt;/th&gt;
&lt;th&gt;1&lt;/th&gt;
&lt;th&gt;0&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Index&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;13&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Let the hash values for &lt;code&gt;B&lt;/code&gt; be 8, 1, and 15. So the array will change to:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Values&lt;/th&gt;
&lt;th&gt;0&lt;/th&gt;
&lt;th&gt;1&lt;/th&gt;
&lt;th&gt;0&lt;/th&gt;
&lt;th&gt;0&lt;/th&gt;
&lt;th&gt;0&lt;/th&gt;
&lt;th&gt;0&lt;/th&gt;
&lt;th&gt;0&lt;/th&gt;
&lt;th&gt;1&lt;/th&gt;
&lt;th&gt;1&lt;/th&gt;
&lt;th&gt;0&lt;/th&gt;
&lt;th&gt;0&lt;/th&gt;
&lt;th&gt;0&lt;/th&gt;
&lt;th&gt;0&lt;/th&gt;
&lt;th&gt;0&lt;/th&gt;
&lt;th&gt;1&lt;/th&gt;
&lt;th&gt;1&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Index&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;13&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  Flow for Looking Up an Item
&lt;/h3&gt;

&lt;p&gt;Let's say we are searching for the element &lt;code&gt;C&lt;/code&gt;, whose hash values are 1, 15, and 9. The values in the bit array at index 1, 15 are 1, but the value at index 9 is 0. So we can safely say that the &lt;code&gt;C&lt;/code&gt; is not part of the set.&lt;/p&gt;
&lt;h3&gt;
  
  
  Why Array Lengths in Powers of 2?
&lt;/h3&gt;

&lt;p&gt;When hashing, why do we usually take array lengths in powers of 2? This is because, usually we do a mod of the size of the array to find the index. If the array size is a power of 2, then taking a mod is simply a bitwise operation. Let's say the number is 254 which is 11111110. &lt;code&gt;254 \% 16&lt;/code&gt; will be the last 4 bits which is 1110 which is 14.&lt;/p&gt;
&lt;h2&gt;
  
  
  Advantages of a Bloom Filter Over a Set Data Structure
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Optimized Space Usage&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Usually a set data structure is maintained using a self-balancing binary tree, hash table, array, linked list, or trie.&lt;/li&gt;
&lt;li&gt;These data structures require storing at least the elements themselves (except for trie). The items could be a small integer or a large string.&lt;/li&gt;
&lt;li&gt;In contrast, the bloom filter takes the same amount of space for an item of any length.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Constant Insertion and Lookup Time&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bloom filters also have the advantage that the insertion time is a constant value &lt;code&gt;\mathrm{O}(&lt;/code&gt;k&lt;code&gt;)&lt;/code&gt;. The same is the case for lookup time.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Parallelizing Hash Functions&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;There is also the advantage of parallelizing the execution of hash functions.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
  
  
  Choosing a Hash Function
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Requirements for the Hash Function&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fast computation&lt;/li&gt;
&lt;li&gt;Uniform distribution&lt;/li&gt;
&lt;li&gt;Cryptographic security to avoid reverse engineering is not a concern here.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Common Hash Functions&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;In line with the above requirements, bloom filters typically use the hash functions MurmurHash, JenkinsHash, and FNV Hash.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
  
  
  Tuning a Bloom Filter
&lt;/h2&gt;

&lt;p&gt;We can intuitively see that if we keep adding elements to a bloom filter, the collisions increase and hence the false positive rate also increases. In the worst case, the bloom filter returns true for any given element. Since we have all the elements saved in secondary storage, we need to increase the values &lt;code&gt;m&lt;/code&gt; and &lt;code&gt;k&lt;/code&gt; and create another bloom filter.&lt;/p&gt;

&lt;p&gt;We can see that there is a tradeoff between the false positive probability &lt;code&gt;p&lt;/code&gt;, the number of elements &lt;code&gt;n&lt;/code&gt;, blooom filter &lt;code&gt;T&lt;/code&gt; with the number of bits &lt;code&gt;m&lt;/code&gt;, and the number of hash functions &lt;code&gt;k&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;For simplicity, let us assume that for any element &lt;code&gt;e&lt;/code&gt; and hash function &lt;code&gt;h_i&lt;/code&gt;, &lt;code&gt;h_i(e)&lt;/code&gt; is distributed independently and uniformly over the range of values &lt;code&gt;1&lt;/code&gt; to &lt;code&gt;m&lt;/code&gt;. Let us also assume that the probabilities that the different bits in the filter &lt;code&gt;T&lt;/code&gt;, &lt;code&gt;T[l]&lt;/code&gt; are set to &lt;code&gt;1&lt;/code&gt; are independent.&lt;/p&gt;

&lt;p&gt;For any &lt;code&gt;1 ≤ l ≤ m&lt;/code&gt;, we evaluate the probability that &lt;code&gt;T[l] = 0&lt;/code&gt;, after inserting all &lt;code&gt;n&lt;/code&gt; elements into the table.&lt;/p&gt;

&lt;p&gt;For each &lt;code&gt;e ∈ S&lt;/code&gt; and &lt;code&gt;1 ≤ i ≤ k&lt;/code&gt;, the probability that &lt;code&gt;h_i(e) ≠ l&lt;/code&gt; is &lt;code&gt;(1 - 1/m)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The probability that &lt;code&gt;T[l] = 0&lt;/code&gt; after inserting all &lt;code&gt;n&lt;/code&gt; elements with &lt;code&gt;k&lt;/code&gt; probes each is simply &lt;code&gt;(1 - 1/m)^{kn}&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Thus, the probability that an entry &lt;code&gt;T[l] = 1&lt;/code&gt; after &lt;code&gt;n&lt;/code&gt; insertions is simply &lt;code&gt;1 - (1 - 1/m)^{kn}&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;A false positive occurs when &lt;code&gt;e' ∉ S&lt;/code&gt; but &lt;code&gt;T[h_i(e')] = 1&lt;/code&gt;, for all &lt;code&gt;1 ≤ i ≤ k&lt;/code&gt;. Hence:&lt;/p&gt;

&lt;p&gt;

&lt;/p&gt;
&lt;div class="katex-element"&gt;
  &lt;span class="katex-display"&gt;&lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;p=(1−(1−1m)kn)k≈(1−e−kn/m)k
p = \left( 1 - \left( 1 - \frac{1}{m} \right)^{kn} \right)^k \approx (1 - e^{-kn/m})^k
&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;p&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;=&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="minner"&gt;&lt;span class="minner"&gt;&lt;span class="mopen delimcenter"&gt;&lt;span class="delimsizing size4"&gt;(&lt;/span&gt;&lt;/span&gt;&lt;span class="mord"&gt;1&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="minner"&gt;&lt;span class="minner"&gt;&lt;span class="mopen delimcenter"&gt;&lt;span class="delimsizing size3"&gt;(&lt;/span&gt;&lt;/span&gt;&lt;span class="mord"&gt;1&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mopen nulldelimiter"&gt;&lt;/span&gt;&lt;span class="mfrac"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;m&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="frac-line"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose nulldelimiter"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose delimcenter"&gt;&lt;span class="delimsizing size3"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;kn&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose delimcenter"&gt;&lt;span class="delimsizing size4"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;≈&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord"&gt;1&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;−&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;e&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mtight"&gt;−&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;kn&lt;/span&gt;&lt;span class="mord mtight"&gt;/&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;m&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose"&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/div&gt;



&lt;p&gt;For a given &lt;code&gt;n&lt;/code&gt; and &lt;code&gt;m&lt;/code&gt;, one can find the &lt;code&gt;k&lt;/code&gt; that minimizes the probability &lt;code&gt;p&lt;/code&gt; by differentiating the right-hand side of Equation 1 with respect to &lt;code&gt;k&lt;/code&gt; and setting it to zero. Solving for &lt;code&gt;k&lt;/code&gt; provides the following.&lt;/p&gt;


&lt;div class="katex-element"&gt;
  &lt;span class="katex-display"&gt;&lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;k≈(ln⁡2)⋅(m/n)
k \approx (\ln 2) \cdot (m/n)
&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;k&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;≈&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mop"&gt;ln&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;⋅&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord mathnormal"&gt;m&lt;/span&gt;&lt;span class="mord"&gt;/&lt;/span&gt;&lt;span class="mord mathnormal"&gt;n&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/div&gt;


&lt;p&gt;Plugging in the value of &lt;code&gt;k&lt;/code&gt; from Equation 2 into Equation 1, we get the following relation.&lt;/p&gt;


&lt;div class="katex-element"&gt;
  &lt;span class="katex-display"&gt;&lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;p≈2−k≈2−(m/n)ln⁡2
p \approx 2^{-k} \approx 2^{-(m/n)\ln 2}
&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;p&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;≈&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mtight"&gt;−&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;≈&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mtight"&gt;−&lt;/span&gt;&lt;span class="mopen mtight"&gt;(&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;m&lt;/span&gt;&lt;span class="mord mtight"&gt;/&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;n&lt;/span&gt;&lt;span class="mclose mtight"&gt;)&lt;/span&gt;&lt;span class="mspace mtight"&gt;&lt;/span&gt;&lt;span class="mop mtight"&gt;&lt;span class="mtight"&gt;l&lt;/span&gt;&lt;span class="mtight"&gt;n&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace mtight"&gt;&lt;/span&gt;&lt;span class="mord mtight"&gt;2&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/div&gt;


&lt;p&gt;You can explore &lt;a href="https://hur.st/bloomfilter/" rel="noopener noreferrer"&gt;https://hur.st/bloomfilter/&lt;/a&gt; to get these values.&lt;/p&gt;

&lt;h2&gt;
  
  
  Example Use Cases
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Akamai's Caching Strategy&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;Akamai uses a bloom filter to avoid caching one-hit wonders. They cache an item only when it is accessed for the second time.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>bloomfilter</category>
      <category>distributedsystems</category>
      <category>database</category>
      <category>algorithms</category>
    </item>
    <item>
      <title>Unpacking the Google File System Paper: A Simple Breakdown</title>
      <dc:creator>rajeevrajeshuni</dc:creator>
      <pubDate>Mon, 15 Dec 2025 18:31:56 +0000</pubDate>
      <link>https://forem.com/rajeevrajeshuni/google-file-system-gfs-keep-it-simple-scale-big-3277</link>
      <guid>https://forem.com/rajeevrajeshuni/google-file-system-gfs-keep-it-simple-scale-big-3277</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fquxw6ek7a3yl2mm4sjyk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fquxw6ek7a3yl2mm4sjyk.png" alt="Google File System Architecture" width="800" height="475"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Core Philosophy: Keep It Simple, Scale Big
&lt;/h2&gt;

&lt;p&gt;GFS prioritizes simplicity, high throughput, and fault tolerance over strict consistency. Key ideas:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Relaxed consistency&lt;/strong&gt;: Trades strict data guarantees for performance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Single master&lt;/strong&gt;: Simplifies metadata management but risks bottlenecks (minimized by design).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Workload focus&lt;/strong&gt;: Optimized for large files and sequential access, not small files.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  GFS Architecture: Masters, Chunkservers, Clients
&lt;/h2&gt;

&lt;p&gt;GFS has three main components:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Master&lt;/strong&gt;: Stores metadata (file structure, chunk locations) in memory for speed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chunkservers&lt;/strong&gt;: Store 64 MB data chunks, replicated (usually 3x) on local disks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Clients&lt;/strong&gt;: Applications that get metadata from the master and data directly from chunkservers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The single master simplifies the design while staying lightweight by handling only metadata operations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Design Choice #1: Big 64 MB Chunks
&lt;/h2&gt;

&lt;p&gt;GFS uses 64 MB chunks instead of the typical 4 KB blocks found in most file systems.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Fewer master queries&lt;/strong&gt;: Large chunks mean less metadata communication overhead.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stable connections&lt;/strong&gt;: Long-lived TCP connections to chunkservers reduce network overhead.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Small metadata footprint&lt;/strong&gt;: Fewer chunks keep the master's memory usage low.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Downsides
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Small files waste space&lt;/strong&gt;: A tiny file still consumes an entire 64 MB chunk.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hotspots&lt;/strong&gt;: Popular small files can overload individual chunkservers (mitigated with extra replicas).&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Design Choice #2: Split Control and Data Planes
&lt;/h2&gt;

&lt;p&gt;GFS separates metadata management (master) from data storage (chunkservers).&lt;/p&gt;

&lt;h3&gt;
  
  
  How It Works
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Master handles file namespace and chunk location mapping.&lt;/li&gt;
&lt;li&gt;Clients communicate directly with chunkservers for actual data transfer.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Benefits
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Lightweight master&lt;/strong&gt;: No data handling keeps the master fast and responsive.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;High throughput&lt;/strong&gt;: Direct client-chunkserver communication maximizes bandwidth.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Simple design&lt;/strong&gt;: Clear separation of concerns makes GFS easier to manage.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This architecture supports thousands of concurrent clients without bottlenecking.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Reads Work
&lt;/h2&gt;

&lt;p&gt;Reading in GFS follows a simple pattern:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Client converts file name and byte offset into a chunk index (offset ÷ 64 MB).&lt;/li&gt;
&lt;li&gt;Client requests chunk handle and chunkserver locations from master.&lt;/li&gt;
&lt;li&gt;Master responds with metadata; client caches this information.&lt;/li&gt;
&lt;li&gt;Client directly requests data from the appropriate chunkserver.&lt;/li&gt;
&lt;li&gt;Chunkserver returns the requested byte range.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Design Choice #3: Two Write Patterns
&lt;/h2&gt;

&lt;p&gt;GFS handles writes differently depending on the operation type:&lt;/p&gt;

&lt;h3&gt;
  
  
  Concurrent Writes
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Master designates one replica as the primary for each chunk.&lt;/li&gt;
&lt;li&gt;Clients send writes to the primary, which coordinates with secondary replicas.&lt;/li&gt;
&lt;li&gt;Primary ensures all replicas apply writes in the same order.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Concurrent Appends
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Still uses the primary replica for coordination and ordering.&lt;/li&gt;
&lt;li&gt;GFS (via the primary) automatically selects the append offset to avoid client coordination overhead.&lt;/li&gt;
&lt;li&gt;Clients receive the actual offset where data was written.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Benefits
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Lock-free&lt;/strong&gt;: Multiple clients can append simultaneously without coordination.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Atomic Append Guarantees&lt;/strong&gt;: Each append operation is guaranteed to complete atomically.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Design Choice #4: Relaxed Consistency Model
&lt;/h2&gt;

&lt;p&gt;GFS deliberately avoids strict consistency guarantees.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Simplifies the overall system design and improves performance.&lt;/li&gt;
&lt;li&gt;Aligns with Google's append-heavy, read-mostly application patterns.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Implications
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Consistent metadata&lt;/strong&gt;: File creation and deletion operations are strongly consistent.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Relaxed data consistency&lt;/strong&gt;: Concurrent writes may result in interleaved data; clients must identify valid data regions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This trade-off works well for Google's specific use cases but requires application-level awareness.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fault Tolerance Mechanisms
&lt;/h2&gt;

&lt;p&gt;GFS ensures high availability through several techniques:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Write-Ahead Logging (WAL)&lt;/strong&gt;: Master operations are logged to disk before acknowledgment, ensuring metadata durability.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shadow masters&lt;/strong&gt;: Backup master servers provide read-only access and failover capability.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Simplified recovery&lt;/strong&gt;: Only namespace and file-to-chunk mappings are persisted; chunk locations are rediscovered by querying chunkservers at startup.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Data Integrity: Checksums
&lt;/h2&gt;

&lt;p&gt;GFS employs &lt;strong&gt;checksums&lt;/strong&gt; to detect data corruption (critical with thousands of commodity disks):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Chunkservers verify data integrity during both reads and writes.&lt;/li&gt;
&lt;li&gt;Essential for maintaining reliability.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Challenges and Limitations
&lt;/h2&gt;

&lt;p&gt;GFS has some inherent limitations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Single master bottleneck&lt;/strong&gt;: Though rare due to the lightweight design and shadow replicas for reads.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Small file inefficiency&lt;/strong&gt;: 64 MB chunks are suboptimal for small files.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consistency complexity&lt;/strong&gt;: Clients must handle potentially inconsistent data regions.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;GFS revolutionized distributed storage by demonstrating how to build massively scalable systems through careful trade-offs. Its large chunk size, separated control/data planes, and relaxed consistency model deliver exceptional performance and fault tolerance for specific workloads.&lt;/p&gt;

</description>
      <category>distributedsystems</category>
      <category>dataengineering</category>
      <category>systemdesign</category>
      <category>pwl</category>
    </item>
  </channel>
</rss>
