<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Sourav Roy</title>
    <description>The latest articles on Forem by Sourav Roy (@souravroyetl).</description>
    <link>https://forem.com/souravroyetl</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3881107%2F974a5223-aa12-4fcf-819d-290a24e33e79.jpeg</url>
      <title>Forem: Sourav Roy</title>
      <link>https://forem.com/souravroyetl</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/souravroyetl"/>
    <language>en</language>
    <item>
      <title>Building a Vectorized SQL Engine from Scratch in C++ — No Dependencies</title>
      <dc:creator>Sourav Roy</dc:creator>
      <pubDate>Wed, 15 Apr 2026 19:11:28 +0000</pubDate>
      <link>https://forem.com/souravroyetl/building-a-vectorized-sql-engine-from-scratch-in-c-no-dependencies-5a07</link>
      <guid>https://forem.com/souravroyetl/building-a-vectorized-sql-engine-from-scratch-in-c-no-dependencies-5a07</guid>
      <description>&lt;p&gt;Building a Vectorized SQL Engine from Scratch in C++ — No Dependencies&lt;/p&gt;

&lt;p&gt;I built an embedded analytical database engine (&lt;a href="https://github.com/SouravRoy-ETL/slothdb" rel="noopener noreferrer"&gt;https://github.com/SouravRoy-ETL/slothdb&lt;/a&gt;) entirely from scratch in C++20 — no external libraries, not even for SQL parsing or&lt;br&gt;
  Parquet file reading. Here's how each subsystem works under the hood.&lt;/p&gt;

&lt;p&gt;The Execution Model: Why Vectorized Beats Row-at-a-Time&lt;/p&gt;

&lt;p&gt;Traditional databases process one row at a time. Each row travels through every operator (scan → filter → project → aggregate) individually. The overhead per row is massive:&lt;br&gt;
  virtual function calls, branch mispredictions, poor cache utilization.&lt;/p&gt;

&lt;p&gt;Vectorized execution processes batches of 2,048 values at once. Instead of calling filter(row) 10 million times, you call filter(vector_of_2048) about 5,000 times. Each call&lt;br&gt;
  processes a tight loop over a flat array — perfect for CPU caches and SIMD auto-vectorization.&lt;/p&gt;

&lt;p&gt;// Row-at-a-time (slow)&lt;br&gt;
  for (each row) {&lt;br&gt;
      if (row.salary &amp;gt; 100000) emit(row);  // virtual call, branch, cache miss&lt;br&gt;
  }&lt;/p&gt;

&lt;p&gt;// Vectorized (fast)&lt;br&gt;
  auto *salaries = vector.GetData();  // flat array, cache-friendly&lt;br&gt;
  auto *output = result.GetData();&lt;br&gt;
  for (idx_t i = 0; i &amp;lt; 2048; i++) {&lt;br&gt;
      output[i] = salaries[i] &amp;gt; 100000;  // tight loop, auto-vectorizes&lt;br&gt;
  }&lt;/p&gt;

&lt;p&gt;The Vector class is the core data structure. It stores a flat data_ptr_t buffer, a ValidityMask (bitvector for NULLs — 1 bit per value, 256 bytes for 2,048 values), and&lt;br&gt;
  supports four modes: FLAT, CONSTANT, DICTIONARY, and SEQUENCE. The CONSTANT mode is key — if an entire vector has the same value (common in WHERE x = 5), we store it once and&lt;br&gt;
  skip the loop entirely.&lt;/p&gt;

&lt;p&gt;The String Problem: 16-Byte Inline Strings&lt;/p&gt;

&lt;p&gt;String comparison is the #1 bottleneck in analytical queries. Most engines store strings as heap-allocated std::string objects — each comparison requires chasing a pointer,&lt;br&gt;
  causing cache misses.&lt;/p&gt;

&lt;p&gt;I use a 16-byte inline string representation:&lt;/p&gt;

&lt;p&gt;struct string_t {  // exactly 16 bytes, fits in one cache line&lt;br&gt;
      uint32_t length;&lt;br&gt;
      char data_[12];  // strings ≤ 12 bytes stored entirely inline&lt;br&gt;
      // For longer strings: first 4 bytes = prefix, next 8 bytes = pointer&lt;br&gt;
  };&lt;/p&gt;

&lt;p&gt;Strings under 12 characters (which covers ~80% of real-world data: country codes, status flags, short names) never touch the heap. For longer strings, the 4-byte prefix enables&lt;br&gt;
   fast inequality comparisons without dereferencing the pointer — if the prefixes differ, you already know the result.&lt;/p&gt;

&lt;p&gt;Writing a SQL Parser with Zero Dependencies&lt;/p&gt;

&lt;p&gt;Most databases use parser generators (Yacc, Bison, ANTLR). I wrote a hand-written recursive descent parser instead. Why? Zero dependencies, full control over error messages,&lt;br&gt;
  and better performance (no generated code bloat).&lt;/p&gt;

&lt;p&gt;The tokenizer converts SQL text into tokens in a single pass. Keywords are looked up in a hash map (std::unordered_map with ~100 entries). The parser is a&lt;br&gt;
  set of mutually recursive functions following the grammar precedence:&lt;/p&gt;

&lt;p&gt;ParseExpression → ParseOr → ParseAnd → ParseNot → ParseComparison&lt;br&gt;
  → ParseAddSub → ParseMulDiv → ParseUnary → ParsePrimary&lt;/p&gt;

&lt;p&gt;Each level handles one precedence level. ParsePrimary handles atoms: constants, column references, function calls, parenthesized expressions, CASE WHEN, CAST, EXISTS&lt;br&gt;
  subqueries, and window functions.&lt;/p&gt;

&lt;p&gt;The trickiest part was LEFT and RIGHT — they're both SQL keywords (for joins) AND function names (LEFT(s, 3)). The parser checks if the keyword is followed by ( and routes&lt;br&gt;
  accordingly.&lt;/p&gt;

&lt;p&gt;A depth counter prevents stack overflow from deeply nested expressions (limit: 256 levels).&lt;/p&gt;

&lt;p&gt;Implementing Parquet Without Apache Arrow&lt;/p&gt;

&lt;p&gt;This was the hardest part. Parquet is a complex columnar file format:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;File starts and ends with magic bytes PAR1&lt;/li&gt;
&lt;li&gt;Footer at the end contains schema, row group metadata, column statistics&lt;/li&gt;
&lt;li&gt;Data is organized in row groups → column chunks → pages&lt;/li&gt;
&lt;li&gt;Each page contains encoded data (PLAIN, RLE, DICTIONARY, etc.)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most implementations use Apache's C++ Parquet library, which pulls in Arrow, Thrift, Boost, and 50+ transitive dependencies. I implemented the format from scratch.&lt;/p&gt;

&lt;p&gt;The reader:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Reads the 4-byte footer size from file_end - 8&lt;/li&gt;
&lt;li&gt;Seeks back and reads the footer (binary metadata with column names, types, offsets)&lt;/li&gt;
&lt;li&gt;For each column chunk, reads data at the specified offset&lt;/li&gt;
&lt;li&gt;Decodes values based on the physical type (INT32, INT64, DOUBLE, BYTE_ARRAY)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Row group statistics (min/max per column) enable predicate pushdown — if you query WHERE salary &amp;gt; 200000 and a row group's max salary is 150000, we skip the entire row group&lt;br&gt;
  without reading it.&lt;/p&gt;

&lt;p&gt;bool ParquetReader::RowGroupMightMatch(idx_t rg_idx, idx_t col_idx,&lt;br&gt;
                                          const std::string &amp;amp;op, const Value &amp;amp;val) {&lt;br&gt;
      auto &amp;amp;col = meta_.row_groups[rg_idx].columns[col_idx];&lt;br&gt;
      if (!col.has_stats) return true;  // can't skip, read it&lt;br&gt;
      if (op == "&amp;gt;") return !(col.max_value &amp;lt; val);  // max &amp;lt; target → skip&lt;br&gt;
      // ...&lt;br&gt;
  }&lt;/p&gt;

&lt;p&gt;Dictionary-Encoded Execution: The 10x String Speedup&lt;/p&gt;

&lt;p&gt;When a column has low cardinality (few unique values — think department, country, status), dictionary encoding replaces strings with integer codes:&lt;/p&gt;

&lt;p&gt;Original:    ["Engineering", "Sales", "Engineering", "Sales", "Marketing"]&lt;br&gt;
  Dictionary:  {0: "Engineering", 1: "Sales", 2: "Marketing"}&lt;br&gt;
  Codes:       [0, 1, 0, 1, 2]&lt;/p&gt;

&lt;p&gt;The key insight: all operations (GROUP BY, JOIN, filter, hash) work on the integer codes, not the strings. Comparing two 4-byte integers is ~10x faster than comparing two&lt;br&gt;
  variable-length strings. Hashing an integer is O(1); hashing a string is O(n).&lt;/p&gt;

&lt;p&gt;The DictionaryVector class automatically encodes columns when cardinality &amp;lt; 50% of row count. GROUP BY on a dictionary-encoded column becomes a simple integer-to-bucket&lt;br&gt;
  mapping.&lt;/p&gt;

&lt;p&gt;GPU Acceleration Architecture&lt;/p&gt;

&lt;p&gt;The GPU engine has a clean abstraction:&lt;/p&gt;

&lt;p&gt;class GPUEngine {&lt;br&gt;
      virtual int64_t SumInt32(const int32_t *data, idx_t count) = 0;&lt;br&gt;
      virtual vector Filter(const int32_t *data, idx_t count,&lt;br&gt;
                                    const string &amp;amp;op, int32_t value) = 0;&lt;br&gt;
      virtual AggResult HashAggregate(const int32_t *keys,&lt;br&gt;
                                       const int32_t *values, idx_t count) = 0;&lt;br&gt;
  };&lt;/p&gt;

&lt;p&gt;Three backends: CUDAEngine (NVIDIA), MetalEngine (Apple Silicon), CPUFallbackEngine. The engine auto-detects available GPU at startup. Operations below 100K rows use CPU (PCIe&lt;br&gt;
  transfer overhead makes GPU slower for small data).&lt;/p&gt;

&lt;p&gt;The CUDA kernels use parallel reduction for SUM (shared memory + warp shuffle), atomic operations for hash aggregation, and radix sort for ORDER BY.&lt;/p&gt;

&lt;p&gt;Apple Metal has a unique advantage: unified memory. No PCIe transfer needed — the GPU reads the same physical memory as the CPU. This makes GPU acceleration viable even for&lt;br&gt;
  smaller datasets on Apple Silicon.&lt;/p&gt;

&lt;p&gt;Security: What I Got Wrong the First Time&lt;/p&gt;

&lt;p&gt;A security audit found 22 vulnerabilities in my initial implementation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Parquet footer parsing: reading field lengths from the file without bounds checking → buffer overflow from malicious files. Fixed with a safe_read wrapper.&lt;/li&gt;
&lt;li&gt;GENERATE_SERIES(0, 9223372036854775807): attempted to allocate trillions of rows. Fixed with a 10M row limit.&lt;/li&gt;
&lt;li&gt;Extension loading: LOAD '/etc/malicious.so' → arbitrary code execution. Fixed by restricting to relative paths.&lt;/li&gt;
&lt;li&gt;Regex DoS: REGEXP_MATCHES(col, '(a+)+$') → catastrophic backtracking in std::regex. Noted for future RE2 migration.&lt;/li&gt;
&lt;li&gt;CSV formula injection: cells starting with = written unquoted → Excel command execution. Fixed by auto-quoting.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The lesson: every value read from a file is attacker-controlled input. Every memcpy from file data needs a bounds check.&lt;/p&gt;

&lt;p&gt;Numbers&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;~33,000 lines of C++20&lt;/li&gt;
&lt;li&gt;325 tests, 131,000+ assertions&lt;/li&gt;
&lt;li&gt;130+ SQL features&lt;/li&gt;
&lt;li&gt;7 file format readers (all from scratch)&lt;/li&gt;
&lt;li&gt;Single binary: ~700KB stripped&lt;/li&gt;
&lt;li&gt;Compiles with MSVC, GCC, and Clang&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/SouravRoy-ETL/slothdb" rel="noopener noreferrer"&gt;https://github.com/SouravRoy-ETL/slothdb&lt;/a&gt;&lt;/p&gt;

</description>
      <category>cpp</category>
      <category>database</category>
      <category>sql</category>
      <category>programming</category>
    </item>
  </channel>
</rss>
